University of Calgary PRISM: University of Calgary's Digital Repository
Graduate Studies The Vault: Electronic Theses and Dissertations
2015-06-16 Historical Vehicle Traffic Analysis and Commute Time Prediction Using Web Mining
Kaur, Charanjeet
Kaur, C. (2015). Historical Vehicle Traffic Analysis and Commute Time Prediction Using Web Mining (Unpublished master's thesis). University of Calgary, Calgary, AB. doi:10.11575/PRISM/26373 http://hdl.handle.net/11023/2302 master thesis
University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. Downloaded from PRISM: https://prism.ucalgary.ca UNIVERSITY OF CALGARY
Historical Vehicle Traffic Analysis and Commute Time Prediction Using Web Mining
by
Charanjeet Kaur
A THESIS
SUBMITTED TO THE FACULTY OF GRADUATE STUDIES
IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE
DEGREE OF MASTER OF SCIENCE
GRADUATE PROGRAM IN ELECTRICAL AND COMPUTER ENGINEERING
CALGARY, ALBERTA
MAY, 2015
© Charanjeet Kaur 2015 Abstract
Analyzing historical vehicle traffic data has many applications including urban planning and intelligent in-vehicle route prediction. A common practice to acquire this data is through roadside sensors. This approach is expensive because of infrastructure and planning costs and cannot be easily applied to new routes. A Web mining approach is proposed to address these limitations. The proposed system gathers information about vehicle commute times, accidents, and weather reports from heterogeneous Web sources. This information is combined to support vehicle traffic analytics. Clustering analysis is performed on historical data that investigates the traffic patterns of highways and arterial roads with factors having the most impact on commute time. A commute time prediction model is built on historical vehicle traffic data analytics.
Commute time prediction model is trained with the traffic problems faced in the past and forecasts the commute time incorporating the impact of external factors such as weather and accidents.
ii Preface
Conference Proceeding:
Kaur, C., Krishnamurthy, D., Far, B.H., Using Web Mining to Support Low Cost Historical
Vehicle Traffic Analytics, 26th International Conference on Software Engineering and
Knowledge Engineering, SEKE 2014.
iii Acknowledgements
I would like to take the opportunity to thank all the people who made this work possible. My deepest gratitude and appreciation goes to my supervisor, Dr. Diwakar Krishnamurthy, for his remarkable guidance. I am sincerely grateful to my co-supervisor, Dr. Behrouz H. Far, for teaching and inspiring me over the past two years. I would like to thank both the professors for their valuable time and efforts to make this research possible and also for providing financial support throughout my research.
I would also like to thank my friend, Sukhpreet Dhaliwal, for helping me to refine the thesis and providing suggestions. Last but not the least, I am grateful to my parents for supporting me spiritually throughout my life. I thank my husband for standing by me through the good and bad times.
iv Dedication
I dedicate this thesis to my parents for making me be who I am, and my husband, Amandeep
Sekhon, for supporting me all the way.
v Table of Contents
Abstract ...... ii Preface ...... iii Acknowledgements ...... iv Dedication ...... v Table of Contents ...... vi List of Tables ...... viii List of Figures and Illustrations ...... ix List of Symbols, Abbreviations and Nomenclature ...... xi
CHAPTER ONE: INTRODUCTION ...... 1 1.1 Background and Motivation ...... 1 1.2 Research Objectives ...... 4 1.3 Contributions ...... 5 1.4 Thesis Organization ...... 11
CHAPTER TWO: LITERATURE REVIEW ...... 12 2.1 Background ...... 12 2.2 Related Work ...... 16 2.1.1 Sensor Infrastructures for Traffic Monitoring ...... 17 2.1.2 Traffic Characterization ...... 21 2.1.3 Travel Time Prediction Models ...... 23 2.3 Summary ...... 26
CHAPTER THREE: WEB DATA COLLECTION METHODOLOGY ...... 29 3.1 Data Collection Process ...... 29 3.1.1 Google Maps ...... 31 3.1.2 Twitter Search API ...... 35 3.1.3 Historical Weather Reports ...... 38 3.2 Data Overlaying ...... 38 3.3 Summary ...... 40
CHAPTER FOUR: HISTORICAL VEHICLE TRAFFIC ANALYSIS OF DEERFOOT TRAIL ...... 41 4.1 Deerfoot Trial Commute Time Analysis ...... 44 4.1.1 Q1. What are the peak and off-peak traffic hours? ...... 44 4.1.2 Q2. What are the traffic conditions at peak hours? ...... 48 4.1.3 Q3. How are peak hours related to number of accidents? ...... 54 4.1.4 Q4. What are the bottleneck road segments during peak hours? ...... 58 4.2 Summary ...... 71
CHAPTER FIVE: UNIQUE TRAFFIC PATTERNS WITH K-MEANS CLUSTERING73 5.1 Clustering Analysis ...... 73 5.2 Commute time patterns ...... 76 5.3 Summary ...... 88
vi CHAPTER SIX: COMMUTE TIME PREDICTION MODEL ...... 90 6.1 Prediction Model ...... 90 6.2 Prediction Performance ...... 96 6.3 Summary ...... 106
CHAPTER SEVEN: CONCLUSIONS AND FUTURE WORK ...... 107 7.1 Conclusions ...... 107 7.2 Future work ...... 111
REFERENCES ...... 113
vii List of Tables
Table 4.1: Deerfoot Trail - north to south commute time (minutes) statistical results at morning hours ...... 45
Table 4.2: Deerfoot Trail - north to south commute time (minutes) statistical results at evening hours ...... 45
Table 4.3: Deerfoot Trail – south to north commute time (minutes) statistical results at morning hours ...... 47
Table 4.4: Deerfoot Trail – south to north commute time (minutes) statistical results at evening hours ...... 47
Table 6.1: Accuracy rate of predicted commute time ...... 97
Table 6.2: Performance of prediction model for each cluster ...... 99
Table 6.3: Relative percentage difference for cluster C5 for various time differences between predicted and actual commute times ...... 105
viii List of Figures and Illustrations
Figure 3.1: Google Maps Data Collection Process ...... 32
Figure 4.1: Probability distribution of commute times at morning peak hours ...... 49
Figure 4.2: Probability distribution of commute times at morning peak hours ...... 50
Figure 4.3: Probability distribution of commute times at afternoon peak hours ...... 51
Figure 4.4: Probability distribution of commute times at afternoon peak hours ...... 52
Figure 4.5: Probability distribution of commute times at evening peak hours ...... 53
Figure 4.6: Probability distribution of commute times at evening peak hours ...... 54
Figure 4.7: Probability distribution of accidents at Deerfoot Trail during morning peak hours (north to south) ...... 55
Figure 4.8: Probability distribution of accidents at Deerfoot Trail during morning peak hours (south to north) ...... 56
Figure 4.9: Probability distribution of accidents at Deerfoot Trail during evening peak hours (north to south) ...... 57
Figure 4.10: Probability distribution of accidents at Deerfoot Trail during evening peak hours (south to north) ...... 57
Figure 4.11: Percentage of days having maximum commute time (per KM) ...... 59
Figure 4.12: Percentage of days having maximum commute time (per KM) ...... 61
Figure 4.13: Average commute time at each segment ...... 63
Figure 4.14: Average commute time at each segment ...... 63
Figure 4.15: Segment 2 - Morning rush hours ...... 65
Figure 4.16: Segment 2 - Morning rush hours ...... 66
Figure 4.17: Segment 2 - Evening rush hours ...... 66
Figure 4.18: Segment 3 - Morning rush hours ...... 68
Figure 4.19: Segment 3 - Evening rush hours ...... 69
Figure 4.20: Segment 3 - Evening rush hours ...... 69
Figure 4.21: Segment 4 - Evening rush hours ...... 70
ix Figure 4.22: Segment 5 - Morning rush hours ...... 71
Figure 5.1: Appropriate number of clusters for North to South direction of Deerfoot Trail ...... 75
Figure 5.2: Appropriate number of clusters for South to North direction of Deerfoot Trail ...... 75
Figure 5.3: Distribution of days into categories based on weather and accidents for Deerfoot Trail (north to south) ...... 76
Figure 5.4: Days with no accidents (North to South) ...... 78
Figure 5.5: Day with no accidents (South to North) ...... 78
Figure 5.6: Traffic Pattern for Weekends on Good Weather Days with accidents ...... 80
Figure 5.7: Traffic Pattern for Weekends on Good Weather Days with accidents ...... 80
Figure 5.8: Regular Traffic Pattern for Working Days ...... 82
Figure 5.9: Regular Traffic Pattern for Working Days ...... 82
Figure 5.10: Traffic Pattern for Working days with more number of accidents ...... 83
Figure 5.11: Traffic Pattern for Working days with more number of accidents ...... 83
Figure 5.12: Regular Traffic Pattern for Working days on Bad Weather ...... 85
Figure 5.13: Regular Traffic Pattern for Working days on Bad Weather ...... 85
Figure 5.14: Random Traffic Pattern on Snowy Days with more number of accidents (North to South) ...... 86
Figure 5.15: Random Traffic Pattern on Snowy Days with more number of accidents (South to North) ...... 86
Figure 6.6.1: Prediction process...... 91
Figure 6.2: Naïve Bayes Classifier ...... 94
Figure 6.3: Day predicted in cluster C1 ...... 100
Figure 6.4: Day predicted in cluster C2 ...... 101
Figure 6.5: Day predicted in cluster C3 ...... 102
Figure 6.6: Day predicted in cluster C4 ...... 103
Figure 6.7: Day predicted in cluster C5 ...... 104
x List of Symbols, Abbreviations and Nomenclature
Symbol Definition
API Application Programming Interface
CCTV Closed Circuit Television
CSV Comma Separated Values
FCMP Fuzzy Clustering Multiple Prototype
GPS Global Positioning System
HTTP Hyper Text Transfer Protocol
IMEI International Mobile Equipment Identity
ICT In Current Traffic
ITS Intelligent Transportation System
IVC Inter-Vehicle Communication
JSON JavaScript Object Notation
PDF Probability Distribution Function
QoE Quality of Experience
WEKA Waikato Environment for Knowledge Analysis
xi
Chapter One: INTRODUCTION
1.1 Background and Motivation
With the rapid increase in traffic volumes on arterial roads and highways, several approaches
have been adopted to improve existing traffic conditions. A lot of enhancements have been made
to accommodate the expected increase in traffic volumes for continuous vehicular travel flow
and to reduce traffic congestion.
Traffic congestion is a problem that countries around the world are coping with. A few studies
reviewed the state of congestion in Canadian cities and found that in many of the country’s large
urban areas it has reached serious levels that are imposing significant costs on drivers, the economy, the environment, and the quality of life of Canadians [1, 2]. Transport Canada [1] calculated the economic cost in major Canadian cities from longer travel times and the additional cost of less reliable travel times requiring people to include contingency time in their travel.
Transport Canada calculated the total economic cost of congestion by multiplying the amount of time that commuters and other drivers lost due to congestion by the assumed value those travelers placed on their time. These costs amount to $5.2 billion per year in Canada’s five largest cities; Toronto, Montreal, Vancouver, Calgary, and Ottawa [1].
An efficient and reliable transportation network is required to relieve urban congestion and reduce traffic accidents. One of the approaches is to expand the road capacity. However, evidence shows that building more roads to address congestion in Canada’s largest cities is not only impractical from a cost perspective and restrictions of land area, but also ineffective: the new road space is used up as fast as it is built and congestion remains unaffected [2].
1
A more viable approach would be using the existing network resources more efficiently to provide better road service level. Intelligent transportation systems (ITS) provide keys to resolve congestion problems. ITSs collect, process, and broadcast information to users across transportation networks in order to improve efficiency and safety [3]. A variety of intelligent transportation systems have already been developed and applied in transportation networks to improve the traffic, reduce traffic congestion and predict the commute times [4, 5, 6].
The main emphasis of this research is to support historical traffic analyses to understand the traffic congestion problems in the past. The results of these analyses can be used by traffic controllers to avoid similar congestion problems in future. Second focus of this research is to build a commute time prediction model by taking the historical analyses into consideration.
Here, commute time prediction is the process of estimating the anticipated commute time at a future time by using the historical data. Commute time prediction information can be delivered to road users for either pre-trip planning or during the trip [7]. Pre-trip commute time information enables the user to make decisions on the best route to take and commute time provided during the trip gives user the option to take an alternative route with less commute time or at least relieve the anxiety resulting from being unaware of the situation [7].
Traffic data needed to conduct these studies can be collected in various ways. A common practice to acquire this data is through roadside sensors. This approach is expensive because of infrastructure and planning costs. Furthermore, it cannot be easily applied to new routes. The most significant motivation of this research is to build a traffic analysis and prediction system which does not require an expensive sensors-based data collection platform.
2
In this research, a web mining is proposed as an alternative approach which collects the traffic data from existing web applications. This approach leverages the capabilities of existing Web sources and does not require additional infrastructure to capture traffic data. Very few researches have focused on data collection from Web sources. Most of the existing work relies on the data collected from roadside sensors and detectors. To the best of my knowledge, there exists one study that focused on collecting low cost data in which traffic data is collected from Bing maps to acquire the traffic flow [8]. The drawback of [8] is that it only includes data from one Web source i.e., Bing maps and does not integrate this information with other types of data such as traffic accidents on those roads or weather conditions of the city. This limitation is addressed in my research by incorporating data from multiple Web sources.
In contrast to the traditional approach of relying on specialized roadside instrumentation, my proposed approach is more flexible in that it can be adapted with little effort and cost to analyze any road for which such Web data is available. This approach is flexible and can be tailored to any route of interest and it can also incorporate new data sources and factors which affect the commute time. Integration of data from multiple Web sources will not only identify the congested areas of the city but will also address the problems leading to the congestion. This data will be useful to perform detailed historical analysis to help identify the traffic congestion issues.
This data will also prove beneficial to build a commute time prediction model as this historical data will reveal the congestion problems which could be incorporated into the prediction model.
Heterogeneous data collection from multiple Web sources will be beneficial for both (i) analyzing historical traffic patterns, and (ii) building commute time prediction models.
3
1.2 Research Objectives
This research has three main objectives: a) Supporting historical vehicle traffic analysis. b) Identifying commute time classification patterns. c) Developing prediction model based on the historical vehicle traffic analyses.
These objectives can help find answers to problems related to road traffic by identifying factors that impact the flow of traffic.
The key metric of interest in this research is commute time. Commonly available Web APIs such as the Google Maps API [9] can be used to capture commute time information. However, there are several challenges that need to be addressed before such data can be used. For example, the Google Maps API provides estimates of commute time for a specified route at the instant at which the API query is made. However, it does not support querying of commute time data for arbitrarily specified days and time instants in the past. This makes it difficult to carry out historical traffic analysis. Furthermore, no single Web source provides information about factors that can have an impact on commute time. For example, information about accidents occurring on a route can be queried by mining Tweets [10] generated by commuters and information about weather conditions can be obtained from several Web sites [11][12].
However, such information needs to be correlated in an automated manner with the commute time information to support traffic analyses. To the best of my knowledge, I am not aware of other systems that offer historical commute time data and that allows analysts to combine such
4
data with metrics that can help reason about observed commute time patterns. Implementing such a system is a key aim of this research.
The second objective of my research is to demonstrate how data captured by such a system can be used to support historical traffic analysis. Specifically, this study shows how the data collected by the system can be exploited to answer the following research questions (RQs) about a given route:
RQ1. What are the peak and off-peak traffic hours?
RQ2. What are the traffic conditions at peak hours?
RQ3. How are peak hours related to number of accidents?
RQ4. What are the bottleneck road segments during peak hours?
The third objective of my research is to exploit historical trends obtained from such analysis to
develop commute time prediction models. Specifically, the research is interested in predicting
the commute time for an arbitrarily specified route, given the time of travel and the weather
conditions on that day.
1.3 Contributions
Several new contributions have been made in this research. Firstly, no study has exploited Web
mining for historical traffic analysis incorporating external factors which impact traffic on the
roads. In this study, all the traffic data is collected from multiple free Web sources which impact
the vehicle traffic. For example, the factors which impact the traffic flow may include the
following:
5
Number of cars, speed, occupancy
Time of the day, weekday, weekend
Weather, temperature
Accidents, detours, lane closures
Glare (direction of sun)
The information about these factors can be collected from multiple Web sources such as commute time and route information from Google Maps [9]; accidents and events information from Twitter [10]; and temperature, rain or snow information from the weather websites [11].
A Web mining driven traffic analytics system is developed in this research. This designed system captures and stores information about the factors that impact the traffic at 19 time instants of the day. A detailed analysis is performed on this data to measure the impact of each factor on commute time, e.g., how many minutes does an accident add to the commute time? In order to perform this detailed historical traffic analysis, data is collected from three Web sources, namely
Google Maps, Twitter and Climate.Weather.gc.ca website. The data from these Web sources is combined on the basis of common factor of day and time. For example, commute time collected from Google Maps, accidents data collected from Twitter and weather data from climate.weather.gc.ca website are merged on the common factor day and time i.e., what was the commute time when an accident occurred on that time and what were the weather conditions on that day? In this way, the impact of external factors such as accidents and weather conditions is measured.
6
In this research I applied clustering technique to group the days having similar effects of external factors on commute time. Clustering identified 6 unique traffic patterns from 153 days for which the data was collected. Each traffic pattern represents a group of days with a different combination of impact of external factors on commute time. Thus, each traffic pattern is unique in the form of amalgamation of external factors such as amount of snowfall and number of accidents. In the next major part of the system, a commute time prediction model is designed which uses the information contained in these clusters. Information on the impact of external factors in the past is used to predict the commute times for the future days.
Specifically, the analysis system collects and stores information for 18 heavy arterial roads and highways in the city of Calgary, Alberta, Canada. This information is also collected for the sub- segments of these 18 roads and highways in order to study the traffic problems related to the specific intersection of the roads. The system as well extracts reports of accidents on these roads by mining Twitter posts, i.e., Tweets, related to these accidents and overlays this information with the commute time data. Finally, detailed information about weather conditions such as temperature, snowfall, and snow accumulation on the ground are mined from the Web and associated with the other pieces of data. The utility of this system is illustrated through a study that investigates the traffic patterns of the busiest highway in Calgary, i.e. Deerfoot Trail, along with factors having the most impact on commute time on this highway.
As a result, new insights are provided on commute time patterns and the factors that influence them. These insights are found for the whole road and as well as for the sub-segments of the road. Few of the results found in this study are:
7
The evening peak hours are significantly longer than the morning peak hours. Evening
peak hours is 3 hours long whereas morning peak hours are for one hour and a half.
Number of accidents happening in the peak hours is much more than the accidents
observed in off-peak hours. The highest number of accidents was 9 accidents recorded
in the historical data at a single peak hour, whereas accidents at off-peak hours were
never more than one accident at the same time.
Commute time to travel the studied segment of Deerfoot Trail is 35 minutes in off-peak
hours under normal traffic conditions, i.e. no accidents and no inclement weather
conditions. Whereas, commute time could go up to one hour with moderate number of
accidents on a day with no snowfall and could rise more than two hours for the same
distance under inclement weather conditions and high number of accidents.
On a blizzard day in Calgary [13] with 25 cm of snow and 9 accidents at 3 PM,
commute time went up to 2 hours and 6 minutes. The number of accidents increased 5
times than the accidents observed on a clear weather day at a single time instant.
Traffic accidents posted on Twitter are associated with the particular intersection of the
highway by text parsing the tweets. This process of text parsing multiple tweets
collected over a period of time helped to identify hotspots of the highway. The
bottleneck segment in the southbound direction of Deerfoot Trail is the segment near
Downtown, that is, 16 Avenue to 17 Avenue in the time period between 4:30 PM and
5:30 PM. Surprisingly, Downtown segment is not the bottleneck for the other direction.
Segment 32 Avenue to 64 Avenue on Deerfoot Trail in northbound direction is the
bottleneck segment in the time period between 5:00 PM and 5:30 PM.
8
The above information could be useful for those that manage the highway, researchers who want to calibrate and validate traffic simulators for the highway, and those that are interested in predicting commute times based on historical trends.
For developing the predictive model, the 202 days’ historical data is divided into two sets; 153 days and 49 days. Clustering is performed on the first set of 153 days’ data in order to recognize trends in the commute time on Deerfoot Trail and is used as training dataset. The other 49 days of data is not clustered and reserved as prediction dataset, also known as testing dataset.
Clustering of 153 days identified 6 unique traffic patterns of commute times for Deerfoot Trail.
These distinct clusters are identified on the basis on time of the day, day of the week, weather conditions and traffic accidents, where each pattern shows dissimilar results for the combination of these factors. Each traffic pattern is different from the other pattern in terms of commute time to travel the same road at 19 time instants of the day. Each traffic pattern is assigned a centroid which is the mean commute time of the days falling into that particular cluster at 19 time instants. For example, days with no accidents and clear weather conditions have one unique pattern and are different from the days with large number of accidents and heavy snowfall, which fall into another traffic pattern.
The other contribution is to develop a commute time prediction model which incorporated the impact of traffic accidents and weather conditions on commute time. Many commute time prediction models have already been proposed and applied in traffic systems [5, 6], and they perform pretty well in case studies. However, incorporating the weather information into the prediction model has only recently been studied. Very few researchers have included weather information as part of the prediction models. Most of these research studies are based on either
9
regression models to include weather information as an explanatory variable or are simulation based approaches [14, 15].
I developed a model that can predict the commute times for a given route given the following inputs: a) day of the week; b) time of the day; and c) weather condition, i.e., bad (snowing or snow on the ground) or good (zero snowfall). The model uses the six clusters identified previously. It first assigns the day for which prediction has to be offered to one of these 6 clusters based on a Naïve-Bayes machine learning classification technique [16]. It then uses the time of the day and the historical accident information for that instant to estimate commute time.
Specifically, the best case commute time to travel the entire stretch of Deerfoot Trail in Calgary is about 35 minutes. Based on historical analysis, each accident inflates commute time by about
7 minutes. Hence, one can estimate the commute time for any given time of the day from the historical probability distribution function of accidents at that instant. A probability distribution function for a random variable provides for each possible value of that random variable the probability of observing that value.
Results show that for 76.8% of days in the prediction dataset, the difference between actual and predicted commute times is less than 5 minutes. The model was particularly effective in predicting commute times for workdays under both good and bad weather conditions.
Furthermore, the model’s predictions are significantly more accurate than a technique that only considers the average commute times at the desired time instant in the assigned cluster.
10
1.4 Thesis Organization
Chapter 1 provides the introduction and the importance of traffic analyses and commute time prediction. Chapter 2 includes a comprehensive review of the literature in vehicle traffic analysis and as well as their application in existing systems, traffic accidents and weather impact on traffic stream and related prediction models. Chapter 3 introduces the statement of the problem and data collection methodology from various web sources which is tested for a case study in this research. Chapter 4 discusses the historical vehicle traffic analysis results conducted on a case study of major highway in the city of Calgary. Chapter 5 described the clustering method to identify the unique commute time patterns from the traffic data collected over a period of six months. Chapter 6 proposed integrated a commute time prediction model based on the historical traffic data analysis and this prediction model also incorporates the impact of traffic accidents and weather conditions on commute time. Finally, Chapter 7 presents the conclusions and directions for future research.
11
Chapter Two: LITERATURE REVIEW
This chapter presents a detailed description on scholarly work done by other researchers in the
field related to this dissertation along with the background knowledge of the techniques used for
analyzing the historical traffic and the techniques used for predicting commute time. The study conducted in this research concentrated on two parts namely, historical vehicle traffic analysis and commute time prediction model. Vehicle traffic analysis and commute time prediction are the two major components of ITS. Therefore, most of the studies focus either on historical traffic
analysis or traffic prediction models. A few studies are conducted on both the topics. This
chapter discusses applications developed so far related to these two topics.
Section 2.1 is the background section which explains the techniques used for historical vehicle
traffic analysis and commute time prediction in this study. Section 2.2 describes the related work
done in the field of historical traffic analysis and commute time prediction models. Section 2.3
summarizes and compares the current study with the previous studies.
2.1 Background
This research has used data mining techniques in order to infer insights from the Web data
collected from heterogeneous Web sources. Two data mining techniques used in this study are
clustering and classification. A data mining tool, WEKA, is used to apply the techniques on the
data [17].
WEKA (Waikato Environment for Knowledge Analysis) is a collection of state-of-the-art
machine learning algorithms for data mining tasks [17]. The algorithms can be directly applied to
the dataset [17]. WEKA contains tools for data pre-processing, classification, regression,
12
clustering, association rules, and visualization [18]. WEKA tool is used in this research for performing clustering and classification.
Clustering is the method of partitioning a set of data points into groups called clusters, where the data points in the same cluster are as similar as possible and dissimilar from the data points in other clusters [19]. There are many clustering techniques available to cluster the data into groups.
In this research, k-means clustering [20] technique is used to cluster the traffic data into clusters having unique commute time patterns in each cluster. K-means clustering partitions the n number of observations/data-points into k number of clusters, where each data-point is assigned to the cluster with nearest centroid [20]. Cluster centroid is the point of coordinates equal to the mean of all the values for the data-points in a cluster [20]. In its simplest form, the k-means method follows the following steps [21].
Step 1. Specify the number of clusters and, arbitrarily, the members of each cluster.
Step 2. Calculate each cluster's centroid, and the distances between each observation and
centroid. If an observation is nearer the centroid of a cluster other than the one to
which it currently belongs, re-assign it to the nearer cluster.
Step 3. Repeat Step 2 until all observations are nearest the centroid of the cluster to which
they belong.
Step 4. To study sensitivity to number of clusters, repeat Steps 1 to 3 with a different
number of clusters and evaluate the results.
13
Then number of clusters, k, can be verified by calculating the inter-cluster variance and intra- cluster variance. The approach presented by Menasce et al. to determine the appropriate number of clusters for a given dataset [22] is used here. This approach uses βvar (equation 2.3) the ratio
of the intracluster variance (equation 2.1) to the intercluster variance (equation 2.2) to decide on the pertinent k-value, i.e., number of clusters.