COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS/ DISSERTATION

o Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

o NonCommercial — You may not use the material for commercial purposes.

o ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

How to cite this thesis

Surname, Initial(s). (2012) Title of the thesis or dissertation. PhD. (Chemistry)/ M.Sc. (Physics)/ M.A. (Philosophy)/M.Com. (Finance) etc. [Unpublished]: University of . Retrieved from: https://ujcontent.uj.ac.za/vital/access/manager/Index?site_name=Research%20Output (Accessed: Date).

The design of a vehicular traffic flow prediction model for Gauteng freeways using ensemble learning

by

TEBOGO EMMA MAKABA

A dissertation submitted in fulfilment for the Degree

of

Magister Commercii

in

Information Technology Management

Faculty of Management

UNIVERSITY OF JOHANNESBURG

Supervisor: Dr B.N Gatsheni

2016

DECLARATION

I certify that the dissertation submitted by me for the degree Master’s of Commerce (Information Technology Management) at the University of Johannesburg is my independent work and has not been submitted by me for a degree at another university.

TEBOGO EMMA MAKABA

i

ACKNOWLEDGEMENTS

I hereby wish to express my gratitude to the following individuals who enabled this document to be successfully and timeously completed:

 Firstly, GOD  Supervisor, Dr BN Gatsheni  Mikros Traffic Monitoring(Pty) Ltd (MTM) for providing me with the traffic flow data  My family and friends  Prof M Pillay and Ms N Eland  Faculty of Management for funding me with the NRF Supervisor linked bursary

ii

DEDICATION

This dissertation is dedicated to everyone who supported me during the project, from the start till the end.

iii

ABSTRACT

Traffic congestion is a major problem in the cities of Gauteng Province (GP). There is high vehicle traffic congestion especially on the Ben Schoeman freeway from Johannesburg to during peak travelling times of 06:00 to 09:00 and 15:00 to 18:00. The increasing number of vehicles in the freeways of GP often leads to accidents, which in turn worsen traffic congestion. Traffic impacts negatively on commuters and the businesses around Gauteng Province. Several intervention programmes were implemented by the Gauteng authorities to minimize the rapid increase of traffic volume but this has not solved the traffic congestion problem. The aim of this study was to construct a vehicular traffic prediction model using ensemble learning methods and machine learning algorithms. Vehicle traffic flow data was obtained from Mikro’s Traffic Monitoring (MTM), a company contracted by the Gauteng Department of Transport (DoT) to collect vehicle traffic data. The vehicle traffic flow data for the freeway that links Johannesburg with Pretoria (i.e. North extending to the North) was used in this study. Ensemble learning methods used to construct vehicle traffic prediction models, namely Bagging, Boosting, Stacking and Random Forest together with machine learning algorithms that include Decision Trees, Support Vector Machine and Multi-Layer Perceptron. A cross-validation (CV) method was used for evaluating the models. The best prediction model was selected by computing the cost prediction by using a combination of a loss matrix and a confusion matrix. The results showed that the models constructed using Random Forest ensemble method achieved the best prediction for traffic congestion at 99.991%. Commuters wishing to travel on the Ben Schoeman freeway can predict traffic flow by using an App. The App can allow the commuters to enter variables such as day of week, travel time and traffic volume. The entered variables will predict travel conditions by examining target concepts such as Freeflow, FlowingCongestion and Congested. The commuters will only be able to predict traffic flow although they will not full have knowledge of the actual vehicle traffic volume in the freeway. In that case they will depend on the media (e.g. radio) traffic reports. The implications of the results are the improvement in the competitiveness of Gauteng Province as an investment destination. This model can inform commuters of traffic flow patterns ahead of time and this enables commuters to make appropriate travel arrangements.

iv

Table of Contents Declaration ...... i Acknowledgements ...... ii Dedication ...... iii Abstract ...... iv LIST OF TABLES ...... viii LIST OF FIGURES ...... x DEFENITION OF TERMS ...... xiv CHAPTER 1: INTRODUCTION ...... 1 1.1 Introduction ...... 1 1.2 Research Problem ...... 3 1.3 Research Statement ...... 3 1.4 Aim of the study ...... 3 1.5 Objectives ...... 4 1.5.1 Sub-Objectives ...... 4 1.6 Research Methodology ...... 4 1.7 Dissertation Contribution ...... 5 1.8 Structure of the Dissertation ...... 5 CHAPTER 2: LITERATURE REVIEW ...... 6 2.1 Introduction ...... 6 2.2 Related Research ...... 6 2.3 Theoretical Framework of the study…………………………………………………………………………………….11 2.4 Chapter Conclusion ...... 13 CHAPTER 3: METHODS ...... 14 3.1 Introduction ...... 14 3.2 Research Methodology ...... 14 3.2.1 Research Approach ...... 14 3.2.2 Research Design ...... 15 3.2.3 Research Settings ...... 15 3.2.4 Research Methods ...... 16 3.2.5 Dataset Collection ...... 17 3.2.6 The Data Collection Procedure ...... 18 3.2.7 Validations ...... 18 3.2.8 Limitations of the study ...... 19 3.3 Attribute Selection ...... 19 3.4 Machine Learning Algorithms Used for the Study ...... 21

v

3.4.1 Decision Trees C4.5 (J48) ...... 21 3.4.2 Multi-Layer Perceptron (MLP) ...... 24 3.4.3 Support Vector Machine (SVM) ...... 25 3.5 Ensemble Learning Method Used for the Study ...... 27 3.5.1 Ensemble Methods...... 27 3.5.2 Data Fusion ...... 28 3.5.3 Bagging ensemble method ...... 29 3.5.4 Boosting ensemble method ...... 30 3.5.5 Stacking ensemble method ...... 32 3.5.6 Random Forest (RF) ensemble method ...... 33 3.6 Search Methods ...... 34 3.6.1 Best-First Search (BFS) ...... 34 3.6.2 Ranker Search (RS) ...... 35 3.6.3 Greedy-Best Search (GBS)...... 36 3.7 Cross-Validation (CV) ...... 36 3.8 Supervised Learning ...... 37 3.9 Discussion of the selected Algorithm of the study ...... 37 3.10 Chapter Conclusion ...... 38 CHAPTER 4: EXPERIMENTS AND RESULTS ...... 39 4.1 Introduction ...... 39 4.2 Data pre-processing ...... 39 4.2.1 Conversion of numeric vehicle traffic flow data to nominal values ...... 40 4.3 Experiments and Results ...... 43 4.3.1 Experiment for Attribute Selections ...... 43 4.3.1.1 Experiment 1: A model constructed using 2 attributes per model...... 44 4.3.1.2 Experiment 2: A model constructed using 3 attributes per model...... 48 4.3.2 Experiment 3: Training and testing of machine learning algorithms ...... 51 4.3.2.1 Root mean square error (RMSE) ...... 52 4.3.3 The Ensemble learning method experiments ...... 54 4.3.3.1 Experiment 4: A model constructed using Bagging ensemble learning method ..... 55 4.3.3.2 Experiment 5: A model construction using Boosting ensemble learning method. .. 58 4.3.3.3 Experiment 6: A model constructed using Stacking ensemble learning Method .... 61 4.3.3.4 Experiment 7: A model constructed using Random Forest Ensemble of learning Method…………………...... 64 4.3.4 Summary of the prediction performance and the RMSE results ...... 67 4.4 Data Post processing ...... 74

vi

4.4.1 Cost of the prediction ...... 74 4.4.1.1 Computation of the cost prediction ...... 77 4.5 Chapter Conclusion ...... 84 CHAPTER 5: DISCUSSION, CONCLUSION AND RECOMMENDATIONS ...... 85 5.1 Discussion ...... 85 5.2 Conclusion and Recommendations...... 87 REFERENCES ANNEXURE 3A ANNEXURE 3B ANNEXURE 4A ANNEXURE 4B ANNEXURE 4C ANNEXURE 4D

vii

LIST OF TABLES

Table 2.1: Summaries concepts for several literature review studies and the current study…………………………………………………………………………………………………13

Table 3.1: The distribution of instances over three years for the testing and training data and their Target Concept………………………………………………………………………………..16

Table 4.1: A sample of 3 instances before data pre-processing………………………………39

Table 4.2: A sample of 3 instances after data pre-processing…………………………………43 Table 4.3: The results after creating models different using combination of 2 attributes from the dataset………………………………………………………………………………….………..47

Table 4.4: The Confusion matrix of the model constructed from TV and AS……..…………48

Table 4.5: The results after creating models using different combination of 3 attributes from the dataset…………………………………………………………………………………………...51

Table 4.6: The confusion matrix for a model constructed from TT, TV and AS attibutes……………………………………………………………………………………………....51

Table 4.7: The result for all the 2013, 2014 and 2015 vehicle traffic flow datasets. The algorithms that were used are C4.5 and MLP……………………………………………………53 Table 4.8: The result for all the 2013, 2014 and 2015 vehicle traffic flow datasets. The algorithm that was used was SVM (Libsvm) to train and test the dataset…………………….53 Table 4.9: The Confusion matrix for training data using the C4.5 algorithm…………………..54

Table 4.10: The results of Bagging method constructed using the C4.5, SVM and MLP…………………………………………………………………………………………………...57

Table 4.11: The confusion matrix for C4.5 Bagging ensemble learning method…………….57

Table 4.12: Results obtained during the construction of a models using Boosting ensemble learning method with C4.5 algorithms……………………………………………………………..60

Table 4.13: The confusion Matrix for Boosting ensemble learning method using the C4.5 algorithm………………………………………………………………………………...... 60

Table 4. 14: Results obtained when constructing a model using Stacking with C4.5, SVM and MLP……………………………………………………………………………………………...... 63

viii

Table 4.15: Confusion matrix for stacking ensemble learning method combining C4.5, SVM and MLP algorithms, where C4.5 algorithms was used as a meta classifier………………………………………………………………………………………………64

Table 4. 16: Results obtained when constructing a model using Random Forest ensemble learning method with numTrees was set to (1,3 and 5) …………………………………………66

Table 4.17: Confusion matrix for the Random forest ensemble model when numTrees was set to 1…………………………………………………………………………………………………….66

Table 4.18: Shows a loss matrix for the vehicle traffic prediction model, a = Freeflow, b = FlowingCongestion and c = Congested…………………………………………………………...75

Table 4.19: The Confusion matrix for the C4.5, for computing the cost of vehicle traffic congestion prediction……………………………………………………………………………….77

Table 4.20: Summary of prediction performance, RMSE and the total cost for attribute selection, where combination of two and three attributes per models were used……………..82

Table 4.21: Summary of prediction performance, RMSE and the total cost for C4.5, MLP and SVM algorithms……………………………………………………………………………………...83

Table 4.22: Summary of prediction performance, RMSE and the total cost for Bagging, Boosting, Stacking and Random Forest ensemble learning methods for the vehicle traffic flow data…………………………………………………………………………………………………...83

ix

LIST OF FIGURES

Figure 3.1: Flow chart diagram for the vehicle traffic prediction model…………………………15

Figure 3.2: A sample report of traffic flow graphs generated from the database server…………………………………………………………………………………………………18

Figure 3. 3: Feature Selection process diagram………………………………………………….20

Figure 3.4: Feature subset evaluation procedure ……………………………………………….21 Figure 3.5: Illustration of a Decision Tree (C4.5) …………………………………………….....21 Figure 3.6: A very simple three-layer network which consisting of hidden, input and output layers…………………………………………………………………………………………………24

Figure 3.7: Linear separating hyperplanes for the separable case, support vectors are circled…………………………………………………………………………………………………26

Figure 3. 8: Diagram to show the Stacking model.………………………………………….…...32

Figure 3.9: Diagram of Best-First search method. Numbers are the nodes and the links between nodes are branches…………………………………………………………...... 35

Figure 4.1: Screen shot of the traffic nominal data after it has been loaded on WEKA………………………………………………………………………………………………...45

Figure 4.2: The results of search method when selecting a combination of 2 best attributes……………………………………………….…………………………………………….46

Figure 4.3: The 2 attributes that have been selected to be the best out of all the attributes……………………………………………………………………………………………..46

Figure 4.4: Message displayed when trying to use “InfoGainAttributeEval” to select best performing attributes………………………………………………………………………………..49

Figure 4.5: Results set for automation of feature selection using 3 attribute per model to training the dataset………………………………………………………………………………….50

Figure 4.6: Results set of training the model using C4.5. It shows the classification performance, confusion matrix and detailed accuracy by class………………………………..52

Figure 4.7: Shows process for designing the Bagging model “meta” classifier and using C4.5 algorithms for the design of the model and the results buffer. …………………………………56

x

Figure 4.8: Results of Boosting ensemble learning method design using C4.5 algorithm. The results buffers are saved to be analysed. The results output for SVM and MLP are shown in Annexure 4D……………………………………….………………………………………………..59

Figure 4.9: Screen to add more than one algorithm for stacking ensemble prediction model……………………………………………………………………………………………..…..62

Figure 4.10: Display algorithms added to perform stacking ensemble learning method…………………………………………………………………………………………….….62

Figure 4.11: After adding C4.5 algorithms as a “metaClassifier” to construct stacking ensemble prediction model (e.g. used Meta classifier was C4.5 algorithm) ………………………………………………………………………………………………………...63

Figure 4.12: Parameters to set when designing the Random forest models are numTrees set to 1 and numFeatures to 5………………………………………………………..…………..……65

Figure 4.13: The result buffer where numTrees used was set to “1” and numFeatures to 5……………………………………………………………………………………………………….65

Figure 4.14: The average prediction performance of a combination of 2 attributes per model, using the Decision Trees algorithm. The attributes are represented by the numbers on the x- axis of the graph, where 1-(TV and AS), 2-(DOW and TT), 3-(TT and AS), 4-(DOW and TV), 5-(DOW and AS) and 6- (TT and TV) …………………………………………….……………....67

Figure 4.15: The average RMSE for selected attributes are represented by the numbers on the graph, 1-(TV and AS), 2-(DOW and TT), 3-(TT and AS), 4-(DOW and TV), 5-(DOW and AS) and 6-(TT and TV), with the C4.5 algorithms.……………………………………………….68

Figure 4. 16: The average prediction performance obtained when combination of 3 attribute per model was used with C4.5 algorithms. The numbers on the graph (x-axis) represent the attributes selected where 1 presents (TT, TV and AS), 2 is (DOW, TV and AS), 3 is (DOW, TT and AS) and 4 is (DOW, TT and TV) …………….………………………………………………...69

Figure 4.17: The average RMSE for selected attributes are represented by the numbers in the x-axis of the graph where 1-(TT, TV and AS), 2-(DOW, TV and AS), 3-(DOW, TT and AS) and 4- (DOW, TT and TV)………………………………………………………………………………..70

Figure 4. 18: The average prediction performance of C4.5, MLP and SVM for the vehicle traffic flow data. The numbers on the graph (x-axis) have no significance, they are just presenting

xi

the used algorithms. Where 1- C4.5, 2-MLP, 3 to 7- SVM etc. All of the attributes were used during the design of the models…………………….……………………………………………...71

Figure 4. 19: The average RMSE for the C4.5, MLP and SVM algorithms. The numbers on the graph x-axis have no significance, they are just representing the used algorithms. All the attributes where used to construct this models…………………………………………………...72

Figure 4.20: The average prediction performance of an ensemble learning methods (Bagging, Boosting Random forest and Stacking) using the three-year traffic data. The numbers on the x-axis represents the methods where (1-Bagging(C4.5), 2-Bagging(MLP), 3-Bagging(SVM),4- Boosting(C4.5), 5-Boosting(MLP), 6-Boosting(SVM),7-Stacking(C4.5),8-Stacking(MLP),9- Stacking(SVM),10-RF (1) ,11-RF (3) and 12-RF (5))…………………………………………….73

Figure 4.21: The average RMSE for an Ensemble learning models, namely Bagging, Boosting, Random forest and Stacking. These models are represented by the numbers on the x-axis where (1-C4.5, 2-MLP and 3-SVM) is bagging, (4-C4.5, 5-MLP and 6-SVM) is Boosting, (7- C4.5, 8-MLP and 9-SVM) and (10, 11 and 12) Random forest…………………………………74

Figure 4.22: The process flow chart used to compute cost prediction for constructing the vehicle traffic prediction models……………………………………………………………………76

Figure 4.23: The prediction cost where a combination of 2 attributes per model. The numbering on the x-axis in the graph represent attributes, 1 present (TV and AS), 2 present (DOW and TT), 3 present (TT and AS), 4 represent (DOW and TV), 5 represent (DOW and AS) and 6 represent (TT and TV) ……………………………………………………………………………...78

Figure 4.24: The prediction cost for the combination of 3 attribute selection, where the numbering on the x-axis represents the groups, namely 1-(TT, TV and AS), 2-(DOW, TV and AS), 3- (DOW, TT and AS) and 4-(DOW, TT and TV) and the C4.5 algorithms was used…………………………………………………………………………………………………..79

Figure 4.25: The prediction cost for the three constructed models when using machine learning algorithms, namely C4.5, MLP and SVM. The numbers on the x-axis represent the models where 1 – C4.5, 2-MLP, and 3-7SVM……………………………………………………………...80

Figure 4.26: The prediction cost of an ensemble of leaning methods, Bagging, Boosting, Stacking and Random Forest. The numbers on the x-axis represent the methods, where (1- C4.5, 2-MLP and 3- SVM) is Bagging, (4-C4.5, 5-MLP and 6-SVM) is Boosting, (7-C4.5, 8- MLP and 9-SVM) is Stacking and (10-12) is Random Forest method. Figure 4.24 shows 3 columns for Bagging, 3 columns of Boosting and 3 columns for Stacking since C4.5, MLP and

xii

SVM were used. Random Forest is represented by 3 columns where numTrees used are (1, 3 and 5) ………………………………………………………………………………………...…….81

xiii

DEFINITION OF TERMS

1. Algorithm: is a procedure or formula for solving a problem. 2. Artificial intelligence (AI): is the intelligence exhibited by machines. It is also the name of the academic field of study which studies how to create computers and computer software that are capable of intelligent behaviour. 3. Attributes: an attribute is a specification that defines a property of an object, element, or file. It may also refer to or set the specific value for a given instance of such. 4. Classification: refers to categorization, the process in which ideas and objects are recognized, differentiated and understood. 5. Cross-Validation (CV): is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. 6. Decision Trees (C4.5): is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs and utility. It is one way to display an algorithm. 7. Ensemble method: uses multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. 8. Machine learning: as a type of artificial intelligence it provides computers with the ability to learn without being explicitly programmed. 9. Multi-Layer Perceptron (MLP): is a feed forward artificial neural network model that maps sets of input data onto a set of appropriate outputs. 10. Prediction: is a statement about the way things will happen in future, often but not always based on experiments. 11. Speed: distance travelled per unit time. 12. Support vector machine: performs classification by finding the hyperplane that maximizes the margin between two classes. The vectors of the hyperplane are the support vector. 13. Traffic congestion: is a condition on networks that occurs as use increases, and is characterized by slower speeds, longer trip times, and increased queuing. 14. WEKA (Waikato Environment for Knowledge Analysis): is a collection of machine learning algorithms for data mining tasks which was written in Java.

xiv

CHAPTER 1: INTRODUCTION 1.1 INTRODUCTION

Vehicle traffic congestion is an increasing problem affecting the quality of life of millions of people globally. Many countries face extreme traffic due to rapid economic growth that is not accompanied by an increase in planning and infrastructure spending (Loubser and Bester, 2009). The Gauteng Province (GP) of is currently experiencing serious traffic congestion, especially during peak hours of 06:00 to 09:00 and 15:00 to 18:00. Currently in Gauteng, the growth of traffic volumes over the past decade has outstripped investments in road infrastructure and the result is a traffic grid-lock. In addition, the increased number of accidents on the freeways tend to bring traffic to a standstill whilst the accident scenes are being cleared. At certain times, the traffic queues between the cities of Johannesburg (JHB) and Pretoria (PTA), which are connected by the Ben Schoeman freeway can stretch for almost 1.5km to 2.0km. “Traffic congestion results in increased travel times due to the reduction in speed, excessive fuel consumption per kilometre and vehicle wear and tear, the rescheduling of trips and in the longer run, the costs of relocating residences and jobs” (Lindsey and Verhoef, 2000). The increased costs of traveling including that of petrol or diesel are estimated to be in the millions of South African (SA) rands per day. Gauteng Province is the smallest of nine provinces in SA. Gauteng contributes 33.9% of the Gross Domestic Product (GDP) (Economy of GP, 2014) of South Africa. Studies have shown that there is a negative effect of traffic congestion on the Gauteng economy and standard of living (Economy of GP, 2014). Traffic congestion has a high negative impact on the commuters’ quality of life and on businesses. Traffic congestion is also an increasing problem in the other bigger metropolises in South Africa such as Ethekwini in Kwa-Zulu Natal and in the Cape Province. The focus of this study is on Gauteng’s traffic volume especially on the Ben Schoeman freeway.

The Gauteng Department of Transport (DoT) has implemented some interventions in order to improve the state of the vehicle traffic flow. These intervention programmes include the Gauteng Freeway Improvements Project (GFIP), South African National Roads Agency (SANRAL- e-Toll), Bus rapid transit system that include the Rea Vaya, A Re Yeng and buses, the Gautrain, the Tshwane express, the metro rail and cycle tracks. The goal of the GFIP project introduced in 2005 was to upgrade and expand the province’s freeway network with the idea of reducing traffic congestion, stimulating economic growth and improving the commuter’s quality of life. The GP authorities also implemented E-tolls which is a system that bills any freeway user in the province to try and reduce traffic volume on the freeways. Public transport systems that including the Gautrain, Rea Vaya and A Re Yeng were

1

introduced to encourage commuters to use public transport instead of their own private vehicles and thereby reduce traffic volumes on the road. The Gautrain covers a distance of 58km between Johannesburg and Pretoria in 30minutes compared to a road trip that may take up to 50minutes or more. The Bus Rapid Transit (BRT) system called the Rea Vaya in Johannesburg and A Re Yeng in Pretoria was introduced to assist with mitigating traffic congestion in these cities. These public transport interventions allow the public to travel quickly around the cities at a more reduced cost than the use of private vehicles.

There were several traffic congestion models designed by other researchers to address traffic congestion issues including the work of Purusothaman and Parasuraman (2013), Thianniwet and Phosaard (2009) and Deng et al. (2009). These models are explained in detail in Chapter 2. This study designed a model for traffic congestion for a Gauteng freeway.

Data collected by MTM (Mikro’s Traffic Monitoring) was used for constructing a vehicle traffic prediction model. Mikro’s Traffic Monitoring (MTM) uses data loggers (a device used to collect data from the freeways) and cameras on the road to monitor traffic volumes on the GP freeways. MTM specializes in monitoring traffic flow and it does not predict traffic condition (MTM, 1998). The data loggers are lifted using a detection algorithm to count the number of vehicles driving through on different lanes for a specific period. The traffic information is sent to the MTM database server, informing the operators about the number and type of vehicles passing by at a certain time. MTM and their partners use the data for statistical purposes and for monitoring traffic. However, the data is not used for predicting the flow of traffic which may help with solving traffic congestion. The purpose of this study was to construct a vehicular traffic flow prediction model for the Ben Schoeman freeway which links Johannesburg with Pretoria. The study was conducted using machine learning algorithms and an ensemble learning methods. Machine learning algorithms are used to learn the data patterns and predict traffic flows. Ensemble methods use more than one algorithm to predict and improve the performance of the constructed model.

Using variables that include travel time, traffic volume, average vehicle speed and day of the week, this study found that the Random Forest (RF) ensemble prediction model provided the best prediction of vehicle traffic flow for the Ben Schoeman freeway. The RF constructed model can be imported in an App that can be used by the commuters to enter variables such as traffic volume, travel time and day of week to predict the state of traffic. Commuters can install the App in their smart phones. The model was constructed to improve commuters’ way of life and the businesses that uses the freeway between Johannesburg and Pretoria. Part of this study and 2013 data have been published with WASET (World Academy of Science,

2

Engineering and Technology) as a journal paper by Makaba & Gatsheni (2016), the journal paper is found at the WASET website: https://www.waset.org/.

1.2 RESEARCH PROBLEM

Traffic congestion is a major problem in the greater Gauteng Province (GP). The increasing number of vehicles in the GP freeways often leads to accidents, which in turn worsen traffic flow. According to the Gautrain Management Agency (GMA, 2014) traffic volume between Johannesburg and Pretoria is said to be increasing by 7% each year. At this rate of predicted traffic volume increase the travel rate is anticipated to drop from about 90km/hr to 50km/hr. A trip from Johannesburg to Pretoria a distance of 58km, can take 50minutes or more by car when there is free flowing traffic. However, when there is traffic congestion the same trip could take almost 2hours. Traffic congestion normally occurs during peak hours between 6:00 to 9:00 in the morning and between 15:00 to 18:00 in the afternoon. Intervention programmes implemented by the Gauteng Department of Transport (DoT) to minimise the rapid increase of traffic volume have not been successful.

This research will potentially help reduce traffic congestion, reduce journey times and improve environmental conditions through the reduction of exhaust gases and the provision of important and reliable information to drivers.

1.3 PROBLEM STATEMENT

Vehicle traffic congestion in Gauteng province negatively impacts the economic growth of the province and the quality of life of commuters. Intervention programmes by the Gauteng DoT to mitigate the problem has not been very successful. Vehicle traffic still negatively impacts GP, by increasing travel time for commuters and causing delays for the businesses around the province. The vehicle traffic congestion is mainly experienced during peak-hours of the morning and afternoon during weekdays in Ben Schoeman freeway that links Johannesburg with Pretoria. A vehicular traffic prediction model is needed to improve the state of traffic in the freeway and to reduce travel time.

1.4 AIM OF THE STUDY

The aim of this study is to construct a vehicular traffic prediction model using an ensemble learning methods and machine learning algorithms. The prediction model will be constructed for the Ben Schoeman freeway that links Johannesburg with Pretoria to improve vehicle traffic flow during peak hours. The constructed model can be imported in to an App which can be

3

installed on a smart phone. This App will allow the road user to be able to enter variables such as day of week, travel time and traffic volume. Then the prediction can be made using target concepts such as Congested, Freeflow and FlowingCongestion. The results will allow the traveller to know the traffic condition and make a decision regarding travelling on that particular road.

1.5 OBJECTIVES

The objective of the study are as follows:

To construct vehicular traffic prediction model using an ensemble learning methods and machine learning algorithms for the freeway that links Johannesburg with Pretoria.

1.5.1 Sub-Objectives

1. To retrieve the traffic flow data for the Ben Schoeman freeway collected by MTM. 2. To identify the variables that influence vehicular traffic flow on this route. 3. To use the identified variables to construct a vehicular traffic prediction model. 4. To train and test the collected data to construct the vehicular traffic prediction model.

1.6 RESEARCH METHODOLOGY

A quantitative research method was used in this study. A set of questions as shown in Annexure 3A was designed to interview the MTM analyst during a site visit to collect vehicle traffic data. The data was transformed into a Microsoft Excel spreadsheet to meet the standard that is required to load onto WEKA (Waikato Environment for Knowledge Analysis) software (Hall et al. 2009). Different types of machine learning algorithms and ensemble learning method were used to construct a vehicle traffic prediction model. Ensemble learning method is the process in which multiple algorithms, such as classifiers are strategically generated and combined to solve a particular problem (Polikar, 2009). Cross-validation method, root mean square error (RMSE) and the cost of prediction were used to evaluate the vehicle traffic prediction model. These methods are discussed in Chapter 3.

4

1.7 DISSERTATION CONTRIBUTION

This study uses variables, machine learning algorithms and an ensemble learning methods to construct the best vehicle traffic flow prediction models for the Ben Schoeman freeway. Attribute selection, machine learning algorithms selected are (Decision Trees (C4.5), Multi-Layer Perceptron (MLP), and Support Vector Machine (SVM)) and an ensemble learning methods are (Bagging, Boosting, Stacking and Random Forest (RF)) are used to construct vehicle traffic prediction models. This study would enable the Ben Schoeman freeway users to make appropriate travel decisions to avoid traffic congestion.

1.8 STRUCTURE OF THE DISSERTATION

The dissertation is divided in to five chapters:

Chapter one provides, an introduction to the study, the research problem, the problem statement, the aim of the study, objectives of the study, research methodology, dissertation contribution and the structure of the dissertation. Chapter two is a presentation of the literature review and the theoretical framework of the study. Chapter three outlines and discusses the research methodology of the study, data collection process and algorithms used. Chapter four presents the experiments and results of the study. Finally, chapter five presents the discussion of the results, a conclusion and also includes recommendations.

5

CHAPTER 2: LITERATURE REVIEW 2.1 INTRODUCTION

Many methods and algorithms have been proposed by other researchers to solve the problem of traffic congestion on the freeways. This literature review focuses on machine learning algorithms and ensemble learning methods for predicting vehicle traffic flow.

2.2 RELATED RESEARCH

Wu et al. (2004) used the Support Vector Regression (SVR) method to predict travel-time in Taipei, Taiwan. The data that was used for the study was collected by Intelligent Transportation Web Service Project (ITWSP), a governmental research centre. The data information source was updated once in every 3min. The data was collected during rush-hours (7:00 to 10:00) between February 5 and March 21, 2003 when there were no special holidays. The data was trained (28 days) and tested (7 days) by looking at three different short distances (45-km/hr, 178-km/hr and 350-km/hr). The trained model was tested using the 7 days’ data. Parameters used included vehicle speed, traffic flow and occupancy which determines travel- time. “Occupancy is one of the traffic parameters which is known as the percent of time a detection zone of a detector is occupied by some vehicles”(Klein, 1993). The results showed that SVR can greatly reduce RME (relative mean error) and RMSE (root-mean-squared error) for travel-time prediction. The study proved that SVR is an applicable algorithm in predicting travel time and performs well for traffic data analysis. However, the study did not consider using the cross-validation method for evaluation and long term data to design the model. This could have improved the prediction performance of the model.

In 2007, Pattara-atikom et al. estimated the road traffic congestion using Cell Dwell Time (CDT) from data collected using phones including date, time and mobile country code. The traffic congestion was estimated using data collected in Bangkok, Thailand. The results proved that the CDT can be used to estimate traffic congestion with an accuracy of up to 85% and root mean square error of 0.44. The approach was weakened by the fact that it did not include the average speed attribute.

Bagging of radial basic function neural networks (RBFNW) for short term traffic prediction in Beijing, China was proposed by Chen and Chen (2007). The study was conducted using data that was collected from 1 to 7 August, 2003. The data was collected every 2minutes between (00:00 to 11:59) which led to 720 patterns every day and 5040 patterns in 7 days. The data traffic patterns used were traffic flow, occupancy and speed. The ensemble method approach

6

such as bagging demonstrated great potential in improving the capability of unstable procedures like RBFNN. However, the weakness of this approach is that the authors did not consider using both long and short term data traffic.

A traffic model using the Adaboost and Random Forest (RF) algorithm was proposed by Leshem and Ritov (2007). The data that was collected by the Jerusalem Traffic Flow Management Control directly from the intersections were used to measure vehicle traffic. The parameters included day, time, volume and intersection. The results showed that the algorithm performed relatively well on real data. The work done by Leshem and Ritov (2007) lacked an evaluation method such as cross-validation.

Intelligent transportation system (ITS) to predict traffic congestion on various roads in Chennai, India was first reported by Padiath et al. (2009). Traffic congestion was predicted using real- time data collected using a video graphic technique. The prediction tool was used to inform road users about the traffic and provide alternative routes that could be used to avoid traffic congestion. The results were found to be promising compared to the actual numbers obtained during data collection. The authors did not consider using evaluation tools such as the cross- validation that can be used to generalize results obtained from the dataset and validate the model.

Short-term traffic data was used to predict traffic flow using Kalman Filtering and Radial Basis Function (RBF) in China by Gan and Canghui (2009). The data was collected over a period of (10 to 15min) for three days in Jing-Hu freeway. The results indicated continuous improvement from the model prediction accuracy. This approach may require long term data to improve the prediction performance and peak-hours data.

The Artificial Neural Network (ANN) prediction model for traffic congestion using Learn Vector Quantization (LVQ) neural network was used by Shen and Chen (2009) to predict traffic congestion in the cities of China. Real-time data was collected using speed, volume and occupancy parameters which were detected by vehicle detectors. The results of the study showed that road traffic can be predicted through inputting traffic flow data and the model used can be useful for traffic congestion. The study was not specific about the data used (peak or off-peak) and did not mention any evaluation method.

Chen et al. (2009) used Neural Networks (NN) to predict travel time in Shanghai, China. The data was collected from road users via their mobiles. The data set for NN’s contained 40-day traffic conditions and were used in both training and testing of the model. The 20-day data was used in both training sets and then another 20-days of data was used in the testing set.

7

The Root Mean Square Error (RMSE) of the data used was about 10%. MATLAB toolbox was used since it contains the NN’s toolbox. The study depended on the data that was collected through road users mobile, which may have allowed their personal devices to be accessed. A model which estimated real-time traffic state using the Multiclass Support Vector Machine (MSVM) was proposed in Los Angeles, California by Deng et al. (2009). The cluster analysis method was used to divide the traffic state into a number of patterns. The data was collected every 5minutes from 5:00 to 10:00 and the parameters included traffic flow, occupancy and average speed data from 13th to 17th June, 2005. The results showed that the proposed approach is promising with 96.57% of the prediction value. However, the weakness of this approach is that the authors did compute the cost of the prediction.

An Ensemble based methods Task 2 to predict traffic jam in Warsaw, Poland was conducted by He et al. (2010). The Ensemble based method was used to fuse data from several base predictors in order to derive a better prediction model. The study focused on the second task, which was to predict where the next traffic jams would occur during the main phases of the morning peak traffic flow. The study measured speed, number of cars and the time when the jam first occurred. During the simulation the data samples were generated for an hour. Inputs of the simulation were “given 5 road segments with road work as a sequence of major roads where the first jams occurred during the initial 20minutes of the simulation”. The goal of the study was to predict where major roads jams occurs every 40minutes. Cross-validation method was used for generalization of dataset and prediction performance on the final test set was effective. However, the study did not consider the RMSE (root mean square error) value which can be used to evaluate the accuracy of the prediction model.

Another study by Hamner (2010) developed a prediction model for traffic congestion using an Ensemble of Random Forest (RF) in Warsaw, Poland. The data was collected using an automated traffic recorder (ATR) signal and GPS real-time data for a 30-minute window. The training data was grouped in to datasets A (1,000), B (11,000) and C (55,000). The Matlab toolbox was used to carry out all the experiments and the results showed that the prediction model was close where the root-mean-squared error(RMSE) achieved for data set was A (24.62), B (22.64) and C (22.59). The study did not consider computing the cost of prediction, which can be used to penalise models that have a high cost value. Duan et al. (2011) used Rough Set and Support Vector Machine (RS-SVM) to develop a traffic prediction model using the data collected by the De Luce University of Minnesota. The model was build using historical data which was collected during peak-hours. The data for weekdays and weekends which included the weather conditions of those day used was. Cross-validation was used during the development of the model. The proposed methods were found to be effective. The

8

model was well presented as it included weather condition which most studies did not consider.

In 2012, Wisitpongphan and colleagues explored Artificial Neural Networks (ANN) to predict traffic congestion in Bangkok, Thailand. The data was collected using the GPS and GPRS technologies over a month. For tracking and controlling traffic the information obtained included speed, time and vehicles as parameters for the study. The ANN model was used to predict travel time during weekdays and weekends and during rush and non-rush-hours. The results from the model were able to approximately predict the actual travel time in Bangkok. However, the cross validation and the cost were not used to improve the performance of the model. Model with high cost prediction value can be penalised. The study did not use traffic volume parameter which plays an important role in vehicle traffic data.

A framework of traffic congestion prediction for night conditions was proposed by Chen et al. (2012) in freeways of Taiwan. The data was collected using video surveillance headlights and grouping as a silent factor for all cars at night. When detecting the “headlights”, the “width”, “height” and “edge” relationships between them is calculated to identify the make of a passing vehicle. Traffic congestion was classified into these five levels (jam, heavy, medium, mild and low in real-time). The traffic congestion prediction estimate accuracy was 95.4%. The approach only focused on the data that was collected at night, instead of using both day and night data to strengthen the prediction model.

A technique for traffic congestion prediction in urban traffic in the city of Timisoana in Romania was proposed by Pescaru (2013). The study used event based routes selection and relied on information collected by a sensor network. Data was collected during the mornings and afternoons for weekdays and the interval selection used was heavy, medium and low. The WEKA toolkit was used for prediction. Simulation experiments were performed with more than 50 patterns and the demonstration showed promising results of the vehicle prediction model. The study has some similar variables as this study, but did not consider travel time, average speed and time of day as parameters which could have contributed to the prediction performance of the model.

Gupta et al. (2013) proposed an improvised traffic-jam-detection framework known as Detect Traffic Congestion (DTC). The framework was applied to versatile GPS data that was collected using a GPS over time and was then clustered using the Expectation Maximization Algorithm. Mining of data helped with the detection of locations that were facing frequent traffic congestion. Decision trees (J48) Classification Model was used to train traffic data collected to make the prediction more accurate. Cross-validation method was used during the design of

9

the model. The results obtained were 86% accurate compared to the real time data. The Decision Trees algorithm (J48) and sliding windows to classify traffic congestion in four cities in Thailand was proposed by Thianniwet and Phosaard (2009). The data was collected by using a GPS device, webcam and opinion survey from road users. The main parameters used were time, speed, volume, service level and the cycles of the traffic signal where motorists had to wait. The evaluations revealed that the J48 model achieved an overall accuracy of 91.29% to predict traffic congestion. The disadvantage of both systems is that road users may not provide access to their traffic data. It may be expensive to the road users as most of them do not own the technologies such as GPS for data collection.

Purusothaman and Parasuraman (2013), used the Support Vector Machine (SVM) to estimate vehicle traffic density using data collected during peak hours in Chennai, India. The data used was collected using surveillance cameras as sensors to record traffic on the road through image processing. The captured images were processed using texture futures to determine the state of the traffic condition. During the experiments 110 images were used for training. The result showed that the approach was efficient for all kinds of traffic images by 90% of the prediction value. The approach was weakened by not considering the RMSE (root mean square error) value, which shows that if the value is high the model is poor.

In 2014, Rasyidi and Ryu compared prediction models designed using travel time and travel speed attributes using data that was collected by the Transportation Information Service centre (TISC) of Busan. In this study models, were constructed using M5 base trees combined with the Bagging ensemble method. The study only focussed on two parameters travel time and vehicle speed. The mean absolute percentage error (MAPE) was used for error measurements. The travel time predictors produced more accurate results compared to the traffic speed prediction. However, the weakness of the study is that it did not combine both the travel time and travel speed attributes.

Makaba and Gatsheni (2016) focused on using the Bagging ensemble method combined with Multi-layer perceptron (MLP) to construct a traffic prediction model in Gauteng, South Africa. The work was done using data collected in a period of one year. MLP was used individually and then combined with Bagging ensemble. WEKA was used to carry out the experiments where cross-validation was set to 10-folds. The results showed that MLP combined with Bagging performs best. The prediction performance for the model was 99.97% and the RMSE (root mean square error) was the 0.014. The study could have been improved if data from more than one year was used as well as using other algorithms to train the data.

10

The previous section highlighted the different kinds of models developed to tackle the problem of vehicle traffic congestion by other researchers.

2.3 THEORETICAL FRAMEWORK OF THE STUDY

In Gauteng the “Are Yeng” and the “Rea Vaya” bus rapid traffic (BRT) systems have been implemented to reduce traffic congestion. They are popular with commuters and have contributed in relieving the traffic congestion to a certain extent. It is highly unlikely that the construction of new roads will ease the traffic congestion especially on the very busy freeways. In some countries, vehicles with even number plates are allowed on the roads on certain days and those with odd number plates on other days. In some countries like Singapore,” they levy a heavy tax on the purchase of new vehicle and the tax can be equivalent to the price of the vehicle” (Goh, 2002). This forces commuters to use public transport. The introduction of E- Toll gates is also a way to force motorists to use public transport. Cycle tracks have been introduced to encourage commuters to cycle to work rather than using motor vehicles. However, in Johannesburg, for example the commuting public are not fully using these cycle tracks. The construction of high rise accommodation has been recommended in corridors close to the major cities to create a critical mass of commuters. These commuters are expected to use the rapid bus transit system. In Gauteng, there is a move to expand the Gautrain fast train system to include other important nodes in Johannesburg that include Fourways, Randburg and Westgate. The Gautrain has been very successful and commuters are utilising it enmasse. However, the fact that it terminates its services well before mid-night (at 20.30hrs) is problematic.

The current research examined the vehicle traffic congestion problem in more detail by not only using some of the parameters mentioned in the literature review but also used historical data collected over period of three years. All the constructed models were assessed using the root mean square error (RMSE), cross-validation (CV) and by calculating the cost of prediction. Vuchic and Kikuchi (1994), defines vehicle traffic congestion as “when vehicular volume on a transportation facility (street or highway) exceeds the capacity of that facility, the results is a state congestion”. The cost of prediction is defined as “the model reduces the number of wrong predictions and thereby minimise the total loss incurred” (Bishop, 1995).

Section 2.2 highlighted the different kinds of models developed to tackle the problem of vehicle traffic congestion. The studies mainly used historical data (Duan et al. 2011; Makaba and Gatsheni, 2016; Deng et al.2009) and real-time data (Shen and Chen, 2009; Pandiath et al. 2009), containing parameters such as time, speed and traffic volume. Duan et al. (2011), constructed a traffic prediction model using the data that contained weekends which may

11

cause inconsistence to the results of the study. The data pattern for weekends is not as stable as that of weekdays due to the variable number of commuters that travel on any weekend. Weekday traffic Monday to Friday is heavy and commuting behaviour is almost constant with insignificant variations. Duan et al. (2011) also used parameters such as weather conditions that can be an important since it affects traffic flow prediction. The study by Deng et al. (2009) used traffic data obtained for one week. This is sufficient in fully capturing the vehicle traffic for constructing a prediction model. A summary of the variables used in this study and the previous work done by other researchers is listed in Table 2.1.

Thianniwet and Phosaard (2009) used Decision Trees (J48) to design the vehicle traffic congestion model by using speed, time and volume parameters. However, the study did not use cross-validation which is used to assess how a model generalizes the supplied dataset. The study also did not consider calculating the cost of prediction which can identify the best performing model. Rasyidi and Ryu (2014) compared travel time and travel speed prediction parameters. The weakness of the study is that the authors did not consider combining the attributes to improve the prediction performance of the model.

There are prediction of road traffic flow reports that were made by (Lethatsa, 2012; Marie, 2014; Mbodila and Ekabua, 2013) in South Africa. The current study is unique in that different algorithms such as an ensemble learning methods were used which can combine more than one algorithm to construct a model. In addition, the current study used a three-year historical dataset to construct the prediction models. All the models in the current study were evaluated using cross-validation and computed the cost of prediction. The study is of value as the constructed models can be imported into an App that can be installed on any gadget to be used by commuters.

Table 2.1 provides a summary of variables used in the current study and those used in other studies. The deficiency of the current study is that it did not use parameters such the weather condition and accidents in constructing the prediction models. Table 2.1 also shows that some studies did not consider using attribute such as TravelTime and AverageSpeed which are required parameters when constructing vehicle traffic prediction models.

12

Table 2.1: Summary of attributes used in this study compared to other studies.

# Features Author 1 Author 2 Author 3 Author 4 Current ( Shen and (Padiath et (Deng et (Thianniwet and study Chen,2009) al.2009) al.2009) Phosaard,2009)

1 Travel Time attribute Yes Yes Yes 2 Traffic Volume attribute Yes Yes Yes Yes Yes 3 Average Speed attribute Yes Yes Yes 4 Historical traffic Data Yes Yes Yes 5 Real-time traffic data Yes Yes 6 Calculating cost Yes 7 Re-evaluating models Yes using cross -validation 8 Used RMSE or other Yes Yes Yes measure? 9 Occupancy attribute Yes yes yes 10 Accidents attribute 11 Weather condition attribute Note: Yes means the feature was used.

2.4 CHAPTER CONCLUSION

A summary of findings from the literature review on vehicle traffic congestion prediction models, showed that most of the related work was conducted using machine learning algorithms and there are few studies that used the ensemble learning method. The ensemble learning methods can be implemented to train weak classifiers to be able to improve their performance. The previous studies showed that different machine learning algorithms and an ensemble method can be used in the prediction of vehicle traffic flow.

13

CHAPTER 3: METHODS 3.1 INTRODUCTION

This chapter introduces the research methodology, data collection and machine learning tools for constructing the vehicle traffic flow prediction models. The research methods and the algorithms that have been chosen are discussed in detail in this chapter.

3.2 RESEARCH METHODOLOGY 3.2.1 Research Approach

This is a quantitative research study. “Quantitative research is defined as a formal, objective, systematic process to describe and test relationships and examine cause and effect interactions among variables” (Burns and Grove, 1993). Site visit and face-to-face interviews questions were conducted to collect data. Sample questions used are included in Annexure 3A. Bauman and Greenberg (1992) defines interview as “asking questions and getting answers from participants in a study”.

Historical data for vehicle traffic flow for the study was obtained from Mikro’s Traffic Monitoring (MTM) in Gauteng. The data collected comprised the vehicle traffic flow data of 2013, 2014 and for 2015. Machine learning algorithms and ensemble learning methods were used to carry out the experiments and the results were evaluated using the confusion, loss and cost matrices. The data was pre-processed using a Microsoft Excel spreadsheet from numerical to nominal values.

14

3.2.2 Research Design

The flow chart in Figure 3.1 shows an overview of the entire process for the construction of the vehicle traffic prediction models.

Figure 3.1: Flow chart diagram for the vehicle traffic prediction model.

Figure 3.1 displays the steps in which data was collected, pre-processed, the experiments carried out with the results and finally post-processing as explained in Chapter 4. The outliers were removed so that the data can contain the required attributes for the study, to avoid inconsistencies and it was done using Ms Excel spreadsheet. The data was converted to nominal which is also known as categorical “which have values that are distinct symbols” (Witten et al. 2011).

3.2.3 Research Settings

The research study was conducted with data collected from Ben Schoeman freeway (“M1 North extending to the N1”) to Pretoria, which links Johannesburg and Pretoria. This freeway has a high traffic congestion and it is the most used route in Gauteng Province (Economy of Gauteng Province, 2014).

15

3.2.4 Research Methods

3.2.4.1 Population

Strydom and Venter (2002), “refers to the population as the sampling frame, the totality of person, events, organizational unit, case records or other sampling units at the time with which the research problem is concerned”. Table 3.1 contains the summary of the training and testing data including the target concept. The testing dataset was used to re-evaluate all the constructed models with experiments for the vehicle traffic flow. The data was collect from Mikro’s Traffic Monitoring (MTM).

Table 3.1: The distribution of instances over three years for the testing and training data and their Target Concept.

Target Concepts Training dataset Testing dataset

Congestion 1678 841 FlowingCongestion 1756 890 Freeflow 7179 3576 Total (Training and Testing) 10613 5307 Total Instances 15920

3.2.4.2 Sampling

Strydom and Venter (2002) “refer to sampling as the type of sample that is based entirely on the judgement of the researcher, in that a sample is composed of elements that contain the most characteristic, representative or typical attributes of a population”. The sample size of 15920 data points was used and was split into a training dataset of 10613 and a testing dataset of 5307.This data was collected over a period of three years. Weekends and public holidays are excluded from the study since the purpose of the study is to help commuters to reduce travel time. The procedure used for splitting the data into training and testing data (Witten et al. 2011), “says it is better to use more than half of the data set for training than using the entire vehicle traffic data, split two-third as training data and have one-third of the data for testing reason to evaluate all constructed models”.

16

3.2.5 Dataset Collection

Quantitative primary data was collected using interviews and site visits to be able to observe how data is collected from the freeway by MTM. The interview was conducted to understand the processes used to collect data from the sites on the freeway. The set of questions are found in Annexure 3A.

The data used for the study was collected from Mikro’s Traffic Monitoring (MTM). The data was collected over a period of three years’ (2013, 2014 and 2015) and was in a Microsoft Excel spreadsheet format. This vehicle traffic flow data was for the freeway that links Johannesburg with Pretoria (M1 North extending to the N1 North), map for the Ben Schoeman freeway is shown in Annexure 3B. MTM is contracted by the Department of Transport (DoT) to collect vehicle traffic flow data on Gauteng freeways.

MTM has data loggers installed at different location sites on the freeway. The data loggers count the number of vehicles, the average speed and the volume of vehicles on the freeway. This is the form of the data that was obtained from MTM. The data is wirelessly transmitted to the MTM’s server for storage. This data is collected and stored in real-time. MTM has a software programme that is used to check if the data loggers are functioning well. MTM can record and play back incidents on the freeway. MTM uses MonCam cameras for quality assurance of the data collection process. These cameras provide frame-grab images for all recorded vehicles and a synchronised video stream with the recorded traffic data for data analysis and processing. The TrafBase software is designed to validate, store and manage large amounts of collected traffic data. Traffic data is made available to a user either in its original or in summarised form through data files, spreadsheet files and physical reports.

Secondary data from the literature review were used to gain more in-depth understanding of the findings.

17

Figure 3.2: A sample report of traffic flow graphs generated from the database server.

Figure 3.2 shows a print screen taken from the MTM database server when collecting traffic data. The database server mainly used to store the traffic data collected from the Ben Schoeman freeway and other freeways in the province. The report shows traffic data collected per day, the site (1863), day of week, date and if the displayed graph is for light and heavy vehicles.

3.2.6 The Data Collection Procedure

Interviews with an MTM analyst were conducted during the site visit using a set of questions. The vehicle traffic flow data was collected by MTM over a period of three years. The data included parameters for weekdays, weekends and public holidays. The data was recorded in a numerical format.

3.2.7 Validations

All of the prediction models constructed for the study were designed using machine learning algorithms and an ensemble learning method with WEKA software application. The results were evaluated using a combination of confusion and loss matrices of all the constructed models, covered in section 4.3.

18

3.2.8 Limitations of the study

There were a number of limitations in this study and the main ones are as follows: In the future other researchers can use several sources of data to improve the performance of the constructed models and also include weather and vehicle accidents variables. There are few reports in the literature where ensemble learning methods were used to tackle vehicle traffic issues. Introducing tools such as data fusion which are used to “fuse data from different sources” combined with ensemble learning methods which can only handle data from one source at the time to do predictions. The commuters could obtain traffic volume on the freeway when using the app in their mobile devices from the media.

3.3 ATTRIBUTE SELECTION

Feature selection, also known as feature reduction or attribute selection, “is used for selecting a subset of relevant features for building a robust learning model” (Hall, 1999). In biology, the technique is called discriminative gene selection which detects influential genes based on DNA micro array experiments (Saeys et al. 2007). By removing the most irrelevant and redundant features from the data, the feature selection is able to improve the performance of learning models (Lei and Liu, 2003) by:

1. Alleviating the effect of the curse of dimensionality 2. Enhancing generalization capability 3. Speeding up the learning process 4. Improving model interpretability

Feature selection algorithms typically fall into two categories: Feature Ranking and Subset Selection. Feature Ranking ranks the features by a “metric and eliminates all features that do not achieve an adequate score” (Kantardzic, 2011). Subset Selection searches the set of possible features to get an optimal subset. Feature selection also helps in providing a better understanding of data by identifying important features and their relation to each other. In statistics, the most popular form of feature selection is stepwise regression. It is a greedy algorithm that adds the best feature (or deletes the worst feature) in each cycle. The main control issue is deciding when to stop training the model. In other words, there is need to determine the stopping criteria. In machine learning, this is typically done by cross validation (Michie et al. 1994).

19

Feature Selection

“Since each feature used in the construction of a model can increase the cost and running time of a recognition system, designing and implementing systems with a highly discriminative small feature set is the key that ensures the achievement of high prediction rates” (Tiwari and Manu, 2010). There is a significant opportunity for improving the usefulness of the traditional machine learning techniques for automatically generating useful prediction procedures (Saeys et al. 2007).

Feature Selection Architecture

The architecture of a feature selection system is given in Figure 3.3. “It is assumed that an initial set of features will be provided as an input representing positive and negative examples of the various classes for which the prediction is to be performed” (Vafaie and De Jong, 1992). A search procedure is used to explore the space of all features of the given feature set. The performance of each of the selected feature subsets is measured by use of an evaluation function shown in Figure 3.3 that measures the specified classification result. The best feature subset that is found is then put out as the recommended set of features to be used in the actual design of a model.

camera feature set Best feature subset Search Input image Feature extract technique

Goodness of feature subset recognition

Criterion function

Classification process

Figure 3.3: Feature Selection process diagram.

The evaluation procedure in Figure 3.4 is split into main steps. Miao and Hou,(2004)“After a feature subset is selected from training data consisting of the entire set of feature vectors and class assignments corresponding to instance from each of the given classes, is reduced”. This is done by excluding the features that are not effective.

20

Reduced feature Fitness subset training decision Data reduce set Rule inducer Fitness function

Figure 3.4: Feature subset evaluation procedure.

Step two is to “perform prediction (algorithms) on the new reduced training collection in order to make rules which capture the undelying function that describes the solution to the problem domain set” (Dash et al. 2002). “A class description is formed by a set of decision rules describing all the training examples given for that particular class. A decision rule is a set of conjuncts of allowable tests of feature values. The last step is to evaluate (post-processing) the prediction performance of the induced rules on the unseen data” (Dash et al. 2002).

Advantages of using attribute selection is to improve rule induction techniques. “This is a step towards automating the construction of prediction systems for difficult problems” (Hall,2000) incorporating hundreds of attributes. The search techniques described in section 3.6 sit in the adaptive feature selection process in Figure 3.3.

3.4 MACHINE LEARNING ALGORITHMS USED FOR THE STUDY 3.4.1 Decision Trees (C4.5)

Figure 3. 5: Illustration of a Decision Tree (C4.5).

According to Mitchell (1997) decision tree learning is “a method for approximating discrete- valued target functions, in which the learned function is represented by a decision tree”. Learned trees can also be represented as sets of if-then rules to improve human readability. Decision trees classify instances by sorting them down the tree from the root to some leaf

21

node, which provides the classification of the instance. Decision Tree learning is “widely used and is also a practical method for inductive inference” (Wang and Hu, 2002). Each node in the tree specifies a test of some attribute of the instance, and each branch descending from that node corresponds to one of the possible values for this attribute. An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, and then moving down the tree branch corresponding to the value of the attribute in the tree as shown in Figure 3.5. This process is then repeated for the sub tree rooted at the new node.

What is Decision Tree learning algorithm suited for:

1. Instances are represented by attribute-value pairs. For example, attribute 'TrafficVolume' and its value 'Low-Traffic', 'Average-Traffic', 'Heavy-Traffic'. 2. The target concepts have discrete output values. It can easily deal with instances which are assigned to a Boolean decision, such as 'true' and 'false', 'p(positive)' and 'n(negative)', but in this study the target concepts are ‘Freeflow’, ’Congested’ and ‘FlowingCongestion’. 3. The training data may contain errors. This can be dealt with pruning techniques that are not covered in this study.

The Decision Tree algorithm has the following advantages (Callan, 2003):

1. Is easy to understand decision tree models after a brief explanation. 2. “They have value even with little hard data. Important insights can be generated based on experts describing a situation (its alternatives, probabilities and costs) and their preferences for outcomes”. 3. Allow the addition of new possible scenarios. 4. Help determine worst, best and expected values for different scenarios 5. They can be combined with other decision techniques.

Disadvantages of Decision Trees are as follows:

1. “For data including categorical variables with a different number of levels, information gain in decision trees are biased in favour of those attributes with more levels”. 2. Calculations can get very complex, particularly if many values are uncertain and/or if many outcomes are linked (Khoonsari and Motie, 2012).

22

Interactive dichotomize 3:

Interactive Dichotomize 3 (ID3) algorithm “is one of the most used algorithms in machine learning and data mining for learning decision trees, due to its easiness to use and effectiveness”(Quinlan, 1986). J. Rose Quinlan developed it in (1986) “based on the Concept Learning System (CLS) algorithm. It builds a decision tree from some fixed or historic symbolic data in order to learn to classify and predict the classification of new data”. The data must have several attributes with different values. Meanwhile, this data also has to belong to diverse predefined, discrete classes (i.e. Freeflow/Congested). The Decision tree chooses the most effective attributes for decision making by using the information gain (IG) measure.

Entropy and Information Gain:

ID3 (Callan, 2003) is the best-known algorithm for learning Decision Tree. Figure 3.5 shows a typical decision-making tree. In this example, people decide to travel in a state where there’s no traffic congestion according to the state of the traffic flow on the road.

The basic ID3 method selects each instance attribute classification by using the information gain measure beginning at the top of the tree. The core question of the ID3 is how to select the attribute of each node of the tree. A statistical property called information gain is defined to measure the worth of the attribute. The statistical quantity Entropy is defined as a measure of how attributes are mixed, to choose the best attribute from the candidate attributes. The definition of Information gain is as follows:

sv ASGain  SEntropy )(),(   SEntropy v )( Equation 3.1  AValuesv )( S “퐸푛푡푟표푝푦(푠) = where 푆 is a collection of examples, 퐴 is an attribute and 푆푣 is the subset of S for which attribute 퐴 has value 푣”.

Equation 3.2 푃𝑖 probability of class 𝑖.

The reasons why the Decision tree (C4.5) “has been considered for this study is that it handles continuous attributes, training data with missing attribute values and post pruning trees after their creation” (Mitchell, 1997).

23

3.4.2 Multi-Layer Perceptron (MLP)

The Multi-layer perceptron (MLP) is neural network with a layer(s) of invisible units. According to Roy et al. (2005) MLP “as an information processing paradigm that is inspired by the way the biological nervous system processes information. The MLP is composed of a large number of highly interconnected processing elements organised in layers”. The MLP is applicatory to non-linearly separable information. In order to handle non-linearly separable data, the perceptron is extended to a more complex structure through the introduction of a hidden layer(s), namely the multi-layer perceptron (MLP). In MLP neuron layers are “stacked in such a way that the output of a neuron in a layer is only allowed to be an input to neurons in the upper layer” (Le and Zuidema, 2014) as shown in Figure 3.6.

Figure 3.6: A very simple three-layer network which consisting of hidden, input and output layers.

The advantages of Multi-Layer Perceptron are as follows:

1. Adaptive learning:” An ability to learn how to do tasks based on the data given for training is also called initial experience” (Ruck et al. 1990). 2. They yield the required decision function directly via training. 3. “A two-layer back propagation network with sufficient hidden nodes has been proven to be a universal approximate” (Cybenko, 1989).

The disadvantage of Multi-Layer Perceptron network is that it finds a way to work out a problem on its own, “thus its action can be unpredictable”. “If there’s a new concept to be introduced, it retains the whole model” (Su et al. 1996). It takes a long to train and come up with a good model. The MLP has been applied in areas of Text to Phoneme Mapping (Turban and Frenzel, 1992), among other applications.

24

3.4.3 Support Vector Machine (SVM)

Vapnik (1995) introduced Support Vector Machines (SVM) “which are based on statistical learning theory and are developed to perform binary classification”. It is a supervised learning algorithm which can solve linear and non-linear separable problems. The SVMs selects the one linear decision boundary that leaves the greatest margin between the two-classes. “The margin is defined as the sum of the distances to the hyperplane from the closest points of the two-classes” (Vapnik, 1995). If the two classes are not linearly separable, slack variables or a kernel trick are introduced in SVM to find the hyperplane that minimizes the margin. The data points that are closest to the hyperplane are used to measure that margin hence these data points are named support vector as shown in Figure 3.7. The idea of two-class algorithm classification is as follows (Madzarov et al. 2009):

For a binary problem, there are training data points {푋𝑖, 푌𝑖}, 퐼 = 1, … … … … . 퐼, 푦𝑖퐸 {−1, 1}. Boser et al. (1992) suggested that there are some hyperplanes which separate the positive from the negative examples. Point x which lies on the hyperplane satisfy푤. 푥 + 푏 = 0, where 푤 is normal to the hyperplane, 푏/||푤|| in Figure 3.7 is the perpendicular distance from the hyperplane to the origin. Suppose that all the training data satisfies the following constraints (Madzarov et al. 2009):

푥𝑖. 푤 + 푏 ≥ +1 푓표푟 푦𝑖 = +1,

푥𝑖. 푤 + 푏 ≤ −1 푓표푟 푦𝑖 = −1, Equation 3.3

These constraints can be combined into one set of inequalities. These points in Figure 3.4 lie on the hyperplane 퐻1 and 퐻2 points lying on the 퐻1 or 퐻2 hyperplane are called support vectors parallel and that no training points fall between them as shown in Figure 3.7.

25

Figure 3.7: Linear separating hyperplanes for the separable case, support vectors are circled.

3.4.3.1 Multiclass for support vector machine (MSVM):

SVM is originally designed for binary classification for a linear separating hyperplane to separate the data into two class, SVM’s have been expanded to multi-class classifier which can separate the data into multi classes. When dealing with an appropriate multi-class a method such as SMO is needed (Platt, 1998), libsvm (Chang and Lin, 2011) and liblinear (Fan et al. 2008). The difference between the binary classification and multiple class problems is that {1,2, … 푘} classes. There are three major methods to extend the binary classification which are “one-against-all which constructs 푘 SVM models where 푘 presents the number of target concepts” (Bottou et al.1994) one-against-one is “another major method which is constructs 푘(푘 − 1)/2 classifiers where each one is trained on data from two classes and direct acyclic graph SVM (DAGSVM)” (Platt et al. 2000) “Its training phase is the same as the one-against-one method by solving 푘(푘 − 1)/2 binary SVM classifiers”. The one-against-all method has been used for the study and libsvm (library for support vector machine) tools from WEKA were used to construct the SVM models. Libsvm library automatically detects more than two-classes and therefore it will start to train a multi class SVM using the one-against-all strategy by default (Chang and Lin, 2011).

There is also the kernel trick in SVM. Kernels used in this study are the radial basis function (RBF). “This kernel nonlinearly maps samples into higher dimensional space. RBF kernels can handle the case when the relation between class labels and attributes are nonlinear” (Hsu et al. 2003).

26

The main advantages of SVM are as follows:

1. SVMs deliver a unique solution since the optimality problem is convex. 2. “Introducing the kernel in svm’s gains flexibility in the choice of the form of the threshold separating solvent from insolvent companies which needs to be linear “(Auria and Moro, 2008). 3. The SVM prediction accuracy is generally high. 4. “Fast evaluation of the learned target function” (Madzarov et al. 2009).

The main drawbacks of the SVM’s are that it takes a long time when generating training results and it is difficult to understand the learned functions (Vapnik, 1995). The major strengths of SVM are that the training is easy. SVM also requires parameter turning when trained to get accurate results.

SVM has been applied successfully is some applications like algorithm trading (Blokker, 2008), object recognition (Wang et al. 2002), face detection (Kumar and Poggio, 2000) and others.

3.5 ENSEMBLE LEARNING METHOD USED FOR THE STUDY 3.5.1 Ensemble Methods

An ensemble learning methods is to “select a whole collection or ensemble of hypotheses from the hypotheses space and combine their predictions” (Russell and Norvig, 2003). Ensemble methods improves the performance of weak classifiers such as (e.g. decision trees) which was used in the study. Ensemble method results are usually combined using voting or averaging procedure. “Ensembles have shown to be more accurate in many cases than the individual classifiers or predictors” (Tan et al. 2006). “Thus combining multiple models depends on the level of disagreement between classifiers or predictors and only helps when these models are significantly different from each other” (Wan and Yang, 2013). Ensembles have been used to improve generalization accuracy on a wide variety of problems (Dietterich, 2002). “Ensemble methods have been used for improving prediction performance” (Sewell, 2007). The four ensemble methods used in our study are: Bagging, Boosting, Random Forest and stacking. The ensemble methods are likely to improve the vehicle traffic flow prediction performance.

27

3.5.2 Data Fusion

Data fusion (DF) (Hall and Llinas, 1997) is a “technique that combines data from multiple predictors. It is also used to combine related information from associated databases”. DF does all this in order to achieve improved accuracy and to make better inferences than could be achieved by the use of a single model or data set. Therefore, the emergence of new algorithms, advanced processing techniques and hardware would make data fusion more effective. Methodological areas of data fusion systems include artificial intelligence, pattern recognition and statistical inference (Sohn and Lee, 2003). The application areas of data fusion include automated target recognition (Buede and Girardi, 1997), identification-friend- foe-neutral (IFFN) systems (Hall, 2000) among others.

These are the limitations of data fusion:

1. Downstream processing cannot make up for errors (or failures) in upstream processing. Data fusion “processing cannot correct errors in processing (or lack of pre- processing) of individual sensor data” (Hall and Steinberg, 2001).

Comparison between Data fusion and Ensemble Methods:

Data fusion (DF) is essentially an information integration problem (Bray, 1997), “it integrates data from multiple sensors to provide better analysis and the data can be used for decision making. In other words, integrated data from multiple sensors of different types provides better results because the strength of the other type can compensate the weaknesses of the other” (Bray, 1997). Ensemble learning method (Bühlmann, 2012) “aims to improve the performance of a statistical learning”. “The general principle of ensemble methods is to construct a linear combination of some model fitting method, instead of using a single fit of the method” (Bühlmann, 2012).DF fuses data from different sources derive good results by using different types of sensors, while an ensemble method uses its learning methods and machine learning algorithms such as (e.g. Decision tree) to combine multiple algorithms to get good results.

28

3.5.3 Bagging ensemble method

Bagging also known as bootstrap, “uses multiple sets of training data which are generated randomly by replacing푁, were 푁 represents the size of the training data” (Breiman, 1996). The method works by reducing variance by majority voting. Bagging is “one of the earliest, most intuitive and simple ensemble based algorithms, with a good performance” (Syarif et al. 2012). Bagging can be combined with one base classifier (e.g. decision trees) at time.

The experiments in section 4.3 are carried out using the bagging method. The procedure for constructing the bagging method is as follows (Tan et al. 2006):

Phase1: Base Classifier

Step 1: Given training data set, new training data set is generated by bagging.

Step 2: “Apply base classifier to each sample of the data set to construct the models”.

Phase 2: Model Prediction

Step 3: Make predictions with each of the model classifier

Step 4: By using voting select the most predicted decision.

Bagging has the following advantages:

1. “Bagging is mainly used for unstable base classifiers. 2. Bagging seeks to reduce the error due to variance of the base classifier. 3. It is noise-tolerant, but not so accurate” (Sweety, 2013).

The disadvantage of Bagging is that it is not simple to interpretable its model.

29

3.5.4 Boosting ensemble method

According to Sewell (2007),” boosting is a general method for improving the performance of any learning algorithm”. In theory, boosting can be used to significantly reduce the error of any “weak” learning algorithm. Weak learning algorithms consistently generate classifiers or predictors which perform slightly better than random guessing (Freud and Schapire, 1996). Despite the potential benefits of boosting promised by the theoretical results, the true practical value of boosting can only be assessed by testing the method on a “real” learning problem (Freund and Schapire, 1996). The boosting method works by voting. The important content of this method is to employ a unit to each information (example) in the training set. These weights can be used to inform the training of the weak learner, for instance, decision trees can be grown that favour splitting sets of samples with high weights. “In the beginning, all weights are equal, but in every round, the weights of all misclassified instances are increased while the weights of correctly classified instances are decreased” (Tan et al. 2006). Boosting’s failure that it is susceptible to overfitting (Sweety, 2013).

The prediction models using the boosting method was constructed in section 4.3. The procedure for constructing the boosting model is as follows (Tan et al. 2006):

Phase 1: Base classifier

Step 0: “Set the weight value,푤 = 1 and assign it to each instance in the training data set”.

Step 1: “Apply a base classifier to the weighted data set”.

Step 2: “Compute classification or prediction error e for the weighted training data set. If 푒 = 0 or 푒 >= .5, then terminate the classifier generation process and go to Step 4; otherwise multiply the weight w of each object by 푒/ (1 – 푒) and then normalize the weights of all objects”.

Phase 2: Model Prediction

Step 3: “Assign weight 푞 = 0 to each target concept to be predicted”.

Step 4: “For each of t- (or less) classifiers or predictors, add – 푙표푔 푒/ (1 – 푒) to the weight of the decision predicted by the classifier and output the decision with the highest weight”.

30

Boosting has the following advantage:

1. Boosting can work with any classification algorithm. Another advantage of boosting is that it tends to achieve higher accuracy than Bagging.

The disadvantage of Boosting is that has very high accuracy meaning poor generalization performance. Boosting focuses on misclassified tuples so it risks over fitting.

What makes boosting interesting is that you can learn a very complex model all at once or learn sequences of simpler models.

Boosting has the following limitations:

Boosting can fail to perform well when given insufficient data. “This observation is consistent with the Boosting theory” (Freund et al. 1999). Boosting does not perform well when there is a large amount of classification noise (i.e. training and test examples with incorrect class labels) in data and it is thus susceptible to noise in the data.

Comparison between Bagging and Boosting:

1. According to Sweety (2013), “Bagging is noise-tolerant and produce better class probability estimate. It is not so accurate and it is related to random sub-sampling”. 2. Boosting is susceptible to noisy data and produces rather bad class probability estimates. 3. In bagging, the model receives equal weight, whereas in boosting weighting is used to give more influence to be more successful.

Bagging and boosting both use a sort of voting for categorization in order to combine the outputs of the several classifiers constructed from one algorithm. “In boosting, unlike bagging, each classifier is influenced by the performance of those built before, so the new classifier tries to pay more attention to errors that were made in the previous ones and to their performances” (Sweety, 2013). According to Rokach (2005), “each instance is chosen with equal probability in bagging, while in boosting, instances are chosen with probability proportional to their weight “.

31

3.5.5 Stacking ensemble method

Stacking, or stacked generalization, “is used for combining multiple predictions” (Ting and Witten, 1999). Unlike bagging and boosting, stacking is usually used to combine various different predictors constructed using different algorithms (e.g. C4.5, MLP and SVM) as shown in Figure 3.8. Stacking consists of two levels which include a base learner as level-0 and a stacking model learner as level-1. Base learner uses different algorithms to learn from a dataset as shown in Figure 3.8. The results of each algorithm are collected to create a new dataset meta-level as shown in Figure 3.8. In the new dataset, each instance is a real value which is supposed to be predicted. Then that data is used through stacking model learner to provide the output results. “The predicted output from the three chosen base classifiers, which are C4.5, MLP and SVM can be used as input variables into stacking model learner, which will attempt to learn from the information how to combine the predictions from these several models to attain the best prediction accuracy” (Syarif et al. 2012).

Figure 3. 8: Diagram to show the Stacking model.

The procedure for constructing the Stacking model is as follows (Wolpert, 1992):

Step 0: Separate the training dataset into two “disjoint sets”.

Step 1: On the first part train several “base learners” such as (e.g. C4.5 and MLP).

Step 2: On the second part test the base learners.

Step 3: “Using the base classifiers from step 2 as the inputs to the correct responses as the outputs of train a higher level learner”.

32

“Steps 0 to 2: instead of using a winner-takes-all approach, we combine the base learners, possibly in a nonlinearly fashion” (Wolpert, 1992).

Stacking has the advantage of improving the predictive accuracy of a model. The disadvantage of stacking “is that it is difficult to understand an ensemble of classifiers or predictors” (Rokach, 2010). This method was used since it greatly improves prediction and can combine more than one classifier to obtain better results.

3.5.6 Random Forest (RF) ensemble method Random Forest (RF) “is another ensemble-based method which focuses only on ensemble of decision trees” (Breiman, 2001). RF combines the base learner principles of bagging with random feature selection to add additional diversity to the decision tree models (Lantz, 2013). After the ensemble trees are generated, the model uses a vote to combine the tree predictions. “Also, random selection of features to be used at splitting nodes enable fast training even if the dimensionality of the feature vector is large” (Mishina et al. 2015). The prediction process supplies sample data which is input to all of the decision trees, and the class probabilities of the leaf nodes arrived at the output. The class with the largest average of the class probabilities is obtained from all of the decision trees Pt(c|x) as shown in Equation 3.4 which is the classification decision (Mishina et al. 2015).

Equation 3.4

The class that has the highest probability ^y is output as the classification result by:

Equation3.5

Random forest was used in the study because of the following reasons (Lantz, 2013):

1. It can handle noisy or missing data, categorical or continuous features. 2. It selects only the most important features. 3. It can be used to handle data with an extremely large number of features. 4. “Parameters can be easily set and therefore it eliminates the need for pruning the trees” (Ali et al. 2012).

33

The disadvantage of Random forest is that “they may require some work to tune the model to the data and its model may not be easy to interpret “(Lantz, 2013). RF was used in section 4.3 where the step by step process of showing how the RF ensemble models were constructed.

3.6 SEARCH METHODS

Search algorithms only have the knowledge provided in the search space. A search space can be a pool of attributes or features. These search methods are assigned to the feature selection process diagram as in Figure 3.3 in section 3.3.

3.6.1 Best-First Search (BFS)

The best-first search technique allows us to switch between paths, thus gaining the benefits of both approaches of depth-first and breadth-first search. At each step (level) the most promising node is chosen. Nodes are presented by either the parent node or the child node as shown in Figure 3.9. “If one of the nodes chosen generates nodes that are less promising it is possible to choose another at the same level and in effect the search changes from depth to breadth first search” (Sharma, 2008). If on analysis these nodes are no better than the previously unexpanded node and branch is not forgotten and the search method reverts to the descendants of the first choice and proceeds, backtracking. The Best-first search sorts nodes in the Frontier list by increasing values of an evaluation function, 푓(푛), that incorporates domain-specific information. This is a generic way of referring to the class of informed search methods. A key component in best-first algorithms is a heuristic function, ℎ (푛), which is the estimated cost of the cheapest path from n to a goal node (Sharma, 2008).

The Best-First Search Algorithm procedure (Sharma, 2008) is shown in Figure 3.9 and processed as follows:

Step 1: Start with OPEN holding the initial state

Step 2: Pick the best node on OPEN

Step 3: Generate its successors

Step 4: For each successor DO

 If it has not been generated before evaluate it, add it to OPEN and record its parent.  If they have been generated before change the parent, if this new path is better and in that case update the cost of getting to any successor nodes.

34

Step 5: If a goal is found or no more nodes left in OPEN, quit or return to step 2.

Figure 3.9: Diagram of Best-First search method. Numbers are the nodes and the links between nodes are branches.

The main advantage of BFS is that” they are saved to enable revisits if an impasse occurs on the apparent best path “(Turban and Frenzel, 1992). BFS was used in section 4.3 when doing attribute selection to select the best three attributes from a pool of attributes for the study.

3.6.2 Ranker Search (RS)

According to Dinakaran and Thangaiah (2013), ranker method “ranks attributes by their individual evaluations use in conjunction with attribute evaluators with the parameter generate ranking (true or false)”. The number to select attributes values is set by which attributes can be discarded from the set of attributes when using ranker search method. The ranker search method generally performs the rank which attributes should be obtain; they are ranked according to the selected attribute from the supplied dataset. “Ranker is providing a rating to the attributes in order by their score to the evaluator” (Kohavi and John, 1997). The Ranker search method was used in section 4.3 when doing attribute selection. The method was used when selecting the best three attributes among the rest of the attributes in the study.

35

3.6.3 Greedy-Best Search (GBS)

The Greedy-best search uses, as an evaluation function, 푓(푛) = ℎ(푛) sorting nodes in the front list by increasing values off them. “The Greedy-best search tries to expand the node that is closest to the goal, on the grounds that this is likely to lead to a solution quickly” (Russell and Norvig, 2003). Greedy algorithms often perform very well. They tend to find good solutions quickly, although they are not always optimal ones.

The main advantage of the Greedy-best search method is to maximize short-term advantage without worrying about long-term consequences. This search method has been used when selecting the best two attributes used from a pool of attributes for this study.

3.7 CROSS-VALIDATION (CV)

Cross-validation is a model validation technique. In other words, CV is used for assessing how a model will generalize to a data set. One can decide on a “fixed size (n)” of folds for the dataset. Fold is when the dataset is broken down to 10 sets of size 푛/10. The data is split into n folds. In each turn one split fold is used as test and the remaining folds for training. In this study 10-folds cross-validation were used, in which 9-folds of the data were used for training and 1-fold of the data was used for testing (re-evaluating models).

The procedure for the 10 folds Cross-validation (Tan et al. 2006) is as follows:

Step 1: Break data into 10 sets of size 푛/10.

Step 2: Train on 9 datasets and test on 1.

Step 3: On the next iterations ensure each of the 10-folds have been used as a test set.

Step 4: Repeat 10 times and take a mean accuracy.

This approach has the advantage “that all the data can be used for training and none has to be held back in a separate test set” (Allen, 1974).

36

3.8 SUPERVISED LEARNING

Cunningham et al. (2008) defines supervised learning as “learning in which mapping occurs between a set of input and an output data, and using the mapping to predict outputs for unseen data. Based on this training data, the algorithm has to generalize in such a way that it is able to correctly classify or predict to all possible inputs”. Supervised learning of actions occurs when the agent is given immediate feedback about the value of each action (Talwar and Kumar, 2013).

The main advantage of Supervised Learning is that all classes or analogy outputs manipulated by the algorithm are meaningful to humans. “It can be easily used for discriminative pattern classification, and for data regression” (Sathya and Abraham, 2013). In this study supervised learning was used in section 4.3 on all experiments.

3.9 DISCUSSION OF THE SELECTED ALGORITHMS OF THE STUDY

The Decision Tree (C4.5), the Multi-Layer Perceptron (MLP), the Support Vector Machine (SVM) algorithms and an ensemble learning methods namely Bagging, Boosting, Stacking and Random Forest are used as effective tools for these study. These algorithms are used in Section 4.3 for experiments.

Decision Tree, Multi-layer Perceptron and Support Vector Machine are “both supervised learning algorithms “(Maimon and Rokach, 2005). Decision Trees are easy to compute the expected value and more than one decision maker can be easily involved with the decision processes (Kohavi and Quinlan, 2002). The C4.5 algorithms perform well in the presence of noise (Tan et al. 2006). C4.5 especially smaller-sized trees are relatively easy to interpret. C4.5 are able to generate understandable rules.

MLP is one of the preferred techniques for gesture recognition and they yield the required decision function directly via training (Su et al. 1996). The drawback of the MLP method is that it is difficult to interpret its model and it is too complex. The MLP and the C4.5 can be compared since both are able to handle interaction between variable. The MLP does not make any assumptions regarding the underlying probability density functions (Cybenko, 1989).

The SVM performs well on data sets that have many attributes, even if there are very few instances on which to train the model. Compared to the MLP, SVM produces results quicker (Kantardzic, 2011). The SVM can model complex, real-world problems such as text and image classification. SVM allow explicit control over complexity of the derived models by tuning some

37

parameters (Kordon, 2009). When constructing the SVM in section 4.3 the parameters that were set are gamma and cost. Where the cost parameter was kept constant and gamma value was changed several five times to obtain more than one model.

An ensemble method constructs a set of base classifiers from the training data and performs classification by taking a vote on the predictions made by each base classifier (Tan et al. 2006). The ensemble method can be used to combine weak base classifiers to produce good results.

The ensemble method used in this study is bagging. Bagging is for aggregating different base learner algorithms to improve the performance of a model. The Bagging method is recommended for unstable base classifiers. The method seeks to reduce the error due to variance of the base classifiers. It is not easy to interpret the results of Bagging. Boosting is a general method for improving the performance of classification (Sewell, 2007) by reducing the error of any weak learning algorithm. The boosting method works by majority voting. Boosting tends to fail if given insufficient data. “It is evident that bagging is noise-tolerant and it is not so accurate, while boosting is susceptible to noisy data can also produce a bad class probability estimate” (Sweety, 2013). Unlike bagging and boosting, which use a single algorithm, stacking is used to combine various different predictors constructed using different algorithms (such as C4.5, SVM and MLP). Stacking improves the prediction accuracy. This ensemble method is used to improve the prediction accuracy and weak learning classifiers (Rokach, 2010). All the algorithms were selected for the study looking at their strength which can improve the performance of all the constructed models.

3.10 CHAPTER CONCLUSION

Quantitative research methods were used in this study. Face to face interviews were conducted during the site visit to collect vehicle traffic flow data from the primary source. The interviews were carried out using a set of questionnaires. This chapter described the research methodology, including the population, sampling, data collection methods and the algorithms used to construct the prediction models. The selected algorithms for the study are Decision Trees (C4.5), Multi-Layer Perceptron (MLP), Support Vector Machine (SVM) and an ensemble learning methods that include Bagging, Boosting, Stacking and Random forest. The reason for selecting these algorithms was considering their strengths that they could help to construct the best model for predicting the vehicle traffic flow. All the selected algorithms and search methods are applied in section 4.3.

38

CHAPTER 4: EXPERIMENTS AND RESULTS 4.1 INTRODUCTION

This chapter contains all the experiments which were conducted using Waikato Environment for Knowledge Analysis (WEKA) for the study. WEKA is an open source software that contains a number of machine learning algorithms. In the experiments three machine learning algorithms, Attribute Selection and an Ensemble learning methods were used. All the results obtained from the construction of the vehicle traffic flow prediction are compared and the best prediction model was chosen.

4.2 DATA PRE-PROCESSING

The data used for the study was collected from MTM as shown in detail in Annexure 4A and it was pre-processed using the Microsoft Excel spreadsheet. The data was in a numeric format as shown in Table 4.1 which shows a sample of 3 instances of the vehicle traffic flow data.

The number of instances used to build the models was 15920.The data was obtained as a Microsoft Excel spreadsheet and contained numerical values that represented attributes and instances that influence traffic flow. The date is represented as DayOfWeek (DOW), time of day as TravelTime (TT), total number of cars at the road as TrafficVolume (TV) and direction 1 as AverageSpeed(AS) as shown in Table 4.1. The study used data for weekdays and excluded data for public holidays and weekends. The sample of the numeric data before pre- processing is shown in detail in Annexure 4B.

Table 4.1: A sample of 3 instances before data pre-processing.

Instance Date Time of day Total No. of AverageSpeed vehicles at the (Direction 1) road (TrafficVolume) 1 2013-01-28 14:00:00 4169 100 2 2013-01-28 15:00:00 6719 100 3 2013-01-28 16:00:00 7660 100

The data in Table 4.1 was collected from MTM in numerical format. The conversion procedure for converting traffic data from numeric to nominal values is shown in section 4.2.1.

39

4.2.1 Conversion of numeric vehicle traffic flow data to nominal values

The conversion of MTM numeric data to nominal data was performed using a Microsoft Excel spreadsheet used. The attributes used and their instances when converting data to nominal values for the study are as follows, DayOfWeek (28-JAN-2013 to 28-DEC-2015), TravelTime (Peak (06:00-09:00 AND 15:00 – 18:00) and Off-Peak (09:00 – 15:00 AND 18:00 – 06:00)), AverageSpeed Low-Speed (<= 98 km/hr), Average-Speed (<= 107 km/hr) and High-Speed (> 108 km/hr) and TrafficVolume (Low-Traffic (< 3500), Average- Traffic (>= 3500 AND =< 5000) and Heavy- Traffic (>5000)). These attributes mean the following: travel time is the time in which the vehicles were traveling at either peak or off-peak period. The average speed attribute is the speed at which the vehicles were traveling. The traffic volume attribute determines the number of vehicles that were passing on the road.

The step by step process that was used to convert the traffic data from numeric to nominal was done according to Molupe (2014):

1. There are 15920 instances that were converted from numeric data to nominal values. 2. Converting TravelTime variables to nominal values using Ms Excel spread sheet formula was done as follows:

“=IF (AND (J3>=1; J3<=5);"Off-Peak";

IF (AND (J3>=6; J3<=9);"Peak";

IF (AND (J3>=10; J3<=14);"Off-Peak";

IF (AND (J3>=15; J3<=18);"Peak"; "Off-Peak")))) Equation 4.1

3. Type Equation 4.1 into an empty cell, P3 referencing the cells J3 with values that has to be converted to nominal values.

4. The first line in Equation 4.1 means that, If the value in cell J3 is greater than or equal to 1 and less than or equal to 5, then convert it to “Off-Peak” the second line means that if cell J3 is greater or equal to 6 and less than or equal to 9 convert it to “Peak” the third line means that if cell J3 is greater or equal to 10 and less than or equal to 14 convert it to “Off-Peak”, or else if the value in cell J3 is greater or equal to 15 and less than or equal to 18, then convert it to “Peak”, the fourth line means convert it to “Off-Peak” value.5. To auto-fill all the instances in column P drag the formula down to fill over all the instances

40

with the TravelTime formula.6. Converting TrafficVolume variables to nominal values using Ms Excel spread sheet formula.

=IF (K3<3500;"Low-Traffic";

IF (AND (K3>=3500; K3<=5000); "Average-Traffic"; "Heavy-Traffic")) Equation 4.2

7. Type Equation 4.2 into an empty cell, Q3 referencing cell K3 with numeric values to convert them to nominal values.8. The first line in the Equation 4.2 means that if the value in cell K3 is less than 3500, then the vehicle traffic volume is predicted to be “Low-Traffic” the second line means that if the value in cell K3 is greater or equal to 3500 and if the value is less than or equal to 5000, then the predicted should be “Average-Traffic” and on the second line, if the value of K3 is greater than 5000, then the predicted value is “Heavy- Traffic”.9. To auto-fill all the instances in the column Q drag the formula down to fill over all the instances with the TrafficVolume formula.10. Converting AverageSpeed numeric data to nominal values using Ms Excel Spreadsheet formula.

=IF (L3<=98;"Low-Speed";

IF (L3<=107;"Average-Speed";"High-Speed")) Equation 4.3

11. The first line in Equation 4.3 means that if the value in cell L3 is less than or equal to 98, then vehicles are travelling at “Low-Speed” the second line means if the values in cell L3 is less than or equal to 107, then the vehicles are travelling with “Average-Speed” the second line means if the values in L3 is greater than 107, then the vehicles are estimated to be travelling at “High-Speed”.12. To auto-fill all the instances in the column R drag the formula down to fill over all the instances with the TrafficVolume formula. Predicting the TargetConcepts of the 15920 instances on the numeric and nominal values.

=IF (AND (P3="Off-Peak"; Q3="Heavy-Traffic"; R3="Low-Speed");"Congested";

IF (AND (P3="Peak"; Q3="Heavy-Traffic"; R3="Low-Speed");"Congested";

IF (AND (P3="Peak"; Q3="Low-Traffic"; R3="High-Speed");"Freeflow";

IF (AND (P3="Peak"; Q3="Heavy-Traffic"; R3="Average-Speed");"FlowingCongestion";

IF (AND (P3="Peak"; Q3="Heavy-Traffic"; R3="High-Speed");"FlowingCongestion";

IF (AND (P3="Peak"; Q3="Average-Traffic"; R3="Low-Speed");"Congested";

41

IF (AND (P3="Off-Peak"; Q3="Heavy-Traffic"; R3="High-Speed");"Freeflow";

IF (AND (P3="Peak"; Q3="Heavy-Traffic"; R3="Average-Speed"); "FlowingCongestion";" Freeflow")))))))) Equation 4.4

13. In cell S3 type Equation 4.4, assign predicted classes to the converted values to predict the state of the traffic.

14. To auto-fill all the instances in the column S drag the formula down to fill over all the instances with the TravelTime formula.

15. The first line in Equation 4.4 means that if value in cell J3, K3 and L3 is either (e.g. Peak, Heavy-traffic and Low-Speed is equal to Congested), FlowingCongestion and Freeflow”.

16. Save the file as comma delimited (csv.*), when using MS Office 2013 as it does not save the file in the comma delimited format that is needed. It saves it in semicolon (;) format. After saving the file in a csv. * format, right click on the file then on the dropdown populated menu select ‘notepad’. Then right click the notepad file, then a menu box will be populated under ‘find’ type semicolon (;) and in textbox replace type comma (,) then click on button replace all. This process means you are removing (;) and replacing them with comma’s (,) so that the file when is uploaded on WEKA it appears in the expected format. This process would enable the format of the file to be in the expected format which was to be in the comma delimited (*.csv) format.

The data in Table 4.2 is the data from Table 4.1 that have been converted to nominal values. The vehicle traffic flow data was categorized into three target concepts namely Freeflow which means that cars are travelling at the required speed greater than 107km/hr, FlowingCongestion which means that cars enter the congestion state at the speed between (98 km/hr– 107 km/hr) and Congested which means that cars are travelling in a speed less than or equal to 98 km/hr. The sample of the nominal data after pre-processing is shown in detail in Annexure 4C.

42

Table 4.2: A sample of 3 instances after data pre-processing. Instances Attributes Predictions

DayOfWeek TravelTime TrafficVolume AverageSpeed TargetConcept 1 DOW-Jan Off-Peak Average-Traffic Average-Speed Freeflow 2 DOW –Jan Peak Heavy-Traffic Average-Speed FlowingCongestion 3 DOW –Jan Peak Heavy-Traffic Average-Speed FlowingCongestion

4.3 EXPERIMENTS AND RESULTS

In this section prediction models were constructed using Attribute selection, Machine learning algorithm and Ensemble learning methods. Prediction models were created using WEKA. These prediction models were created using the training and testing data. The Ensemble learning models of predictors was constructed. These models were used to predict the status of vehicle traffic on the freeway.

4.3.1 Experiment for Attribute Selections

In this section of the experiment “Ranker” and “GreedyStepwise” search methods were used to rank and reduce the select attributes. These attributes were used later to construct vehicle traffic prediction models. The “GreedyStepWise” search method was used to select the best two attributes and “Ranker” was used to select the best three attributes. The search method was used as they are determined by WEKA when an attribute evaluator was chosen during attribute selection. Please note that all the confusion matrices that are not shown in the experiments section are included in Annexure 4D for this chapter.

43

4.3.1.1 Experiment 1: A model constructed using 2 attributes per model.

The procedure for attribute selection allows one to discard irrelevant attributes and thus reduce the dimensionality of the dataset. It also allows a combination of two attributes per model to be selected as follows:

Step1: Load data

1. The training dataset was cleaned and organized using MS Excel. 2. Click on button “Explorer”. 3. The Pre-process tab, click on the button “Open File”, then select the file load dataset *.CSV file. The button “Visualize all” will let you bring up a screen showing all the distributions at once as shown in Figure 4.1. “The Attributes frame allows the user to modify the set of attributes using select and remove options”. 4. Save button to save the dataset file as “*. arff” extension.

NB: All the experiments that follow start with the same steps (from 1 to 4). Thus in subsequent experiments steps 1 to 4 are assumed to be in place.

44

Figure 4.1: Screen shot of the traffic nominal data after it has been loaded on WEKA.

Step 2: Attribute Selection (2)

5. Click on the “Select attributes” tab as shown in Figure 4.2. 6. Under the “Attribute Evaluator” frame click choose relevant evaluation method (e.g. “CfsSubsetEval”) 7. “Under the “Search Method” frame select relevant search method - in this case it is “GreedyStepwise””. 8. In the Attribute Selection mode frame chosen selection the ‘Use Full training set” and then click on the start button.

45

Figure 4.2: The results of search method when selecting a combination of 2 best attributes.

Figure 4.3 shows the results of the model constructed from 2 sets of the best attributes (TV and AS) when using attribute selection.

Step 3: First use the best 2 sets of attributes combination in building the model, then use other different sets of attributes.

=== Attribute Selection on all input data ===

Search Method: Greedy Stepwise (forwards). Start set: no attributes Merit of best subset found: 0.806

Attribute Subset Evaluator (supervised, Class (nominal): 5 TargetConcept): CFS Subset Evaluator Including locally predictive attributes

46

Selected attributes: 3,4: 2 TrafficVolume AverageSpeed

Figure 4.3: The 2 attributes that have been selected to be the best out of all the attributes.

9. On Pre-process tab under “attribute” highlight “2 attributes” to be removed then click “remove” button. 10. You will be left with 2 attributes and the target Concept. 11. On Classify tab, click the choose button under “Classifier”, then select classification (e.g. C4.5 decision trees), then click on the Start button. 12. Right click on “result list” and “save model”. 13. Repeat the same steps for all the remaining attributes until 6 models are designed using the available attributes. 14. Evaluate all the results and then compare their confusion matrix and RMSE.

Table 4.3 shows the vehicle traffic flow prediction performance of using different combinations of 2 attributes per model. The combination of TT and AS obtained the highest prediction results of 94.667% and also had a good RMSE of 0.153.

Table 4.3: The results after creating models using different combinations of 2 attributes from the dataset.

# Algorithm Attributes Correct prediction RMSE instances (%) Training 1 C4.5 TV and AS 83.256 0.291 2 C4.5 DOW and TT 80.006 0.293 3 C4.5 TT and AS 94.667 0.153 4 C4.5 DOW and TV 67.644 0.404 5 C4.5 DOW and AS 80.665 0.316 6 C4.5 TT and TV 84.189 0.261

Table 4.4 shows the confusion matrix of a model constructed when a combination of TV and AS attributes was used with the C4.5 algorithm. Each row in a confusion matrix represents the instances in an actual state of traffic, and columns represent the instances in a predicted target concept. In the confusion matrix ‘Freeflow’, ‘FlowingCongestion’ and ‘Congested’ are represented by a, b and c, respectively and the same format will be used to present all the

47

following confusion matrices. The values of the confusion matrix showed that 7165 instances were correctly predicted as Freeflow and 14 were incorrectly predicted as Congested. One thousand seven hundred and fifty-six (1756) were incorrectly predicted as Freeflow and 1671 were correctly predicted as Congested and then 7 were incorrectly predicted as Freeflow.

Table 4.4: The confusion matrix for a model constructed from TV and AS attibutes.

Predicted a b c Actual A 7165 0 14 B 1756 0 0 C 7 0 1671

4.3.1.2 Experiment 2: A model constructed using 3 attributes per model.

The procedure for attribute selection where a combination of three attributes per model was used, is as follows:

The procedure for uploading dataset on WEKA is as follows:

Step 1: Load data

The procedure is the same as in Experiment 1 (Step 1) covered in section 4.3.1.1.

Step 2: Attribute selection

5. Click on “Select attributes” tab as shown in Figure 4.4 this option allows you to discard irrelevant attributes and reduce the dimensionality of your dataset 6. Under “Attribute Evaluator” frame click “choose” to select the relevant evaluation method (e.g. “InfoGainAttributeEval”) 7. Under “Search Method” frame click on the “choose” button to choose the relevant search method such as (e.g. “Ranker”) 8. On Attribute Selection Mode choose either ‘Use Full training set”, or “cross-validation”. Click on start button. In this case the “Full training set” has been used. 9. Output results set will be displayed, with attributes ranker to give you the ranked features in order of effectiveness.

48

Figure 4.4: Message displayed when trying to use “InfoGainAttributeEval” to select best performing attributes.

 Figure 4.4 shows a message that displays when you have selected “InfoGainAttributeEval” under Attribute Evaluator to alert you as to which search method the attribute evaluator prefers to use which is the “Ranker”.  The results set show that ranked attributes using numbering mean that TT (TravelTime) is a high ranker as an attribute that will give good results when selected with a combination of other attributes as shown in Figure 4.5.

Step 3: Use selected attributes in step 2 or a different attribute combination in building the models

In Figure 4.5, click on the Pre-process tab to remove the lowest ranked feature. Click on the check box on the feature, then click on remove button, then click on classify tab to select the classifier (e.g. C4.5) click on choose on the populated drop down menu, select the preferred algorithm to classify the dataset, then click on “Start button”.

49

10. On the “result list” right click and “save model”, on the “Classifier output” box you will be able to see the “Confusion Matrix”, “Detailed Accuracy by Class”, “Root mean square error” and the “Correctly classified instance” 11. After saving the model click on Pre-process, on the Pre-process tab graphical user interface click button “Undo” to have all the original features on the dataset. 12. Click on the second lowest feature as ranked on “Select attribute” tab, then click on classify tab to do the classification, choose the classifier, then click start button and “Save model”. 13. Click on the “pre-process tab” to repeat the same procedure as done in number (10, 11 and 12) to perform classification to all the dataset using 3 different attributes sets. To be able to analyse the results for all the attribute combinations.

Figure 4.5: Results for automation of feature selection using 3 attribute per model to train the dataset.

Table 4.5 shows the results when 3 attributes per model were used for the prediction model. The results of the prediction performance and the RMSE show that this set of attributes TT, TV and AS had the lowest RMSE of 0.011 and the correctly predicted instance of 99.981%.

50

Table 4.5: The results after creating a model using only 3 attributes per model.

# Algorithm Attributes Correct Prediction RMSE Training instances (%) 1 C4.5 TT,TV and AS 99.981 0.011 2 C4.5 DOW, TV and AS 94.667 0.153 3 C4.5 DOW,TT and AS 83.256 0.291 4 C4.5 DOW, TT and TV 84.189 0.261

Table 4.6 shows the confusion matrix of a C4.5 model constructed from the combination of (TT), (TV) and (AS) attributes. The values of the confusion matrix show that 7177 instances were correctly predicted as Freeflow and 2 were incorrectly predicted as Congested. In Table 4.6 it can be seen that 1756 instances were all correctly predicted as FlowingCongestion and 1678 were correctly predicted as Congested.

Table 4.6: The Confusion matrix of the model constructed from TT, TV and AS.

Predicted a b c Actual A 7177 0 2 B 0 1756 0 C 0 0 1678

4.3.2 Experiment 3: Training and testing of machine learning algorithms

In this section experiments for training and testing decision trees (C4.5), multi-layer perceptron (MLP) and support vector machine (SVM) are carried out.

Step1: Load data

The procedure is the same as in Experiment 1 (Step 1) covered in section 4.3.1.1.

Step2: data Classification

14. Click on Classify tab after loading dataset into WEKA. 15. On the Test option frame, click on the “choose” button and then select the classifier to be applied to the dataset (e.g. C4.5 unpruned decision tree, SVM and MLP. Note that there are 4 options on how to test the model you are building. Using the “test set”, a

51

“training set” (you will need to specify the location of the training set, in this case where the *.csv file has been saved and then cross-validation will be used for the experiment). In this case Cross-validation was used for the experiment as shown in Figure 4.6. 16. The resulting model has additional information and will be displayed after you click on “Start” button. The results also include the confusion matrix as shown on Figure 4.6. 17. On the result list frame right click on the populated menu list and then click “Save model”

Figure 4.6: Results set of training the model using C4.5. It shows the classification performance, confusion matrix and detailed accuracy by class.

4.3.2.1 Root mean square error (RMSE)

RMSE is a “measure of the difference between locations that are known and locations that have been interpolated. RMSE is derived by squaring the differences between known and unknown points, adding those together, dividing that by the number of test points, and then taking the square root” (Chai and Draxler, 2014) of the results as shown in Equation 4.5.

The RMSE of a prediction model with respect to the estimated variable 푋푚표푑푒푙 is defined as the square root of the mean squared error as follows:

52

n 2 ( obs,i  XX mo ,idel ) RMSE  i1 n Equation 4.5

NB: where 푋표푏푠 are observed values and 푋푚표푑푒푙 are modelled values at time or place i.

The RMSE values “can be used to distinguish model performance in a calibration period with that of a validation period as well as to compare the individual model performance to that of other predictive models” (Chai and Draxler, 2014).

The results in Tables 4.7 and 4.8 show the prediction values for the C4.5, SVM and the MLP. The results show that the C4.5 is the best model as it had a 99.981% prediction performance and the lowest RMSE of 0.011. All the attributes were used during the construction of the models namely, travel time, traffic volume, average speed and day of week. Table 4.7: The result for all the 2013, 2014 and 2015 vehicle traffic flow datasets. The algorithms that were used are C4.5 and MLP. # Algorithm Correct Predicted RMSE instances training RMSE Predicted Testing (%) Training Instance

Testing (%)

1 C4.5 99.981 0.011 100 0

2 MLP 98.342 0.094 100 0.004

Table 4.8: The result for all the 2013, 2014 and 2015 vehicle traffic flow datasets. The algorithm that was used was SVM (Libsvm) to train and test the dataset. # Algorithm Gamma Correct Predicted RMSE instances training RMSE Predicted Testing (%) Training Instance

Testing (%)

1 SVM(Libsvm) 0.01 93.018 0.216 97.229 0.136

2 SVM(Libsvm) 0.02 95.082 0.181 97.229 0.136

53

3 SVM(Libsvm) 0.03 98.634 0.095 97.229 0.136 4 SVM(Libsvm) 0.04 98.182 0.110 97.229 0.136 5 SVM(Libsvm) 0.05 99.152 0.075 97.229 0.136

The value of c-Cost = 1.0 was kept constant during the experiment. SVM has five sets of results. This was done so to obtain several results for SVM, by changing the gamma parameter and leaving the cost parameter constant. All the confusion matrix that show the design of the SVM and the MLP models are show in Annexure 4D.

Table 4.9 shows the confusion matrix for a C4.5 algorithm of the vehicle traffic flow data. The confusion matrix values in Table 4.9 show that the 7177 instances were correctly predicted as Freeflow and 2 instances were incorrectly predicted as Congested when it was supposed to have been predicted as Freeflow, then 1756 instances are correctly predicted as FlowingCongestion and 1678 is correctly predicted as Congested.

Table 4.9: The Confusion matrix for training data using the C4.5 algorithm.

Predicted a b c Actual A 7177 0 2 B 0 1756 0 C 0 0 1678

The previous sections show that although the prediction performance is good it is not optimal. This has motivated the use of ensemble learning methods covered in section 4.3.3. 4.3.3 The Ensemble learning method experiments

In this section ensemble learning methods is used to predict models using the vehicle traffic flow data. An ensemble learning methods include Bagging, Boosting, Stacking and Random Forest are included. All the attributes of the data were used to be able to determine the best performing model. In section 4.3.1 models were designed using a combination of two and three attributes per model to also evaluate which combination of attributes performs best together. The ensemble learning method enables the use of multiple models to obtain a better predictive performance. It is capable of combining multiple weak models to produce strong predictive models (Brown, 2011).

54

4.3.3.1 Experiment 4: A model constructed using Bagging ensemble learning method.

The procedure for designing the Bagging ensemble learning method model is as follows:

Step1: loading data

The procedure is the same as in Experiment 1 (Step 1) covered in section 4.3.1.1.

Step 2: Using Bagging ensemble learning method

4. Click on Classify tab after loading the dataset into WEKA, then click choose button to select a classifier ->meta ->Bagging as shown in Figure 4.7. 5. “Click on the name of the method to the right of the Choose button”. Click Choose then select classifier ->trees-> C4.5. 6. “Save the output of the model. Right-click the “Result list” on the last line and select save result the model”. Then give it a name “C4.5Bagging10.out”. 7. You can produce new models using for bagging ensemble method by using, MLP and SVM algorithms by repeating the previous steps since a model was constructed using C4.5 already. “Save the corresponding outputs in files “SVMBagging10.out” and “MLPBagging10.out”, respectively to be able to analyse the performance of all the constructed models”.

55

Figure 4.7: Shows process for designing the Bagging model “meta” classifier and using C4.5 algorithms for the design of the model and the results buffer.

Figure 4.7 shows a model constructed using C4.5 for Bagging ensemble learning method. The other constructed models using SVM and MLP confusion matrix are shown in Annexure 4D.

Table 4.10 shows the results of the model when using the Bagging ensemble learning method. The Bagging ensemble learning method when using C4.5 has the best vehicle traffic flow prediction of 99.981% and a RMSE of 0.01.

56

Table 4.10: The results of the Bagging method constructed using the C4.5, SVM and MLP.

# Algorithm Attributes Correct prediction RMSE Predicted RMSE instances (%) Training Instance Testing Testing (%) 1 C4.5 TT,TV,AS and 99.981 0.010 100 0 DOW 2 MLP TT,TV,AS and 98.483 0.088 99.286 0.056 DOW 3 SVM TT,TV,AS and 93.046 0.210 97.229 0.136 DOW

Table 4.11 shows the confusion matrix when all the attributes are used, the Bagging learning method and C4.5 algorithms. The values of the confusion matrix show that 7177 instances were correctly predicted as Freeflow, whereas 2 were incorrectly predicted as Congested. One thousand seven hundred and fifty-six (1756) instances were correctly predicted as FlowingCongestion and 1678 instance were correctly predicted as Congested.

Table 4.11: The Confusion matrix for C4.5 Bagging ensemble learning method.

Predicted a b c Actual A 7177 0 2 B 0 1756 0 C 0 0 1678

57

4.3.3.2 Experiment 5: A model construction using Boosting ensemble learning method.

The procedure for designing the Boosting ensemble learning model is as follows:

Step1: Load data

The procedure is the same as in Experiment 1 (Step 1) covered in section 4.3.1.1.

Step2: Using Boosting to classify

1. Click on Classify tab after loading dataset into WEKA, then click choose button to select classifier ->meta ->AdaBoostM1 as shown in Figure 4.8. 2. “Click on the name of the method to the right of the Choose button”. Click Choose then select classifier ->trees-> C4.5. 3. “Save the output of the model. Right-click the “Result list” on the last line and select save result the model. Then give it a name C4.5Boosting10.out”. 4. You can produce new Boosting models using SVM and MLP algorithms by repeating the previous steps 1 to 3. “Save the outputs in files “SVMBoost10.out” and “MLPBoost10.out”, respectively”.

58

Figure 4.8: Results of Boosting ensemble learning method designed using C4.5 algorithm. The results buffers are saved to be analysed. The results output for SVM and MLP are shown in Annexure 4D.

Table 4.12 shows results of the Boosting ensemble learning method with C4.5, MLP and SVM algorithms. The results showed that Boosting when used together with the MLP algorithm gives the highest prediction of vehicle traffic flow instances of 99.991% and the lowest RMSE of 0.004.

59

Table 4.12: Results obtained during the construction of a model using the Boosting learning method with C4.5 algorithms.

# Algorithm Attributes Correct RMSE Predicted RMSE prediction Instance Training Testing instances (%) Testing (%)

1 C4.5 DOW,TT,TV and 99.981 0.011 100 0 AS

2 MLP DOW,TT,TV and 99.991 0.004 99.286 0.066 AS

3 SVM DOW,TT,TV and 99.708 0.048 99.943 0.052 AS

Table 4.13 shows the confusion matrix of all attributes when using the C4.5 Boosting ensemble learning model. The values of the confusion matrix show that 7177 instances were correctly predicted as Freeflow, whereas 2 were incorrectly predicted as Congested. One thousand seven hundred and fifty-six (1756) were correctly predicted as FlowingCongestion and 1678 were correctly predicted as Congested. The confusion matrix table only displays results for Boosting when using the C4.5 algorithm. The confusion matrix for SVM and MLP are shown in Annexure 4D. The results in Table 4.13 means Boosting when used with MLP algorithm performs best in terms of the prediction instance and the RMSE value.

Table 4.13: The confusion Matrix for Boosting learning method when using the C4.5 algorithm.

Predicted a b c Actual A 7177 0 2 B 0 1756 0 C 0 0 1678

60

4.3.3.3 Experiment 6: A model constructed using Stacking ensemble learning method.

The model was constructed using the vehicle traffic flow data and the model was then saved to be used later to predict new models for vehicle traffic flow instances. The procedure for constructing the Stacking ensemble learning model is as follows:

Step1: Load data

The procedure is the same as in Experiment 1 (Step 1) covered in section 4.3.1.1.

Step2: Using three classifiers to be ensemble

5. Click on Classify tab after loading dataset into WEKA, then click “choose” button to select a classifier -> meta -> Stacking. 6. “Click on the name of the method to the right of the Choose button”. Click Choose then select classifier ->trees-> C4.5 as shown in Figure 4.9. Click “Start” button to run the classifier then save it as “stackingc4.5.out”. 7. “Click on the name of the method to the right of the choose button”. Click choose then select a classifier ->function->libsvm as shown in Figure 4.10. Click “Start” button to run the classifier then save it as “stackingsvm.out”. 8. “Click on the name of the method to the right of the choose button”. Click choose, then select a classifier -> function->Multi-Layer Perceptron. Click “Start” button to run the classifier the save it as “stackingmlp.out”.

61

Figure 4.9: Screen to add more than one algorithm for stacking ensemble prediction model.

Step3: Using Stacking ensemble for prediction

9. “Click on the name of the method to the right of the Choose button”. Click Choose “weka.gui.GenericArrayEditor”, choose->classifier->add; repeat the process until you have all the classifier’s (C4.5, SVM and MLP) that you want in the ensemble learning method as shown in Figure 4.11. Click “Start” button to run the model and save it as stacking1.out.

Figure 4.10: Displaying algorithms added to perform stacking ensemble prediction model.

62

Figure 4.11: After adding C4.5 algorithm as a “metaClassifier” to construct stacking ensemble prediction model.

Table 4.14 shows results of a stacking model constructed using the C4.5, SVM and MLP algorithms. Stacking when using MLP and SVM as meta classifier showed a prediction rate of 99.981% with RMSE 0. 01. Meta classifier is described as wrapping around base learning algorithms to widen the applicability or enhance performance of the model (Hall et al.2009).

Table 4. 14: Results obtained when constructing a model using Stacking with C4.5, SVM and MLP.

# Algorithm Attributes Correct RMSE Predicted RMSE classification Training Instance instances (%) Testing (%) 1 C4.5(Stacking DOW,TT,TV 97.487 0.131 100 0 C4.5, MLP and AS and SVM) 2 MLP(Stacking DOW,TT,TV 99.981 0.010 100 0 MLP,SVM and AS and C4.5) 3 SVM(Stacking DOW,TT,TV 99.981 0.011 100 0 SVM, MLP and AS and C4.5)

Table 4.15 shows the confusion matrix of all attributes when using the Stacking ensemble learning method combining the C4.5, the SVM and the MLP algorithm with C4.5 as a meta classifier. The values of the confusion matrix showed that 7177 instances were correctly

63

predicted as Freeflow, whereas 2 were incorrectly predicted as Congested. One thousand seven hundred and fifty-six (1756) instances were correctly predicted as FlowingCongestion. In addition, 1408 instances were correctly predicted as Congested and 270 instances were incorrectly predicted as Freeflow.

Table 4.15: Confusion matrix for stacking ensemble learning method combining C4.5, SVM and MLP algorithms, where C4.5 algorithms was used as a meta classifier.

Predicted a b c Actual A 7177 0 2 B 0 1756 0 C 270 0 1408

4.3.3.4 Experiment 7: A model constructed using Random Forest ensemble learning method.

The model was constructed using the vehicle traffic flow data and the model was to evaluate the results. The procedure for constructing the Random Forest ensemble prediction model is as follows:

Step1: Load data

The procedure is the same as in Experiment 1 (Step 1) covered in section 4.3.1.1.

Step2: Using Random forest ensemble classifier

4. Click on Classify tab after loading dataset into WEKA, then click “choose” button to select a classifier -> meta -> Random Forest. 5. “Click on the name of the method to the right of the choose button”. Click choose then set numTrees and numFeatures as shown in Figure 4.12. Click “Start” button to run the classifier then save it as Randomforest1.out.Reapet step 2 to design Randomforest3.out and Randomforest5.out.

64

Figure 4.12: Parameters to set when design the Random Forest models are numTrees set to 1 and numFeatures to 5.

Figure 4.13: The result buffer where numTrees used was set to “1” and numFeatures to 5.

65

Table 4.16 shows results of a model constructed using the Random Forest ensemble learning method, when setting numTrees to 1 and numFeatures to 5. The RF model achieved 99.991% prediction performance and a low RMSE of 0. 005.

Table 4.16: Results obtained when constructing a model using Random Forest ensemble method were numTrees parameters was tuned to using 1, 3 and 5.

# Learning numTrees Attributes Correct RMSE Predicted RMSE Algorithm classification Training Instance Testing instances (%) Testing (%) 1 RF 1 TT,TV,AS 99.991 0.008 100 0 and DOW 2 RF 3 TT,TV,AS 99.991 0.005 100 0 and DOW 3 RF 5 TT,TV,AS 99.991 0.005 100 0.004 and DOW

The results in Table 4.17 show that 7178 instances were correctly predicted as Freeflow, whereas 1 was incorrectly predicted as Congested. One thousand seven hundred and fifty-six (1756) instances were correctly predicted as FlowingCongestion and then 1678 instances were correctly predicted as Congested.

Table 4.17: Confusion matrix for the Random Forest model when numTrees was set to 1.

Predicted a b c Actual A 7178 0 1 B 0 1756 0 C 0 0 1678

The confusion matrix for all the models constructed from the rest of the algorithms are included in Annexure 4D.

66

4.3.4 Summary of the prediction performance and the RMSE results

This section of the study is where the prediction performance and RMSE results from Section 4.3 are plotted.

Figure 4.14 and 4.15 shows the results where attribute selection (combination of 2 attributes per model) were used to determine the best performing attributes. The Decision tree(C4.5) algorithm using the vehicle traffic flow data were used to construct the model. The attributes were selected using the features selection option on WEKA software.

Prediction performance vs 2 Attributes 100 (TT & AS) 94,667 (DOW & AS) 80,665 90 (TV & AS) 83,256 (TT & TV) 84,189 (DOW & TT) 80,006 80 70 (DOW & TV) 67,644 60 50 40 30 20 Prediction Performance(%) Prediction 10 0 C4.5 C4.5 C4.5 C4.5 C4.5 C4.5 C4.5 algorithm for 2 Attributes

Figure 4.14: The prediction performance of a combination of 2 attributes per model, using the Decision trees algorithm. The attributes are represented by the numbers on the x-axis of the graph, where 1-(TV and AS), 2-(DOW and TT), 3-(TT and AS), 4-(DOW and TV), 5-(DOW and AS) and 6- (TT and TV).

Figure 4.14 shows the attribute combinations were used that include (TV and AS), (DOW and TT), (TT and AS), (DOW and TV), (DOW and AS) and (TT and TV). The results showed that the average prediction of (TT and AS) attributes model for the vehicle traffic flow data was 94.667%.

67

RMSE vs 2 Attributes 0,45 (DOW & TV) 0,404 (DOW & AS) 0,316 0,4

0,35 (DOW & TT) 0,293 (TT & TV) 0,261 (TT & AS) 0,291 0,3 0,25

RMSE 0,2 (TT & AS) 0,153 0,15 0,1 0,05 0 1 2 3 4 5 6 C4.5 Algorithms for 2 Attributes

Figure 4.15: The RMSE for selected attributes are represented by the numbers on the graph, 1-(TV and AS), 2-(DOW and TT), 3-(TT and AS), 4-(DOW and TV), 5-(DOW and AS) and 6-(TT and TV), with the C4.5 algorithms.

Figure 4.15 shows the average RMSE for the 6 set combination of attributes namely 1-(TV and AS), 2-(DOW and TT), 3-(TT and AS), 4-(DOW and TV), 5-(DOW and AS) and 6- (TT and TV). The results showed that 3-(TT and AS) had the low RMSE of 0.153.

68

Figures 4.16 and 4.17 show the results where attribute selection (combination of 3 attributes per model) was used to identify the best performing attributes among the rest. The C4.5 algorithm was used to perform these experiments.

Prediction performance vs 3 Attributes 120 (TT,TV & AS) (DOW,TV & AS) 94,667 99,981 (DOW,TT & TV) 84,189 100 (DOW,TT & AS) 83,256 80

60

40

20 Prediction Performance (%) Performance Prediction 0 1 2 3 4 C4.5 Algorithm for 3 Attributes

Figure 4. 16: The prediction performance obtained when a combination of 3 attributes per model was used with C4.5 algorithm. The numbers on the graph (x-axis) represent the attributes selected where 1 presents (TT, TV and AS), 2 is (DOW, TV and AS), 3 is (DOW, TT and AS) and 4 is (DOW, TT and TV).

Figure 4.16 shows the prediction performance of the attributes TravelTime, TrafficVolume, DayOfWeek and AverageSpeed for the vehicle traffic flow data. These attribute models were constructed using three attributes to determine the best attributes when they are combined. The C4.5 algorithm was used for all the attributes to construct the models. The results revealed that the prediction performance for the combination (TT, TV and AS) attributes was 99.981%.

69

RMSE vs 3 Attributes 0,35 (DOW,TT & TV) 0,261 0,3 (DOW,TT &AS) 0,291

0,25

0,2 (DOW,TV & AS) 0,153

RMSE 0,15

0,1 (TT,TV & AS) 0,011

0,05

0 1 2 3 4 C4.5 Algorithm for 3 Attributes

Figure 4.17: The RMSE for selected attributes are represented by the numbers in the x- axis of the graph where 1-(TT, TV and AS), 2-(DOW, TV and AS), 3-(DOW, TT and AS) and 4- (DOW, TT and TV).

Figure 4.17 shows the average RMSE for the Decision tree models. The attributes include 1- (TT, TV and AS), 2-(DOW, TV and AS), 3- (DOW, TT and AS) and 4- (DOW, TT and TV). The results showed that the combination of the attributes that include (TT, TV and AS) had the lowest RMSE of 0.011.

70

Figures 4.18 and 4.19 shows the results when vehicle traffic flow data was trained for model construction using Decision trees (C4.5), Support Vector Machine (SVM), and Multi-Layer Perceptron (MLP) were used and their RMSE values recorded.

Predictions performanve vs Machine learning algorithms 102 C4.5 99,981 SVM(3) 98,634 SVM(5) 99,152 100 MLP 98,342 SVM(4) 98,182 98 SVM(2) 95,082 96 SVM(1) 93,018 94

92

Prediction Performance(%) Prediction 90

88 C4.5 MLP SVM(1) SVM(2) SVM(3) SVM(4) SVM(5) Algorithms

Figure 4. 18: The prediction performance of C4.5, MLP and SVM for the vehicle traffic flow data. The numbers on the graph (x-axis) have no significance, they are just presenting the used algorithms. Where 1- C4.5, 2-MLP, 3 to 7- SVM etc. All of the attributes were used during the design of the models.

Figure 4.18 shows the prediction performance of the selected three algorithms used, namely the Decision Trees, the Multi-Layer Perceptron and the Support Vector Machine for the vehicle traffic flow dataset. The results of Figure 4.18 showed that C4.5 was the best model among SVM and MLP with the prediction performance of 99.981%.

71

RMSE vs Machine learning algorithms 0,25 SVM(1) 0,216 SVM(2) 0,181 0,2

0,15 SVM(4) 0,11 MLP 0,094 SVM(3) 0,095 SVM(5) 0,075 RMSE 0,1 C4.5 0,011 0,05

0 1 2 3 4 5 6 7 Algorithms

Figure 4. 19: The RMSE for the C4.5, MLP and SVM algorithms. The numbers on the graph x-axis have no significance, they are just presenting the used algorithms. All the attributes where used to construct these models.

Figure 4.19 shows the average RMSE of the three selected algorithms, namely the Decision tree, the Support vector machine and the Multi-Layer perceptron for the vehicle traffic flow data for the year 2013, 2014 and 2015. The results showed that C4.5 achieved the lowest RMSE of 0.011.

72

Figure 4.20 and 4.21 shows the results were the Ensemble learning methods. The models were constructed using the following machine learning algorithms Decision tree, the Support Vector Machine and the Multi-Layer Perceptron (MLP) algorithm. The ensemble learning methods used to carry out the models included Bagging, Boosting and Stacking and Random Forest.

Prediction performance vs Ensemble Learning Methods 101 Boosting(MLP) 99,991 RF(1) 99,991 Bagging(C4.5) 99,981 Boosting(SVM) 99,708 RF(5) 99,991 100 Stacking(SVM) 99,981 99 Bagging(MLP) 98,483 Boosting(C4.5) 99,981 Stacking(MLP) 99,981 RF(3) 99,991 98 97 Stacking(C4.5) 97,437 96 95 94 Prediction Performance(%) Prediction 93 Bagging(SVM) 93,046 92 0 2 4 6 8 10 12 14 Ensemble Learning Methods

Figure 4.20: The prediction performance of an Ensemble learning methods (Bagging, Boosting Random forest and Stacking) using the three-year traffic data. The numbers on the x-axis represents the methods where (1-Bagging(C4.5), 2-Bagging(MLP), 3- Bagging(SVM),4-Boosting(C4.5), 5-Boosting(MLP), 6-Boosting(SVM),7- Stacking(C4.5),8-Stacking(MLP),9-Stacking(SVM),10-RF (1) ,11-RF (3) and 12-RF (5)).

Figure 4.20 shows the average prediction performance of Bagging, Boosting, Random Forest and Stacking. The results show 3 models each constructed using Bagging (using C4.5, MLP and SVM algorithms), Boosting (using C4.5, MLP and SVM algorithms) and Random forest (by setting numTrees parameter to 1, 3 and 5). Stacking involves using a combination of the 3 chosen (C4.5, MLP and SVM) algorithms and setting a meta classifier per model that is the reason why it has 3 models as shown in Figure 4.20. Boosting model when combined with MLP and Random Forest models achieve the same prediction value of 99.991% among the rest of the models in Figure 4.20.

73

RMSE vs Ensemble Learning Methods 0,25 Bagging(SVM) 0,21 0,2

0,15 Stacking(C4.5) 0,131

Bagging(MLP) 0,088 0,1 RMSE Bagging(C4.5) 0,01 Boosting(SVM) 0,048 Stacking(SVM) 0,011 0,05 RF(3) 0,005 Stacking(MLP) 0,01 RF(1) 0,008 RF(5) 0,005 0 Boosting(C4.5) 0,011 Boosting(MLP) 0,004 0 2 4 6 8 10 12 14 -0,05 Ensemble Learning Methods

Figure 4.21: The RMSE for Ensemble learning models, namely Bagging, Boosting, Stacking and Random forest. These models are presented by the numbers on the x- axis where (1-C4.5, 2-MLP and 3-SVM) is bagging, (4-C4.5, 5-MLP and 6-SVM) is Boosting, (7-C4.5, 8-MLP and 9-SVM) and (10, 11 and 12) Random forest.

Figure 4.21 shows the average RMSE for the ensemble learning methods which were Bagging, Boosting, Stacking and Random forest for the vehicle traffic flow data. The results show that the MLP for Boosting obtained the lowest RMSE of 0.004 which was close to Random forest which obtained 0.005 and 0.008 RMSE.

4.4 DATA POST PROCESSING

This section explains the confusion matrices obtained in section 4.3 and the cost of predictions is calculated. The data post-processing determines the model that would reduce the number of incorrect predictions and calculates the total cost of prediction. The goal was to minimise the probability of a wrong prediction for all the models. Costly predictions heavily penalise commuters. All the constructed models in section 4.3 are evaluated using the loss matrix in Table 4.18.

4.4.1 Cost of the prediction

The main objective was to select a model that has the lowest cost of prediction: the model that reduces the number of wrong predictions and thereby minimise the “total loss incurred”. The loss matrix shown in Table 4.18 was designed and the highest penalty of (3) was assigned to a model that predicted Freeflow (a) when the actual traffic status was FlowingCongestion (b) and Congested(c). A low penalty (1) was assigned to a model that predicted Congested when

74

the actual traffic status was FlowingCongestion as shown in Table 4.18. The lowest penalty of (0) was assigned for all the entries along the leading diagonal of the loss matrix which means that there is no penalty as the actual and the predicted values are the same.

Since there was a 3 x 3 confusion matrix from three classes, a 3 x 3 loss matrix had to be designed. This loss matrix in combination with a confusion matrix was used for computing the cost and hence as a consequence, for selecting the best vehicle traffic congestion prediction model.

Table 4.18: Shows a loss matrix for the vehicle traffic prediction model, a = Freeflow, b = FlowingCongestion and c = Congested

Predicted a b c Actual A 0 1 2 B 3 0 1 C 3 1 0

Table 4.18 shows the loss matrix that was used with each confusion matrix from the experiment section. “The input elements (Lkj) of the loss matrix in Table 4.18 specify the penalty associated with the prediction of target concept Cj when in fact it is target conceptCk ” (Bishop, 1995). The values for the loss matrix were chosen by hand. The following process flow procedure was used to compute the cost for all the constructed models:

75

Getting suitable Loss Computing confusion matrix with similar matrix dimension as the confusion matrix

Compute cost

Perform combination of results

If high RMSE? High prediction performance? Accept prediction model No YES Reject prediction model High cost?

If low RMSE? High prediction performance? YES NO low cost?

Figure 4.22: The process flow chart used to compute cost prediction for constructing the vehicle traffic prediction models.

Thus for all instances x which belong toCk , the expected loss for those instances is given by equation 4.6. Lkj “is the penalty associated with incorrect predictions”.

Equation 4.6 c  )|( dCPLR xx k  kj  k j1 R j The overall expected “loss or risk” for patterns from all classes is given by Equation 4.7.

c Equation 4.7   kj x CPLR k )|( k 1

76

The risk is minimised if Equation 4.7 is “minimised at each point x, i.e. if the regions R j are chosen such that x  R j ” (Bishop, 1995). A loss of 1 or more was assigned if an instance was placed in a wrong class and a loss of 0 if the pattern was placed in a correct class. According to Bishop (1995), the values of the coefficients of Lkj (values for cells in Table 4.18) were chosen by hand based on the views of experienced MTM personnel.

4.4.1.1 Computation of the cost prediction

In this section the cost matrix is calculated the loss matrix in Table 4.18 and confusion matrix of models that were constructed in section 4.3 the step by step process for computing the cost as shown in Figure 4.22.

Equation 4.8 C _cos MLPt  xLConf kjkj

Table 4.19: The Confusion matrix for the C4.5 algorithm, for computing the cost of vehicle traffic congestion prediction.

Predicted a b c Actual A 7177 0 2 B 0 1756 0 C 0 0 1678

The results are obtained by “multiplying the values of the Loss Matrix” in Table 4.18 with the corresponding cell values of the Confusion Matrix in Table4.19.

푻풐풕풂풍 푪풐풔풕 = (0x7177) + (1x0) + (2x2) + (3x0) + (0x1756) + (1x0) + (3x0) + (1x0) + (0x1678)

= 4

The cost for subsequent models is calculated in a similar fashion. Thus only the cost values are written in Tables 4.20, 4.21 and 4.22.

77

Figure 4.23, 4.24, 4.25 and 4.26 show the graphical plot of the cost for all the designed models in this study. Figure 4.23 shows the cost for attribute selection using a combination of two attributes, (TV and AS), (DOW and TT), (TT and AS), (DOW and TV), (DOW and AS) and (TT and TV). Figure 4.24 shows the cost of the 3 combinations of attributes which were selected to design different models and are (TT, TV and AS), (DOW, TV and AS), (DOW, TT and AS) and (DOW, TT and TV). Figure 4.25 shows the computed cost for the C4.5, MLP and SVM algorithms. Figure 4.26 shows the computed cost for the ensemble learning methods used during experiments, namely Bagging, Boosting, Stacking and Random Forest.

Cost vs 2 Attributes 12000 (DOW & TV) 10702

10000

8000 (DOW & AS) 5868 6000 (TV & AS) 5317 Cost

4000 (DOW & TT) 2768 (TT & TV) 2362 2000 (TT & AS) 1146

0 1 2 3 4 5 6 C4.5 Algorithm for 2 Attributes

Figure 4.23: The prediction cost where a combination of 2 attributes selection per model. The numbering on the x-axis in the graph represent attributes, 1 present (TV and AS), 2 present (DOW and TT), 3 present (TT and AS), 4 represent (DOW and TV), 5 represent (DOW and AS) and 6 represent (TT and TV).

Figure 4.23 shows the cost estimates for all the models designed during the experiments. The model with the “lowest cost is the best model”. The best model according to Figure 4.23 is the combination of attributes 6- (TT and AS) - it has the “lowest prediction cost” of 1146. The decision tree (C4.5) algorithm was used during the construction of the models.

78

Cost vs 3 Attributes 6000 (DOW,TT & AS) 5317

5000

4000

3000 (DOW,TT & TV)

Cost 2362

2000 (DOW,TV & AS) 1146 1000

(TT,TV & AS) 4 0 1 2 3 4 C4.5 Algorithm for 3 Attributes

Figure 4.24: The prediction cost for the combination of 3 attribute selection per model, where the numbering on the x-axis represents the groups, namely 1-(TT, TV and AS), 2- (DOW, TV and AS), 3- (DOW, TT and AS) and 4-(DOW, TT and TV) and the C4.5 algorithms was used.

Figure 4.24 shows the cost estimate of the model with the lowest cost as the “best performing” model. The first set of attributes 1-(TT, TV and AS) has the “lowest prediction cost” of 4. Therefore, it can be concluded that this combination of attributes is the best performing combination.

79

Cost vs Machine learning algorithms 1800 1600 SVM(1) 1535 1400 SVM(2) 1208 1200 1000

Cost 800 600 MLP 447 SVM(4) 422 SVM(3) 361 400 200 SVM(5) 107 C4.5 4 0 1 2 3 4 5 6 7 Algorithms

Figure 4.25: The cost prediction for the three constructed models when using machine learning algorithms, namely C4.5, MLP and SVM. The numbers on the x-axis represent the models where 1 – C4.5, 2-MLP, and 3-7SVM.

Figure 4.25 shows the cost associated with “incorrect prediction” of traffic flow status by using different models. The model with the lowest cost is the best performing model. The Decision tree (C4.5) model has the lowest cost of prediction of 4.

80

Cost vs Ensemble Learning Methods

1600 Bagging(SVM) 1529 1400 1200 1000 Boosting(C4.5) 4 800 Stacking(C4.5) 814 Cost Bagging(MLP) 138 600 Stacking (SVM) 4 400 Boosting(MLP) 3 Bagging(C4.5) 4 Stacking(MLP) 4 200 Boosting(SVM) 35 0 1 2 RF(1) 2 3 4 RF(3) 2 5 6 RF(5) 2 7 8 9 10 11 12 Ensemble Algorithm

Figure 4.26: The prediction cost of an ensemble leaning methods, Bagging, Boosting, Stacking and Random Forest. The numbers on the x-axis represent the methods, where (1-C4.5, 2-MLP and 3- SVM) is Bagging, (4-C4.5, 5-MLP and 6-SVM) is Boosting, (7-C4.5, 8-MLP and 9-SVM) is Stacking and (10-12) is Random Forest method. Figure 4.24 shows 3 columns for Bagging, 3 columns of Boosting and 3 columns for Stacking since C4.5, MLP and SVM were used. Random Forest is represented by 3 columns where numTrees used are (1, 3 and 5).

Figure 4.26 shows the cost associated with the “incorrect prediction” of traffic status by different models constructed using an ensemble learning method. The Random Forest ensemble model has the lowest prediction cost of 2 when the value of the numTrees is set to (1, 3 and 5).

81

Tables 4.20, 4.21, and 4.22 shows a summary for all the constructed models, where the Decision Trees, the Multi-Layer Perceptron and the Support Vector Machine were used algorithms. Attribute selection and an Ensemble learning methods were used to design the models for the traffic flow data. The last constructed models of the study were designed using an ensemble learning methods.

Table 4.20: Summary of prediction performance, RMSE and the total cost for attribute selection, where a combination of two and three attributes per models were used.

Algorithm Two and three Attributes Prediction RMSE Cost per Model Performance (%)

C4.5 TV and AS 83.256 0.291 5317

C4.5 DOW and TT 80.006 0.293 2768

C4.5 TT and AS 94.667 0.153 1146

C4.5 DOW and TV 67.644 0.404 10702

C4.5 DOW and AS 80.665 0.316 5868

C4.5 TT and TV 84.189 0.261 2362

C4.5 TT,TV and AS 99.981 0.011 4

C4.5 DOW,TV and AS 94.667 0.153 1146

C4.5 DOW,TT and AS 83.256 0.291 5317

C4.5 DOW,TT and TV 84.189 0.261 2362

82

Table 4.21: Summary of prediction performance, RMSE and the total cost for C4.5, MLP and SVM algorithms.

Algorithms Attributes Prediction RMSE Cost Performance (%) C4.5 DOW,TT,TV and AS 99.981 0.011 4 MLP DOW,TT,TV and AS 98.342 0.094 447 SVM1(Libsvm) DOW,TT,TV and AS 93.018 0.216 1535 SVM2(Libsvm) DOW,TT,TV and AS 95.082 0.181 1208 SVM3(Libsvm) DOW,TT,TV and AS 98.634 0.095 361 SVM4(Libsvm) DOW,TT,TV and AS 98.182 0.110 422 SVM5(Libsvm) DOW,TT,TV and AS 99.152 0.075 107

Table 4.22: Summary of prediction performance, RMSE and the total cost for Bagging, Boosting, Stacking and Random Forest ensemble learning methods for the vehicle traffic flow data.

Ensemble Attributes Algorithms Prediction RMSE Cost Learning Performance Methods (%) Bagging DOW,TT,TV and AS C4.5 99.981 0.010 4 Bagging DOW,TT,TV and AS MLP 98.483 0.088 138 Bagging DOW,TT,TV and AS SVM 93.046 0.210 1529 Boosting DOW,TT,TV and AS C4.5 99.981 0.011 4 Boosting DOW,TT,TV and AS MLP 99.991 0.004 3 Boosting DOW,TT,TV and AS SVM 99.708 0.048 35 Stacking DOW,TT,TV and AS C4.5(Stacking C4.5, 97.437 0.131 814 MLP and SVM) Stacking DOW,TT,TV and AS MLP(Stacking C4.5, 99.981 0.010 4 MLP and SVM) Stacking DOW,TT,TV and AS SVM(Stacking C4.5, 99.981 0.011 4 MLP and SVM) Random DOW,TT,TV and AS RF(numTrees=1) 99.991 0.008 2 Forest

83

Random DOW,TT,TV and AS RF(numTrees=3) 99.991 0.05 2 Forest Random DOW,TT,TV and AS RF(numTrees=5) 99.991 0.005 2 Forest

4.5 CHAPTER CONCLUSION

Pre-processing for the vehicle traffic data was conducted using Microsoft spreadsheet. Machine learning algorithms, namely the Decision Trees, the Multi-Layer Perceptron and the Support Vector Machine, Attribute selection and an Ensemble learning Methods, namely Bagging, Boosting, Stacking and Random Forest were used to construct vehicle traffic prediction models.

The prediction results from these models were compared to determine the best performing model. The total cost that was associated with an incorrect prediction of the traffic flow status by each trained model was calculated. The main idea of computing the cost of prediction was to minimize the number of incorrect predictions and reduce any loss associated with such prediction. The overall results were generalized looking at the prediction performance, RMSE and the cost of prediction to identify the best performing model.

84

CHAPTER 5: DISCUSSION, CONCLUSION AND RECOMMENDATIONS

5.1 DISCUSSION

Vehicle traffic flow data collected from MTM was used to construct the prediction models. The data contained four attributes: travel time, average speed, traffic volume and the day of week. The data was split into training and testing datasets. The models were constructed using WEKA software. Tables 4.20, 4.21 and 4.22 summarise the results for all the constructed prediction models of the study. The results were generalized using the prediction performance, RMSE value and the cost of prediction.

Table 4.20 in chapter 4 showed that the results for models constructed using a combination of 2 attributes per model. Table 4.20 shows that in terms of the prediction instance, RMSE and the cost of the prediction when a set of 2 attribute combination models the following sets of attributes (TV and AS), (DOW and TT), (TT and AS), (DOW and TV), (DOW and AS) and (TT and TV) were used. The results in Table 4.20 shows that TT and AS attribute combination achieved a prediction of 94.667%, the RMSE 0.153 and prediction cost of 1140. The results mean that the TT and AS model was the best performing model. Table 4.20 again shows the results sets of 3 attributes selection combination were (TT, TV and AS), (DOW, TV and AS), (DOW, TT and AS) and (DOW, TT and TV). All the models in Table 4.20 were constructed using the C4.5 algorithm. In terms of the results in Table 4.20 (TT, TV and AS) attributes achieved the best prediction of 99.981%, the RMSE of 0.011 and the prediction cost of 4. The result mean that TT, TV and AS attribute combination outperformed the rest of the attribute combination models.

In Table 4.21 the results were obtained during the construction of the prediction models using Machine learning algorithms namely, C4.5, MLP and SVM including all the attributes (TT, TV, DOW and AS). The results in Table 4.21 show that the C4.5 algorithm obtained a prediction of 99.981%, the RMSE value of 0.011 and the prediction cost of 4. The results in Table 4.21 suggest that the C4.5 algorithm performed best when compared with MLP and SVM algorithm models.

Table 4.22 shows the results obtained when constructing a prediction model using Ensemble learning methods namely Bagging, Boosting, Stacking and Random Forest (RF). The Ensemble learning method models were constructed using C4.5, MLP and SVM algorithms. The results in Table 4.22 revealed that the Random Forest ensemble prediction models outperformed all the models in terms of the prediction performance of 99.991%, the RMSE of

85

0.005 and the lowest prediction cost of 2. The data used for this study did not include variables such as weather and other variables such as road accidents. This may be the reason why other models performed poorly during the experiments in section 4.3. The RF ensemble model confusion matrix results obtained in Table 4.15 support the model. It shows that only 1 instance was incorrectly predicted as Congested while its actual class is Freeflow. The results in Table 4.15 mean that the model when analysed with the cost of prediction value was less penalised. Therefore, it can be concluded that RF ensemble prediction model best performed when compared with all the constructed models in Tables 4.20, 4.21 and 4.22.

Table 4.20 of the models built from attribute selection indicates that average speed, travel time and traffic volume are important parameters for constructing a vehicle traffic prediction model. These results are in agreement with those of Wu et al. (2004), who used traffic data collected only for peak hours. However, the RF ensemble model for this study is better than that of Wu et al. (2004) in that data for both peak and off-peak hours was used.

This study also compares well with the work done by Thianniwet and Phosaard (2009) and Gupta et al. (2013) who used the Decision tree (C4.5) algorithm. Hamner (2010) also used vehicle traffic data for weekdays but excluded holiday data in his study that is similar as this study which focused on weekday only. Hamner (2010) also considered the RMSE for evaluating the traffic prediction model as done in the current study.

The current study used a window of 1 hour for the collected data. In addition, the RF ensemble prediction model also considered the cost of prediction. He et al. (2010) and Duan et al. (2011) used cross-validation for evaluation and used data for peak hours but did not compute the cost of prediction which would have indicated which models needed to be penalised.

Decision trees (C4.5) have been expected to do well because of the ability to perform post pruning of the tree in order to avoid over fitting and the ability of dealing with missing data. In this study the C4.5 algorithm produced a low predictive performance as shown in Table 4.20 compared to the Random Forest (RF) ensemble model.

The ensemble method stacking was expected to perform well compared to the Bagging and Boosting learning methods. The Stacking method was expected to perform best since it is the only learning method that allows for a combination of more than one machine learning algorithm to be combined to improve the prediction model. Stacking improves the performance of weak algorithms (Syarif et al. 2012) but in this case the predictive performance was low. This may be due to one of the algorithms used during meta-data classification for stacking model construction. Bagging is known to improve unstable machine learning algorithms

86

whereas boosting can be used to reduce the error of any weak learning algorithm. RF is known to handle data with an extremely large number of attributes and thus its performance had been expected to be good. From the results, it can be concluded that the RF ensemble vehicle prediction model is the best potential algorithm to predict vehicle traffic flow on the Ben Schoeman freeway. Commuters wishing to travel on the Ben Schoeman freeway can predict traffic flow by using an App. The App can allow the commuters to enter variables such as day of week, travel time and traffic volume. The entered variables will predict travel conditions by examining target concepts such as Freeflow, FlowingCongestion and Congested. This process will enable commuters to plan ahead and thus make informed travel arrangements. The commuters will only be able to predict traffic flow although they will not have full knowledge of the actual vehicle traffic volume in the freeway. In that case they will depend on the media (e.g. radio) traffic reports. Therefore, it can be concluded that the RF ensemble prediction model will allow commuters to use alternative routes or travel earlier to avoid being held back in traffic. In addition, businesses that use the Ben Schoeman freeway will also see improvements due to on time delivery of goods and staff reporting to work on time.

5.2 CONCLUSION AND RECOMMENDATIONS

This work addressed the vehicle traffic congestion that affected commuters and businesses in Gauteng. The study focused on the freeway that links Johannesburg (M1 north) with Pretoria (N1). The data was collected from MTM and comprised the vehicle traffic flow over a period of three years. A quantitative method was used for the study and site visit and interviews were conducted to collect the data.

Prediction models for vehicle traffic congestion on the Ben Schoeman freeway was constructed based on ensemble learning methods and machine learning algorithms. The results showed that Random Forest (RF) ensemble models outperformed all the constructed models such as, C4.5, MLP, SVM when used individually, Attribute selection, Bagging, Boosting and Stacking ensemble learning method when combined with the machine learning algorithms on the prediction of vehicle traffic flow in terms of cost prediction and the prediction performance. This model will complement the interventions that the Gauteng Department of transport have implemented in the province which include electronic toll (e-toll) gates, bus rapid transit (BRT) and road expansion. The adoption of this vehicle traffic flow prediction model will potentially make Gauteng Province an attractive destination for investors. The commuters and businesses in Gauteng will benefit immensely as a result of free flowing traffic.

The results of this study elaborated that ensemble learning methods and machine learning algorithms are good in predicting vehicle traffic flow. Using variables that include travel time,

87

traffic volume, average vehicle speed and day of the week it was possible to construct a Random Forest ensemble model that provides prediction of vehicle traffic flow for the Gauteng Ben Schoeman freeway. This tool is better than most of the models that were used for prediction of traffic flow in Gauteng such as MLP, SVM, and ensemble learning methods (Bagging, Boosting and Stacking). The Department of transport (DoT) in Gauteng Province and MTM investing in these tools will help the Gauteng Province to tackle traffic congestion on the freeways.

Based on the ongoing investigation vehicle traffic flow problem, the following are recommended for future improvements:

 It is recommended that further investigations in constructing such models as the RF ensemble prediction model will help in predicting vehicle traffic flow.  Collect data that contains weather and vehicle accident variables.  To include algorithms such as Data fusion which is capable of handling data from different sources.  To use methods such as Sensitivity analysis, Analysis of variance (ANOVA) which focuses on two or more variables, Multivariate (MANOVA) analyse which deal with variables independently using SPSS (software package for statistical analysis) and Factor analysis which deals with correlation of covariance.  Consider adding other machine learning algorithms that were used recently by other researchers that were tackling the issue of vehicle traffic flow (such as Randomization, Neural Networks, Bayesian network, Kalman Filter and Fuzzy logic).  Commuters can use media traffic report to get traffic volume updates.  To use primary data that was collected from different sources.

88

REFERENCES

Allen D.M., 1974, The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16(1):pp.125-127.

Ali, J., Khan, R., Ahmad, N., & Maqsood, I., 2012, Random forests and decision trees. International Journal of Computer Science Issues (IJCSI), September 2012, 9(5), pp.272-278.

Auria L. & Moro R.A., 2008, Support Vector Machines (SVM) as a Technique for Solvency Analysis, German Institute for Economic Research, DIW Berlin, pp. 1-88.

Bauman, LJ. & Greenberg, E., 1992, "The use of ethnographic interviewing to inform questionnaire construction." Health Education Quarterly. 19(1), pp. 9-23.

Bishop, C.M., 1995. Neural networks for pattern recognition. Oxford university press.

Bottou L., Cortes C., Denker J., Drucker H., Guyon I., Jackel L., LeCun Y., Muller U., Sackinger E., Simard P., & Vapnik V.,1994, “Comparison of classifier methods: A case study in handwriting digit recognition,” in Proc. Int. Conf. Pattern Recognition., pp. 77–87.

Boser E., Guyon I. & Vapnik V., 1992, A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, ACM Press, pp.144-152.

Buede D.M. & Girardi P., 1997, “A target identification comparison of Bayesian and Dempster- Shafer multisensory fusion”, IEEE Trans. On System, Man and Cybernetics-Part A, 27(5), Sep, pp.569-577.

Bühlmann, P., 2012. Bagging, boosting and ensemble methods. In Handbook of Computational Statistics, Springer Berlin Heidelberg, pp. 985-1022.

Burns, N. & Grove, S.K., 1993, the practice of nursing research: conduct, critique and utilization, (2nd edition), Philadelphia: W.B. Saunders.

Blokker J., 2008, The Application of SVM to Algorithmic Trading, Stanford University, CS229 Term Project, pp.1-4.

Bray, O.H., 1997. Information integration for data fusion. Sandia National Laboratories, Report SAND97-0195.

Breiman, L., 1996, Bagging predictors, Machine Learning 24 (2), pp. 123-140.

Breiman, L., 2001. Random forests. Machine learning, 45(1), pp.5-32.

Brown, G., 2011, Ensemble learning. In Encyclopedia of Machine Learning, Springer US, pp. 312-320.

Callan R., 2003, Artificial Intelligence. New York: Macmillan.

Cunningham, P., Cord, M. & Delany, S.J., 2008. Supervised learning. In Machine learning techniques for multimedia, Springer Berlin Heidelberg, pp. 21-49.

Chang, C.C. & Lin, C.J., 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), p.27.

Chai T. & Draxler R.R., 2014, Root Mean square error (RMSE) or Mean absolute error (MAE), “Arguments against avoiding RMSE in literature”., Geoscientific Model Development, 7(3), pp. 1247-1250.

Chen, L., & Chen, C.P., 2007, April. Ensemble learning approach for freeway short-term traffic flow prediction. In System of Systems Engineering, 2007. SoSE'07. IEEE International Conference, IEEE, pp. 1-6.

Chen P., Lu Z., & Gu J., 2009, Vehicle Travel Time Prediction Algorithm Based on Historical Data and Shared Location, IEEE 5th International Joint Conference on INC, IMS & IDC, 25-27 Aug., pp.1632-1637.

Chen, H.T., Tsai, L.W., Gu, H.Z., Lee, S.Y. & Lin, B.S.P., 2012, July. Traffic congestion classification for night-time surveillance videos. In Multimedia and Expo Workshops (ICMEW), 2012 IEEE International Conference, IEEE, pp. 169-174.

Cybenko, G., 1989, Approximation by super positions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), pp. 303-314.

Dash M., Choi K., Scheuermann P & Liu H., 2002, “Feature selection for clustering – a filter solution,” in Proc, Of the Second International Conference on Data Mining, pp. 115-122.

Deng C., Wang F., Shi H., & Tan G.,2009, Real-time Freeway Traffic State Estimation Based on Cluster Analysis and Multiclass Support Vector Machine(MSVM), In Intelligent Systems and Applications, International Workshop, IEEE, May, pp.1-4.

Dietterich, T.G., 2002, Ensemble learning. The handbook of brain theory and neural networks, 2, pp.110-125. Dinakaran, S., & Thangaiah, P. R. J., 2013, Role of Attribute Selection in Classification Algorithms. The International Journal of Scientific & Engineering Research, 4(6), 67-71.

Duan, G., Liu, P., Chen, P., Jiang, Q. & Li, N., 2011, July. Short-term traffic flow prediction based on rough set and support vector machine. In Fuzzy Systems and Knowledge Discovery (FSKD), 2011 Eighth International Conference, IEEE, Vol. 3, pp. 1526-1530.

Economy of GP. 2014, Internet: www.gautengonline.gov.za/Business.

Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R. & Lin, C.J., 2008. LIBLINEAR: A library for large linear classification. The Journal of Machine Learning Research, 9, pp.1871-1874.

Freund, Y. & Schapire, R.E., 1996, July. Experiments with a new boosting algorithm. In ICML Vol. 96, pp. 148-156.

Freund, Y., Schapire, R. & Abe, N., 1999. A short introduction to boosting. Journal-Japanese Society for Artificial Intelligence, 14(771-780), pp.1612.

Gan, C. & Canghui, Z., 2009, January. Modeling and Simulation of Freeway Short-Term Traffic Flow Prediction. In Advanced Computer Control, 2009. ICACC'09.IEEE International Conference, pp. 46-50.

Goh, M., 2002, Congestion management and electronic road pricing in Singapore. Journal of Transport Geography, 10(1), pp.29-38.

Gupta A., Choudhary S. & Paul S., 2013, “DTC: A Framework to Detect Traffic Congestion by Mining versatile GPS data”, IEEE 1st International Conference, Emerging Trends and Application in Computer Science, pp.97-103.

GMA (Gautrain Management Agency), 2014, Internet: http://gma.gautrain.co.za/gautrain-sed.

Hall, D.L. & Llinas, J., 1997. An introduction to multisensor data fusion. Proceedings of the IEEE, 85(1), pp.6-23.

Hall M.A., 2000, “Correlation-based feature selection for discrete and numeric class machine learning,” in Proc. Of the 7th International Conference on Machine Learning, pp. 359-366.

Hall D.L & Steinberg A.N, 2001, Dirty Secrets in Multisensor Data Fusion, Penny-sylvania State University Park Applied Research Lab, pp. 1-10

Hall M.A., 1999, Correlation-based Feature Selection for Machine Learning, Waikato, Department of Computer Science, pp. 1-198.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H., 2009, The WEKA data mining software: an update. ACM SIGKDD explorations newsletter, 11(1), pp.10- 18.

He J., He Q., Swirszcz G., Kamarianakis Y., Lawrence R., Shen W., & Wynter L., 2010,” Ensemble-based Method for Task 2: Predicting Traffic Jam”, Int. IEEE., Conf. on Data mining workshop New York 10598, pp.1363-1365.

Hamner, B., 2010, December. Predicting Future Traffic Congestion from Automated Traffic Recorder Readings with an Ensemble of Random Forests. In Data Mining Workshops (ICDMW), 2010 IEEE International Conference, pp. 1360-1362.

Hsu C.W., Chang C.C.& Lin C.J., 2003, A practical guide to support vector classification, Department of Computer Science, National Taiwan University, pp.1-16.

Kantardzic M., 2011, Data mining: concepts, models, methods and algorithms, John Wiley & Sons, pp. 50-57.

Kohavi R., & John G.H., 1997, “Wrappers for feature Subset Selection." Artificial Intelligence, vol. 97, pp 273-324. Kohavi, R., & Quinlan, J. R., 2002, Data mining tasks and methods: Classification: decision- tree discovery. In Handbook of data mining and knowledge discovery Oxford University Press, Inc.., January, pp. 267-276. Kordon A.K., 2009, Applying Computational Intelligence: How to Create Value, November 28, Springer Science & Business Media.

Kumar V. & Poggio T., 2000, Learning-based approach to real time tracking and analysis of faces, In Proceedings. 4th IEEE Int. Conference on Automatic Face and Gesture Recognition.pp.96-101.

Khoonsari P.E. & Motie A., 2012, “A comparison of efficiency and robustness of ID3 and C4.5: Algorithms using dynamic test and training data sets.” Int. Journal of machine learning and computing, 2(5), pp. 540-543.

Klein, L. A., 1993, Traffic parameter measurement technology evaluation. In Vehicle Navigation and Information Systems Conference, 1993, IEEE Proceedings of the IEEE- IEE pp. 529-533.

Lantz B., 2013, Machine Learning with R, Packet Publishing, October, pp. 344-350.

Le P. & Zuidema W., 2014, Perceptron and Multilayer Perceptron, International Conference on Measuring Technology and Mechatronics Automation., pp. 1-8.

Lei, Y., & Liu H., 2003, Feature selection for high-dimensional data: a fast correlation-based filter solution. In ICML, August 21, Vol 3, pp. 856-863.

Leshem G. & Ritov Y. ,2007, Traffic Flow Prediction using Adaboost Algorithm with Random Forests as a Weak Learner, Proc of World Academy of Scie, Eng and Tech Vol 21, Jan 2007,pp.193-198.

Lindsey, C. R., & Verhoef, E. T., 2000, Traffic congestion and congestion pricing (No. 00- 101/3). Tinbergen Institute Discussion Paper.

Lethatsa G.M.,2012,The exploitation of ICT technology in mitigating road traffic jam in Gauteng, Master Degree in Information Technology, Applied and Computer system, Vaal University of technology(VUT), and access from: VUT Goldfield library, 2012.

Loubser, M. R., & Bester, C. J., 2009, the feasibility of using mobility performance measures for congestion analysis in South Africa. In Proceedings of the 28th Southern African Transport Conference (SATC 2009), July 6-9, Vol. 6, p. 513-521.

Madzarov G., Gjorgjevikj D. & Chorbev I., 2009, A Multi-class SVM Classifier Utilizing Binary Decision tree, Informatica, 33(2), pp. 233-241.

Makaba T.E. & Gatsheni B.N., 2016, The design of Vehicle Traffic Flow Prediction Model for Gauteng Freeway Based on an Ensemble of Multi-Layer Perceptron, World academy of Science, Engineering and Technology, International Journal of Computer, Electrical, Automation, Control and Information Engineering, 10(1), pp.106-117.

Maimon, O., & Rokach, L. (Eds.), 2005, Data mining and knowledge discovery handbook (Vol. 2). New York: Springer.

Marie T.E.,2014, An intelligent automatic vehicle traffic flow monitoring and control system, Master Degree in Information Technology, Applied and Computer system, Vaal University of technology(VUT), and access from: VUT Goldfield library, 2014.

Mitchell M., 1999 -1997, An Introduction to Genetic Algorithm. Singapore : McGraw-Hill.

Miao D., & Hou L., 2004, a comparison of rough set methods and representative inductive learning algorithms. Fundamental informaticae, January 01, 59(2-3), pp. 203-219.

Michie D., Spiegelhalter D.J. & Taylor C.C., 1994, Machine learning, neural and statistical classification,URL:http://minds.jacobsuniversity.de/sites/default/files/uploads/teaching/share/ MantasRecommendedBook.pdf.

Mishina Y., Murata R., & Yamauchi Y., 2015 Boosted Random Forest, IEICE Transactions on Information and Systems, 98(9), pp. 1630-1636.

Molupe C.B., 2014, the design of an intelligent vehicle traffic flow prediction model for the Gauteng Freeways, MCOM (IT Management), University of Johannesburg, and Retrieved from: https://ujdigispace.uj.ac.za (Accessed: 04-May-2015), 2014.

Mbodila, M., & Ekabua, O. ,2013, Novel model for vehicle's traffic monitoring using wireless sensor networks between major cities in South Africa. In Proceedings of the International Conference on Wireless Networks (ICWN) (p. 1). The Steering Committee of the World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp).

MTM (Mikro’s Traffic Monitoring agency). , 1998, Internet: www.trafmon.co.za/mtm/.

Padiath A., Vanajakshi L., Subramanian C. & Manda H., 2009, “Prediction of Traffic Density for Congestion Analysis under Indian Traffic Conditions.” IEEE Conf. Intell. Trans. Sys., pp. 78-83.

Pattara-atikom W., Peachvanish R., & Luckana R., 2007, “Estimating Road Traffic Congestion using Cell Dwell Time with Simple Threshold and Fuzzy Logic Techniques, “IEEE Trans. Intell. Trans. Sys., Conf, pp.956-964.

Pescaru D., 2013, “Urban Traffic Congestion Prediction Based on Routes Information.” 8th IEEE Int. Sys. Comp. Intell. Info, May 23-25, Timisoara, Romania, pp. 121-126.

Polikar R., 2009, Ensemble Learning, Scholarpedia, Vol: 4(1):pp. 27-76 http://www.scholarpedia.org/article/Ensemble_learning.

Purusothaman S.B., & Parasuraman K.P., 2013, Vehicle Traffic Density State Estimation using Support Vector Machine, IEEE Int Conference on Emerging Trends in Computing, Communication and Nanotechnology, pp. 782-785.

Platt, J., 1998, Sequential minimal optimization: A fast algorithm for training support vector machines.

Platt J. C., Cristianini N., & Shawe-Taylor J., 2000, “Large margin DAG’s for multiclass classification”, 2000, in Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2000, vol. 12, pp.547–553.

Quinlan J.R.,1986, Induction of decision trees, Machine learning, Centre for advanced computing Science, New South Wales institute of technology.

Rasyidi, M.A. & Ryu, K.R., 2014, November. Comparison of traffic speed and travel time predictions on urban traffic network. In Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference, pp. 373-380.

Rokach, L., 2010, Ensemble-based classifiers. Artificial Intelligence Review, 33(1-2), pp.1-39.

Rokach L., 2005, Ensemble Methods for Classifier: Data mining and knowledge discovery handbook, Springer, pp. 957-980.

Roy, K., Chaudhuri, C., Kundu, M., Nasipuri, M. & Basu, D.K., 2005, Comparison of the Multilayer perceptron and the nearest neighbour classifier for handwritten digit recognition. Journal of Information Science and Engineering, 21(6), pp.1245-1257. Ruck, D.W., Rogers, S.K., Kabrisky, M., Oxley, M.E. & Suter, B.W., 1990, The multilayer perceptron as an approximation to a Bayes optimal discriminant function. Neural Networks, IEEE Transactions on, 1(4), pp.296-298.

Russell S. & Norvig P., 2003, Artificial Intelligence: A modern approach. 2nd ed, EUA: Prentice Hall.

Saeys Y., Inza I. & Larra-naga P., 2007, “A review of feature selection techniques in bio- informatics.” Bio-informatics, vol. 23, no. 19, pp. 2507-2517.

Sathya R. & Abraham A., 2013, Comparison of Supervised and Unsupervised Learning Algorithms for pattern Classification, Int J Adv Res Artificial Intell, 2(2) vol. 2, pp.34-38.

Sohn S.Y. & Lee S.H., 2003, Data fusion, ensemble and clustering to improve the classification accuracy for the severity of road traffic accidents in Korea, Safety Science, 41(1), pp. 1-14.

Sewell M., 2007,” Ensemble methods”., RN 11(02), pp. 1-13.

Su M. C., Jean W. F., & Chang H. T., 1996, " A Static Hand Gesture Recognition System Using a Composite Neural Network," in Fifth IEEE Int. Conf. on Fuzzy Systems, New Orleans, U.S.A. (NSC85-2213-E-032-009), pp. 786-792.

Sharma P., 2008, Artificial Intelligence., 1st ed.09 February. pp. 64.

Shen X. & Chen J., 2009, Study on Prediction of Traffic Congestion Based on LVQ Neural Network, Intell Conf. on Measuring Tech & Mechatronics Aut., pp. 318-321.

Strydom, H. & Venter, L., 2002, Sampling and sampling methods. In: De VOS, A.S.; STRYDOM, H.; FOUCHE, C.B. & DELPORT, C.S.L. Research at Grass Roots: For the social sciences and human service professions. Pretoria: Van Schaik Publishers.

Sweety R.P., 2013, Ensemble of classifier using artificial neural networks as base classifier, Int Journal of Comp Sci and Mob App, vol.1 (4), Oct -, pp. 7-16.

Syarif I., Zaluska E., Prugel-Bennett A. & Wills G., 2012, Application of Bagging, Boosting and Stacking to Intrusion Detection, In Machine Learning and Data Mining in Pattern Recognition: Springer Berlin, pp. 593-602. Talwar, A. and Kumar, Y., 2013, Machine Learning: An artificial intelligence methodology. International Journal of Engineering and Computer Science, 2, pp.3400-3404.

Tan, P. N., Steinbach, M., & Kumar, V., 2006, Classification: basic concepts, decision trees and model evaluation. Introduction to data mining, vol.1, pp.145-205. Ting, K.M. & Witten, I.H., 1999, Issues in stacked generalization. J. Artificial. Intelligent. Res.(JAIR), vol.10, pp.271-289.

Tiwari R. & Singh M.P., 2010, Correlation-based Attribute Selection using Genetic Algorithm, International Journal of Computer Applications, August, vol. 4, no. 8, pp 28-34.

Turban E., & Frenzel L.E, 1992, Expert System and Applied Artificial Intelligence. Prentice Hall Professional Technical Reference February 1, pp. 53-54.

Thianniwet T. & Phosaard S., 2009, Classification of Road Traffic Congestion Levels from GPS Data using a decision Tree Algorithm and Sliding Windows, Proceedings of the World Congress on Engineering(WCE) Vol. 1, July 1-3, pp. 978-988.

Vafaie H., & De Jong K., 1992, Genetic algorithms as a tool for feature selection in machine learning. In Tools with Artificial Intelligence, 1992, TAI’92, Proceedings. 4th International Conference, IEEE, pp. 200-203.

Vapnik V., 1995, “The Nature of Statistical Learning Theory”, Springer Verlag.

Wan, S. & Yang, H., 2013, July. Comparison among methods of ensemble learning. In 2013 International Symposium on Biometrics and Security Technologies.

Vuchic, V.R. and Kikuchi, S., 1994, The bus transit system: its underutilized potential, Report DOT-T-94-29, Washington, D.C.: Federal Transit Administration.

Wang, Y. & Hu, J., 2002, May. A machine learning based approach for table detection on the web. In Proceedings of the 11th international conference on World Wide Web, ACM, pp. 242- 250. Wang Y., Chua C.S., & Ho Y.K,2002, Facial feature detection and face recognition from 2D and 3D images, Pattern Recognition Letters, Feb., 23(10), pp. 1191-1202.

Wisitpongphan N., Jitsakul W., & Jieamumporn D. ,2012,” Travel Time Prediction using Multi- Layer Feed Forward Artificial Neural Network”, 4th.Int. Conf., Comp. Intell, Comm. Sys & Net, pp.326-329. Wolpert D.H., 1992, “Stacked Generalization.” Neural Networks, 5(2), pp. 241-259.

Witten I.H., Frank E. & Hall M. A., 2011, Data Mining Practical Machine Learning Tools and Techniques, 3rd edition, Library of Congress Cataloging-in-Publication Data.

Wu C.H, Ho J.M & Lee D.T, 2004, Travel-Time Prediction with Support Vector Regression, IEEE Trans on Intelligent Transportation System, Vol.5(4), December 2004,pp.276-281.

ANNEXURE 3A

Questions asked during the interview with the MTM analyst regarding the row vehicle traffic flow data:

1. How does MTM collect data from the freeway?

2. What kind of device are installed on the freeway that collect data?

3. What is the interval in which the data gets collected on the freeway?

4. Who maintains the devices installed on site?

5. What does the agency use for quality assurance?

6. What kind of a system does the agency data analyst and technical team use to monitor devices installed on site if they malfunction?

7. Does the software interface have functionality where it can be rewind or fast forward to preview history?

8. Can errors be identified on the server level (e.g. where there’s missing values to see what was a fault)?

9. On the storage database can the data be manipulated to meet certain requirements?

10. Can the sites devices track any violation made by the route users (e.g. when a heavy vehicle uses the wrong lane instead of the recommended lane)?

ANNEXURE 3B The Ben Schoeman freeway (M1 North extending to the N1 North) that links Johannesburg with Pretoria where vehicle traffic flow data used for this study was collected by Mikro’s Traffic Monitoring (MTM), from Allandale Rd(M1) to N1 Ben Schoeman freeway Pretoria.

ANNEXURE 4A Sample data as received from MTM from their database server.

Site ID Date Time Duration Total Light Heavy Total Light Heavy Total Lane Lane Lane Lane Lane Lane Lane Hours 1 1 1 2 2 2 3 2013-01- 1863 28 14:00:00 00:37 831 826 5 867 819 48 573 2013-01- 1863 28 15:00:00 01:00 1303 1290 13 1329 1256 73 971 2013-01- 1863 28 16:00:00 01:00 1483 1470 13 1452 1391 61 1069 2013-01- 1863 28 17:00:00 01:00 1573 1556 17 1373 1319 54 1101 2013-01- 1863 28 18:00:00 01:00 1452 1428 24 1263 1200 63 1006 2013-01- 1863 28 19:00:00 01:00 811 804 7 897 869 28 670 2013-01- 1863 28 20:00:00 01:00 354 346 8 637 620 17 480 2013-01- 1863 28 21:00:00 01:00 183 183 0 456 439 17 357 2013-01- 1863 28 22:00:00 01:00 126 125 1 325 314 11 269 2013-01- 1863 28 23:00:00 01:00 77 77 0 232 224 8 190 2013-01- 1863 28 24:00:00 01:00 9 9 0 110 108 2 123 2013-01- 1863 29 01:00:00 01:00 3 3 0 43 43 0 73 2013-01- 1863 29 02:00:00 01:00 1 1 0 37 36 1 44 2013-01- 1863 29 03:00:00 01:00 4 4 0 33 32 1 46 2013-01- 1863 29 04:00:00 01:00 3 3 0 46 42 4 61 2013-01- 1863 29 05:00:00 01:00 39 39 0 156 154 2 141 2013-01- 1863 29 06:00:00 01:00 796 791 5 880 859 21 622 2013-01- 1863 29 07:00:00 01:00 2032 2015 17 1682 1643 39 1509

ANNEXURE 4B Sample data as received from MTM after removing the unused data.

Instance Date Time Time Duration Hour Total Road Average Speed Hours Dir 1 1 2013-01-28 14 14:00:00 00:37 4169 100 2 2013-01-28 15 15:00:00 01:00 6719 100 3 2013-01-28 16 16:00:00 01:00 7660 100 4 2013-01-28 17 17:00:00 01:00 8682 93 5 2013-01-28 18 18:00:00 01:00 7793 97 6 2013-01-28 19 19:00:00 01:00 4853 99 7 2013-01-28 20 20:00:00 01:00 2855 90 8 2013-01-28 21 21:00:00 01:00 1871 94 9 2013-01-28 22 22:00:00 01:00 1340 99 10 2013-01-28 23 23:00:00 01:00 910 103 11 2013-01-28 24 24:00:00 01:00 462 102 12 2013-01-29 1 01:00:00 01:00 227 101 13 2013-01-29 2 02:00:00 01:00 194 96 14 2013-01-29 3 03:00:00 01:00 184 95 15 2013-01-29 4 04:00:00 01:00 214 102 16 2013-01-29 5 05:00:00 01:00 613 108 17 2013-01-29 6 06:00:00 01:00 4046 109 18 2013-01-29 7 07:00:00 01:00 11147 81

ANNEXURE 4C Sample data after the data has been converted to nominal values so that it can be uploaded on WEKA for experiments.

TargetConcept

Instance DayOfWeek TravelTime TrafficVolume AverageSpeed 1 DOW Off-Peak Average- Average-Speed Freeflow Traffic 2 DOW Peak Heavy-Traffic Average-Speed FlowingCongestion 3 DOW Peak Heavy-Traffic Average-Speed FlowingCongestion 4 DOW Peak Heavy-Traffic Low-Speed Congested 5 DOW Peak Heavy-Traffic Low-Speed Congested 6 DOW Off-Peak Average- Average-Speed Freeflow Traffic 7 DOW Off-Peak Low-Traffic Low-Speed Freeflow 8 DOW Off-Peak Low-Traffic Low-Speed Freeflow 9 DOW Off-Peak Low-Traffic Average-Speed Freeflow 10 DOW Off-Peak Low-Traffic Average-Speed Freeflow 11 DOW Off-Peak Low-Traffic Average-Speed Freeflow 12 DOW Off-Peak Low-Traffic Average-Speed Freeflow 13 DOW Off-Peak Low-Traffic Low-Speed Freeflow 14 DOW Off-Peak Low-Traffic Low-Speed Freeflow 15 DOW Off-Peak Low-Traffic Average-Speed Freeflow 16 DOW Off-Peak Low-Traffic High-Speed Freeflow 17 DOW Peak Average- High-Speed Freeflow Traffic 18 DOW Peak Heavy-Traffic Low-Speed Congested

ANNEXURE 4D Confusion matrix for all the constructed models in Chapter 4 those were not shown during the experiments. Confusion matrix of the model constructed from DOW, TV and AS.

Predicted a b c Actual A 6755 128 296 B 142 1614 0 C 0 0 1678 Confusion matrix of the model constructed from DOW, TV and AS.

Predicted

a b c

Actual A 7165 0 14

B 1756 0 0

C 7 0 1671

Confusion matrix of the model constructed from DOW, TV and TT.

Predicted

a b c

Actual a 7179 0 0

b 0 1756 0

c 342 1336 0

Confusion matrix for a model constructed from DOW and TT attibutes.

Predicted

a b c

Actual a 6735 444 0

b 0 1756 0

c 323 1355 0

Confusion matrix for a model constructed from AS and TT attibutes.

Predicted

a b c

Actual a 6755 128 296

b 142 1614 0

c 0 0 1678

Confusion matrix for a model constructed from TV and DOW attibutes.

Predicted

a b c

Actual a 7179 0 0

b 1756 0 0

c 1678 0 0

Confusion matrix for a model constructed from AS and DOW attibutes.

Predicted

a b c

Actual a 6883 0 296

b 1756 0 0

c 0 0 1678

Confusion matrix for a model constructed from TT and TV attibutes.

Predicted

a b c

Actual a 7179 0 0

b 0 1756 0

c 342 1336 0

Confusion matrix for training data using the MLP algorithm.

Predicted

a b c

Actual a 7161 1 17

b 111 1614 31

c 16 0 1662

Confusion matrix for training data using the SVM1 algorithm

Predicted

a b c

Actual a 6903 128 148

b 0 1614 142

c 1323 0 1355

Confusion matrix for training data using the SVM2 algorithm

Predicted

a b c

Actual a 7141 36 2

b 0 1614 142

c 342 0 1336

Confusion matrix for training data using the SVM3 algorithm

Predicted

a b c

Actual a 7141 36 2

b 107 1649 0

c 0 0 1678

The Confusion matrix for training data using the SVM4 algorithm

Predicted

a b c

Actual a 7093 71 15

b 107 1649 0

c 0 0 1678

Confusion matrix for training data using the SVM5 algorithm

Predicted

a b c

Actual a 7089 73 17

b 0 1756 0

c 0 0 1678

Confusion matrix for MLP Bagging ensemble learning method.

Predicted

a b c

Actual a 7179 0 0

b 61 1614 81

c 19 0 1659

Confusion matrix for SVM Bagging ensemble learning method.

Predicted

a b c

Actual a 6906 128 145

b 0 1614 142

c 323 0 1355

Confusion Matrix for Boosting learning method when usin MLP algorithm

Predicted

a b c

Actual a 7179 0 0

b 0 1756 0

c 1 0 1677

Confusion Matrix for Boosting learning method when usin SVM algorithm

Predicted

a b c

Actual a 7148 27 4

b 0 1756 0

c 0 0 1678

Confusion matrix for stacking ensemble learning method combining C4.5, SVM and MLP algorithms, where MLP algorithms was used as a meta classifier.

Predicted

a b C

Actual a 7177 0 2

b 0 1756 0

c 0 0 1678

Confusion matrix for stacking ensemble learning method combining C4.5, SVM and MLP algorithms, where SVM algorithms was used as a meta classifier.

Predicted

a b C

Actual a 7177 0 2

b 0 1756 0

c 0 0 1678

Confusion matrix for Random forest when numTrees (3) and numFeatures are set.

Predicted

a b C

Actual a 7178 0 1

b 0 1756 0

c 0 0 1678

Confusion matrix for Random forest when numTrees (5) and numFeatures are set.

Predicted

a b C

Actual a 7178 0 1

b 0 1756 0

c 0 0 1678