Combined Model Approach to the Problem of Ranking

Item Type text; Electronic Dissertation

Authors Lee, Alexander S.

Publisher The University of .

Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.

Download date 28/09/2021 08:08:32

Link to Item http://hdl.handle.net/10150/631941

COMBINED MODEL APPROACH TO THE PROBLEM OF RANKING

by

Alexander S. Lee

______

Copyright ©Alexander S Lee 2019

A Dissertation Submitted to the Faculty of the

DEPARTMENT OF SYSTEMS AND INDUSTRIAL ENGINEERING

In Partial Fulfillment of the Requirements

For the Degree of

DOCTOR OF PHILOSOPHY

In the Graduate College

THE UNIVERSITY OF ARIZONA

2019

2

3

STATEMENT BY AUTHOR

This dissertation has been submitted in partial fulfillment of the requirements for an advanced degree at the University of Arizona and is deposited in the University Library to be made available to borrowers under rules of the Library.

Brief quotations from this dissertation are allowable without special permission, provided that an accurate acknowledgement of the source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the copyright holder.

SIGNED: Alexander S. Lee

4

ACKNOWLEDGEMENTS

Over the years since my first year at the University of Arizona, I met many mentors, colleagues, and friends. I am extremely thankful for the experiences that I gained for both my professional and personal growth. I will cherish all the great memories and friendships formed during my time at the University of Arizona. I am very grateful to Dr. Wei-Hua Lin, Dr. Young-Jun Son, Dr. Ricardo Valerdi, and Dr. Joseph Valacich for serving on my doctoral committee. I especially want to thank my advisor, Dr. Wei-Hua Lin, for his academic guidance, encouragement, and patience. His research advice over the years helped shape me into a better researcher. In addition, I want to thank Dr. Larry Head for his advice on careers and networking in conferences. I also want to thank the University of Arizona Department of Systems & Industrial Engineering faculty and staff. I especially want to thank Linda Cramer and Mia Schnaible for their guidance with the paperwork and sending important emails and reminders on careers, seminars, and upcoming events. Their help definitely made my life a lot easier! I extend my thanks to all former and current University of Arizona Systems & Industrial Engineering students I have met over the years. Personally, I want to thank Dong Xu, Chao Meng, Sojung Kim, Byung-Ho Beak, Matthew Dabkowski, Tommy Ryan, and Danny Thebeau for their friendship and guidance. I feel very lucky to have met them and call them my friends. Most importantly, I want to thank my family, especially my parents and my brother, for their continuous love and support. None of the things that I have accomplished would have been possible without them.

5

TABLE OF CONTENTS

LIST OF FIGURES……………………………………………………………………………….9

LIST OF TABLES……………………………………………………………………………….10

ABSTRACT……………………………………………………………………………………...11

1. INTRODUCTION…………………………………………………………………………...13

1.1. Main Research Goal…………………………………………………………………….15

1.2. Objectives……………………………………………………………………………….16

1.2.1. Qualitative and Quantitative Elements of Combined Models…………………...16

1.2.2. Quantification of Similarity for Road Segment Clustering……………………...16

1.2.3. Road Hotspot Identification……………………………………………………...16

1.2.4. Unsupervised Learning Hybrid Model…………………………………………..17

1.2.5. Traffic Congestion Measurement………………………………………………..17

1.2.6. Combined Model based on Machine Learned Ranking Algorithms for Voting Bias

Detection……………………………………………………………………………18

1.2.7. Human-Machine Symbiosis……………………………………………………...18

1.3. Organization of the Remainder of the Dissertation……………………………………..18

2. BACKGROUND AND LITERATURE REVIEW………………………………………….20

2.1. Road Safety and Hotspot Identification…………………………………………………20

2.2. Traffic Congestion Measurement……………………………………………………….22

2.2.1. Specific Aspects of Traffic Congestion………………………………………….23

2.2.2. Hybrid Model…………………………………………………………………….24

2.2.3. Principal Component Analysis and Ranking…………………………………….25

2.3. Machine Learning Algorithms…………………………………………………………..26 6

2.3.1. Voting Bias………………………………………………………………………26

2.3.2. Combined Model………………………………………………………………...27

2.4. Literature Summary……………………………………………………………………..29

3. PROPOSED METHODOLOGIES………………………………………………………...... 30

3.1. Enhanced Empirical Bayesian Method………………………………………………….30

3.1.1. Similarity Measure……………………………………………………………….30

3.1.1.1. Road Segment Similarity Measure………………………………………31

3.1.1.2. Proportion Discordance Ratio (PDR)……………………………………32

3.1.2. Empirical Bayesian Method…………………………………………………...... 34

3.1.3. Enhancement to the Empirical Bayesian Method………………………………..37

3.2. Unsupervised Learning Hybrid Model………………………………………………….39

3.2.1. Normalized Scoring Method (NSM)…………………………………………….40

3.2.2. Principal Component Analysis (PCA)…………………………………………...40

3.2.3. Proportion Discordance Ratio (PDR) Similarity Matrix………………………...42

3.3. Machine Learned Ranking Combined Model - Supervised Learning Approach……….43

3.3.1. Inverse Mean-Squared Error (MSE) Weighted Sum…………………………….43

3.3.1.1. Support Vector Machines (SVM)………………………………………..45

3.3.1.2. Neural Networks (NN)…………………………………………………...49

3.4. Methodology Summary…………………………………………………………………51

4. ENHANCED EMPIRICAL BAYESIAN METHOD – CASE STUDY IN PHOENIX,

ARIZONA……………………………………………………………………………………52

4.1. Background……………………………………………………………………………...52

4.2. Description of the Sites and Data used in the Arizona Case Study……………………..53 7

4.3. Results and Discussion………………………………………………………………….57

4.3.1. Overall Road Hotspot Identification……………………………………………..57

4.3.2. Road Hotspot Identification for Different Timeframes………………………….62

4.3.2.1. Seasons…………………………………………………………………...63

4.3.2.2. Days of the Week………………………………………………………...64

4.3.2.3. Times of the Day…………………………………………………………65

4.4. Chapter Summary……………………………………………………………………….67

5. HYBRID MODEL RANKING FOR TRAFFIC CONGESTION ASSESSMENT OF

METROPOLITAN AREAS…………………………………………………………………69

5.1. Background……………………………………………………………………………...70

5.2. Dataset Description……………………………………………………………………...71

5.3. Results and Discussion………………………………………………………………….72

5.3.1. Normalized Scoring Method (NSM)…………………………………………….72

5.3.2. Principal Component Analysis (PCA)…………………………………………...77

5.3.3. Proportion Discordance Ratio (PDR) Similarity Matrix………………………...82

5.4. Chapter Summary……………………………………………………………………….87

6. BASEBALL HALL OF FAME MACHINE LEARNING COMBINED MODEL…………89

6.1. Background……………………………………………………………………………...90

6.2. Dataset Description……………………………………………………………………...92

6.3. Results and Discussion………………………………………………………………….94

6.3.1. Outfielders………………………………………………………………………..95

6.3.2. Infielders…………………………………………………………………………98

6.3.3. Starting Pitchers………………………………………………………………...100 8

6.3.4. Overall Summary of Results……………………………………………………102

6.4. Chapter Summary……………………………………………………………………...108

7. SUMMARY AND CONCLUSIONS………………………………………………………110

7.1. Research Summary…………………………………………………………………….110

7.1.1. Contributions to Combined Model for Addressing a Class of Problems Pertaining

to Extreme Values and Rare Events………………………………………………111

7.1.2. Contributions to Exploring and Clarifying Similarity for Road Segment

Clustering…………………………………………………………………………112

7.1.3. Contributions to Unsupervised Learning Hybrid Model for Traffic Congestion

Measurement and Ranking………………………………………………………..112

7.1.4. Contributions to Exploring the Weighted Sum of Classification Probabilities

using Machine Learned Ranking Algorithms……………………………………..113

7.2. Firsts in the Research…………………………………………………………………..114

7.3. Future Research Directions…………………………………………………………….115

RESEARCH FUNDING SOURCES………………………………………………………...…118

REFERENCES…………………………………………………………………………………119

9

LIST OF FIGURES

Figure 4.1: Initial Cluster Centroids of 55 Road Segments in Arizona State Route 101………………………………………………………………………………………………..55

Figure 4.2: 2014 4 th Quarter Predictions vs. Actual Number of Crashes in Arizona State Route 101………………………………………………………………………………………………..59

Figure 4.3: Arizona State Route 101 Ratios vs. Enhanced EB Predictions……………………...61

Figure 4.4: Arizona State Route 101 Seasonal Heat Map……………………………………….64

Figure 4.5: Arizona State Route 101 Daily Heat Map…………………………………………...65

Figure 4.6: Arizona State Route 101 Hourly Heat Map…………………………………………67

Figure 5.1: Metropolitan Area Normalized Scoring Method (NSM) Scores and Clusters………74

Figure 5.2: Metropolitan Area Normalized Scoring Method (NSM) Scores Ordered Based on TomTom’s Ranks………………………………………………………………………………...76

Figure 5.3: Principal Component Analysis Graph……………………………………………….78

Figure 5.4: First Principal Component Scores Ordered Based on TomTom’s Ranks…………...79

Figure 5.5: Proportion Discordance Ratio (PDR) Similarity Matrix…………………………….83

Figure 6.1: Top 30 Outfielders – Hall of Fame (HOF) Probability 95% Confidence Intervals…………………………………………………………………………………………..98

Figure 6.2: Top 30 Infielders – Hall of Fame (HOF) Probability 95% Confidence Intervals…………………………………………………………………………………………100

Figure 6.3: Top 30 Starting Pitchers – Hall of Fame (HOF) Probability 95% Confidence Intervals…………………………………………………………………………………………102

Figure 6.4: ROC Curve…………………………………………………………………………104 10

LIST OF TABLES

Table 3.1: Key Differences Between Enhanced EB Method vs. Standard EB Method…………39

Table 4.1: Descriptive Statistics of Arizona State Route 101 Road Segments…………………..56

Table 4.2: Arizona State Route 101 Average Error Percentages………………………………...58

Table 4.3: Arizona State Route Road Segments and Popular Destinations……………………...63

Table 5.1: TomTom Dataset Feature Definitions………………………………………………..72

Table 5.2: Beta Values for the 1st Principal Component………………………………………...80

Table 5.3: Rank Results and Comparisons………………………………………………………81

Table 5.4: Tucson Similarity Comparison with Other Metropolitan Areas……………………...85

Table 6.1: Top 30 Outfielders – Hall of Fame (HOF) Probability Scores (α=0.4609 and β=0.5391)………………………………………………………………………………………...96

Table 6.2: Top 30 Infielders – Hall of Fame (HOF) Probability Scores (α=0.4884 and β=0.5116)………………………………………………………………………………………...99

Table 6.3: Top 30 Starting Pitchers – Hall of Fame (HOF) Probability Scores (α=0.4662 and β=0.5338)……………………………………………………………………………………….101

Table 6.4: Confusion Matrix for All Groups of Players Combined……………………………103

Table 6.5: Future Hall of Fame (HOF) Candidate Predictions…………………………………106

11

ABSTRACT

Ranking can be defined in many ways and has been applied in many areas. Over the past years, many research studies have been conducted on ranking methods for decision-making. The issue is that results pertaining to ranking entities are based only on one method, which can be subjective and prone to biases. As a result, results vary from method to method. To resolve this issue, a combined model approach is proposed, in which multiple methods are taken into account. The goals of the combined model are to rank entities more objectively and obtain more reliable results.

The combined model has both qualitative and quantitative elements, where the qualitative element is the ranking and clustering, while the quantitative element is comprised of the scores.

Since ranking is based on the scores of the entities, it takes into account the distribution of scores. In some scenarios, closely ranked entities can have similar scores, while in other scenarios, their scores can be relatively different even though they are ranked close to one another. The score distribution leads to clustering analysis, where entities are divided into clusters based on the spread of the scores. Hence, the combined model takes into account not only the ranks but also the scores of the entities.

The proposed combined model is applied to three areas for this dissertation research. The first is identifying and ranking road hotspots and predicting the number of traffic crashes in road segments using the Empirical Bayesian (EB) enhanced by the Proportion Discordance Ratio

(PDR) metric. The effectiveness of the Enhanced EB method is tested and demonstrated through a case study that is conducted in one of the major highways in Phoenix, Arizona. The second is ranking major US metropolitan areas in traffic congestion using unsupervised learning based on 12 the Normalized Scoring Method (NSM), Principal Component Analysis (PCA), and the PDR similarity matrix. In 2015, TomTom ranked Tucson as the 21st most congested metropolitan area in the US, and the unsupervised learning combined model is applied to assess TomTom’s traffic congestion ranking of the metropolitan areas in order to determine if Tucson is highly congested based on the proposed model. The third is ranking and assessing the Hall of Fame

(HOF) worthiness of retired Major League Baseball players based on their performance statistics through supervised learning based on Support Vector Machines (SVM) and Neural Networks

(NN). Players are considered for the HOF through a voting procedure, where voters are comprised of members of the media, but there is a possibility of voting bias that can favor or go against certain players. Results from all three scenarios show that the proposed combined models are more reliable and can more objectively rank entities in order to correct biases based on previous methodologies.

13

CHAPTER 1

INTRODUCTION

Over the past decades, many methodologies have been proposed to rank entities in order to facilitate decision-making. Decisions made based on such rankings have made a profound impact in society. One of the most well-known papers on ranking is the PageRank algorithm created by Google co-founders Larry Page and Sergei Brin (Brin and Page, 1998). The

PageRank algorithm conducts a pair-wise comparison among web pages from a network standpoint in terms of the likelihood that a certain web is accessed from a previous web page.

As a result, web pages are ranked based on their relevance from keywords typed in the Google search engine (Page et al., 1998). Afterwards, pair-wise comparison algorithms RankSVM

(Joachims, 2002) and RankNet (Richardson et al., 2005) as well as list-wise comparison algorithms ListNet (Cao et al., 2007) and ListMLE (Xia et al., 2008) have been proposed to improve the PageRank algorithm.

The issue with relying on only one method is that a single method has its own disadvantages and weaknesses, which limit the reliability and hence the accuracy of results.

Results based on a single method are more likely to be prone to biases, which can lead to costly and ineffective decisions. As a result, rankings are likely to vary from method to method. To overcome this deficiency, the combined model approach that uses multiple methods is taken.

That way, limitations of each method are offset by one another, and results obtained from combined models become more objective and reliable. In fact, some papers have taken a combined model approach to further improve the PageRank algorithm. Moon et al. (2010) took into account both pair-wise comparison as well as list-wise comparison among all web pages using isotonic regression, and such combination led to an improvement in ranking performance 14 based on a loss function. Sculley (2010) proposed the Combined Regression and Ranking

(CRR) method that optimized both regression and ranking objectives at the same time, and using the combined optimization approach, it performed better than the RankSVM algorithm. From a regression standpoint, the CRR model was able to overcome the issue of the presence of outliers and skewed distributions after taking into account ranking constraints. In addition, the use of combined models has led to improved results in energy (Xiao et al., 2015; Zhang et al., 2017) and medical (Avendi et al., 2016; Wei et al., 2016) applications.

Overall, past literature has shown that combined models performed better than individual models based on one method. For this dissertation, combined models are comprised of both qualitative and quantitative elements, where the qualitative element is the ranking and clustering, and the quantitative element is comprised of the scores. Since ranking is based on the scores of the entities, it takes into account the score distribution. In some scenarios, closely ranked entities can have similar scores, while in other scenarios, their scores can be relatively spread apart among each other even though they are ranked close to one another. The score distribution leads to clustering analysis, where entities are divided into clusters based on the spread of the scores.

Hence, the combined model for this dissertation takes into account not only the ranks but also the scores of the entities.

The goal of this dissertation is to improve ranking methodologies using combined models for road hotspot identification and the assessment of traffic congestion levels in major metropolitan areas. For road hotspot identification, this dissertation addresses the issue of defining “similar” for road segments that are divided into clusters for assessing the risk of traffic crashes in those road segments. To resolve this issue, similarity is quantified based on crash patterns of road segments through a pair-wise comparison, and this metric is incorporated into 15 the standard methods for predicting the number of traffic crashes. For the assessment of traffic congestion levels on a macroscopic standpoint, this dissertation conducts an independent investigation in this area using a hybrid model, which is a different version of a combined model.

While a combined model combines multiple methods into one unique method to obtain a single set of results with the ranks, a hybrid model applies multiple methods that are independent among each other and obtains a set of results from each method based on the same dataset to ensure the consistency of results obtained from these methods. This dissertation is interested in assessing the validity of TomTom’s rankings of metropolitan areas based on their traffic congestion levels, which are determined based on speed measurements of vehicles equipped with

TomTom’s GPS devices. Furthermore, the scope of this dissertation is not limited to transportation research. It addresses the issue of voting bias in the Major League Baseball Hall of Fame voting procedure by scoring and ranking retired baseball players based on their performance statistics through machine learning algorithms.

1.1 Main Research Goal

The main goal of this dissertation is to rank entities more objectively and obtain more reliable results through combined models and a hybrid model. Methodologies proposed in this dissertation takes into account additional features in the datasets to improve the accuracy of predictions from previously existing methods. Specifically, this dissertation applies these models in transportation research and sports analytics in order to show that they are less prone to biases compared to the previous and current methods in these research areas. For ranking, combined models take into account not only the ranks themselves but also the scores that are 16 assigned to the entities. Based on the scores, the entities are ranked and clustered based on the score distribution.

1.2 Objectives

The objectives based on the main goal of this dissertation are explained below.

1.2.1 Qualitative and Quantitative Elements of Combined Models

Combined models are comprised of qualitative and quantitative elements. The qualitative element is the ranking and clustering, while the quantitative elements are scores. After scores are assigned to entities that are ranked, entities are divided into clusters based on the distribution of the scores.

1.2.2 Quantification of Similarity for Road Segment Clustering

Hauer, et al. (2002) proposed methods of predicting the number of traffic crashes for similar road segments. However, its definition of “similar” is vaguely defined. To resolve this issue, similarity is defined based on crash patterns of road segments through pair-wise comparison. This similarity metric is incorporated into the methods proposed by Hauer to improve traffic crash predictions. The performance of the combined model is demonstrated through one of the major highways in Phoenix, Arizona as a case study.

1.2.3 Road Hotspot Identification

Traffic crashes are considered as extreme and rare events. Based on the combined model proposed above, road segments are ranked based on the predicted number of traffic crashes to 17 identify road hotspots. They are also ranked based on the relative difference between the predicted number of traffic crashes and the actual number of traffic crashes in the test dataset.

This metric indicates a surge in the number of traffic crashes for a certain road segment in a given timeframe. Moreover, the analysis is conducted after taking into account the seasons of the year, day of the week, and hour of the day to pinpoint road hotspots during certain times.

1.2.4 Unsupervised Learning Hybrid Model

This dissertation makes a distinction between a combined model and a hybrid model.

Unlike the combined model, which is unique based on a combination of multiple methods, the hybrid model tests each method that is evaluated separately and compared with other methods to ensure consistency of results. Such consistency ensures higher validity of results. This approach is taken using unsupervised learning methods for measuring traffic congestion, which is the next objective of this dissertation.

1.2.5 Traffic Congestion Measurement

In 2015, TomTom ranked Tucson as the 21 st most congested metropolitan in the United

States, while Phoenix was ranked 43 rd . TomTom’s ranks are based on speed measurements of vehicles equipped with TomTom’s GPS devices. Also, past literature found that areas with larger road networks and populations are proportional to traffic congestion levels. These factors, along with the interest in the study by the Pima Association of Governments, serve as a motivation to conduct an independent investigation in the assessment of traffic congestion levels.

18

1.2.6 Combined Model based on Machine Learned Ranking Algorithms for Voting Bias

Detection

Currently, retired baseball players are voted by members of the media for induction into the Major League Baseball Hall of Fame. The issue with the current voting procedure is that voters are likely to be biased for or against certain players. To rank players and more objectively determine which players deserve to be in the Hall of Fame, machine learning algorithms are applied to the baseball performance statistics dataset.

1.2.7 Human-Machine Symbiosis

Both humans and machines have their own strengths and weaknesses. Decision-making by humans tend to be biased based on their own personal experiences, but they tend to be more flexible in their decision. On the other hand, machines are able to see hidden patterns of large datasets very quickly, but they can be rigid. Hence, to offset the respective weaknesses, decision-making is optimal when both human and machine components are taken into account.

It is important to emphasize that machines are not intended to replace humans.

1.3 Organization of the Remainder of the Dissertation

The remainder of the dissertation is organized as follows. Chapter 2 discusses the past literature done on road safety, traffic congestion measurement, and the use of combined models and machine learning algorithms to improve model performance. Chapter 3 explains the methodologies used to accomplish research objectives mentioned in Section 1.2 for this dissertation. Chapter 4 tests the performance of the Enhanced Empirical Bayesian (EB) method 19 through one of the major highways in Phoenix as a case study for road hotspot identification.

Chapter 5 ranks major metropolitan areas in the United States based on their traffic congestion levels using the hybrid model and compares the results with TomTom’s ranks. Chapter 6 objectively assesses the performance of retired baseball players using machine learning algorithms to identify potential biases behind the voting procedure of the Major League Baseball

Hall of Fame. Chapter 7 provides the summary of key findings of this dissertation as well as concluding remarks.

20

CHAPTER 2

BACKGROUND AND LITERATURE REVIEW

This chapter provides a comprehensive literature review of this dissertation. Section 2.1 provides the description of previous literature done on road safety and predicting the number of crashes using the Standard Empirical Bayesian method. The next section discusses past work done on traffic congestion measurement, hybrid models, and ranking entities using principal component analysis in Section 2.2. Then, Section 2.3 explains the potential bias in human decision-making and the use of combined models and machine learning algorithms by previous work to improve model performance. Lastly, Section 2.4 provides the summary of the literature review.

2.1 Road Safety and Hotspot Identification

Traffic safety engineers and researchers are entrusted with the responsibility of continuously mitigating the crash risk of relatively unsafe sites, which are also referred to as hotspots, sites with promise, and black-spots (Hauer et al., 2004; Maher and Mountain, 1988;

Cheng and Washington, 2005; Huang et al., 2009). Safety of road users is of paramount importance in the transportation industry due to the enormous monetary and emotional burden caused by vehicular crashes of several accident severity levels (Blincoe et al., 2002). To maintain a safe driving environment, the traffic safety management process initiates with network screening, also called hotspot identification (HSID), which is followed by problem diagnosis, countermeasure identification, and project prioritization. The foremost step of HSID is particularly crucial to extract the most benefit from the limited financial resources allocated 21 towards crash remediation. Detection of truly hazardous sites, based on underlying safety, would yield maximum benefit as funds would not be wasted to rectify sites that were already safer.

Many methods have been proposed for the purpose of HSID in the past. The traditional methods rely on the observed crash count (Deacon et al., 1975), and crash rate (Norden et al.,

1956). The empirical crash data were observed to be subjected to the phenomenon of the regression-to-the-mean (RTM) bias, which poses a challenge to the identification of hotspots

(Hauer, 1986; Persaud, 1988; Wright et al., 1988; Hauer, 1996; Carlin and Louis, 2000; Meza,

2003; Carriquiry and Pawlovich, 2004). The RTM bias stems from the fact that collision frequency on a roadway is stochastic in nature, which means that it may experience fluctuations at a particular site in a short period of time without reflecting any changes in factors that affect the true underlying safety, as the long-term average collision frequency remains near its true mean (Lee et al., 2016). To eliminate the RTM bias associated with empirical crash data, some researchers proposed and implemented the sophisticated Empirical Bayesian (EB) framework

(Hauer and Persaud, 1987; Hauer et al., 1991; Hauer et al., 2002; Cheng and Washington, 2005;

Wu et al., 2014). This approach is based on the combination of prior and current types of information for the estimation of expected safety at particular sites under consideration. The prior information is the expected crash count from a group of similar sites (reference population), while the current information is the observed crash frequency at the concerned site. The point estimates from the two types of information are combined to obtain an improved estimate of the expected long-term safety of a site.

The accuracy of EB crash estimates is immensely dependent on the selection of the reference population. In the case of intersections, incorrect selection of similar sites may have significant implications on the long term results as intersections are one of the most dangerous 22 locations in a road network due to the greater tendency of occurrence of various types of crashes

(Hauer et al., 1989; Abdel-Aty et al., 2005; Kumara and Chin, 2005). However, the similarity of intersections that are used to generate the safety performance function (SPF) is just a subjective concept. It represents the intersections that share some common traits such as traffic flow, facility location, and local weather with the given intersection. A formal and scientific definition of intersection similarity, which is the basis of the EB method procedure, has been obscure for a long time. This motivates the authors to fill in the blanks about segment similarity measurement in the field of traffic safety analysis.

2.2 Traffic Congestion Measurement

Many research studies pertaining to measuring traffic congestion have been conducted.

Yi et al. (2013) quantified congestion by defining three congestion metrics, which are the transportation environment satisfaction (TES), travel time satisfaction (TTS), and traffic congestion frequency and feeling (TCFF), based on survey data obtained from travelers in

Shanghai in order to help policy makers and traffic planners get a better understanding of traffic congestion from the travelers’ point-of-view. Han et al. (2012) took a microcosmic simulation approach to measure the congestion intensity for each road segment and weighted the segments based on their connectivity among each other from a road network standpoint to obtain the overall road network congestion intensity. Also, some researchers measured traffic congestion using travel times and speed measurements of probe vehicles through mobile devices (Pattara- atikom et al. 2007; Mandal et al., 2011; Al-Sobky and Mousa, 2016). Another approach to measure traffic congestion is the use of GPS via probe vehicles (D’Este et al., 1999; Pattara- 23 atikom et al., 2006; Kong et al., 2015). Currently, TomTom uses its GPS navigation devices to get speed measurement data in order to obtain the overall congestion index for all major metropolitan areas around the world.

2.2.1 Specific Aspects of Traffic Congestion

Currently, TomTom ranks metropolitan areas in terms of their traffic congestion levels based on its overall traffic congestion index. However, the limitation with TomTom’s ranking is that its overall traffic congestion index is solely based on speed measurements of vehicles equipped with TomTom’s GPS devices, and relying on a single measurement does not accurately provide a big picture needed for decision-making. Gleckler et al. (2008) emphasized the importance of taking into account a wide range of variables for climate models since a single variable provides different information and thus cannot accurately represent other variables and the climate of atmospheric fields as a whole. Also, Findlay and Reid (2002) found that it is preferable to use individual performance variables rather than a single performance index for predicting Major League Baseball Hall of Fame inductions. Hence, relying on specific attributes rather than on a single overall metric is more reliable for decision-making.

More specifically, past research studies pertaining to traffic congestion specifically focused on congestion levels during travel peak hours. Texas A&M Transportation Institute

(TTI) defined traffic congestion through its Travel Time Index metric that is calculated as peak travel time divided by travel time during free-flow conditions. Boarnet et al. (1998) measured traffic congestion based on the highway capacity and the traffic volume during travel peak hours for California highways. Zhang and Zhang (2008) studied the underlying reasons behind high traffic congestion levels during peak hours in Shanghai. Another study formulated the 24 transportation network design problem (NDP) model based on both morning and evening peak hours instead of only one peak hour since the two peak hours exhibit different traffic characteristics (Wang et al., 2014). In addition, a congestion pricing model that charged drivers fees for entering downtown areas during peak hours was developed in order to improve traffic flow and reduce traffic emissions (Wang and Li, 2009).

More studies have shown that traffic congestion levels tend to be higher for larger cities

(Chang et al., 2017; Louf and Barthelemy, 2014). Specifically, Louf and Barthelemy, 2014 found that the total length of a road network scales sublinearly with the total population of a city, and the daily total delay due to congestion scales super-linearly with the total population. Hence, as the total population increases, so do the total road network length and delay time caused by congestion. Also, Braess’s paradox postulates that adding more roads to a road network does not alleviate congestion levels but, in fact, makes traffic congestion worse (Braess, 1968; Braess et al., 2005). Therefore, more complex road networks with a higher total road length experience higher congestion levels. Based on these findings, this dissertation will later show that Tucson, which is smaller than Phoenix, is less congested than Phoenix, a finding that contradicts

TomTom’s rankings, by using the hybrid model.

2.2.2 Hybrid Model

Multiple research papers have implemented hybrid models in different areas. Shukur and

Lee (2015) applied the hybrid model that consists of neural networks and the Kalman filter based on the linear autoregressive integrated moving average (ARIMA) model to improve wind speed forecasting instead of the standard ARIMA model. Liou et al. (2007) proposed the hybrid model that uses both the decision-making trial and evaluation laboratory (DEMATEL) method and 25 analytic network process (ANP) to quantitatively measure airline safety through the airline safety index. Also, Behera and Panigrahi (2015) implemented the hybrid model of fuzzy clustering and neural networks for credit card fraud detection and improved the effectiveness of fraud detection by reducing the false alarm rate. Cooper and Sommer (2016) applied Agile methods that are primarily used in the software industry for project management in conjunction with the traditional Stage-Gate model for the development of physical products. The upsides of Agile methods are quicker adaptation to changes, taking the element of uncertainty into account, and faster review cycles to respond to unexpected events, but the downsides are not understanding how to apply such methods effectively and too much flexibility that can lead to counterproductivity of teams. Meanwhile, the advantages of the Stage-Gate model are faster process of bringing successful products into market and taking a simple, focused, and disciplined approach to the development process, but a major disadvantage is the limitation of the creativity element. The proposed method of using both Agile and State-Gate approaches by Cooper and

Summer (2016) was tested on the LEGO toy company, and initial results found that their method led to several benefits such as faster response to changing customer needs and improved team communication and productivity. In addition, hybrid models based on support vector machines and ARIMA model (Pai and Lin, 2005) as well as ARIMA model, exponential smoothing model, and recurrent neural networks (Rather et al., 2015) led to improved predictions for stock prices and stock returns respectively.

2.2.3 Principal Component Analysis and Ranking

Researchers have used principal component analysis (PCA) to rank entities in many applications. Fang et al. (2018) found that using a combined ranking method based on PCA provided a more reliable ranking in systemic risk of Chinese banks. Another application is in 26 ranking blueberry cultivars based on their resistance to a fruit infection using resampling and

PCA (Ehlenfeldt et al., 2010). Also, PCA was applied to rank countries based on their readiness for internet-based retailing (Sharma, 2008). Additional areas of application include ranking top cricket players (Manage and Scariano, 2013) and breast cancer detection (Sharma and Saroha,

2015). Furthermore, PCA ranking is applied to information retrieval and computer vision. Cai et al., 2015 took the combined active sampling and kernel principal component analysis approach to improve performance in retrieving and ranking relevant web documents. For computer vision, the principal components of the features of a person’s face are ranked based on their alignment with the direction of the separating hyperplane (Thomaz and Giraldi, 2010).

Such ranking of the principal components led to improved facial recognition rates.

2.3 Machine Learning Algorithms

The final portion of the literature review goes beyond transportation research. Instead, it shifts its focus on applying combined models in sports analytics. In this case, this dissertation focuses on potentially resolving the issue of voting bias among voters comprised of sportswriters for the Major League Baseball Hall of Fame (HOF) voting process.

2.3.1 Voting Bias

The research topic on baseball and voting bias began when research studies assessed the existence of voter’s racial discrimination against African American and Latin American players.

Overall, they found the existence of racial bias during HOF voting (Desser, et al., 1999; Findlay and Reid, 1997; Jewell, et al., 2002; Jewell, 2003; Mills and Salaga, 2011) and All-Star Game voting (Depken II and Ford, 2006; Hanssen and Andersen, 1999; Hanssen, 2002) was either very limited or not significant. Although racial bias in baseball voting did not have a significant 27 effect on voters’ choices, it still raises the possibility of the existence of bias in favor of or against certain players.

Since HOF voters are made up of members of the media, this aspect raises concerns that the way the media covers a wide variety of topics such as baseball can affect public opinion.

DellaVigna and Kaplan (2007) found that between 1996 and 2000, Fox News significantly affected the Presidential and Senate elections. In fact, based on the study, Fox News convinced voters to choose Republican candidates. Meanwhile, Gerber, et al. (2009) found that the media exposure from the Washington Post or the Washington Times inclined people to vote for the

Democratic candidate during the 2005 Virginia gubernatorial election. Moreover, media firms present information that are most likely to boost their audience ratings instead of objective information that people should know. Bernhardt, et al. (2008) concluded that even though voters are aware of media biases, such biases in the information presented can lead to wrong decisions by voters during elections. Chiang and Knight (2011) also determined that media bias influenced voters’ decisions. Their study found that voters’ decisions are affected by the credibility of newspaper endorsements who support political candidates favored by the newspaper companies.

2.3.2 Combined Model

In the world of big data over the recent years, machine learning algorithms have been applied to solve different kinds of problems using data. One of the most popular machine learning algorithms used over the past decades is Support Vector Machines (SVM). SVM has been considered the gold standard among existing algorithms for comparison purposes. It has been used for several areas of research such as facial recognition (Guo et al., 2001; Zhang and 28

Qiao, 2003; Jia and Martinez, 2009) and cancer detection (Liu et al, 2003; Zhang and Liu, 2004;

Shukla, 2016). Another machine learning algorithm that has been widely used over the recent years is Neural Networks (NN). NN has been applied in many applications (Paliwal and Kumar,

2009) that range from fraud detection and marketing to wind power prediction and coronary artery disease detection.

To resolve the issue of voting bias, multiple research studies have been conducted to determine HOF inductions based on player performance. Findlay and Reid, 2002 found that it is preferable to use individual performance variables rather than a single performance index for

HOF predictions. Machine learning algorithms such as Neural Networks (Young, et al., 2008),

Random Forests (Freiman, 2010), and Radial Basis Function (RBF) network (Lloyd and

Downey, 2009) were applied to predict HOF inductions. However, there are some limitations of these studies. One of the limitations is that each of the mentioned studies relies on only one algorithm, which leads to questionable results from a reliability standpoint. Another limitation is that these studies focusing on HOF predictions of only batters. To resolve this issue, the analysis of this dissertation is extended to both batters and pitchers for HOF predictions. Also, this dissertation divides players into multiple groups to more accurately assess HOF-worthiness of each player relative to other players in the same group.

One resolution to the limitation of relying on only one algorithm is combined models. In fact, the use of combined models has shown to improve model performance in other research areas. Xiao, et al. (2015) took into account both past data and optimized weight coefficients of neural networks to better forecast electrical power load in Australia. Zhang, et al. (2017) proposed the combined model of Complete Ensemble Empirical Mode Decomposition Adaptive

Noise (CEEMDAN) and Flower Pollination Algorithm with Chaotic Local Search (CLSFPA) to 29 predict short-term wind speed. The proposed combined model led to improved wind speed predictions compared to the ones from the individual CEEMDAN and CLSFPA models.

Avendi, et al. (2016) took the combined approach of convolutional neural networks and deformable models for segmentation of the left ventricle in the heart using magnetic resonance imaging (MRI) and found that it performed better than the current baseline methods. Wei, et al.

(2016) effectively predicted the spread of hepatitis using the combined model of Autoregressive

Integrated Moving Average (ARIMA) and Generalized Regression Neural Network (GRNN).

Using the combined model performed better in terms of predicting hepatitis incidence than the individual ARIMA and GRNN models.

2.4 Literature Summary

The literature review has shown that combined models are more reliable throughout several research domains. For this dissertation, combined models are applied in transportation research and sports analytics. For transportation research, this dissertation attempts to fill in the gap of defining similarity for clustering road segments in order to predict the number of traffic crashes to identify road hotspots. It also takes into account multiple aspects of traffic congestion instead of relying on a single metric to more accurately measure traffic congestion levels of metropolitan areas. For sport analytics, the use of combined model based on machine learning algorithms is explored for the Major League Baseball Hall of Fame classification.

30

CHAPTER 3

PROPOSED METHODOLOGIES

This section covers the methodologies proposed by this dissertation to fulfill the research objectives specified in Chapter 1. Section 3.1 describes the Empirical Bayesian method enhanced by the Proportion Discordance Ratio (PDR) similarity matrix for predicting the number of traffic crashes and ranking road segments for hotspot identification. It is followed by the hybrid model that consists of the Normalized Scoring Method (NSM), Principal Component

Analysis (PCA), and the PDR similarity matrix to rank metropolitan areas based on their traffic congestion levels in Section 3.2. Then, Section 3.3 discusses the use of a combined model based on Support Vector Machines (SVM) and Neural Networks (NN) to rank retired baseball players in order to objectively determine which players deserve to be in the Hall of Fame based on their past performance statistics. The final section, Section 3.4, briefly summarizes the methodologies applied in this dissertation.

3.1 Enhanced Empirical Bayesian Method

This section provides a detailed description of the enhancement to the Standard EB method through the PDR measure, which is the key component of the Enhanced EB method developed in this dissertation for identifying a set of similar sites for a roadway segment.

3.1.1 Similarity Measure

In this section, a brief background on similarity is provided, which is then followed by the description of the PDR. Note that similarity is a typical subjective issue that is based on a certain set of features. To give a simple example, a spherical object can be described by its color, 31 size, and texture. Let us define color by the grayscale, size by the radius, and texture by the coefficient of friction. In this case, color, size, and texture are represented by a1, a2, and a3 respectively. Each spherical object is described by the three features, and the quantification of the similarity of objects from one another based on these features is a point of interest. In this dissertation, the crash patterns are feature sets, and road segments are objects. The objective is to assess the similarity of road segments based on the chosen features of the road.

Our discussion in this dissertation is based on the feature space of crash patterns. The similarity of two road segments is determined by a certain numerical measurement based on their corresponding crash patterns. Measuring road segment similarity through road segment crash patterns and comparing any two road segments through a matrix form is discussed in more detail in the respective subsections Segment Similarity Measurement and the PDR below.

3.1.1.1 Road Segment Similarity Measurement

As discussed in the previous section, similarity measurement depends on the feature space. In the study of segment similarity, the first task is to generate and describe the crash pattern space, since segment safety analysis is based on the set of crash patterns. To describe crash patterns for crashes that occur on a roadway segment, both qualitative and quantitative characteristics are taken into account. In this dissertation, the features for the crash patterns are weather conditions, crash location, and crash type. The crash location is based on crashes that occurred either in the vicinity of highway exits or in the middle of the highway for a particular road segment. For the crash type, it is based on crash severity levels.

Each crash feature can be classified into small categories. Ideally speaking, for each crash feature, its subcategories should be independent of each other. Therefore, the new categories for 32 three general crash features are then generated. The crash pattern space consists of all the possible combinations of the three selected features. Suppose there are n possible crash patterns included in the feature space F, and a certain segment x can be described by a vector IA x with n elements. Each element represents the number of crashes corresponding to a certain crash pattern:

(Eq. 3.1)

Given such expression of segment crashes, a relatively large number of crash patterns with a relatively small number of corresponding crash may possibly exist. One way to resolve this problem is to omit trivial ai’s by introducing the idea of sufficient set (Nowakowska, 2002).

The sufficient set of segment x is created by reordering the crash numbers with different crash patterns by magnitude and selecting only the first 80% of the crashes. This can be achieved using the PDR, which is discussed in more detail in the section below.

3.1.1.2 Proportion Discordance Ratio (PDR)

The idea of using the PDR to measure the proportion similarity and the structure similarity between any two segments was first presented by Nowakowska (2002). The essence of this method is to first construct the sufficient sets for x and y respectively by taking the first 80% most frequent crash patterns. To obtain the 80% most frequent triplet values, crash patterns that contain very few or zero crashes are first eliminated. The number of crash patterns that are categorized as such is represented as p. Then, the top 80% of the remaining crash patterns in terms of the number of crashes for each crash pattern are taken into consideration to determine similarity among road segments. The number of crash patterns that are eliminated after applying 33 the 80% rule is represented as q. As a result, the number of crash patterns in feature space F is m=n–p–q. Afterwards, the PDR value for any two road segments is calculated.

(Eq. 3.2)

where pi(x), pi(y) are the percentages of crashes corresponding to Crash Pattern i after sufficient sets have been applied to both Road Segments x and y. Note that PDR values obtained in Eq. 3.2 are actually dissimilarity values. After obtaining these values, its similarity value is calculated by

(Eq. 3.3)

The idea of similarity measurement of two road segments can be extended to a more general case. Suppose there are a total of m road segments. Each road segment is described by its crash vector that is represented by Eq. 3.1. All m road segments should have the same number of crash patterns, which is achieved by setting the missing values to be zero. The mxm symmetric similarity matrix SS is then generated by applying Eq. 3.2 and Eq. 3.3 a total of m2 times to measure the similarities for these m segments.

(Eq. 3.4)

where sij denotes the similarity between Road Segments i and j, and 0 ≤ sij ≤ 1. The similarity matrix SS should have two basic characteristics:

34

sij = sji , for all i, j. ------Symmetry

sii = 1, for all i. ------Self-similarity

The objective is to find a set of road segments that consists of the “most similar” segments among the given m segments. In this dissertation, in order to determine a pair of road segments to be similar, it is assumed that sij ≥ 0.9.

The next section covers the Empirical Bayesian (EB) method. The introduction to the EB method is first provided. To assess the similarity of clusters in the Standard EB method, the K- means algorithm is applied. This will be followed by the enhancement to the EB method. The

PDR similarity values that are obtained from Eq. 3.3 and Eq. 3.4 are used in the Enhanced EB method. This dissertation will later show that the Enhanced EB method is a preferred approach over the Standard EB method.

3.1.2 Empirical Bayesian Method

After all observations from the dataset have been assigned to their respective clusters using the K-means algorithm based on weather conditions, whether crashes occurred at highway exits or in the middle of the highway, and crash severity levels, the EB method is applied to each cluster. The EB estimate obtained from the EB method is the weighted sum of the expected value and the actual value, where the weights are determined based on the number of actual measurements for a certain variable (Hauer, 2002). From a traffic safety standpoint, the EB method can be written as: 35

(Eq. 3.5)

In this case, the expected value is the expected number of crashes, and the actual value is the actual number of crashes. To obtain the expected number of crashes, the following formula is used:

(Eq. 3.6)

(Eq. 3.7) where L is the road segment length in kilometers, Y is the time duration in years during which a certain number of crashes occurred in a particular road segment, AADT is the value of the annual average daily traffic that measures the average traffic volume in road segments, and a and b are the regression parameters from the regression line of the plot of the total number of traffic crashes vs. AADT.

Road segments that are considered to be similar should have the same AADT value in

Eq. 3.6 and Eq. 3.7. To obtain AADT values for similar road segments, each road segment’s corresponding AADT value is divided by its respective road segment length and number of lanes to obtain traffic density. Then, the K-means algorithm is applied to cluster road segments based on initial centroid values pertaining to the chosen features. After all road segments are assigned to their respective clusters, the average traffic density for each cluster is multiplied by the average road length and the average number of lanes for the respective cluster. This value is the

AADT that is used in Eq. 3.6 and Eq. 3.7 to calculate the expected number of crashes. 36

Then, the weights for the expected number of crashes are determined based on the following formula:

(Eq. 3.8) where is the expected number of crashes per kilometer per year, Y is the number of years, and

is the overdispersion parameter. According to the Highway Safety Manual, the overdispersion parameter is calculated based on the formula 0.236/L, where L is the length of the road segment in miles.

After obtaining the EB estimate for each time interval, the EB estimates are averaged over all time intervals. Such estimates in that time period are treated as training data. These averages are the predictions of the number of crashes for a particular road segment. Then, after obtaining the averaged EB estimate, this value is compared to the “future” time interval’s actual number of crashes. The actual number of crashes reported during the “future” time interval is treated as validation data. To make such a comparison quantitatively, this is achieved by using an error percentage, which is calculated as:

(Eq. 3.9)

It is important to emphasize that in order to obtain accurate results from the EB method, the assumption that the length of each road segment is relatively short must hold. After the estimates of the number of crashes based on actual and expected number of crashes are obtained from the EB method, the road segments will be ranked based on the weighted sum estimates from the EB method calculations. That way, safety levels of road segments are assessed to 37 identify segments with a high risk of crashes so safety features and policies would be implemented to reduce such risk.

3.1.3 Enhancement to the Empirical Bayesian Method

This dissertation proposes to enhance the EB method by using the PDR to improve prediction on the number of crashes. The purpose of the PDR-based similarity matrix, which is an n-by-n square matrix where n is the total number of road segments for a chosen road, is to objectively quantify road segment similarity. When similarity in general is assessed for comparison purposes, results tend to be subjective and biased. When two things are compared, some people may say they are similar while others may disagree and conclude they are not similar. The way people define similarity varies based on their definition of similarity and their past experiences.

The major advantage of using the PDR-based similarity matrix is that results based on this matrix are more objective. In this dissertation, the similarity matrix takes into account crash patterns that are based on weather conditions, whether crashes occurred at highway exits or in the middle of the highway, and crash severity levels. Among similar road segments, the proportion of the number of crashes that are categorized in a particular crash severity level out of the total number of crashes for a particular road segment is calculated. The formula to calculate the expected number of crashes for each severity level is modified as follows:

(Eq. 3.10) 38

The procedure to calculate AADT values among similar road segments is the same as the one for the Standard EB method through Eq. 3.6 and Eq. 3.7. However, the key difference is that instead of the K-means algorithm and its clusters, the PDR-based similarity matrix is used to assess similarity among road segments. Through the similarity matrix, road segments are considered similar if similarity values obtained from the similarity matrix in Eq. 3.4 are at least

0.9.

Also, the procedure for obtaining the value of the crash severity proportion per severity level for similar road segments is very similar to the one for obtaining AADT values for similar road segments. For each road segment, the sum of the number of crashes for each severity level over the past time intervals in the training data is obtained. Then, the percentage of crashes for each respective severity level out of the total number of crashes in the training data is computed.

If similarity values from the similarity matrix are at least 0.9, the percentages of crashes for each severity level in these similar road segments are averaged for the respective severity level. This average for each severity level is the crash severity proportion value for the respective severity levels. After the averaged EB estimates for each severity level are computed through Eq. 3.5,

Eq. 3.8, and Eq. 3.10, the EB estimates over all severity levels are summed together to get the overall EB estimate.

Table 3.1 summarizes the key differences between the proposed Enhanced EB method and the Standard EB method.

39

Enhanced EB Method Standard EB Method Uses similarity values obtained from the Uses the K-means algorithm to determine similarity matrix via the PDR Matrix to road segment similarity determine road segment similarity Similarity matrix takes into account crash Combinations of the selected features are very patterns that consider all possible important limited in the K-means algorithm cluster combinations of the selected features centroids Objectively quantifies similarity through the Similarity is more subjective; Results are similarity matrix biased based on predefined initial cluster centroid values through the K-means algorithm For a particular road segment, AADT values Only AADT values are averaged among and crash severity proportion values are similar road segments averaged among similar road segments with similarity values of at least 0.9 Applied multiple times for different levels of Applied only once for each road segment crash severity for each road segment Table 3.1: Key Differences Between Enhanced EB Method vs. Standard EB Method

3.2 Unsupervised Learning Hybrid Model

This section provides a detailed description of the hybrid model that is comprised of three unsupervised learning methods. They are the Normalized Scoring Method (NSM), Principal

Component Analysis (PCA), and the similarity measure based on the Proportion Discordance

Ratio (PDR). 40

3.2.1 Normalized Scoring Method (NSM)

The values of each of the 7 features are first normalized, where each normalized value is defined to be:

(Eq. 3.11)

where xa,i is a value of Feature i for Metropolitan Area a, xmin,i is the minimum value of Feature i among all major metropolitan areas, xmax,i is the maximum value of Feature i among all major metropolitan areas, and ni(a) is the normalized value of Feature i for Metropolitan Area a, where

0 ≤ ni(a) ≤ 1.

Then, major metropolitan areas are ranked by summing all the normalization values using

Eq. 3.11 for each metropolitan area, where each normalization value represents each selected feature. For Metropolitan Area a, the normalized scores obtained from all the features are summed together.

(Eq. 3.12)

where Sa is the total score assigned to Metropolitan Area a. The higher the value of Sa is, the more congested Metropolitan Area a is based on the given features.

3.2.2 Principal Component Analysis (PCA)

If the dataset contains n rows and p features, the maximum possible number of principal components is min (n-1, p). For Metropolitan Area a, each principal component is defined as:

(Eq. 3.13) 41 where β represents the weight assigned to each normalized feature value, and X represents the normalized feature. The sum of squared values of β’s is constrained to 1.

(Eq. 3.14)

For each normalized feature, it has a mean of 0 with a standard deviation of 1.

(Eq. 3.15)

The covariance matrix of the normalized features is a pxp matrix. The diagonal elements are the variances of the respective normalized features, while the non-diagonal elements represent the covariance of the selected two features.

(Eq. 3.16)

On the contrary, the covariance matrix of the principal components is a zxz matrix, where z = min (n-1, p). The diagonal elements are the variances of the respective principal components, where the values of the diagonal elements are in descending order. Meanwhile, the non-diagonal elements are all 0.

(Eq. 3.17) 42

The relationship between the two covariance matrices is that the sum of the diagonal elements in Eq. 3.16 is equal to the sum of the diagonal elements in Eq. 3.17.

(Eq. 3.18)

Through Eq. 3.17, the proportion of variance for Principal Component i is defined as:

(Eq. 3.19)

The higher the proportion of variance value for a particular principal component is, the more meaningful that principal component is to the analysis of this dissertation. In PCA, the correlation between the principal components should be 0. That way, since the two principal components are not correlated, their directions are orthogonal to each other, which is why the non-diagonal matrix elements in Eq. 3.17 are all 0.

3.2.3 Proportion Discordance Ratio (PDR) Similarity Matrix

The use of the PDR to measure the proportion similarity between any two entities was first presented by Nowakowska (2002). The size of the PDR matrix depends on how many major metropolitan areas in the US there are for comparison purposes. Let us say that a set of features is represented as F. Each matrix element represents a dissimilarity value between any two metropolitan areas. To calculate a matrix element for Metropolitan Areas a and b,

(Eq. 3.20) where m is the number of features. Note that PDR matrix elements are actually dissimilarity values. After obtaining the PDR(a,b) value in Eq. 3.20, its similarity value is obtained by 43

(Eq. 3.21)

Note that the PDR matrix is a square matrix, and all diagonal elements have the value of 1, since a metropolitan area must be 100% similar to itself.

3.3 Machine Learned Ranking Combined Model - Supervised Learning Approach

This section provides a detailed description of the two machine learning algorithms that will be used in this dissertation. More specifically, this dissertation takes a combined model approach using these two algorithms to provide a reliable score for each baseball player in terms of the player’s HOF worthiness. Afterwards, the scores are ranked and clustered to determine each player’s HOF worthiness.

3.3.1 Inverse Mean-Squared Error (MSE) Weighted Sum

Each of the two machine learning algorithms outputs a probability value for a certain player. Eq. 3.1 shows that the overall probability is the weighted sum of the two probability values from the respective algorithms, where the weights are based on the variance. The overall probability is treated as a score metric to rank all baseball players for a certain position in order to measure their overall performance of their playing careers.

(Eq. 3.22) where and are the probabilities that a certain baseball player should be in the HOF based on Machine Learning Algorithms A and B respectively. The weights α and β pertains to the probabilities obtained from the respective machine learning algorithms. They are calculated 44 through Eq. 3.23 and Eq. 3.24 that are based on the respective variances, where the higher the variance, the lower the weight.

(Eq. 3.23)

(Eq. 3.24)

The variances are based on the Mean-Squared Error (MSE), which is the average squared difference between the actual value and the predicted value. They are calculated through Eq.

3.25 and Eq. 3.26. The actual value is binary, where it is either 1 for HOF prediction or 0 for not getting into the HOF. The predicted value is the probability calculated from the machine learning algorithms.

(Eq. 3.25)

(Eq. 3.26)

Since this dissertation takes a simulation approach, both Machine Learning Algorithms A and B are run many times. For each algorithm, the average probability value and its sample standard deviation of the probability value are obtained from the simulation runs. Based on these values, 95% confidence intervals can be calculated for both Machine Learning Algorithms A and

B. Based on the confidence interval standard errors obtained from each algorithm, the standard error of the overall confidence interval can be calculated using the formula 45

, where α and β are the weight values for Machine

Learning Algorithms A and B from Eq. 3.2 and Eq. 3.3 respectively.

The goal of this dissertation is to determine the probability, which is treated as a score metric, that a player should be in the HOF. The outcome of HOF voting for each player is already known, where 1 means HOF induction and 0 means players did not get inducted. Since the outcome is a binary variable, the probability of HOF induction is calculated using the supervised learning approach. For each group, each player is compared to other players in the same group based on their respective performance statistics in terms of his HOF-worthiness using the Leave-One-Out Cross Validation (LOOCV) approach. In another words, each single player is treated as a test dataset, while the rest of the players in the same group are cross- validated to optimize parameters that are used to determine the HOF-worthiness for that player.

This process is repeated until every player for a particular group is treated as a test dataset to obtain a score for each player.

Since the outcome is a binary variable, where 1 means HOF induction and 0 otherwise, the probability of HOF induction is calculated using the supervised learning approach. In this dissertation, the two machine learning algorithms that will be applied are Support Vector

Machine (SVM) and Neural Networks (NN). The methodology behind them is explained in more detail below.

3.3.1.1 Support Vector Machines (SVM)

The methodology behind SVM divides the data into different classes based on hyperplanes that best divides the data into the respective classes. The optimal hyperplane is determined based on the maximum margin between the hyperplane and the nearest datapoint 46 representing each class. The nearest datapoints for the respective classes that the maximum margin is based on are the support vectors for the algorithm. For a simple two-dimensional linear case with two classes, where the dependent variable is represented as either 1 or 0, one hyperplane splits the data into two classes as a straight line. If there are more than two classes, multiple hyperplanes are involved. When the dataset is multidimensional, in which the data are represented as tensors, the hyperplane becomes nonlinear and multiple support vectors are required for SVM.

This dissertation takes the multidimensional nonlinear approach with two classes. Note that since the output is binary (ie. whether a player is HOF worthy or not), there is only one hyperplane. The goal of SVM from the methodology standpoint is to maximize the margin discussed above, which is defined as , where θ is a vector of weights corresponding to the features of the dataset. From a graphical standpoint, θ is orthogonal to the hyperplane. Since

, from an optimization standpoint, is minimized.

To prevent the hyperplane from overfitting in terms of its position relative to the positions of the datapoints, the soft margin SVM is applied in this dissertation. The soft margin provides more flexibility in terms of allowing the shape of the hyperplane to misclassify outliers in order to generalize better for classification purposes. That way, the hyperplane is not too sensitive to outliers. To achieve the appropriate shape and position of the hyperplane, the weights of misclassified outliers are penalized. Taking such penalties for outliers into account and since , Eq. 3.27 is the optimization function for SVM hyperplane margin maximization. 47

(Eq. 3.27)

where ti is 1 if the binary variable is 1, and -1 if the binary variable is 0. xi is a vector that represents each row of the dataset, θ is a vector of weights corresponding to the data features, and b is the distance between the hyperplane and the origin of the graph. More precisely, represents the orthogonal distance between the hyperplane and the origin. Also, is the slack variable added to the optimization function to penalize misclassified outliers, and C is the regularization parameter that is taken into account to prevent overfitting.

The value of C is inversely proportional to the margin size of the SVM hyperplane. In another words, if C is small, the margin size of the hyperplane is large, in which there are more support vectors that are taken into account to determine the position of the hyperplane. In Eq.

3.1, the sum of the slack variables multiplied by the regularization constant C is the penalty term for decreasing the θ weight values of misclassified outliers for SVM.

Since Eq. 3.27 is a primal optimization problem, its dual form is:

(Eq. 3.28)

48 where ’s are the Lagrangian multipliers, and is the SVM kernel. If the linear kernel is applied, . On the other hand, if the Gaussian kernel is applied,

, where is a parameter that determines how much influence support vectors ‘s have on other datapoints in terms of their classification. For this dissertation, since the number of rows is approximately equal to the number of variables, the

Gaussian kernel is chosen for this dissertation’s analysis. The parameters that are optimized to determine the HOF-worthiness of each player are C and .

In order to apply SVM, feature scaling is first performed to the dataset, where for each feature, it is normalized to . The dependent variable is binary, where 0 represents the player did not get inducted into the HOF and 1 represents that the player got inducted. In another words, . In order to obtain the probability that a player should be in the HOF for SVM, a second optimization function needs to be defined from a cost minimization standpoint. Eq. 3.29 is the optimization function for SVM cost minimization.

(Eq. 3.29)

where C is the regularization constant from Eq. 3.27 to resolve the issue of overfitting, ’s are the weights from Eq. 3.27, m is the number of players in the dataset, n is the number of features, and is the HOF induction probability value through Platt scaling (Platt, 2000). The probability value is the point of interest for the analysis. For , A and B are parameters that are optimized, and , which is an estimate of the binary variable . If , is 49 predicted to be 1, while if , is predicted to be 0, where K is a vector based on the

SVM kernel that takes into account all support vectors.

3.3.1.2 Neural Networks (NN)

For NN, the parameters that are optimized are the number of nodes in the hidden layer and λ, which is a regularization parameter to prevent overfitting. For this dissertation, the NN has one hidden layer. Just like in SVM, feature scaling is also performed in the dataset for NN, where . The probability that a player should be in the HOF is based on the optimization function to minimize the cost function, which is defined in Eq. 3.30.

(Eq. 3.30) where λ is the regularization parameter, is the number of nodes in Layer l, is the weight corresponding to Node i in Layer l for Node j in Layer l+1 , and is the probability value that is calculated using forward propagation, which is performed through Eq. 3.31.

(Eq. 3.31) where represents Node j in Layer l-1, represents the weight corresponding to Node j in Layer l-1 for Node i in Layer l, is the total number of nodes in Layer l, and is a sigmoid activation function that is defined as . In the beginning, values of are represented by features from the training dataset. Note that in the first iteration of forward propagation, the weights are initially randomized based on the standard normal distribution.

Hence, the value of each node in the subsequent layers is based on the weighted sum of products representing nodes from the previous layer. As a result, forward propagation starts from the 50 input layer and goes right from layer to layer until it reaches the output layer. When ,

, which is the probability value.

Afterwards, backpropagation is performed to calculate the error terms . Based on these error values, weights in the NN get updated, which is where the power of deep learning occurs. Backpropagation travels in the opposite direction, where it starts from the output layer and goes left until it reaches back to the input layer. When backpropagation initially starts from the output layer, the error terms, are differences between the predicted probability values and the respective actual values, which are the binary output 1 and 0. They are calculated through Eq.

3.32.

(Eq. 3.32)

Then, when it goes from the hidden layers to the previous layer until backpropagation goes back to the input layer, the error terms are calculated using Eq. 3.33.

(Eq. 3.33)

Based on , backpropagation uses gradient descent to minimize the cost function in Eq.

3.30. The gradients are partial derivatives of the cost function with respect to the weights. They are computed through Eq. 3.34.

(Eq. 3.34)

After the gradients are computed, the weights are updated through Eq. 3.34 for the next

NN iteration. 51

(Eq. 3.35)

where is the learning rate. After the weights are updated through Eq. 3.35, NN goes through another cycle of forward propagation and backpropagation in the next iteration. The cycle of forward propagation and backpropagation repeats like a feedback loop until the change in the cost from Eq. 3.30 from iteration to iteration becomes trivial or a certain number of iterations is reached.

3.4 Methodology Summary

This chapter provides the methodologies used for this dissertation’s research, which is discussed in more detail in Chapter 4, Chapter 5, and Chapter 6. Chapter 4 applies the methodologies in Section 3.1, where it addresses the issue of clarifying the meaning of “similar” for road segment clustering in order to identify road hotspots. It is followed by Chapter 5, which applies the methodologies in Section 3.2, where the hybrid model is used to rank metropolitan areas in terms of their congestion levels based on specific metrics of congestion. Finally,

Chapter 6 applies the methodologies in Section 3.3, where the combined model using machine learning algorithms is applied to objectively rank retired baseball players based on their performance statistics to determine which players deserve to be in the Major League Baseball

Hall of Fame.

52

CHAPTER 4

ENHANCED EMPIRICAL BAYESIAN METHOD – CASE STUDY IN PHOENIX,

ARIZONA

The goal of this chapter is to enhance the Empirical Bayesian (EB) method by incorporating one of the similarity measures into the EB method. Specifically, the similarity measure based on the Proportion Discordance Ratio (PDR) is incorporated into the EB method as a combined model. This combined model is referred to as the Enhanced EB method in this dissertation. Section 4.1 discusses the motivation behind using the Enhanced EB method for predicting the number of traffic crashes in road segments. In the next section, Section 4.2 provides the description of the dataset used in the analysis as a case study. Section 4.3 discusses the results obtained from the case study. It tests the performance of the Enhanced EB method vs. the Standard EB method and identifies road hotspots. Section 4.4 identifies road hotspots based on the hour of the day, day of the week, and season of the year. Section 4.5 provides the summary and conclusions of this chapter.

4.1 Background

Hauer, et al. (2002) proposed methods of predicting the number of traffic crashes for similar road segments. However, the paper did not explicitly define the definition of “similar”.

To clarify such ambiguity, similarity is defined based on crash patterns of road segments through pair-wise comparison. This approach of defining similarity based on crash patterns using the

Proportion Discordance Ratio (PDR) was first proposed by Nowakowska (2002). This similarity metric is incorporated into the methods proposed by Hauer to improve traffic crash predictions. 53

For the case study in this chapter, each road segment has an Average Annual Daily

Traffic (AADT) value, which indicates the amount of traffic volume it experiences. After all road segments are clustered into several groups based on crash patterns that define the PDR values, the AADT values are averaged for each cluster. These average values are used for the

Enhanced EB method in this chapter’s analysis. Afterwards, the performance of the Enhanced

EB method is tested and compared with Standard EB method through one of the major highways in Phoenix, Arizona as a case study. Based on the Enhanced EB method, road segments are ranked based on their risk of traffic crash occurrence, and potential road hotspots are identified.

The next section describes more in detail about the datasets used for the analysis.

4.2 Description of the Sites and Data used in the Arizona Case Study

For the Arizona case study, the analysis is based on two datasets. The first dataset is the compilation of all crash records that occurred around Arizona from 2011 to 2014 by the Arizona

Department of Transportation (ADOT). It contains several parameters such as the hour of the day, day of the week, month of the year, number of vehicles involved, number of motorists involved, number of non-motorists involved, number of injuries, number of fatalities, weather conditions, light conditions, severity of the crash, collision manner, road where the crash took place, latitude and longitude coordinates of the crashes, city code, and county ID.

The second dataset contains the AADT values that are collected by ADOT for each road segment of all highways in Arizona from 2011 to 2014. For a particular road segment, the

AADT value is different from year to year. Along with the AADT values, the dataset’s 54 parameters include road segment lengths and specific highway exits that represent the starting and ending points of the road segments.

The EB method enhanced by the PDR will be applied to one of the major highways in

Phoenix, AZ. For this dissertation, Arizona State Route 101 (AZ SR 101) is chosen for the dissertation’s case study. AZ SR 101 loops around the greater Phoenix metropolitan area. A high-occupancy-vehicle (HOV) lane was recently added to reduce traffic congestion levels. The highway starts from Tolleson, AZ, which is in the west of Phoenix, and goes north through

Glendale, AZ, where is located. Next, it travels east through north of

Phoenix and passes through Scottsdale, AZ, which has two casinos and a major luxury shopping mall. Then, AZ SR 101 goes south through Tempe, AZ, where (ASU) is located and ends in Chandler, AZ.

The methodology mentioned above is applied to assess safety levels of AZ SR 101 road segments based on the number of vehicular crashes, road segment length, number of lanes, traffic volume, crash severity levels, weather conditions, and whether crashes occurred at highway exits or in the middle of the highway. In order to assign traffic crashes to road segments for AZ SR 101, the K-means algorithm is applied to divide AZ SR 101 into 55 clusters, since ADOT collects the AADT data at 55 different locations in AZ SR 101. In other words,

ADOT collects data on how many vehicles drive through the 55 road segments in AZ SR 101 on a daily basis in real time. The length of each road segment is determined by the distance between the highway exits. Each cluster from the K-means algorithm will represent each road segment of the highway. 55

The key parameters used for the K-means algorithm are the latitude and longitude coordinates where the crashes took place. The latitude and longitude coordinates of the midpoint of each of the 55 road segments in AZ SR 101 are assigned as the respective initial cluster centroid locations. Eventually, the crash locations are assigned to the road segments where the crashes took place, and the number of crashes that occurred in each road segment is obtained.

As a result, the number of crashes for each road segment obtained from the K-means algorithm represent the actual number of crashes for that road segment. Figure 4.1 shows the initial centroid locations of all 55 road segments in AZ SR 101. Table 4.1 shows the descriptive statistics of the road segments in AZ SR 101.

Figure 4.1: Initial Cluster Centroids of 55 Road Segments in AZ SR 101

56

Mean Road Segment Length (km) 1.75 Mean Number of Lanes 8.29 Mean Traffic Volume (AADT) 140303 Mean Number of Crashes per Road Segment per Year 95

Table 4.1: Descriptive Statistics of AZ SR 101 Road Segments

The main question to ask is why the EB method enhanced by the PDR-based similarity matrix is a preferred approach over the Standard EB method and the Baseline method. For comparison purposes, the Baseline method solely uses the actual number of crashes to predict the number of crashes in the 16th Quarter. Its predictions are based on the average of the number of crashes from the first 15 quarters in 2011-2014. Since the Baseline method is the simplest possible method that requires the minimum amount of data, it is used as a base for comparison purposes. The performance of the Enhanced EB method is examined based on the 55 road segments in AZ SR 101. The performance measure considered for comparison is the error percentage, which is defined as:

(Eq. 4.1)

Eq. 4.1 is a specific version of Eq. 3.9. The 16th Quarter prediction for a particular road segment is computed as the average of the EB estimates in the first 15 quarters for the same road segment. By applying Eq. 4.1, the accuracy of predictions based on the training data is assessed based on how far off they are from the actual number of crashes in the 16th Quarter. The 16th

Quarter data are treated as a validation dataset, while the data from the first 15 quarters are treated as a training dataset.

57

4.3 Results and Discussion

4.3.1 Overall Road Hotspot Identification

It is important to ensure that a group of road segments that are considered to be similar to a particular road segment is different for the Enhanced EB method vs. the Standard EB method.

The Standard EB method uses the K-means algorithm based on the initial cluster centroid values that are percentages of crashes that are categorized as most severe, occurred during raining conditions, and occurred at highway exits for the respective road segments to determine similar road segments. In this case, there are a total of 8 clusters. Meanwhile, the Enhanced EB method uses the PDR-based similarity matrix through Eq. 3.2, Eq. 3.3, and Eq. 3.4 with a similarity value requirement of at least 0.9 based on the chosen crash patterns. Crash patterns are based on all possible combinations of crash severity levels, weather conditions, and whether crashes occurred at highway exits or in the middle of the highway. For the case study, a total of 28 crash patterns were chosen after applying the 80% rule for the similarity matrix.

Also, the regression parameters a and b from Eq. 3.6, Eq. 3.7, and Eq. 3.10 should be calculated. In order to find them, a plot of the total number of crashes over the initial AADT values for the first 15 quarters is drawn. Then, a best-fit regression line is drawn on the plot.

Based on the given data for the Arizona case study, the parameter values are a = 10 -5 and b =

1.4436.

To obtain the error percentages, Eq. 4.1 is applied to each of the road segments for the

Enhanced EB method, Standard EB method, and the Baseline method. If road segments reported fewer than 20 crashes in the 16 th Quarter, they are omitted from the error percentage calculations.

It would not make sense to include road segments with a relatively low number of crashes in the 58 error percentage calculations since a small deviation between the predicted and actual number of crashes for these road segments would result in a high error percentage. After removing these road segments, error percentage calculations are performed based on the remaining 46 road segments out of the initial total of 55 road segments for AZ SR 101. Table 4.2 shows the overall error percentages using the three methods.

Enhanced EB Standard EB Baseline Method Method Method Mean 21.43% 28.97% 32.63% Sample Standard Deviation 13.91% 16.35% 16.58% Sample Variance 1.94(% 2) 2.67(% 2) 2.75(% 2) Table 4.2: AZ SR 101 Average Error Percentages

It is important to take into consideration not only the mean of the error percentage but also its sample variance. Table 4.2 shows that for the Enhanced EB method, the average error percentage is significantly lower than the ones for the Standard EB method and the Baseline method. Also, the variance of the average error percentage is lower for the Enhanced EB method than the ones for the other two methods.

From a visualization standpoint, the plot of predictions vs. the actual number of crashes is drawn for both the Enhanced EB method and the Standard EB method. Using Eq. 3.5, Eq. 3.8, and Eq. 3.10, Figure 4.2 compares how far off the predicted number of crashes based on the average of the EB estimates in the first 15 quarters is from the actual number of crashes in the

16th Quarter (i.e. Quarter 4 in 2014). 59

Figure 4.2: 2014 Q4 Predictions vs. Actual Number of Crashes in AZ SR 101

The left subplot in Figure 4.2 shows that the predictions based on the Enhanced EB method are closer to the actual number of crashes for the respective road segments than the ones based on the Standard EB method shown in the right subplot. A line with a slope of 1 is drawn to indicate that more predictions from the Enhanced EB method are aligned with their respective actual number of crashes compared to the Standard EB method. Also, predictions from the

Enhanced EB method are more pessimistic than the ones from the Standard EB method. In another words, the Enhanced EB method predicts a higher number of crashes compared to the

Standard EB method. The cause behind higher and more accurate predictions for the Enhanced 60

EB method is that it takes into account the accident severity proportions for similar road segments, where for a given road segment, it is clustered with other road segments if they meet the similarity threshold value of 0.9 using the PDR-based similarity matrix from Eq. 3.4. On the contrary, the Standard EB method does not take the proportions into account for its predictions.

Note that if road segments reported a much higher number of crashes in the 16 th Quarter than in the previous quarters, it is more difficult for the Enhanced EB method to accurately predict the number of crashes for these road segments. This issue leads to the next part of the analysis, which is to identify road hotspots that are road segments with a high risk of traffic crashes. In order to identify these road hotspots, a quantitative measure is defined based on the ratio of the actual number of crashes reported in the 16 th Quarter over the predicted number of crashes based on the Enhanced EB method for all road segments. Since it is already shown that the Enhanced EB method is a preferred method, only the predictions based on the Enhanced EB method will be used. If the ratio is high for a particular road segment, this indicates a sudden increase in the number of crashes, and this road segment is potentially identified as a hotspot.

On the contrary, if the ratio is low, this indicates a decrease in the number of crashes, which is optimal from a traffic safety standpoint. Figure 4.3 plots the ratios over predictions based on the

Enhanced EB method. In the plot, the 99% confidence band is applied to distinguish hotspots from other road segments. 61

Figure 4.3: AZ SR 101 Ratios vs. Enhanced EB Predictions

In Figure 4.3, road segments that are below the 99% confidence band are colored green and hence are of no concern. Their ratios are relatively low, which means that the actual number of crashes reported in the 16 th Quarter is fewer than the number of crashes predicted by the

Enhanced EB method. On the other hand, the road segments that are colored red are identified as road hotspots. Since they are above the 99% confidence band, their ratios are relatively high based on their respective predictions. These hotspots should warn ADOT about such increase in the number of crashes. This would motivate ADOT to implement additional traffic safety measures and/or policies for these road segments to ensure improved safety over the long run. 62

Based solely on the predictions obtained from the Enhanced EB method for all 55 road segments regardless of their corresponding actual number of crashes in the 16th Quarter, road segments are ranked in terms of their risk of crashes based on these predictions. The analysis shows that multiple consecutive road segments in AZ SR 101 that are close to ASU are ranked as one of the most dangerous. This makes sense since students are more likely to drive recklessly than other age groups who can drive (Sharma et al., 2008). From ADOT’s point-of-view, it would identify these road segments as road hotspots and determine ways to reduce the number of crashes through traffic safety measures and/or policies.

4.3.2 Road Hotspot Identification for Different Timeframes

This section identifies road hotspots based on the quarters of the year, days of the week, and times of the day for AZ SR 101. This time, data from 2011-2013 are used as a training dataset, and the entire 2014 data are used as a validation dataset. After identifying road hotspots on a quarterly, daily, and hourly basis, using the information provided, ADOT would be able to effectively reduce the number of traffic crashes during certain times for certain road segments.

There are two ways to rank road segments based on the risk of traffic crashes. The first way is ranking them based on the EB estimates for a certain timeframe. The other way is ranking based on the ratios of the actual number of traffic crashes in 2014 over the predicted number of traffic crashes, which are the EB estimates. The ratios provide ADOT information on surges in the number of traffic crashes in certain road segments at a certain time. It is important to note that the ratios provide valuable information if the corresponding EB estimates are relatively high. In another words, road segments that reported high EB estimates as well as high ratios are the areas of interest and are considered potential road hotspots. 63

Table 4.3 shows the AZ SR 101 road segments corresponding to popular destinations that are in the vicinity of the road segments for reference purposes.

Road Segments AZ SR 101 Popular Destinations 4-5 State Farm Stadium 6-7 Desert Diamond Casino – West Valley 19-22 with 24-27 Interchange with State Route 51 35-38 44-48 Arizona State University Table 4.3: AZ SR 101 Road Segments and Popular Destinations

4.3.2.1 Seasons

Figure 4.4 is the heat map of the EB estimates and the ratios among the 55 road segments in AZ SR 101 for the four quarters of the year. The red areas indicate potential hotspots both in terms of the EB estimates and the ratios. Focusing on the EB estimates for road hotspot identification, the risk of traffic crashes is highest for road segments that are in the vicinity of

ASU during the 1 st quarter and especially the 4 th quarter of the year, which spans from January to

March and October to December respectively. Another observation is that the risk is also highest for a road segment in AZ SR 101 that is in the interchange with Interstate 17, except the risk is greater during the 1 st quarter over the 4 th quarter. 64

Figure 4.4: AZ SR 101 Seasonal Heat Map

4.3.2.2 Days of the Week

Figure 4.5 is the heat map of the EB estimates and the ratios for the days of the week.

Based on the overall trend of the EB estimates in the heat map, the risk of traffic crashes is higher during the weekdays for the majority of the road segments. Logically, this trend makes 65 sense since people typically commute to work on weekdays. Particularly, the risk is the highest for the same road segments mentioned above in Section 4.4.1 during Fridays.

Figure 4.5: AZ SR 101 Daily Heat Map

4.3.2.3 Time of the Day

Figure 4.6 is the heat map of the EB estimates and the ratios for the times of the day. For the time segments, a 24-hour day is divided into four 6-hour increments, which are 12AM-6AM,

6AM-12PM, 12PM-6PM, and 6PM-12AM. These particular time segments are chosen for the analysis since roads are mostly empty during the first time segment, drivers commute to work 66 during the second time segment, drivers travel back home from work during the third time segment, and people tend to travel for personal reasons during the fourth time segment. Based on the overall trend of the EB estimates in the heat map, the risk of traffic crashes is highest during 12PM-6PM for the majority of the road segments. One observation to note is that the risk of traffic crashes is also relatively high during 6AM-12PM, but based on the ratios, the risk is highest during that timeframe. In another words, 2014 reported a relatively high increase in traffic crashes during 6AM-12PM for multiple road segments.

67

Figure 4.6: AZ SR 101 Hourly Heat Map

4.4 Chapter Summary

This chapter filled in the research gap of clarifying the definition of similarity for road segment clustering in order to better predict the number of traffic crashes. Similarity in this case is defined based on how similar road segments are through their crash patterns using pair-wise 68 comparison. This aspect is incorporated into the Enhanced EB method, which was able to predict the number of traffic crashes more accurately compared to the Standard EB method through a case study. Using the Enhanced EB method, potential road hotspots are identified based on seasons, days of the week, and times of the day. This information can be used by

Departments of Transportation (DOTs) to effectively reduce the number of traffic crashes at the right places and at the right times.

69

CHAPTER 5

HYBRID MODEL RANKING FOR TRAFFIC CONGESTION ASSESSMENT OF

METROPOLITAN AREAS

The goal of this dissertation is to take a hybrid model approach to rank major metropolitan areas in the United States to assess their congestion levels. The hybrid model that will be used in this dissertation consists of the Normalized Scoring Method (NSM), Principal

Component Analysis (PCA), and the similarity measure based on the Proportion Discordance

Ratio (PDR). It is important to address the accuracy of congestion levels because inaccurate findings can lead to ineffective decision making and unnecessary expenditure by the local and state governments. The Pima Association of Governments wants to investigate whether traffic congestion levels in the Tucson metropolitan area are truly high based on TomTom’s ranking.

In terms of the organization of this chapter, Section 5.1 discusses the motivation behind using the hybrid model for ranking metropolitan areas based on their traffic congestion levels. In this case, the hybrid model consists of NSM, PCA, and the PDR similarity matrix. In the next section, Section 5.2 provides the description of the dataset used in the analysis. Section 5.3 discusses the results obtained from each of the three methods mentioned. Results from the three methods are compared to ensure that they are consistent among each other. Section 5.4 provides the summary and conclusions of this chapter.

70

5.1 Background

In 2015, a study conducted by TomTom found that Tucson, AZ is the 21st most congested metropolitan area in the United States. On the other hand, TomTom’s study ranked

Phoenix, AZ as the 43rd most congested metropolitan area. TomTom defines its congestion index metric as a percent increase in overall travel times of vehicles compared to free-flow traffic conditions. The metric is based on TomTom’s historical database that is comprised of speed measurements obtained from its Global Positioning System (GPS) navigation devices.

From a road network standpoint, it is weighted based on the number of vehicles equipped with the GPS devices that are present throughout the roads in the road network.

The ranks obtained from the hybrid model mentioned above will be compared to

TomTom’s ranks in order to observe any discrepancies. Some of the smaller metropolitan areas such as Tucson were ranked by TomTom to be more congested than much larger metropolitan areas. However, it makes more sense for larger metropolitan areas to experience high traffic congestion levels. In fact, past literature found that areas with higher populations and more complex road network were directly proportional with traffic congestion levels. It also found that using multiple variables rather than on a single variable or a single overall index provided a more accurate representation of the problem it is trying to solve. This chapter conducts an independent study using a hybrid model that takes into account specific aspects of congestion measurement instead of the overall congestion index.

71

5.2 Dataset Description

TomTom ranked major metropolitan areas in the United States in terms of congestion based on the overall congestion index, which is defined as a percentage increase in overall travel time when it is compared to free-flow conditions. This is measured for all types of road at any time of the day. Based on TomTom’s ranking in 2016, Tucson is ranked 26th as the most congested metropolitan area in the US, while Phoenix is ranked 47th.

For the analysis, this dissertation proposes its own methodology to rank all 71 metropolitan areas, each of which has a population of at least 800,000, in terms of congestion based on the seven features, which are the Morning Peak Congestion Index, Evening Peak

Congestion Index, Highway Congestion Index, Non-Highway Congestion Index, Total length of

Highways, Total length of Non-Highways, and the Average Extra Travel Time.

The dataset that will be used for this dissertation’s analysis is based on these 7 features.

Table 5.1 shows the definitions of the 7 features based on TomTom’s website.

72

Features Definitions Morning Peak Percentage increase in morning peak travel times compared to a free-flow Congestion condition. This is measured only during weekdays (Mondays – Fridays). Index The morning peak time is either 7:00 AM – 8:00 AM or 8:00 AM – 9:00 AM. Evening Peak Percentage increase in evening peak travel times compared to a free-flow Congestion condition. This is measured only during weekdays (Mondays – Fridays). Index The evening peak time is either 4:00 PM – 5:00 PM or 5:00 – 6:00 PM. Highway Percentage increase in highway travel times compared to a free-flow Congestion condition. Index Non-Highway Percentage increase in non-highway travel times compared to a free-flow Congestion condition. Index Total Length of Total length of the evaluated highway network. Highways Total Length of Total length of the evaluated non-highway network. Non-Highways Average Extra Extra travel time in minutes during peak hours vs. 1 hour of driving Travel Time during free-flow conditions. This value is measured over a period of 230 days per year. Table 5.1: TomTom Dataset Feature Definitions

5.3 Results and Discussion

Applying the methodology of the hybrid model mentioned in Section 3.2, three separate results for the major US metropolitan areas in terms of traffic congestion are obtained. The results are based on the NSM, PCA, and PDR similarity matrix.

5.3.1 Normalized Scoring Method (NSM)

The NSM has both qualitative and quantitative aspects. The quantitative aspect consists of the scores of metropolitan areas in terms of their traffic congestion levels. After applying Eq.

3.11 and Eq. 3.12 to the given dataset, scores are obtained for the metropolitan areas, which are then ranked based on these scores. On the other hand, the qualitative aspect is the classification 73 of major US metropolitan areas into clusters in terms of severity of traffic congestion levels. The assignment of metropolitan areas to the clusters is based on the relative distances among the scores using the K-means clustering algorithm. After the initial cluster centroid locations are assigned, the K-means algorithm is applied to divide the 71 metropolitan areas into 6 clusters.

Figure 5.1 shows the scores and the cluster assignments for all 71 metropolitan areas using the K-means algorithm. The metropolitan areas in the X-axis of Figure 5.1 are ordered based on the scores in descending order. A higher score represents higher traffic congestions levels for a particular metropolitan area. Each circle in Figure 5.1 represents a cluster of metropolitan areas with similar traffic congestion levels. The left circle represents metropolitan areas with the highest congestion levels, and the right circle represents the ones with the lowest congestion levels. 74

Figure 5.1: Metropolitan Area NSM Scores and Clusters

Previously, TomTom ranked Tucson and Phoenix as the 26 th and 47 th most congested metropolitan areas respectively. However, Figure 5.1 shows that through the NSM, Phoenix is more congested than Tucson. In fact, based on the scores, Tucson is ranked 51 st , and Phoenix is ranked 26 th . In Figure 5.1, Phoenix is in Cluster 5, while Tucson is in Cluster 6, which is comprised of metropolitan areas with the lowest traffic congestion levels. 75

To see if the differences among the clusters in terms of their respective average scores are significant, ANOVA is conducted, where the statistical hypothesis test is:

(Eq. 5.1)

Assuming α = 0.05, the p-value is very close to 0, which means that the null hypothesis at

α = 0.05. This indicates that at least one pair of clusters is significantly different at α = 0.05.

Furthermore, pair-wise multiple comparisons for all possible combinations of clusters are performed using the Tukey’s statistical test. Since there are 6 clusters, a total of 15 statistical tests is done for all possible combinations of clusters. After conducting Tukey’s statistical test, the p-values of all 15 tests are very close to 0. Hence, all clusters are significantly different among each other at α = 0.05.

Figure 5.2 shows the scores of the metropolitan areas. However, unlike Figure 5.1, the

X-axis in Figure 5.2 represents metropolitan areas that are ordered based on TomTom’s ranking in terms of the overall congestion index. 76

Figure 5.2: Metropolitan Area NSM Scores Ordered Based on TomTom’s Ranks

The plot in Figure 5.2 shows that there is a general trend of the scores in descending order over the metropolitan areas ordered based on TomTom’s ranks. However, the plots in

Figure 5.1 and Figure 5.2 are very different, and such differences indicate that there are several discrepancies between the scores and TomTom’s ranks. For instance, Tucson has a much lower score based on the NSM than the scores of other metropolitan areas that are ranked close to

Tucson by TomTom. Just like in Figure 5.1, it is clearly shown in Figure 5.2 that Phoenix has a 77 higher score than Tucson, which indicates that the Phoenix metropolitan area is more congested.

In fact, based on the NSM scores, Tucson is ranked as the 51 st most congested metropolitan area, while Phoenix is ranked 27 th . It is interesting to note that the Dallas/Fort Worth metropolitan area received a much higher score than other metropolitan areas that are similarly ranked by

TomTom. While TomTom ranked Dallas/Fort Worth the 34 th most congested metropolitan area, it is ranked 10 th based on the scores via the NSM.

5.3.2 Principal Component Analysis (PCA)

Next, the given dataset is normalized using Eq. 3.15, and the PCA is applied to the normalized dataset. Through Eq. 3.13 and Eq. 3.14, the principal components are obtained, and they are ordered based on the order of the diagonal elements of the covariance matrix in Eq.

3.17. For the analysis, the first two principal components are the points of interest since they convey the vast majority of the dataset’s information based on their proportion of variance values, which are obtained using Eq. 3.17 and Eq. 3.19. It is important to note that principal components are linear weighted combinations of the normalized features, and they are statistically independent among each other.

Figure 5.3 shows the plot of the first two principal components. The datapoints, each of which represents a metropolitan area, are color-coded in Figure 5.3 so that they can be traced to the clusters that they are assigned to through the K-means algorithm for the NSM. 78

Figure 5.3: Principal Component Analysis Graph

In Figure 5.3, each color represents the cluster assignments achieved by the K-means algorithm for the NSM. In this case, black represents Cluster 1, red represents Cluster 2, green represents Cluster 3, blue represents Cluster 4, turquoise represents Cluster 5, and pink represents Cluster 6. Cluster 1 represents metropolitan areas with the highest congestion levels, and Cluster 6 represents the ones with the lowest congestion levels. Figure 5.3 shows that as the values of the first principal component increase in the X-axis, traffic congestion levels of metropolitan areas tend to decrease. 79

Figure 5.3 also shows that the proportion of variance of the first principal component is much higher than the one of the second principal component. Hence, the relationship between the first principal component values and congestion levels is evident in Figure 5.3. For this reason, the analysis solely based on the first principal component is a point of interest for this dissertation. Figure 5.4 shows the plot of the first principal component values over the metropolitan areas that are ordered based on TomTom’s ranks.

Figure 5.4: First Principal Component Scores Ordered Based on TomTom’s Ranks

80

Table 5.2 shows the beta values in Eq. 3.13 for the first principal component.

Features Betas Values Morning Peak Congestion Index β11 -0.4210051 Evening Peak Congestion Index β12 -0.4147815 Highway Congestion Index β13 -0.4120204 Non-Highway Congestion Index β14 -0.3868197 Total Length of Highways β15 -0.2729542 Total Length of Non-Highways β16 -0.2760796 Average Extra Travel Time β17 -0.4249668 Table 5.2: Beta Values for the 1 st Principal Component

Unlike the NSM, since the beta values in Table 5.2 are all negative for the first principal component, a lower score based on the first principal component means higher traffic congestion levels for a particular metropolitan area. However, just like Figure 5.2 that is based on the NSM, there are many instances in which the PCA scores in Figure 5.4 are not consistent with

TomTom’s ranking. Figure 5.4 shows that Tucson has a higher first principal component score than Phoenix, which indicates that Tucson has lower traffic congestion levels than Phoenix. In fact, Tucson is ranked 51 st , while Phoenix is ranked 28 th . Also, Dallas/Fort Worth has a much lower first principal component score than the other metropolitan areas that are similarly ranked by TomTom. This suggests that congestion levels of Dallas/Fort Worth is much worse than what

TomTom indicated, since it is ranked 15 th as opposed to 34 th by TomTom.

Overall, the findings based on the NSM and PCA are consistent with each other. Table

5.3 summarizes the ranks provided by the NSM and the PCA, which are compared with the ranks provided by TomTom for a sample of metropolitan areas.

81

Metro Area TomTom Rank NSM Rank Rank Diff PCA Rank Rank Diff Tucson 26 51 25 51 25 Phoenix 47 27 20 28 19 Los Angeles 1 1 0 1 0 San Francisco 2 3 1 3 1 New York City 3 2 1 2 1 Seattle 4 4 0 4 0 San Jose 5 6 1 5 0 Boston 10 9 1 9 1 Las Vegas 18 36 18 37 19 McAllen 24 47 23 45 21 Fresno 32 58 26 58 26 Dallas/Fort Worth 34 10 24 15 19 Minneapolis 48 30 18 29 19 Detroit 51 29 22 32 19 Albany 60 63 3 62 2 Omaha 67 70 3 70 3 Kansas City 68 61 7 63 5 Indianapolis 69 64 5 64 5 Knoxville 70 69 1 69 1 Dayton 71 71 0 71 0 Table 5.3: Rank Results and Comparisons

Table 5.3 shows that the ranks of the metropolitan areas based on the scores through the

NSM and the PCA are very consistent with each other. Both methods show that Tucson is not as congested as what TomTom found. In fact, they both report that Phoenix is much more congested than Tucson. Another observation in Table 5.3 is that McAllen and Fresno are also not as congested just like Tucson. On the other hand, Dallas/Fort Worth and Detroit are much more congested than what TomTom found just like Phoenix. Overall, the ranks based on this dissertation’s analysis agree with TomTom’s ranks only for highly congested and least congested metropolitan areas. However, discrepancies occur mostly for metropolitan areas that are moderately congested based on the rank differences between this dissertation’s methods and

TomTom’s ranking. 82

5.3.3 Proportion Discordance Ratio (PDR) Similarity Matrix

The final part of the analysis calculates the pairwise similarity values among the 71 major metropolitan areas in the US based on the normalized dataset using the PDR similarity matrix.

In another words, for a particular metropolitan area, this method calculates how similar it is to the other metropolitan areas. The PDR similarity matrix, which in this case is a 71x71 square matrix, is obtained by applying Eq. 3.20 and Eq. 3.21 a total of 5,041 times. Figure 5.5 shows the PDR similarity matrix for the 71 metropolitan areas . 83

Figure 5.5: PDR Similarity Matrix 84

The light colored cells in the PDR similarity matrix shows that for the given two metropolitan areas, they are very similar to each other, while on the contrary, the dark cells indicate that they are not so similar to each other. Note that the similarity values in the diagonal elements of the matrix must be 1 since for a particular metropolitan area, it must be exactly the same as itself. Figure 5.5 shows that Tucson, which is represented as a red oval, is very similar to other major metropolitan areas that are ranked low congestion-wise according to TomTom.

Even though TomTom ranked Tucson as the 26th most congested major metropolitan area in the

US, based on the 7 features in the dataset used for this dissertation’s analysis, Tucson is more similar to metropolitan areas with much lower congestion levels. In addition, the three purple ovals in Figure 5.5 represent Las Vegas, McAllen, and Fresno respectively. Just like Tucson, they are also not as congested as the ranks provided by TomTom.

Using Tucson as a reference point, Table 5.4 shows the similarity values of Tucson based on the PDR similarity matrix to some of the other metropolitan areas.

85

Metropolitan Area TomTom Rank Tucson Similarity NSM Rank PCA Rank Los Angeles 1 0.272 1 1 San Jose 5 0.575 6 5 Portland 7 0.668 11 10 Boston 10 0.639 9 9 Chicago 14 0.688 14 14 Austin 15 0.750 18 17 Las Vegas 18 0.912 36 37 Philadelphia 21 0.768 17 19 McAllen 24 0.935 47 45 Denver 25 0.807 24 24 Tucson 26 1 (Reference Point) 51 51 San Antonio 28 0.822 23 23 Fresno 32 0.948 58 58 Dallas/Fort Worth 34 0.628 10 15 Bakersfield 37 0.946 52 52 Jacksonville 38 0.876 28 27 Memphis 42 0.916 48 48 Allentown 45 0.943 55 55 Phoenix 47 0.800 27 28 Salt Lake City 50 0.903 44 44 Albuquerque 52 0.957 59 60 Birmingham 55 0.935 46 47 Albany 60 0.941 63 62 St. Louis 62 0.913 49 50 Richmond 66 0.891 66 66 Dayton 71 0.839 71 71 Table 5.4: Tucson Similarity Comparison with Other Metropolitan Areas

TomTom ranked Tucson as the 26th most congested metropolitan area in the US. Based on the similarity values in Table 5.4, Tucson is more similar to other metropolitan areas that are ranked much lower by TomTom. Results in Figure 5.5 and Table 5.4 both show that Tucson is not as congested as what TomTom found. The only metropolitan areas that are similar to Tucson and are also ranked close to Tucson by TomTom are Las Vegas, McAllen, and Fresno. The similarity values of McAllen and Fresno support the results of the NSM and the PCA that indicate they should be ranked much lower congestion-wise just like Tucson. Both Table 5.3 86 and Table 5.4 agree that Tucson, along with McAllen and Fresno, is not as congested as what

TomTom found.

The results from the NSM, PCA, and the PDR similarity matrix are very consistent among each other. Applying the hybrid model based on these three approaches to assess traffic congestion levels of metropolitan areas makes more sense, because it takes into account 7 metrics based on 7 specific aspects of congestion as the dataset’s features instead of only 1 feature, which is the overall congestion index used by TomTom. Also, based on the dataset, with the exception of congestion levels in local roads, Phoenix reported a higher extra travel time during peak hours vs. one hour of driving in free flow conditions, higher traffic congestion levels in highways, and higher congestion levels during morning and evening peak travel times than

Tucson. In addition, Phoenix has a much larger road network in terms of the total road network length for both highways and local roads than Tucson. Loaf and Barthelemy (2014) found that there is a connection between the total length of a road network and delays caused by traffic congestion. Hence, it makes more sense for Phoenix to be ranked higher in terms of traffic congestion than Tucson.

Also, if Tucson, McAllen, and Fresno are compared with other metropolitan areas that are ranked in their vicinity by TomTom, they have the lowest values based on the seven features in the dataset. The features that stood out the most for these three metropolitan areas are the size of the highway network and the highway congestion index. Compared to the other metropolitan areas that are ranked close to them by TomTom, Tucson, McAllen, and Fresno have a much smaller highway network. They also reported less extra travel time during peak hours than the other metropolitan areas. Using the PDR similarity matrix, the three metropolitan areas are very 87 similar among each other. Based on these observations, Tucson, McAllen, and Fresno should be ranked lower than what TomTom reported.

Furthermore, there is a potential bias behind TomTom’s data collection that serves as a basis for the calculation of the overall congestion index. TomTom’s database of speed measurements is collected from TomTom’s GPS navigation devices that are used by drivers, most of who are not familiar with road networks that they are in. These drivers are likely to drive slow especially in local roads, since they would be more cautious and are more likely to get lost in local roads. Hence, the bias is more profound for metropolitan areas with road networks that are comprised mostly of local roads. On the contrary, it is not as profound in highways since the drivers simply drive with the traffic flow in highways and there are fewer turns in highways than local roads. As a result, the driving patterns of drivers who are not familiar with the road network do not accurately represent the overall driving patterns of all drivers and hence the congestion levels of metropolitan areas. Since road networks of metropolitan areas like Tucson,

McAllen, and Fresno consist mostly of local roads, such bias is more profound for these metropolitan areas.

5.4. Chapter Summary

This chapter used the hybrid model to objectively and reliably assess traffic congestion levels of metropolitan areas. The hybrid model found that Tucson is not as congested as what

TomTom found. In fact, results based on the three methods that the hybrid model is comprised of are consistent among each other. The cause behind such discrepancy between the hybrid model’s ranking and TomTom’s ranking is that the hybrid model took into account multiple 88 specific aspects of congestion rather than on a single overall index. Also, TomTom’s rankings are based on speed measurements of vehicles, which potentially led to biased results against smaller metropolitan areas like Tucson.

89

CHAPTER 6

BASEBALL HALL OF FAME MACHINE LEARNING COMBINED MODEL

Currently, retired baseball players are voted into the Major League Baseball Hall of Fame

(HOF) by sportswriters who are members of the media. The issue with the current voting procedure is that voters are likely to be biased for or against certain players. For instance, if certain players are negatively portrayed by the media during their playing careers, their chances of getting inducted into the HOF are lower due to the influence of the media on the voters despite the players’ accomplishments in the field. Also, there have been cases in which some players were not as widely covered by the media as other players even though the performances of both groups of players are similar. This scenario leads to the possibility that the careers of these players could be overshadowed by other players who are more well-known.

The goal of this dissertation is to more objectively and reliably determine which retired baseball players are worthy of the Major League Baseball Hall of Fame (HOF) using a machine learning combined model based on the players’ performance statistics. Section 6.1 discusses the motivation behind the use of machine learning algorithms to help voters choose players to be inducted into the HOF. Specifically, the combined model based on machine learning algorithms ranks players based on their performance statistics and applies clustering to determine their

HOF-worthiness. In the next section, Section 6.2 provides the description of performance statistics dataset used in the analysis. Section 6.3 discusses the results obtained from the machine learning combined model. Section 6.4 provides the summary and conclusions of this chapter.

90

6.1 Background

Currently, retired players are inducted into the HOF based on the number of votes they receive from voters who are members of the Baseball Writers’ Association of America

(BBWAA). In order to be considered for the HOF, the playing career duration should be at least

10 years. Also, there is a 5 year period between the time when players retire from baseball and when they are first considered for the HOF. Players need at least 75% of the votes in order to be inducted into the HOF. If they do not meet the 75% requirement within 10 ballot years or if they get less 5% of the votes during any ballot year, they are automatically removed from consideration in future HOF ballots. On the voters’ side, each voter cannot vote for more than

10 players per ballot year.

However, the issue is the existence of voting bias in the current baseball HOF voting procedure. It can cause some players worthy of HOF induction to be overlooked. To resolve this issue, this dissertation proposes a combined model approach using machine learning algorithms to more objectively determine which players are HOF-worthy based solely on their performance statistics. Since different algorithms provide different results, by relying on more than one algorithm, the results become more reliable by checking their consistency among each other. The ultimate goal is to help improve decision-making for HOF voters, not replace the voters. In fact, Jarrahi (2018) found that it is optimal from a decision-making standpoint for humans and machines to work together instead of relying on one or the other separately.

Another benefit to using machine learning algorithms as a combined model is that it may expedite the voting process compared to the current process, in which some players may need to wait for years before they are eventually inducted into the HOF. It is possible that due to media 91 bias, some players did not get inducted into the HOF when their performance statistics show otherwise. In this case, players who are no longer eligible for HOF consideration through the

BBWAA voting procedure can be considered again through the Eras Committees. There are multiple committees, where each committee represent the time period when players’ careers took place. Each committee is comprised of 16 members, who are HOF players, baseball executives, and members of the media. They meet once every 5 years to vote for these players. Players under consideration through their respective committees need to receive at least 12 votes out of

16 to get into the HOF. The model proposed by this dissertation can also be used by members of the Eras Committees to help them choose players for induction into the HOF.

For machine learning algorithms, conventional methods such as regression and decision trees have some downsides. The issue with regression is that it is very sensitive to outliers, and as a result, overfitting can occur. For decision trees, as more data are added into the dataset, the decision tree structure tends to change radically, where the branches of decision trees are completely rearranged and the conditional threshold for each branch is changed. Hence, the issue with decision trees is the instability and also being prone to overfitting. To resolve the issue of overfitting, this dissertation takes the combined model approach using Support Vector

Machines (SVM) and Neural Networks (NN). SVM has been a widely accepted algorithm, and it has been used in research over many years. With the rise in big data over the recent years, deep learning has become popular, and neural networks have been used to solve big data problems.

92

6.2 Dataset Description

The dataset obtained from www.baseball-reference.com is comprised of all players who were considered for the HOF during the 1997 to 2017 ballot years. In 1969, the pitcher’s mound was lowered by 5 inches, and the pitcher’s strike zone became smaller due to pitchers’ dominance over batters. Then, to further boost hitting power and thus revenue from American

League games, the position of designated hitters was created in 1973. Previously, it was typical for pitchers to throw complete games, but over the recent years, their pitch count was usually limited to 100 to 125 pitches to give their arms more rest for subsequent games. Because of these rule changes, along with changes in playing style and improvements in workouts, training routines, and diets over the recent years, it is difficult to compare one player in one era and another player in different era. Also, most players who were in the HOF ballots in the late

1990’s played during the 1970’s and 1980’s after the rule changes. Thus, this dissertation took into account the more recent data to assess each player’s HOF worthiness in these ballot years.

The dataset is split into smaller subsets of the data based on the primary position of the players. The positions are starting pitcher, relief pitcher, catcher, first baseman, second baseman, shortstop, third baseman, and left fielder, center fielder, and right fielder. Infielders are comprised of first baseman, second baseman, shortstops, and third baseman, and outfielders are left fielders, center fielders, and right fielders. Altogether, there are 102 infielders, 84 outfielders, 21 catchers, 74 starting pitchers, and 39 relief pitchers from the 1997-2017 ballot years combined that are taken into account for this dissertation’s analysis.

In terms of variables used for the analysis, both cumulative and percentage statistics are used so they respectively take into account both the longevity as well as the effectiveness of 93 players. For all batters, their performance is based on offensive and defensive statistics.

Offensive statistics are wins above replacement (WAR), runs, hits, home runs, runs batted in

(RBI’s), stolen bases, walks, batting averages, on-base percentages (OBP), slugging percentages

(SLP), and on base + slugging percentages (OPS). Defensive statistics are putouts, assists, errors, double plays, and field percentage. For catchers, additional defensive statistics are passed balls, wild pitches, stolen bases allowed, number of players caught stealing, and caught stealing percentage. For all pitchers, their performance is based on wins above replacement (WAR), wins, losses, win percentage, earned run average (ERA), walks and hits per inning pitched

(WHIP), games, innings pitched, hits allowed, home runs allowed, walks allowed, strikeouts, hits allowed per 9 innings, home runs allowed per 9 innings, walks per 9 innings, strikeouts per 9 innings, and strikeout-walk ratio. The only difference between starting and relief pitchers in terms of their statistics is that games started is an additional feature for starting pitchers, and saves is an additional feature for relief pitchers.

Based on the given data, the correlation matrix is applied to eliminate highly correlated features. If the correlation value of any given pair of features is at least 0.9, one of the features is eliminated from the dataset. After highly correlated features are eliminated, the remaining features are used for the analysis. In this dissertation, the correlation matrix is applied to infielders and outfielders combined, catchers, starting pitchers, and relief pitchers. Hence, there are a total of 4 correlation matrices. Note that there is a separate correlation matrix for catchers since they have additional defensive statistics from other batters.

94

6.3 Results and Discussion

After highly correlated features are eliminated, the remaining features in the dataset are normalized as mentioned in Section II for SVM and NN. Since infielders are more demanding than outfielders defensively, outfielders tend to have more offensive power than infielders. Also, infielders tend to switch positions within the infield, and outfielders switch positions within the outfield, but switching from infield to outfield and vice versa is rare. Hence, from the analysis standpoint, SVM and NN are applied to infielders and outfielders separately. Altogether, there are a total of 5 SVM and NN models, which are applied to starting pitchers, relief pitchers, catchers, infielders, and outfielders. For each of the 5 groups of players, SVM and NN are applied to determine the probability that a certain baseball player should be in the HOF through

Leave-One-Out Cross Validation (LOOCV). The probability value is determined for each player based on all other players in the same group. That way, the quality of each player’s performance over his entire career can be relatively measured compared to the other players in the same group.

This dissertation makes the following assumptions. From the analysis standpoint, players are considered HOF for the dependent variable only if they got inducted through the BBWAA voting procedure. Since All Star Game appearances, MVP awards, and Cy Young awards are likely to be affected by media bias, the number of awards players received is excluded from the analysis. Players’ performance in playoff games and World Series are also excluded since clinching to the playoffs is mostly based on team performance rather than on the individual player’s performance alone. In addition, if players admitted to steroid use or have significant evidence against them regarding the use of steroids in order to enhance their performance, these players are removed from this dissertation’s analysis. 95

The analysis takes the simulation approach by running both SVM and NN models 100 times for each player through LOOCV. For SVM, since the number of rows in the dataset for all groups is relatively small, it uses the Gaussian kernel. For NN, it has only 1 hidden layer, since the size of the dataset is relatively small, and this dissertation found that multiple hidden layers did not have a significant impact in the analysis. The number of nodes in the hidden layer is optimized based on the cost function defined in Eq. 3.30.

6.3.1 Outfielders

Table 6.1 shows the probability values of the top 30 outfielders in terms of their HOF worthiness. The probability values are treated as scores for these players, who are ranked based on these scores. The table includes the probability values obtained from both SVM and NN and also the overall probability based on the two algorithms. The outfielders are ordered based on their overall probability values in descending order. For the group of outfielders, the weights that are used to calculate the overall probability are 0.4609 and 0.5391 for SVM and NN respectively through Eq. 3.23 and Eq. 3.24. Based on the distribution of the probability scores from a visual standpoint in Figure 6.1, cluster predictions are determined by dividing the scores into two clusters through the K-means algorithm. The first cluster is comprised of players who deserve be in the HOF, while the other cluster determines otherwise. Also, the actual outcome of whether the players are inducted into the HOF or not is included in Table 6.1.

96

Players SVM Prob NN Prob Overall Prob HOF Prediction Rickey 0.518955359 0.781086092 0.660272487 1 1 LarryHenderson Walker 0.633182592 0.619085168 0.625582539 0 1 Dave Winfield 0.683917219 0.543026159 0.607961532 1 1 Kenny Lofton 0.735021680 0.486702119 0.601150284 0 1 Ken Griffey 0.615092746 0.521890062 0.564846308 1 1 Andre Dawson 0.596767463 0.472253012 0.529640559 1 1 Vladimir 0.483976334 0.522435689 0.504710132 0 1 TonyGuerrero Gwynn 0.289416592 0.583401994 0.447906869 1 1 Harold Baines 0.180312488 0.433169056 0.316629827 0 1 Garret Anderson 0.331091531 0.295343679 0.311819530 0 1 Dave Parker 0.192854415 0.215248687 0.204927376 0 0 Tim Raines 0.054561204 0.324584828 0.200133463 1 0 Dwight Evans 0.139220670 0.228067919 0.187119052 0 0 Luis Gonzalez 0.113708302 0.242711574 0.183255171 0 0 B.J. Surhoff 0.149981346 0.199125350 0.176475338 0 0 Steve Finley 0.076561601 0.226922806 0.157622732 0 0 Brett Butler 0.185131652 0.132566396 0.156793231 0 0 Jim Rice 0.103176141 0.187752751 0.148772182 1 0 Juan Gonzalez 0.135024123 0.145784052 0.140824901 0 0 Willie Wilson 0.038398500 0.226302223 0.139699153 0 0 Kirby Puckett 0.097865978 0.167367784 0.135335051 1 0 Magglio Ordonez 0.090100648 0.171184973 0.133813965 0 0 Bernie Williams 0.055902953 0.199856859 0.133509849 0 0 Rusty Staub 0.079826748 0.115452139 0.099032729 0 0 Ellis Burks 0.051595549 0.135192618 0.096663510 0 0 Jim Edmonds 0.030228931 0.124731459 0.081176127 0 0 Moises Alou 0.049271239 0.098813636 0.075980008 0 0 Marquis Grissom 0.024683148 0.107883370 0.069537165 0 0 Joe Carter 0.075962305 0.052574609 0.063353780 0 0 Paul O'Neill 0.026576892 0.090303925 0.060932731 0 0 Table 6.1: Top 30 Outfielders – HOF Probability Scores (α=0.4609 and β=0.5391)

It is worth pointing out in Table 6.1 that even though Larry Walker is ranked 2nd among all outfielders who have been considered for the HOF, he is actually not inducted into the HOF yet. He played for the Colorado Rockies in Denver, CO for the majority of his career, but the high altitude in Denver makes the air thin, which causes baseballs to travel farther when they are hit compared to other parks. Hence, Coors Field in Denver is considered a hitter-friendly park, 97 and HOF voters believe such advantage inflated Larry Walker’s offensive numbers. Another player worth noting is Kenny Lofton who is ranked 4th among all outfielders but is not inducted into the HOF as well. In fact, in his first ballot in 2013, Kenny Lofton received only 3.2% of the votes, which is not enough to be considered for subsequent ballots. During that year, there were several other more well-known players who were considered for the HOF and received the majority of the votes. Also, in 2017, Vladimir Guerrero, who is ranked 7th in Table 6.1, received

71.7% of the votes in his first ballot, which is short of the required 75% needed for HOF induction. But in 2018, he met the HOF induction requirement on his second ballot. Note that the results in Table 6.1 is based on the actual HOF inductions as of 2017.

The 95% confidence intervals for the overall probability values in Table 6.1 are calculated and shown for the same respectable players from Table 6.1 through Figure 6.1. The line represents the prediction boundary between players who are predicted to be in the HOF vs. the rest of the players. 98

Figure 6.1: Top 30 Outfielders – HOF Probability 95% Confidence Intervals

6.3.2 Infielders

Just like the outfielders in Table 6.1, Table 6.2 shows the probability scores of the top 30 infielders. The SVM and NN weights for the infielders are 0.4884 and 0.5116 respectively. The cluster predictions and the actual outcome are also included in Table 6.2. Figure 6.2 shows the

95% confidence intervals of the infielders from Table 6.2.

99

Players SVM Prob NN Prob Overall Prob HOF Prediction Paul Molitor 0.992884200 0.933204971 0.962351052 1 1 George Brett 0.940330800 0.884721882 0.911880109 1 1 Eddie Murray 0.915523300 0.842239225 0.878029627 1 1 Cal Ripken 0.933739600 0.821762951 0.876449992 1 1 Craig Biggio 0.907083700 0.840730133 0.873135820 1 1 Roberto Alomar 0.901585200 0.840901429 0.870538107 1 1 Robin Yount 0.773443300 0.823784522 0.799198927 1 1 Jeff Bagwell 0.770229300 0.705284383 0.737002115 1 1 Ozzie Smith 0.692045000 0.651562718 0.671333413 1 1 Edgar Martinez 0.809859500 0.405515519 0.602988619 0 1 Frank Thomas 0.587920100 0.559421938 0.573339841 1 1 Fred McGriff 0.550046400 0.517362369 0.533324563 0 1 Barry Larkin 0.483132700 0.537623100 0.511011134 1 1 Jeff Kent 0.422355100 0.459424764 0.441320719 0 1 Lou Whitaker 0.367102400 0.468374947 0.418915564 0 1 Julio Franco 0.362788800 0.466503302 0.415851320 0 1 Alan Trammell 0.410620200 0.418285479 0.414541918 0 1 Ryne Sandberg 0.328393100 0.478496895 0.405189357 1 1 Wade Boggs 0.272413000 0.238543192 0.255084494 1 0 Willie Randolph 0.238365700 0.190506623 0.213879990 0 0 Ron Santo 0.139712900 0.175257060 0.157898039 0 0 Will Clark 0.152877600 0.156694694 0.154830505 0 0 Dick Allen 0.102675900 0.143651359 0.123639806 0 0 Dave 0.136108400 0.103917787 0.119639006 0 0 KeithConcepcion Hernandez 0.100629200 0.131951519 0.116654357 0 0 John Olerud 0.090616840 0.122564563 0.106961967 0 0 Steve Garvey 0.062224950 0.148422343 0.106325348 0 0 Andres 0.085042830 0.116312951 0.101041281 0 0 MarkGalarraga Grace 0.083662310 0.116196109 0.100307286 0 0 Joe Torre 0.060053500 0.132093957 0.096910912 0 0 Table 6.2: Top 30 Infielders – HOF Probability Scores (α=0.4884 and β=0.5116)

100

Figure 6.2: Top 30 Infielders – HOF Probability 95% Confidence Intervals

6.3.3 Starting Pitchers

In addition, Table 6.3 shows the probability scores of the top 30 starting pitchers. The

SVM and NN weights are 0.4662 and 0.5338 respectively. Figure 6.3 shows the 95% confidence intervals of these starting pitchers.

101

Players SVM Prob NN Prob Overall Prob HOF Prediction Nolan Ryan 0.992452188 0.989401300 0.990823675 1 1 Bert Blyleven 0.929598332 0.992461000 0.963153377 1 1 Don Sutton 0.928690295 0.946938400 0.938430830 1 1 Greg Maddux 0.934410402 0.833346300 0.880464068 1 1 Phil Niekro 0.754284743 0.982748600 0.876234945 1 1 Randy Johnson 0.930115498 0.811821100 0.866971918 1 1 Curt Schilling 0.631917503 0.412658000 0.514880432 0 1 Pedro Martinez 0.345966838 0.662068600 0.514696694 1 1 David Cone 0.380588417 0.553983500 0.473143824 0 1 Mike Mussina 0.473033395 0.418881100 0.444127802 0 1 Tommy John 0.321580203 0.387495100 0.356764477 0 0 John Smoltz 0.190210603 0.347105500 0.273958486 1 0 Kevin Brown 0.168323539 0.324457300 0.251665140 0 0 Luis Tiant 0.180007438 0.222851100 0.202876671 0 0 Tom Glavine 0.193108107 0.044478420 0.113772056 1 0 Jim Kaat 0.184425153 0.031108310 0.102587176 0 0 Frank Tanana 0.168342018 0.019882610 0.089096859 0 0 Jack Morris 0.078044695 0.075099440 0.076472567 0 0 Mickey Lolich 0.098929682 0.046091960 0.070725786 0 0 Dwight Gooden 0.060330869 0.052180860 0.055980530 0 0 Chuck Finley 0.049145739 0.058144240 0.053948989 0 0 Charlie Hough 0.101274570 0.003656255 0.049167539 0 0 Dennis 0.044255408 0.002372944 0.021899246 0 0 RickMartinez Reuschel 0.039922051 0.003580135 0.020523342 0 0 Mark Langston 0.039479072 0.002759774 0.019878922 0 0 Sid Fernandez 0.038685510 0.000278075 0.018184261 0 0 Orel Hershiser 0.032300492 0.001022332 0.015604731 0 0 Bret 0.028643389 0.001512871 0.014161570 0 0 SaberhagenAl Leiter 0.022624927 0.004660147 0.013035627 0 0 Ron Guidry 0.025662430 0.000605147 0.012287270 0 0 Table 6.3: Top 30 Starting Pitchers – HOF Probability Scores (α=0.4662 and β=0.5338)

102

Figure 6.3: Top 30 Starting Pitchers – HOF Probability 95% Confidence Intervals

6.3.4 Overall Summary of Results

Overall, Table 6.1, Table 6.2, and Table 6.3 all show that the scores obtained from SVM and NN are very close with each other for most of the players. Discrepancies between the two algorithms tend to occur for players who are close to the prediction boundary determined by the

K-means clustering algorithm. Also, based on the confidence intervals in Figure 6.1, Figure 6.2, and Figure 6.3, the margins of error tend to be narrow for players who received either very high or very low scores. In another words, the model is very confident about its output for players who are either definitely HOF-worthy or are on the opposite extreme end from the score distribution standpoint. On the other hand, confidence intervals are wider for players who are close to the prediction boundary. Even though the model is run the same number of times for 103 every player, the probability scores of these players are less precise. From a simulation standpoint, there is more uncertainty regarding their scores. This finding indicates that these players are borderline candidates for the HOF, and this information would allow HOF voters to give a second look at these players’ accomplishments.

The results of the overall probability values and their 95% confidence intervals can be obtained in a very similar process for catchers and relief pitchers. Note that the probability scores are relative among players for each group. In another words, a player from one group cannot be compared to another from a different group with their probability scores since models that are applied to different groups are independent from each other. After the total numbers of true positives, true negatives, false positives (Type I error), and false negatives (Type II error) for all 5 groups are combined, the confusion matrix is obtained. Table 6.4 is the confusion matrix based on these numbers for all groups of players combined.

Predict: Not HOF Predict: HOF Actual: Not HOF 265 17 Actual: HOF 8 30 Table 6.4: Confusion Matrix for All Groups of Players Combined

There are a total of 265 true negatives, 30 true positives, 17 false positives, and 8 false negatives. Based on these numbers, the values of accuracy, sensitivity, and specificity are 0.922,

0.789, and 0.940 respectively. The accuracy value indicates the degree in which HOF predictions determined by the analysis agree with HOF decisions made by voters. It is important to note that the accuracy value does not necessarily indicate how truly correct the HOF predictions are since they are compared with the voters’ HOF decisions that are prone to biases.

Moreover, the sensitivity value indicates the existence of media bias among voters who vote for 104 some players into the HOF even though these players’ scores obtained from this dissertation’s model indicates otherwise. On the other hand, the specificity value indicates a different kind of media bias, in which voters do not vote for some players even though these players deserve to be in the HOF based on the model. Based on the values from Table 6.4 for all groups of players,

Figure 6.4 is the ROC curve of the true positive rate over the false positive rate.

Figure 6.4: ROC Curve

105

The area under the ROC Curve is 0.8875. The high AUC value indicates that the combined model does a very good job distinguishing HOF players from non-HOF players. Yet, there are some false positives and false negatives; in fact, there are more false positives than false negatives. Hence, this dissertation is mostly interested in the false positives, which are players who should be in the HOF based on this dissertation’s model but are actually not in the

HOF, since there are more false positives than false negatives. This area of concern can potentially be addressed through the Eras Committees, two of which are the Modern Baseball

Committee and Today’s Game Committee. The Modern Baseball Committee focuses on players who mostly played in the 1970-1987 era, and the Today’s Game Committee focuses on the 1988-

Present era. This dissertation’s model can be used as a tool to help members of the Modern

Baseball Committee and Today’s Game Committee vote for players who truly deserve to be in the HOF in future elections. In December 2017, Jack Morris and Alan Trammell got elected into the HOF through the Modern Baseball Committee. According to the analysis, Jack Morris is ranked 18 th out of 74 starting pitchers, and Alan Trammell is ranked 17 th out of 102 infielders.

Based on the clustering algorithm, it determined the top 10 starting pitchers and the top 18 infielders to be HOF-worthy. Hence, the analysis predicted that Alan Trammell would get inducted into the HOF, which is the case.

Jack Morris and Alan Trammell were both close to the prediction boundary set by the K- means clustering algorithm based on the probability scores. They retired in 1994 and 1996 respectively. Hence, it took over 20 years for them to be inducted. If voters based their decisions on this dissertation’s model in future ballots, it is possible that players, who are in a similar situation as Jack Morris and Alan Trammell, could be inducted much quicker. 106

Moreover, the probability scores obtained from the analysis can also be used to predict recently retired players who have not yet been in the BBWAA ballots due to the 5 year waiting period. Table 6.5 shows the predictions and scores of a sample number of future HOF candidates. These players are treated as a sample test dataset since their outcomes of HOF inductions are not yet known. Scores are based on the calculated weights from Eq. 3.23 and Eq.

3.24 and the training data of all players in respective groups who have already been considered for the HOF.

Players 1st Ballot Yr Group Prob Cutoff Rank Cutoff Derek Jeter 2020 Infielder 0.9495 0.4052 2 18 Roy Halladay 2020 Starting Pitcher 0.0326 0.4441 23 10 Bobby Abreu 2020 Outfielder 0.1842 0.3118 14 10 Paul Konerko 2020 Infielder 0.0389 0.4052 40 18 Torii Hunter 2021 Outfielder 0.2910 0.3118 11 10 Tim Hudson 2021 Starting Pitcher 0.0239 0.4441 24 10 Table 6.5: Future HOF Candidate Predictions

For instance, Derek Jeter’s first year of eligibility for HOF consideration is in 2020, and if his performance statistics are tested based on the statistics of all infielders who have already been in the HOF ballots as a training dataset, his overall probability score is 0.9495, which would rank him 2nd among all infielders who have been considered. Since his score and his rank is much higher than the prediction boundary score-wise and rank-wise, the model concludes that he is a very strong candidate for the HOF. Also, Torii Hunter, whose first year of eligibility is in 2021, achieved a probability score of 0.2910, but in terms of rank, he is very close to the prediction boundary. Hence, it is likely that based on the model, his HOF candidacy would be open to debate among voters. 107

Overall, the combined model based on two machine learning algorithms makes the results more reliable. Results for both algorithms are very consistent for most of the players.

Moreover, as more players get considered for the HOF over time, and more players get inducted into the HOF in the future, more data are taken into account, which makes the results even more valid. As the list of players considered for the HOF grows and the HOF status of players gets updated every year, scores obtained from this dissertation’s model change every year. As a result, the model is very dynamic, where scores and ranking of retired players get updated every year.

While relying on more than one algorithm makes the results more reliable, there are some limitations in this dissertation’s proposed model. First, the model is applied only for players who have retired from the sport or are close to finishing their careers. It does not work for players who are early in their careers since the model takes into account cumulative statistics. Another limitation is a relatively small number of players who got inducted into the HOF for a certain group of players. For instance, from 1997-2017, out of 39 relief pitchers, only 3 got inducted into the HOF. Such scenario is treated more like an anomaly detection problem. The low number of players inducted into the HOF contributed to skewed results, in which the scores are relatively much lower for the relief pitchers compared to the other groups of players. On the contrary, as more relief pitchers get inducted into the HOF in the future, this issue will eventually be resolved.

In addition, the analysis took into account both offensive and defensive statistics. As a result, it is likely to be biased against designated batters since their role is purely offensive. For instance, Frank Thomas was a designated hitter and first baseman who got inducted into the HOF in 2014. Since he was a designated hitter for the majority of his career, his cumulative defensive 108 statistics as a first baseman are limited. Based on this dissertation’s analysis, his score is 0.5733, which is ranked 11th among infielders. Even though the model correctly concluded Frank

Thomas to be HOF-worthy, his score would have been higher if the model took into account only offensive statistics. Yet, such bias is not a major issue since only a few designated hitters over the past decades have been considered for the HOF.

Furthermore, while performance improvement has been achieved through improved workouts, training routines, and diets over the recent years, one downside is the controversy regarding the use of performance enhancing drugs. The integrity of the sport has been questioned over this issue. Some players admitted to the use of steroids to improve their performance, while other players admitted to using human growth hormone (HGH) solely to expedite their recovery times from injuries. There are also several others who have been accused of steroids but insufficient evidence has made it difficult to prove that they cheated. These players enter the murky territory over whether their performance statistics are truly legitimate.

From an analysis standpoint, players who played clean may be in a disadvantage due to other players who may have gained an unfair advantage through steroid use.

6.4 Chapter Summary

This chapter took the combined model approach based on machine learning algorithms to objectively rank retired baseball players based on their performance statistics. By assigning scores to each player, they can be clustered into two groups, where one of them represents players who are HOF-worthy. The purpose of this chapter is to provide insight for the Major

League Baseball HOF voters as well as Eras Committee voters in order to aid them in their 109 decision-making, especially for borderline HOF candidates; it is not intended to replace the voters. By taking both human and machine components into account, decisions for inducting players into the HOF are less prone to human biases and would allow players who are truly

HOF-worthy to be inducted more accurately and quickly.

110

CHAPTER 7

SUMMARY AND CONCLUSIONS

7.1 Research Summary

In this dissertation, the Enhanced EB method is the preferred method from an optimization standpoint to predict the number of traffic crashes and rank road segments in terms of the crash risk over the Standard EB method. It is the combined model of the Standard EB method and the PDR similarity measure. In terms of similarity, the Enhanced EB method is a more objective approach than the Standard EB method. The information that is presented based on the analysis and graphs through a case study using the Enhanced EB method will be very useful for state DOTs to make key decisions regarding the improvement of traffic safety through road hotspot identification.

The analysis using the hybrid model found that Tucson is actually not as congested what

TomTom found. Results from all three methods of the hybrid model are found to be consistent among each other, which confirms the validity of results. Based on the findings, the Pima

Association of Governments does not need to be overly concerned about TomTom’s assessment of Tucson in traffic congestion based on its overall congestion index. Due to the reliance on speed measurements of vehicles for TomTom’s congestion index, it was found to have contributed to a possible bias against smaller metropolitan areas like Tucson for ranking metropolitan areas based on traffic congestion.

The combined model based on machine learning algorithms is able to objectively assess and rank players who have been considered for the HOF through the BBWAA voting procedure.

It is comprised of both qualitative and quantitative elements, where the quantitative elements are 111 made up of scores, while the qualitative element is clustering. Based on the score distribution, clustering analysis is performed to determine which players are HOF-worthy. Since the combined model is comprised of two machine learning algorithms instead of only one, the probability scores of players are more reliable. It is important to note that the model is not static.

Instead, the model goes through a dynamic process iteratively since the scores of all players change every year as more players get considered for the HOF in future ballots. As more data are taken into account, it would be able to assess players’ HOF-worthiness more accurately.

Moreover, the model can be used to assess the HOF induction likelihood of recently retired players who have not yet been considered.

Based on Chapter 4, Chapter 5, and Chapter 6, combined models have shown to improve model performance in this dissertation in terms of the objectivity and reliability of results. The main goal is to improve decision-making by improving methods for ranking entities. The following below highlights the contributions of this dissertation.

7.1.1 Contributions to Combined Model for Addressing a Class of Problems Pertaining to

Extreme Values and Rare Events

In Chapter 4, the combined model of incorporating the PDR similarity measure into the

Standard EB method as the Enhanced EB method was proposed. The class of problems on extreme values and rare events that it addressed was traffic crashes, since their occurrences are rare but at the same time are detrimental to traffic safety and traffic flow. Improving traffic crash predictions allows state DOTs to more effectively reduce the number of traffic crashes through policies and/or safety implementations. That way, costs pertaining to traffic crashes and traffic 112 congestion are reduced for state governments, and at the same time traffic safety for both drivers and pedestrians is improved.

7.1.2 Contributions to Exploring and Clarifying Similarity for Road Segment Clustering

Methods from Hauer (2002) are proposed for predicting the number of traffic crashes for similar road segments. However, the definition of “similar” is not explicitly defined for road segment clustering. This dissertation attempted to clarify and fill in this research gap in Chapter

4 using the PDR similarity measure that is incorporated into the Standard EB method as the

Enhanced EB method. The performance of the Enhanced EB method was tested and compared with the one of the Standard EB method through a case study in Phoenix, Arizona. Through error percentages of traffic crash predictions, this dissertation found that the Enhanced EB method performed significantly better for predicting the number of traffic crashes. These predictions are used for state DOTs to identify potential road hotspots through EB posterior estimates and ratios of the actual number of traffic crashes to the predictions. Since time stamps of traffic crashes are recorded, road hotspots can be identified during certain timeframes such as seasons, day of the week, and times of the day.

7.1.3 Contributions to Unsupervised Learning Hybrid Model for Traffic Congestion

Measurement and Ranking

In Chapter 5, the hybrid model that is comprised of NSM, PCA, and the PDR similarity matrix was applied to measure traffic congestion. It takes into account specific aspects of congestion rather than a single overall index. Such aspects include traffic congestion during peak hours, highway congestion levels, and road network length. By taking these factors into account, it provides a more accurate representation of traffic congestion as a whole. 113

Each of the three methods in the proposed hybrid model is tested, and results from all three methods are consistent among each other to confirm the validity of results. The hybrid model found that Tucson is not as congested as what TomTom found. Initially, TomTom’s ranking of metropolitan areas based on its overall congestion index raised concerns for the Pima

Association of Governments, and the proposed hybrid model was applied to conduct an independent investigation to assess traffic congestion levels. It was found that there is a potential bias in TomTom’s rankings due to the fact that its overall congestion index is based on speed measurements of vehicles equipped with TomTom’s GPS devices.

7.1.4 Contributions to Exploring the Weighted Sum of Classification Probabilities using

Machine Learned Ranking Algorithms

In Chapter 6, the combined model based on machine learning algorithms was proposed to objectively and reliably determine which retired baseball players deserve to be inducted into the

Major League Baseball Hall of Fame based on their performance statistics. Specifically, it takes a simulation approach, where the model is the weighted sum of classification probabilities with the weight values determined based on the variance behind the simulation results. Confidence intervals are obtained, and the overall probability values are treated as scores that are ranked and clustered to determine which players are Hall of Fame-worthy. Through the confusion matrix, the model confirms the existence of voting bias among voters, who are comprised of members of the media. The goal of the proposed combined model is to help, not replace, voters in their decision-making. However, as HOF predictions based on the analysis are nearly in complete agreement with decisions made by voters over time, the decision of whether to completely replace human voters with a computer through machine learning algorithms is open to debate. 114

7.2 Firsts in the Research

To the best of my knowledge, the following firsts have been achieved in this dissertation:

• Taking into account both quantitative and qualitative aspects in combined models. The

quantitative aspect is the scores, while the qualitative aspect is the ranking and clustering

based on the score distribution.

• Addressing a class of problems pertaining to extreme values and rare events using the

combined model approach. In this dissertation, the Enhanced EB method is applied to

predict traffic crashes.

• Defining and clarifying “similar” for road segment clustering. The Enhanced EB

method, which is a combined model, incorporates the PDR similarity metric into the

Standard EB method for road hotspot identification.

• Defining traffic congestion based on specific aspects of congestion using the hybrid

model approach. The hybrid model is comprised of NSM, PCA, and the PDR similarity

matrix, where results are obtained from each method to ensure consistency of results for

all three methods of the hybrid model.

• Introducing and applying the combined model approach to optimize decision-making

through human-machine symbiosis in the field of sports analytics. The model is

comprised of multiple machine learning algorithms to assist human decision-making for

MLB HOF voters.

115

7.3 Future Research Directions

While this dissertation provided several key contributions, several extensions can be made for future research in road hotspot identification. First, there are still several topics open related to segment similarity. Given the segment similarity based on crash patterns, some crash prediction models could be specially designed for predicting the expected crash frequency for a certain crash pattern. Also, more features can be added to crash patterns in order to test the

Enhanced EB method and observe further improvements in traffic crash predictions. Second, the relationship between some physical traits of segments and the result obtained by our similarity- measuring methods can be explored more. For instance, if two segments are similar with respect to their crash patterns, does it imply that they were geographically and geometrically constructed similarly or that they used the same technique to improve safety? Thus, the analysis of identifying hotspots can be further improved by using methods that combine statistical information about crashes with qualitative information. Third, the precision in terms of the timeframes for identifying road hotspots can be improved. For example, instead of using 6-hour time segments, the analysis can focus on separate 1-hour time segments. It can also combine a

1-hour time segment with a particular day of the week pinpoint road hotspots for that particular timeframe. These potential research directions can be tested in different case studies to show that the Enhanced EB method works throughout multiple highways.

For road congestion measurement, this dissertation is limited to metropolitan areas with populations of at least 800,000 in the United States. The hybrid model approach can be extended to metropolitan areas around the world. It can also be applied to smaller metropolitan areas, which is an interesting area of future research since it was found that TomTom’s rankings are biased against smaller metropolitan areas whose road networks are mostly comprised of local 116 roads like Tucson. Another area of research that this dissertation can potentially address is

TomTom’s data coverage that can vary from one area to another. For instance, it is possible that a higher percentage of people use TomTom’s GPS devices for a particular area compared to other areas. Such varying coverage can be another cause behind the bias in TomTom’s rankings.

To overcome this issue, another data source such as Google can be potentially used to ensure consistency of data coverage from area to area. Also, additional features can be incorporated into the hybrid model for ranking metropolitan areas. For instance, Texas A&M Transportation

Institute (TTI) has defined many congestion metrics, and some of them can be added into the hybrid model. Another area of future research is to combine results obtained from each of the three methods of the hybrid model into a single set of results. As a result, such combination takes the combined model approach that outputs a single set of results with the ranks based on the three methods. That way, transportation officials can rely on this single ranking system in order for them to make effective decisions pertaining to the reduction of traffic congestion.

Areas of future research for the combined model approach in HOF inductions include applying additional machine learning algorithms along with SVM and NN to make results even more reliable. Some of the algorithms that could be incorporated into the combined model are gradient-boosted decision trees, random forests, and logistic regression. Since logistic regression is equivalent to neural networks with no hidden layers, results obtained from logistic regression and neural networks can be compared to see how far off their respective probability output values are from the corresponding binary dependent variables. Another area of research that this dissertation can potentially address is taking into account the overall team performance and team reputation for the respective players. If a certain player helped his team win the World Series, this accomplishment is likely to boost his chances of getting into the HOF. In addition, this 117 dissertation’s combined model approach can also be applied to the selection of players for All-

Star Game appearances, MVP awards, and Cy Young awards. Since these awards are affected by media bias, the model can be used to objectively determine which players deserve such accolades just like the one for HOF inductions. Furthermore, it can be applied to other sports such as basketball, ice hockey, American football, and soccer as well. In a very similar way, retired players in other sports can be divided into groups based on their positions. Then, they can be assessed and ranked based on their performance statistics relative to other retired players in their respective sports.

118

RESEARCH FUNDING SOURCES

Partial financial supports by grants from UCCONNECT (No. 00008606) and the National

Institute for Transportation and Communities (NITC) Fellowship (No. 1180) are gratefully acknowledged. We want to thank the Arizona Department of Transportation (ADOT) and

TomTom N.V. for providing the datasets that made the analysis possible for Chapter 4 and

Chapter 5 of this dissertation respectively. Furthermore, we also want to thank the University of

Arizona Department of Systems & Industrial Engineering, the University of Arizona Graduate

Professional Student Council (GPSC), and the NITC Student Travel Award for providing travel grants that made the presentation of Chapter 5 of this dissertation possible at the 2018 IEEE-

ITSC conference in Maui, Hawaii.

119

REFERENCES

AASHTO, (2010). Highway Safety Manual, 1st Edition. American Association of State Highway and Transportation Officials , Washington, DC.

Abdel-Aty, M., Keller, J., & Brady, P. A. (2005). Analysis of the Types of Crashes at Signalized Intersections Using Complete Crash Data and Tree-Based Regression. Proceedings of TRB 2005 Annual Meeting , Washington, D.C.

ADOT AADT Datasets (2011-2014), https://www.azdot.gov/planning/DataandAnalysis . Arizona Department of Transportation Officials , Phoenix, AZ.

ADOT Crash Database (2011-2014). Arizona Department of Transportation Officials , Phoenix, AZ.

Al-Sobky, A.S. & Mousa, R.M. (2016). Traffic Density Determination and its Applications Using Smartphone. Alexandria Engineering Journal 55 (1), 513-523.

Avendi, M.R., Kheradvar, A., & Jafarkhani, H. (2016). A Combined Deep-Learning and Deformable-Model Approach to Fully Automatic Segmentation of the Left Ventricle in Cardiac MRI. Medical Image Analysis 30 , 108-119.

Baseball Eras Committees, https://baseballhall.org/hall-of-famers/rules/eras-committees.

Baseball Rule Change Timeline, http://www.baseball-almanac.com/rulechng.shtml.

Baseball Statistics Dataset, https://www.baseball-reference.com.

BBWAA Hall of Fame Rules for Election, https://baseballhall.org/hall-of-famers/rules/bbwaa- rules-for-election.

Behera, T.K. & Panigrahi, S. (2015). Credit Card Fraud Detection: A Hybrid Approach Using Fuzzy Clustering & Neural Network. 2015 Second International Conference on Advances in Computing and Communication Engineering . doi: 10.1109/ICACCE.2015.33.

Bernhardt, D., Krasa, S., & Polborn, M. (2008). Political Polarization and the Electoral Effects of Media Bias. Journal of Public Economics 92 (5-6), 1092-1104.

Blincoe, L., Seay, A., Zaloshnja, E., Miller, T., Romano, E., Luchter, S., & Spicer, R. (2002). The Economic Impact of Motor Vehicle Crashes. Washington, DC: U.S. DOT.

Boarnet, M.G., Kim, E., & Parkany, E. (1998). Measuring Traffic Congestion. Transportation Research Record: Journal of the Transportation Research Board 1634 (1), 93-99.

Braess, D. (1968). Uber ein paradoxen der verkehrsplanung. Unternehmensforschu 12, 258-268.

Braess, D., Nagurney, A., & Wakolbinger, T. (2005). On a Paradox of Traffic Planning. Transportation Science 39 (4), 446-450. 120

Brin, S. & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems , 30, 107-117.

Cai, F., Chen, H., & Shu, Z. (2015). Web Document Ranking via Active Learning and Kernel Principal Component Analysis. International Journal of Modern Physics C 26 (4).

Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., & Li, H. (2007). Learning to Rank: From Pairwise Approach to Listwise Approach. Proceedings of the 24 th International Conference on Machine Learning , 129-136.

Carlin, B. & Louis, T. (2000). Bayes and Empirical Bayes Methods for Data Analysis. Boca Raton, Florida: CRC Press.

Carriquiry, A.L. & Pawlovich, M. (2004). From Empirical Bayes to Full Bayes: Methods for Analyzing Traffic Safety Data, White Paper. Iowa Department of Transportation, Ames, Iowa.

Chang, Y.S., Lee, Y.J., & Choi, S.S.B. (2017). Is There More Traffic Congestion in Larger Cities? - Scaling Analysis of the 101 Largest U.S. Urban Centers. Transport Policy 59, 54-63.

Cheng, W. & Washington, S.P. (2005). Experimental Evaluation of Hotspot Edentification Methods. Accident Analysis and Prevention , 37 (5), 870–881. doi:10.1016/j.aap.2005.04.015.

Chiang, C.F. & Knight, B.G. (2011). Media Bias and Influence: Evidence from Newspaper Endorsements. Review of Economic Studies 78 (3), 795-820.

Cooper, R.G. & Sommer, A.F. (2016). The Agile–Stage‐Gate Hybrid Model: A Promising New Approach and a New Research Opportunity. The Journal of Product Innovation Management , 33 (5), 513-526.

D’Este, G.M., Zito, R., & Taylor, M.A.P. (1999). Using GPS to Measure Traffic System Performance. Computer-Aided Civil and Infrastructure Engineering 14, 255-265.

Davis, G.A., & Yang, S. (2001). Bayesian Identification of High-Risk Intersections for Older Drivers via Gibbs Sampling. Transportation Research Record , 1746, 84–89. doi:10.3141/1746- 11.

Deacon, J.A., Zegeer, C.V., & Deen, R.C. (1975). Identification of Hazardous Rural Highway Locations. Transportation Research Record 543, 16–33.: Washington, D.C: TRB, National Research Council.

DellaVigna, S. & Kaplan, E. (2007). The Fox News Effect: Media Bias and Voting. The Quarterly Journal of Economics 122 (3), 1187-1234.

Depken II, C.A. & Ford, J.M. (2006). Customer-based Discrimination against Major League Baseball Players: Additional Evidence from All-Star Ballots. The Journal of Socio-Economics 35 (6), 1061-1077. 121

Desser, A., Monks, J., & Robinson, M. (1999). Baseball Hall of Fame Voting: A Test of the Customer Discrimination Hypothesis. Social Sciences Quarterly 80 (3), 591-603.

Ehlenfeldt, M.K., Polashock, J.J., & Stretch, A.W. (2010). Ranking Cultivated Blueberry for Mummy Berry Blight and Fruit Infection Incidence Using Resampling and Principal Components Analysis. HortScience 45 (8), 1205-1210.

Fang, L., Xiao, B., Yu, H., & You, Q. (2018). A Stable Systemic Risk Ranking in China’s Banking Sector: Based on Principal Component Analysis. Physica A: Statistical Mechanics and its Applications 492, 1997-2009.

Findlay, D.W. & Reid, C.E. (1997). Voting Behavior, Discrimination and the National Baseball Hall of Fame. Economic Inquiry 35 (3), 562-578.

Findlay, D. W. & Reid, C. E. (2002). A Comparison of Two Voting Models to Forecast Election into the National Baseball Hall of Fame. Managerial and Decision Economics 23, 99-113.

Freiman, M.H. (2010). Using Random Forests and Simulated Annealing to Predict Probabilities of Election to the Baseball Hall of Fame. Journal of Quantitative Analysis in Sports 6 (2), 1-35.

Gerber, A.S., Karlan, D., & Bergan, D. (2009). Does the Media Matter? A Field Experiment Measuring the Effect of Newspapers on Voting Behavior and Political Opinions. American Economic Journal: Applied Economics 1 (2), 35-52.

Gleckler, P.J., Tayler, K.E., & Doutriaux, C. (2008). Performance Metrics for Climate Models. Journal of Geophysical Research – Atmospheres 113 (D6). doi:10.1029/2007JD008972.

Guo, G., Li, S.Z., & Chan, K.L. (2001). Support Vector Machines for Face Recognition. Image and Vision Computing 19 (9-10), 631-638.

Han, Y., Zhang, X., & Yu, L. (2012). Traffic Congestion Measurement Method of Road Network in Large Passenger Hub Station Area. LTLGB 2012: Proceedings of International Conference on Low-Carbon Transportation and Logistics, and Green Buildings , Volume 1, Chapter 31, 189-195.

Hanssen, F.A. & Andersen, T. (1999). Has Discrimination Lessened Over Time? A Test Using Baseball’s All-Star Vote. Economic Inquiry 37 (2), 326-352.

Hanssen, F.A. (2002). A Test of the Racial Contact Hypothesis from a Natural Experiment: Baseball’s All‐Star Voting as a Case. Social Science Quarterly 82 (1), 51-6.

Hauer, E. (1986). On the Estimation of the Expected Number of Accidents. Accident Analysis and Prevention , 18(1), 1–12. doi:10.1016/0001-4575(86)90031-X.

Hauer, E. (1996). Identification of Sites with Promise. Transportation Research Record , 1542, 54–60. doi:10.3141/1542-09. 122

Hauer, E., Council, F. M., & Mohammedshah, Y. (2004). Safety Models for Urban Four-Lane Undivided Road Segments. Proceedings of the Transportation Research Board of the National Academies , Washington, D.C.

Hauer,E., Harwood, D.W., Council, F.M., & Griffith, M.S. (2002). Estimating Safety by the Empirical Bayes Method: A Futorial. Transportation Research Record , 1784, 126–131. doi:10.3141/1784-16.

Hauer, E., Ng, J. C. N., & Lovell, J. (1989). Estimation of Safety at Signalized Intersections. Transportation Research Record , 1185, 48–61.

Hauer, E. & Persaud, B. N. (1987). How to Estimate the Safety of Rail-Highway Grade Crossings and the Safety Effects of Warning Devices (No. 1114).

Hauer, E., Persaud, B. N., Smiley, A., & Duncan, D. (1991). Estimating the Accident Potential of an Ontario Driver. Accident Analysis and Prevention , 23(2/3), 133–152. doi:10.1016/0001- 4575(91)90044-6.

Huang, H., Chin, H., & Haque, M. (2009). Empirical Evaluation of Alternative Approaches in Identifying Crash Hotspots: Naive Ranking, Empirical Bayes, and Full Bayes Methods. Transportation Research Record: Journal of the Transportation Research Board , 2103, 32–41. http://www.webopedia.com/TERM/G/GIS.html doi:10.3141/2103-05.

ICF Consulting, Ltd. (2003). Costs-Benefit Analysis of Road Safety Improvements.: London, UK: JMW Engineering, Inc. AIMS Capabilities. http://www.jmwengineering.com/aims_capability.htm, cited June 7, 2003.

Jarrahi, M.H. (2018). Artificial Intelligence and the Future of Work: Human-AI Symbiosis in Organizational Decision Making. Business Horizons 61 (4), 577-586.

Jewell, R.T., Brown, R.W., & Miles, S.E. (2002). Measuring Discrimination in Major League Baseball: Evidence from the Baseball Hall of Fame. Applied Economics 34 (2), 167-177.

Jewell, R.T. (2003). Voting for the Baseball Hall of Fame: The Effect of Race on Election Date. Journal of Industrial Relations 42 (1), 87-100.

Jia, H. & Martinez, A.M. (2009). Support Vector Machines in Face Recognition with Occlusions. 2009 IEEE Conference on Computer Vision and Pattern Recognition .

Joachims, T. (2002), Optimizing Search Engines using Clickthrough Data. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘02 , 133-142.

Kong, X., Yang, J., & Yang, Z. (2015). Measuring Traffic Congestion with Taxi GPS Data and Travel Time Index. 15th COTA International Conference of Transportation Professionals . 123

Kumara, S. S. P., & Chin, H. C. (2005). Application of Poisson Underreporting Model to Examine Crash Frequencies at Signalized Three-Legged Intersections. Proceedings of the TRB 2005 Annual Meeting , Washington, D.C.

Lee, A.S., Lin, W.H., Gill, G.S., & Cheng, W. (2018). An Enhanced Empirical Bayesian Method for Identifying Road Hotspots and Predicting Number of Crashes. Journal of Transportation Safety & Security , DOI: 10.1080/19439962.2018.1450314.

Lee, A.S. & Lin, W.H. (2018). Traffic Congestion Assessment of Metropolitan Areas Through Hybrid Model Ranking. The 21st IEEE International Conference on Intelligent Transportation Systems (Presented).

Lee, J., Chung, K., & Kang, S. (2016). Evaluating and Addressing the Effects of Regression to the Mean Phenomenon in Estimating Collision Frequencies on Urban High Collision Concentration Locations. Accident Analysis and Prevention , 97, 49–56. doi:10.1016/j.aap.2016.08.019.

Liou, J.J.H., Tzeng, G.H., & Chang, H.C. (2007). Airline Safety Measurement Using a Hybrid Model. Journal of Air Transport Management , 13 (4), 243-249.

Liu, H.X., Zhang, R.S., Luan, F., Yao, X.J., Liu, M.C., Hu, Z.D., & Fan, B.T. (2003). Diagnosing Breast Cancer Based on Support Vector Machines. Journal of Chemical Information and Modeling 43 (3), 900-907.

Lloyd, S. & Downey, J. (2009). Predicting Baseball Hall of Fame Membership using a Radial Basis Function Network. Journal of Quantitative Analysis in Sports 5 (1), 1-21.

Louf, R. & Barthelemy, M. (2014). How Congestion Shapes Cities: From Mobility Patterns to Scaling. Scientific Reports 4:5561.

Maher, M. J. & Mountain, L. J. (1988). The Identification of Accident Nlackspots: A Comparison of Current Methods. Accident Analysis and Prevention , 20 (2), 143–151. doi:10.1016/0001-4575(88)90031-0.

Manage, A.B.W. & Scariano, S.M. (2013). An Introductory Application of Principal Components to Cricket Data. Journal of Statistics Education 21 (3). doi: 10.1080/10691898.2013.11889689.

Mandal, K., Sen, A., Chakraborty, A., Roy, S., Batabyal, S., & Bandyopadhyay, S. (2011). Road Traffic Congestion Monitoring and Measurement Using Active RFID and GSM Technology. 14th International IEEE Conference on Intelligent Transportation Systems (ITSC) , 1375-1379.

McNamara, P., “Drivers, Beware: Tucson 21st in Worst Congestion,” Arizona Daily Star, http://tucson.com/news/local/drivers-beware-tucson-st-in-worst-congestion/article_b9c8ae80- 503e-5154-b9c0-3d20bd8bb176.html, 2015, accessed on April 2015. 124

Meza, J. (2003). Empirical Bayes Estimation Smoothing of Relative Risks in Disease Mapping. Journal of Statistical Planning and Inference , 112 (1), 43–62. doi:10.1016/S0378-3758 (02)00322-1.

Mills, B.M. & Salaga, S. (2011). Using Tree Ensembles to Analyze National Baseball Hall of Fame Voting Patterns: An Application to Discrimination in BBWAA Voting. Journal of Quantitative Analysis in Sports 7 (4), 1-32.

Mitchell, G.J., 2007. Report to the Commissioner of Baseball of an Independent Investigation into the Illegal Use of Steroids and Other Performance Enhancing Substances by Players in Major League Baseball, http://mlb.mlb.com/mlb/news/mitchell/index.jsp.

Moon, T., Smola, A., Chang, Y., & Zheng, Z. (2010). IntervalRank: Isotonic Regression with Listwise and Pairwise Constraints. Proceedings of the 3 rd ACM International Conference on Web Search and Data Mining , 151-160.

Ng, Andrew, 2011. Machine Learning, Coursera.

Ng, Andrew, 2017. Deep Learning, deeplearning.ai.

Norden, M., Orlansky, J., & Jacobs, H. (1956). Application of Statistical Quality-Control Techniques to Analysis of Highway-Accident Data. Bulletin 117, HRB (pp. 17–31). Washington, D.C: National Research Council.

Nowakowska, M. (2002). Identifying Similarities and Dissimilarities Among Road Accident Patterns. Transportation and Traffic Theory in the 21st Century, Proceedings of the 15th International Symposium on Transportation and Traffic Theory .

Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies , Working Paper 1999-0120.

Pai, P.F. & Lin, C.S. (2005). A Hybrid ARIMA and Support Vector Machines Model in Stock Price Forecasting. Omega 33 (6), 497-505.

Paliwal, M. & Kumar, U.A. (2009). Neural Networks and Statistical Techniques: A Review of Applications. Expert Systems with Applications 36 (1), 2-17.

Pattara-atikom, W., Pongpaibool, P., & Thajchayapong, S. (2006). Estimating Road Traffic Congestion using Vehicle Velocity. 6th International Conference on ITS Telecommunications , 1001-1004.

Pattara-atikom, W., Peachavanish, R., & Luckana, R. (2007). Estimating Road Traffic Congestion using Cell Dwell Time with Simple Threshold and Fuzzy Logic Techniques. Proceedings of the 2007 IEEE Intelligent Transportation Systems Conference , 956-961.

Persaud, B. N. (1988). Do Traffic Signals Affect Safety? Some Methodological Issues. Transportation Research Record , 1185, 37–46. 125

Platt, J. (2000). Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. Advances in Large Margin Classifiers. Advances in Large Margin Classifiers. Cambridge, MA, MIT Press.

Rather, A.H., Agarwal, A., & Sastry, V.N. (2015). Recurrent Neural Network and a Hybrid Model for Prediction of Stock Returns. Expert Systems with Applications 42 (6), 3234-3241.

Richardson, M., Prakash, A., & Brill, E. (2006). Beyond PageRank: Machine Learning for Static Ranking. Proceedings of the 15th International World Wide Web Conference , 707–715.

Schrank, D., Eisele, B., Lomax, T., & Bak, J. (2015). 2015 Urban Mobility Scorecard. https://mobility.tamu.edu/ums/report/ , cited August 2015.

Sculley, D. (2010). Combined Regression and Ranking. Proceedings of the 16 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘10 , 979-988.

Sharma, N. & Saroha, K. (2015). A Novel Dimensionality Reduction Method for Cancer Dataset using PCA and Feature Ranking. 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) . doi: 10.1109/ICACCI.2015.7275954.

Sharma, R., Grover, V. L., & Chaturvedi, S. (2008). Health-Risk Behaviors Related to Road Safety Among Adolescent Students.

Sharma, S. (2008). Principal Component Analysis (PCA) to Rank Countries on their Readiness for E-tail. Journal of Retail & Leisure Property 7 (2), 87-94.

Shukla, N. (2016). Lung Cancer Detection and Classification using Support Vector Machine. International Journal of Engineering and Computer Science . doi:10.18535/Ijecs/v4i11.20.

Shukur, O.B. & Lee, M.H. (2015). Daily Wind Speed Forecasting through Hybrid KF-ANN Model based on ARIMA. Renewable Energy , Elsevier, vol. 76(C), 637-647.

Thomaz, C.E. & Giraldi, G.A. (2010). A New Ranking Method for Principal Components Analysis and its Application to Face Image Analysis. Image and Vision Computing 28 (6), 902- 913.

TomTom International B.V. TomTom Traffic Index – Measuring Congestion Worldwide. https://www.tomtom.com/en_gb/trafficindex/, cited March 2017.

Wang, D.G. & Li, S.M. (2009). Congestion Pricing to Reduce Traffic Emission. Transportation and Geography 1, 289-295.

Wang, H., Xiao, G.Y., Zhang, L.Y., & Ji, Y.B.B. (2014). Transportation Network Design Considering Morning and Evening Peak-Hour Demands. Mathematical Problems in Engineering .

Wei, W., Jiang, J., Liang, H., Gao, L., Liang, B., & Huang, J. (2016). Application of a Combined Model with Autoregressive Integrated Moving Average (ARIMA) and Generalized Regression 126

Neural Network (GRNN) in Forecasting Hepatitis Incidence in Heng County, China. PLoS ONE 11(6): e0156768.

Wright, C., Abbess, C., & Jarrett, D. (1988). Estimating the Regression-to-Mean Effect Associated with Road Accident Black Spot Treatment: Towards a More Realistic Approach. Accident Analysis and Prevention , 20(3), 199–214. doi:10.1016/0001-4575(88)90004-8.

Wu, L., Zou, Y., & Lord, D. (2014). Comparison of Sichel and Negative Binomial Models in Hotspot Identification. Transportation Research Record: Journal of the Transportation Research Board , 2460, 107–116. doi:10.3141/2460-12.

Xia, F., Liu, T.Y., Wang, J., Zhang, W., & Li, H. (2008). Listwise Approach to Learning to Rank – Theory and Algorithm. Proceedings of the 25th International Conference on Machine Learning , 1192-1199.

Xiao, L., Wang, J., Hou, R., & Wu, J. (2015). A Combined Model Based on Data Pre-Analysis and Weight Coefficients Optimization for Electrical Load Forecasting. Energy 82 , 524-549.

Yi, L., Hui, Y., & Yang, D. (2013). Road Traffic Congestion Measurement Considering Impacts on Travelers. Journal of Modern Transportation 21 (1), 28-39.

Young, W.A., Holland, W.S., & Weckman, G.R. (2008). Determining Hall of Fame Status for Major League Baseball Using an Artificial Neural Network. Journal of Quantitative Analysis in Sports 4 (4), 1-46.

Zhang, J. & Liu, Y. (2004). Cervical Cancer Detection Using SVM Based Feature Screening. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2004) , 873-880.

Zhang, S. & Qiao, H. (2003). Face Recognition with Support Vector Machine. IEEE International Conference on Robotics, Intelligent Systems and Signal Processing, 2003 Proceedings. doi:10.1109/RISSP.2003.1285674.

Zhang, S. & Zhang, H. (2008). Analysis of and Suggestion for the Congested Traffic in Shanghai during Peak Hours. Advances in Management of Technology , Part 2.

Zhang, W., Qu, Z., Zhang, K., Mao, W., Ma, Y., & Fan, X. (2017). A Combined Model Based on CEEMDAN and Modified Flower Pollination Algorithm for Wind Speed Forecasting. Energy Conversion and Management 136 , 439-451.