Florida State University Libraries

2016 Aging Driver Focused Traffic Crash Frequency and Severity Analyses Aschkan Omidvar

Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected] STATE UNIVERSITY

FAMU-FSU COLLEGE OF ENGINEERING

AGING DRIVER FOCUSED TRAFFIC CRASH FREQUENCY AND SEVERITY

ANALYSES

By ASCHKAN OMIDVAR

A Thesis submitted to the Department of Industrial & Manufacturing Engineering in partial fulfillment of the requirements for the degree of Master of Science

2016

© 2016 Aschkan Omidvar

Aschkan Omidvar defended this thesis on July 12, 2016 The members of the supervisory committee were:

O. Arda Vanli Professor Co-Directing Thesis

Eren Erman Ozguven Professor Co-Directing Thesis

Abhishek K. Shrivastava Committee Member

Chiwoo Park Committee Member

The Graduate School has verified and approved the above-named committee members, and certifies that the thesis has been approved in accordance with university requirements.

ii

Dedicated to those who made me a better man…

iii

ACKNOWLEDGMENTS

I would like to express my sincere gratitude to my advisors Dr. Arda Vanli and Dr. Eren Erman Ozguven for providing me the opportunity to come to Florida State University and be part of their research group. I will forever be indebted to them for the guidance, mentorship and support throughout my studies. The past two years of research experience under their guidance is a tremendous asset to my career and life.

I am also thankful to my MS committee members Dr. Chiwoo Park and Dr. Shrivastava, for their insightful advice and comments have served as valuable inputs for the engineering and scientific significance of this research.

This thesis was supported by United States Department of Transportation grant DTRT13-G- UTC42, and administered by the Center for Accessibility and Safety for an Aging Population (ASAP) at the Florida State University (FSU), Florida A&M University (FAMU), and University of (UNF). We also thank the Florida Department of Transportation and National Oceanic and Atmospheric Administration for providing the data. The opinions, results, and findings expressed in this manuscript are those of the authors and do not necessarily represent the views of the United States Department of Transportation, The Florida Department of Transportation, The National Oceanic and Atmospheric Administration, The Center for Accessibility and Safety for an Aging Population, the Florida State University, the Florida A&M University, or the University of North Florida.

iv

TABLE OF CONTENTS

List of Figures ...... vii List of Tables ...... ix List of Abbreviations ...... x Abstract ...... xi

1. INTRODUCTION ...... 1

2. LITERATURE REVIEW ...... 3

2.1 Traffic Crash Analysis: Statistical Tools ...... 3 2.1.1 Logistic Regression ...... 4 2.1.2 Poisson and Negative Binomial ...... 6 2.1.3 Statistical Learning Methods ...... 9 2.2 Traffic Crash Analysis: Computational Intelligence ...... 10 2.2.1 Traffic Crash Analysis: Neural Networks...... 11 2.2.2 Traffic Crash Analysis: Other Approaches ...... 13

3. RESEARCH METHODOLOGY AND RESULTS ...... 15

3.1 Data Collection and Pre-Processing ...... 16 3.1.1 Statewide FDOT Crash Dataset ...... 17 3.1.2 Meteorological Data...... 18 3.1.3 Hourly Traffic Flow ...... 18 3.2 Exploratory Descriptive Analysis ...... 22 3.3 Correlation Analysis for the Entire Corridor ...... 28 3.4 Logistic Regression Analysis ...... 29 3.4.1 Logistic Regression Analysis of Crash Frequency for Roadway Segments ...... 31 3.4.2 Exploratory Analysis for Crash Severity ...... 36 3.4.3 Logistic Regression Analysis of Crash Severity for Roadway Segments ...... 39 3.5 Prediction Capabilities using ROC Curves ...... 49 3.5.1 Crash Frequency Prediction Analysis ...... 51 3.6 Research Limitations ...... 54

4. CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS ...... 57

APPENDICES ...... 60

A. KERNEL DENSITY FUNCTION FOR CRASH SEVERITY (JAX) ...... 60 B. KERNEL DENSITY FUNCTION FOR CRASH SEVERITY ( ...... 62

v

References ...... 64

Biographical Sketch ...... 71

vi

LIST OF FIGURES

1 Research Methodology ...... 15

2 I-95 Corridor and TTMS Stations in Florida ...... 19

3 A Sample of FTI User Interface...... 19

4 Output Database Processed for the Research...... 21

5 Time Series Plots for Traffic Flow: North and South Directions and April and Summer Months Hourly Averages ...... 23

6 Time Series Plot of Hourly Crash Counts for All and Aging Drivers ...... 25

7 Histograms of Crash Flows for All and an Aging Drivers in Miami ...... 26

8 Flow Dot-plots of Flow Data for Different Light Conditions (a) Miami, (b) Jacksonville . 28

9 Traffic Characteristics for the I-95 Corridor in Florida, (a) Correlation Matrix for The AADT vs. Roadway Width for All Segments, (b) AADT, and Crashes for All Age Groups and Aging-Involved Crash Frequencies Versus Distance (miles) ...... 29

10 Logit Curves for Crash Probability vs. Flow, (a) Miami Station 2 All Drivers, (b) Miami Station 2 Aging Drivers ...... 35

11 (a) Crash Severity Kernel Density Maps for the Miami 1-location in (b) 2010, (c) 2011, (d) 2012...... 37

12 Kernel Density for Miami 1 for Year 2010 and Comparison to The Scaled Density Values on The Highway for The Years 2010, 2011 and 2012...... 38

13 Kernel Density Function on a 2-D plain for 3 years. X: Segment Length. Y: Normalized Kernel Raster Value ...... 39

14 Delta-Beta Residual Analysis for Outlier Detection and Removal (Node Miami 1). The Right-Hand Side Figure Is The Expanded Version of The Box on Left-Hand Side Figure Shown After Outlier Handling ...... 40

15 A Schematic Illustration of Aggregation Approach ...... 48

vii

16 (left) A Sample Frequency Table for Observation vs. Prediction Generated by Algorithm, (right) ROC Curve for Crash Frequency Model for Node Miami1-All Age Groups ...... 51

17 Receiver Operating Characteristic Plots for Crash Severity Prediction by Logit for Miami- All Ages (a), Miami-Aging Drivers (b), Jacksonville-All Ages (c) and Jacksonville-Aging Drivers (d). Values at The Lower Right Corner of The Plots Show Area Under Curves (AUC)...... 52

18 Comparison of ROC Curves for Spatially Aggregated Model of Miami with Data for 2010- 2012 Vs. Spatially Aggregated Model for 2011 ...... 53

viii

LIST OF TABLES

1 FDOT Crash Data Attributes ...... 17

2 Summary of Crashes According to Age Groups, Location, Time of Day and Weather Conditions...... 23

3 Deviance Goodness of Fit Test Results for Different Models (LoF: Lack of Fit) ...... 31

4 Fitted Regression Models. The Values on Top Are P-Values and The Underlined and Bolded Italic Numbers Are Coefficient Estimates...... 34

5 Goodness of Fit p-values between Observed Data and Predicted Values ...... 42

6 Fitted Regression Models (Crash Severity). The Values Underlined in Italic Are P-Values and Other Numbers Are Coefficients. Results Are Significant Only for Miami 1 And Miami 2 Locations...... 43

7 Crash Attributes for Crash Severity ...... 43

8 Candidate Factor for Regression Analysis of Crash Severity ...... 44

9 Number of Observations in Severity Data and Test Data sets ...... 44

10 Crash Severity Involvement by Age (All numbers are in percent). The number of crashes experienced by an age group is shown as a percentage of total number of crashes in a location...... 45

11 Fitted Regression Models for Crash Severity Analysis: Locations in Miami and Jacksonville Are Temporally Aggregated for 3 Years (2010-2012) ...... 46

12 Logistic Regression Analyses for Aggregated Models of Three Years vs. Aggregated Model of 2011 ...... 49

13 Schematic Observation Frequency Table for Observation Vs. Prediction ...... 50

14 Performance Comparison of Temporally Aggregated Model for Miami vs. Miami-2011 .. 54

ix

LIST OF ABBREVIATIONS

 AADT: Annual Average Daily Traffic  ANN: Artificial Neural Network  ART: Adaptive Resonance Theory  AVI: Automatic Vehicle Identification  BNN: Bayesian Neural Networks  BPNN: Back Propagation Neural Networks  CART: Classification and Regression Tree  CC: Convex Combination  CFS: Correlation Based Feature Selector  CI: Computational Intelligence  DHSMV: Department of Highway Safety and Motor Vehicles  EC: Evolutionary Computation  FAWN: Florida Automated Weather Network  FB: Full Bayes  FDOT: Florida Department of Transportation  FL: Fuzzy Logic  GIS: Geographical Information System  GP: Goal Programming  HL: Hosmer Lemeshow  ML: Machine Learning  MLP: Multi-Layer Perceptron  N2PFA: NN Pruning for Function Approximation  NB: Negative Binomial  NC: Neural Computing  NLCCA: nonlinear (nonparametric) canonical correlation analysis  NOAA: National Oceanic and Atmospheric Administration  OP: Ordered Probit  PNN: Probabilistic Neural Network  PTMS: Portable Traffic Monitoring Sites  RBF: Radius Basis Function  RF: Random Forest  ROC: Receiver Operating Characteristic  SVC: Single Vehicle Crash  SVM: Support Vector Machines  TAZ: Traffic Analysis Zone  TTMS: Telemetric Traffic Monitoring Sites  USNO: United States Naval Observatory

x

ABSTRACT

The aim of this thesis is to investigate the effect of environmental and traffic-related factors on the frequency and severity of traffic crashes with a focus on different age groups, with special attention to the aging populations. Existing studies in the traffic safety have not specifically focused on aging driver-involved crashes. It is well known that, aging drivers are more vulnerable to the roadway crashes than other adult age groups due to their cognitive, physical and health limitations. This problem becomes more challenging due to the drastic variation in the traffic patterns that especially happen on the major highways. In this thesis, several datasets from different sources, such as the National Oceanic and Atmospheric Administration (NOAA) and Florida Automated Weather Network (FAWN), the Florida Department of Transportation (FDOT) and the United States Naval Observatory (USNO), are collected, refined and combined. With the aid of statistical correlation analysis and logistic regression, a top down analysis is performed in order to analyze the occurrence of crashes via a case study application on the I-95 highway corridor in the State of Florida. Using logit curves, a sensitivity analysis is carried out to quantify the effect of traffic volume on the crash frequency. In addition to the crash frequency analysis, factors influencing the crash severity are also analyzed in an aggregated manner for two metropolitan areas in the City of Jacksonville and Miami, Florida. Both frequency- and severity-focused analyses have led to several important conclusions. Results suggest that the variation in the hourly traffic volume significantly affects the crash occurrences for both aging drivers and drivers of all ages depending on the geographical location; however, the crash occurrence for aging drivers is less sensitive to the flow than all age groups in congested locations. Results indicate that crash severities for all other age groups decrease on roadways with narrower shoulders and at night unlike those of aging drivers. Furthermore, driving at night on I-95 in Jacksonville seems to be problematic for both age classes whereas that risk is less for Miami locations. Higher roadway surface width also appears to increase the chance of having a severe crash for aging drivers. The DUI-influenced crashes have also been detected considerably high on the I-95 highway corridor in the City of Miami, Florida. This problem seems critical both in terms of crash frequency and severity. The proposed methodology can help

xi transportation officials to understand the nature of the aging driver-involved crashes, and formulate more effective safety-oriented decisions.

xii

CHAPTER 1

INTRODUCTION

In 2013, 32,719 people died and approximately 2.3 million people were injured in motor vehicle crashes in the U.S. Traffic incidents also imposed an economic cost of $242.0 billion to the U.S. economy in 2010. In 2012, the number of police-reported vehicle crashes were over 5.5 million where 33,561 people were killed and over 2.3 million people were injured or paralyzed due to these crashes [1-3]. These figures show the importance of thoroughly studying the factors that affect the occurrence of crashes. In the literature, researchers classify the crash influencing factors in three main categories: (1) behavioral and human-related factors, (2) Environmental factors and (3) Vehicle-related factors. Recent literature indicates that approximately over 80 to 95% of the roadway crashes are driver-related [1, 4]. This shows the significant impact of cognitive limitations and driver errors, including reckless driving, fatigue and driving under influence. In addition, these main factors may interact and cause crashes. For instance, a harsh weather condition on a steep upgrade or on a poor quality pavement can put a driver prone to error, which may lead to a crash. Therefore, a robust crash prediction methodology should study the effect of these factors as well as their interactions on the likelihood of crash occurrences and severities. The objective in this thesis is to investigate the impact of environmental and traffic-related factors on the frequency and severity of highway crashes with a focus on different age groups including the aging populations. To the author’s knowledge, existing studies in the traffic safety have not specifically focused on aging driver-involved crashes. Aging drivers are more vulnerable to the roadway crashes than other adult age groups due to their cognitive, physical and health limitations. Therefore, we pay a special attention on studying the frequency and severity of aging driver- involved crashes. In this thesis, several datasets from different sources, such as the National Oceanic and Atmospheric Administration (NOAA) and Florida Automated Weather Network (FAWN), the Florida Department of Transportation (FDOT) and the United States Naval Observatory (USNO), are collected, refined and combined. With the aid of statistical correlation analysis and logistic

1 regression, a top down analysis is performed in order to analyze the occurrence of crashes via a case study application on the I-95 highway corridor in the State of Florida. Using logit curves, a sensitivity analysis is carried out to quantify the effect of traffic volume on crash frequency. In Chapter 2, we perform a comprehensive review on the existing literature. Several core and related topics to our research are covered in this section. In detail, we discuss researches that have studied the various factors affecting crash frequency and severity followed by a review of the studies that employed GIS tools in their research in order to identify influencing factors and their correlations using different spatial and temporal analyses. We also focus on those researches that have used support vector machines in order to analyze and predict the significant parameters in the involvement of crashes. Finally, we perform a survey on the studies that have employed artificial intelligence for crash factor analysis and prediction. In Chapter 3, we discuss the contribution of this thesis. We first present the overall research methodology followed by the data collection, synchronization and processing approach. Next, we perform an exploratory descriptive analysis on the selected I-95 segments to get a better sense of the crash characteristics as well as the traffic flow behavior in those segments. Moreover, we investigate the trends and patterns of crash occurrence, and the frequency of crashes with respect to different flows. This extensive evaluation is followed by analyzing the highway crashes on the I-95 corridor in the State of Florida. Here, we perform correlation and factor analyses on each Telemetered Traffic Monitoring Sites (TTMS) station to understand the crash patterns of the entire corridor with respect to several roadway-related factors. Finally, we introduce the overall modelling approach. In order to perform this, we first conduct a survey of the most common and powerful statistical modelling approaches, and then select the most suitable one with the highest merit for this research, which is the logistic regression model. We use this model to find the significant factors affecting crash frequency on all selected I-95 segments in the Metropolitan areas of Jacksonville and Miami. We measure the validity and goodness of fit for all the models through several statistical tests. We also propose logit curves to help policy makers and practitioners in the traffic safety decision making process. In this thesis, we also focus on the crash severity in addition to the crash frequency. Therefore, we analyze the factors influencing the crash severity in an aggregated manner for two metropolitan areas in Jacksonville and Miami, and we conclude this section with some managerial insights and suggestions. Chapter 4 presents the conclusions and future work directions.

2

CHAPTER 2

LITERATURE REVIEW

2.1 Traffic Crash Analysis: Statistical Tools

For decades, researchers in the field of crash analysis and traffic safety have tried to develop statistical and computational intelligent tools to identify the major factors in crashes and predict the frequency and severity of crashes in the future. Statistical tools, such as generalized linear models and regression analyses, have been extensively studied in different branches of traffic safety [5]. Most studies in the field of traffic safety focus on two main problems: (1) crash hotspot analysis, which detects the locations with high risk of crash occurrence and corresponding significant factors affecting the crash frequency and severity, and (2) crash prediction, trends and forecasting. Various researches have taken different approaches using statistical tools to identify, predict and analyze the crashes spatially and temporally, and to investigate the risk and severity associated with each crash location. We will present an extensive review of these studies in this section. In this research, we review different modelling approaches available in the literature. Almost all modelling approaches can be classified under generalized linear models (GLM). In GLM, response variables can have an error distribution other than a normal distribution, and in the exponential family such as Poisson, gamma and binomial distributions. In fact, GLM is a generalized version of linear regression by allowing the linearity in the model to be related to the response variable with a link function. SAS Institute separates GLM from a similar approach, named Generalized Regression models [6]. These techniques try to fit better models by shrinking the model coefficients to zero. Therefore, the resulting estimates are biased. This increase in bias may result in a reduction of prediction variance, and consequently decrease in overall prediction error. Some of the most frequently used models are Maximum Likelihood, Forward Selection, Elastic Net, Lasso and Ridge Regression, where the Elastic Net and the Lasso methods need variable selection in the

3 modeling procedure. For more traditional modeling techniques, in case of p > n, (p is the number of predictors and n is the number of observations) where the number of variables is more than observations, variable selection can be used. In addition, these techniques, specifically the Elastic Net and the Lasso perform, produce promising results for large-scale datasets and those where collinearity is a significant issue. Several researches have utilized these techniques to analyze factors affecting the crash frequency or severity as discussed; however, to the best of authors’ knowledge, they mostly employed the Maximum Likelihood for this purpose. The Maximum Likelihood method is a rather classical approach, and it is commonly used as a baseline for comparison with other techniques.

2.1.1 Logistic Regression

The Generalized Regression enables one to select a variety of distributions based on the purpose of the research and the nature of the response variables in the analysis: exponential, gamma, Cauchy, Gaussian, negative binomial distributions as well as the zero-inflated versions, namely zero-inflated negative binomial, zero-inflated binomial, zero-inflated Poisson, zero- inflated beta binomial, and zero-inflated gamma (e.g., categorical, continuous etc.). Although negative binomial distribution is one of the most frequently implemented modelling approaches in the traffic safety field, it is not feasible to implement this approach in this research since the response variable of our data is binary (and not of type of count). In either case, the Generalized Regression or GLM, the nature of data specifies the distribution of the model and the link function. In the field of crash analysis using logistic regression tools, numerous researches have been conducted. Al-Ghamdi [7] applied logistic regression to accident-related data to examine the contribution of several variables to accident severity. He studied the accident severity as a dichotomous variable with two outcomes of fatal and non-fatal. He tested nine independent variables, and found out that two were most significantly associated with accident severity: location and cause of crashes. A similar research has been carried out by Dissanayake and Lu [8] with a focus on aging drivers to identify factors influencing severity of injury in fixed object– passenger car crashes in Florida. They developed two sets of models for driver injury severity and crash severity, where severity is considered from no injury to fatality. The fitted model was then used to identify the influence of factors such as roadway, environmental, vehicle, and driver related

4 attributes on severity. They concluded that travel speed, restraint device usage, point of impact, use of alcohol and drugs, personal condition, gender, whether the driver is at fault, urban/rural nature and grade/curve existence at the crash location are the important factors for making a severity difference to aging drivers involved in single vehicle crashes. Furthermore, Tay et al. [9] employed logistic regression to focus on factors affecting hit and run crashes (Leaving the scene of a crash without reporting it). They considered driver characteristics, vehicle types, crash characteristics, roadway features and environmental characteristics to distinguish the potential factors that contribute to hit and run crashes from non- hit and run ones. They suggested that drivers are more likely to run when crashes occurred at night, on a bridge and flyover, bend, straight road and near shop houses, two-wheel vehicles and vehicles from neighboring countries. Male drivers, minority, and aged between 45 and 69 were also found to be more prone to commit hit and run crashes. On the other hand, collisions involving right turn and U-turn maneuvers, and occurring on undivided roads were mostly non-hit-and-run crashes. Sze et al. [10] emphasized on pedestrian involved crashes. They evaluated the injury severity of pedestrian casualties and found the factors contributing to mortality and severe injury in Hong Kong. They considered demographic, crash, environmental, geometric, and traffic characteristics of crashes and pedestrian and utilized binary logistic regression is used to measure the probability of fatality and severe injury. They revealed that there is a downward trend in pedestrian injury risk, controlling for the influences of mostly demographic and road environment factors. Moreover, they discussed that the effect of pedestrian behavior, traffic congestion, and junction type on pedestrian injury risk are subject to temporal variation. Studies in this field are not limited to binary logistic regression. Some researchers used different variations of logistic regression depending on the purpose of research and the format of data. For instance, Shankar et al. [11] employed nested logit formulation to determine crash severity in Washington State. The estimation results demonstrated evidence of the influence of environmental conditions, highway design, accident type, driver characteristics and vehicle attributes on crash severity, and they also conclude that nested logit is a suitable approach for this analysis. Shankar and Mannering [12] used multivariate analyses to eliminate the potential ambiguity and bias caused by univariate analyses to identify the causality of motorcycle rider crash severity in single-vehicle collisions. Their findings suggested that the multinomial logit formulation is a promising approach to assess the determinants of motorcycle accident severity.

5

Mixed logit is another variation employed by Milton et al. [13] in this context. The approach they took allowed for estimated model parameters to vary randomly across roadway segments to account for unobserved effects potentially relating to factors such as roadway characteristics, environmental factors, and also driver behavior. Their findings suggested that traffic volume-related variables, namely average daily traffic per lane, average daily truck traffic, truck percentage, interchanges per mile and weather effects are best modeled as random- parameters. However, some other parameters such as roadway characteristics (the number of horizontal curves, number of grade breaks per mile and pavement friction) were best modeled as fixed parameters. They finally suggested that the mixed logit model has considerable promise in highway safety.

2.1.2 Poisson and Negative Binomial

Different approaches have been used in the literature to identify, predict and analyze the crash locations. For example, Poisson or Negative Binomial distributions were used in order to present an accident frequency-focused approach. One important assumption in a Poisson regression model is that the mean and variance of the number of crashes at a given road segment are equal. This assumption is found to be restrictive in many applications in which the variance is not necessarily equal to the mean. The Negative Binomial distribution employs a dispersion parameter which relaxes this assumption and in negative binomial regression variance does not have to equal to the mean. Poch and Mannering [14] analyzed seven-year accident histories, which were targeted for operational improvements, from 63 intersections in Bellevue, Washington. Their objective was to estimate the frequency of accidents at intersection approaches using a negative binomial regression, and they uncovered important interactions between traffic-related elements, accident frequencies and geometry. Shankar, Mannering and Barfield [15] studied the frequency of crash occurrence on highways based on a multivariate analysis of roadway geometric, namely horizontal and vertical alignments, weather, and other seasonal effects. They estimated the overall crash frequencies using a negative binomial model along with models of the frequency of specific crash types. They also studied the interactions between weather and geometric variables, and uncovered several important determinants of accident frequency.

6

Abdel-Aty and Radwan [16] used the Negative Binomial distribution to model the frequency of accident occurrence using a 3-year data set, accounting for 1606 accidents on a principal arterial in . In this research, Annual Average Daily Traffic (AADT), degree of horizontal curvature, lane, and median widths, urban/rural, and the section’s length were shown to be significant factors affecting the crash frequency. They also studied the demographic characteristics of the driver (age and gender), and concluded that heavy traffic volume, speeding, narrow lane width, larger number of lanes, urban roadway sections, narrow shoulder width and reduced median width increase the likelihood for crash involvement. In addition, they found out that male drivers experience more traffic crashes while speeding whereas female drivers have a greater tendency to be involved in crashes with heavy traffic volume, narrow lane width, reduced median width, and larger number of lanes. As for the age of the drivers, they indicated that young and older drivers experienced more crashes than middle aged drivers with heavy traffic volume, and reduced shoulder and median widths. Younger drivers were found to be more prone in crash involvement while turning and speeding. There is also a growing interest in the use of Geographical Information Systems (GIS)- based spatial analysis in order to understand spatial patterns in crash data. Most of the studies in the literature have selected one portion of a roadway network (e.g., road segment, intersection, corridor), and studied the spatial patterns of crashes with different geographic levels, from Traffic Analysis Zones (TAZ) to census tracts as well as at the scale of county, state or national level [17]. For example, Valverde and Jovanis [18] conducted a spatial analysis for the state of Pennsylvania by comparing Negative Binomial (NB) and Full Bayes (FB) methods based on the different factors such as weather, transportation infrastructure, traffic volume and socio-demographical characteristics. They concluded that the crash rate was higher in counties with higher poverty rate and traffic volume, and the crash rate increased for the following age groups: 10-14, 15-24, 65+. Several researches also focused on the motor vehicle crashes with the pedestrians and cyclists [19, 20]. The development of GIS also allowed the researchers to investigate spatial and temporal solutions to the problem both simultaneously and hierarchically. Although spatial-temporal (or spatio-temporal) analysis has been widely used in different sectors such as fire locations, bio surveillance, disease outbreak surveillance, such an application for roadway crash pattern analysis is relatively limited. An interesting study was conducted by Plug et al. [21], who studied spatial,

7 temporal and spatio-temporal patterns of single vehicle crashes (SVC) based on a 10-year data (1998-2008) in Western Australia. They generated several spider graphs to demonstrate the temporal patterns of crashes with a focus on the hours of a day and days of a week. For a spatial pattern analysis, they implemented the Kernel approach, and they used the Comap method in order to conduct a spatio-temporal study. Using these methods, they generated maps and graphs to investigate the occurrence of crashes. Another interesting research in this area was conducted by Li et al. [22]. They analyzed the spatio-temporal crash risk patterns in Texas by creating a series of posterior risk maps indicating a relative risk degree to each intersection and segment. They used a Bayesian approach in order to rank and identify segments according to the assigned risk values, having the ability to forecast risks with a high certainty even with insufficient data. Although this research successfully analyzed the crash pattern of different segments, it did not consider the severity of the crashes. Several researchers have argued on the need to include the environmental and/or traffic- related factors in order to investigate the severity of crash occurrences [15, 23-26]. Golob and Recker [27] conducted linear and non-linear canonical correlation analyses to investigate the relationship between the traffic flow, weather and light conditions, and concluded that crash rates increased as the median speed increased. Ahmed, Abdel-Aty and Yu [28] considered the roadway geometry, real-time weather and traffic data in order to predict the crash occurrences. They concluded that the roadway geometry, real time weather and automatic vehicle identification (AVI) system data were influential on the occurrence of crashes, especially during winter and snowy weather conditions. They also stated that the traffic density fluctuation and the road geometry impacts on the crashes were more obvious. Similarly, Abdel-Aty and Pammanaboina developed a crash likelihood prediction model using a real-time traffic flow variable in addition to the rainfall data [29]. For more information on the methods of crash analyses, please refer to [30]. Among the literature that studied the effect of driver age on the crashes, Alam and Spainhour [31] showed that older drivers were more prone to intersection-related crashes than those that happen at the roadway segments. They also showed that older people were more at fault compared to younger drivers in fatal crashes. The driver age was also identified as a critical contributing factor for fatal traffic crashes on state highways in Florida [32]. Staplin et al. [33] prepared taxonomy tables that identified the risk factors in crashes involving aging drivers, such

8 as risky behavior, driving habits and exposure patterns. For further information on related highway traffic safety studies, please refer to [34, 35].

2.1.3 Statistical Learning Methods

Support Vector Machine (SVM) probably is the most widely used statistical learning technique and refers to supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis [36]. For a given set of marked-to-be- categorized training instances, an SVM training algorithm creates a model that assigns new data instances into one of the available categories (deterministic binary linear classifier). In detail, a support vector machine creates a set of hyperplanes in a high-dimensional space. The hyperplane with the largest distance to the nearest trained data point has the highest classification power and lowest generalization error. Recently, the application of SVM in crash data-mining and analysis has gained extra attention. Researchers have targeted different deficiencies in classical methods, and tried to overcome them using SVM. For example, Li et al. [37] proposed Negative Binomial (NB) regression and SVM models, and compared these models using the data collected in Texas. The study showed that SVM models predicted crash data more effectively and accurately without over- fitting the data than the traditional NB models or Back-Propagation Neural Network (BPNN). They suggested the use of SVM if the sole purpose of the study consisted of predicting motor vehicle crashes since it was easier to implement than the BPNN method. Yu and Abdel-Aty [38] developed a SVM model to eliminate the linear functional form and over-fitting drawbacks of logistic regression and neural network models, respectively. In this research, they proposed a Classification and regression tree (CART) model to select the most important explanatory variables, then estimated three candidate Bayesian logistic regression models to capture the unobserved heterogeneity. Finally, they developed SVM models with different kernel methods, and compared them with the Bayesian logistic regression model. They found that Receiver Operating Characteristic (ROC) curve demonstrated that the SVM model with the Radial-basis kernel function performed more efficiently than others. A similar work has been carried out by Chen et al. [39] with a focus on the driver injury severities in rollover crashes in New Mexico. They employed a classification and regression tree (CART) model to identify the

9 significant factors and used SVM models with polynomial and Gaussian radius basis function (RBF) kernels to investigate driver injury severity patterns in rollover crashes. Results showed that seatbelt use, number of lanes, comfortable driving environment conditions, alcohol or drug involvement, driver demographic features, maximum vehicle damages in crashes and crash time and location were the significant factors associated with serious and fatal injuries. Li et al. [40] developed an SVM with a focus on crash injury severity analyses, and compared it with the Ordered Probit (OP) model. The results showed that it provided better prediction outputs for small proportion injury severities even though the SVM model might suffer from the multi-class classification problem (48.8% correct prediction for the SVM model compared to 44.0% for OP model). They also conducted a sensitivity analysis to investigate the potential of using the SVM model for evaluating the impacts of external factors on crash injury, and the results demonstrated that the SVM model produced comparable results with the OP model. Wang et al. [41] created a multi-layer perceptron artificial neural network model benchmark in order to evaluate the performance of the proposed SVM model, and to investigate the relationship between driver injury severity level and driver, roadway, environmental and vehicle-related factors. Based on several case studies in Wisconsin, their results from SVM suggested an overall classification accuracy of 63.4% and 58.6% for the training and testing group datasets, respectively. For the Traffic Analysis Zone (TAZ) level crash prediction, SVM has also shown satisfactory capability considering spatial correlations [42]. The authors applied a Correlation- based Feature Selector (CFS) to evaluate candidate factors related to zonal crash frequency in order to handle high-dimensional spatial data, and showed that SVM models involving the spatial proximity trumped the non-spatial models in terms of both fitting and predictive performance.

2.2 Traffic Crash Analysis: Computational Intelligence

Most of the exiting literature focused on predicting crash frequency and severity using statistical tools and generalized linear models, such as Negative Binomial (NB), Poisson Regression [43] or Bayesian Empirical methods [44] since the occurrence of accidents on a highway section can be assumed as a random event. However, soft computing techniques are capable of capturing highly nonlinear relationships between the independent and the dependent

10 variables. Therefore, they are able to discover the hidden and complex correlations and impacts of input factors in complex datasets. In computer science, soft computing, as a branch of Computational Intelligence (CI), mainly refers to the use of sub-optimal methods for computationally hard tasks, namely NP- complete problems where there is no known algorithm to compute an exact solution in polynomial time [45]. The role model for soft computing is the human mind and the nature [46] .They can be modified and applied to a wide variety of complex problems, such as OR modeling [47, 48]. In fact, soft computing techniques are capable of resembling biological processes more closely and effectively than traditional and classical techniques. The principal constituents of Soft Computing are Fuzzy Logic (FL) [49], Machine Learning (ML), Neural Computing (NC) and Evolutionary Computation (EC), evolutionary algorithms, swarm intelligence and metaheuristic algorithms are categorized under this class. For an in-depth understanding of computationally intelligent methods, readers may refer to [50].

2.2.1 Traffic Crash Analysis: Neural Networks

Most researchers in the field of traffic safety and crash analysis have employed statistical methods such as regression analysis; however, the use of neural networks has recently gained extra attention. Despite the benefits and the prevalence of use, the application of computational intelligence in traffic crash prediction and safety studies has mostly been limited to the use of Artificial Neural Network (ANN) packages and decision tree-based approaches. For example, Mussone, Rinelli and Reitani [51] used artificial neural networks in order to analyze vehicular accidents in Milan, and quantified the degree of danger on urban intersections using different scenarios by the proposed ANN model. Several researchers have also attempted to modify the structure of a neural network in order to obtain results more accurately and efficiently. Most researchers studied the application of neural networks on the severity of the crash injury, and little attention has been given to the crash frequency. Chang [52] compared the efficiency and the performance of ANN with a negative binomial regression model using the 1997–1998 freeway accident data in Taiwan, and concluded that ANN was a more consistent alternative to analyzing freeway accident frequency.

11

Zeng and Huang [53] proposed a convex combination (CC) algorithm to train a neural network for crash severity prediction and a modified NN Pruning for Function Approximation (N2PFA) algorithm to optimize the network overall structure. They compared the proposed algorithm with the NN trained by traditional back-propagation (BP) algorithm with an ordered logit model with a case study in Florida. They found that the CC algorithm outperforms the BP algorithm in training speed, classification power and convergence ability with a less complex network. Comparing the results of NN to the ordered logit model demonstrated the NN’s superiority over statistical models in predicting the crash injury severity. Abdel-Aty and Pande [54] classified traffic speed patterns emerging from the loop detector data. In order to classify these data into either crashes or non-crashes, they proposed a Bayesian classifier based methodology, a probabilistic neural network (PNN), and a neural network implementation of Bayesian-Parzen classifier. Their results showed that PNN trained much faster than the multilayer feed forward networks, and was able to classify the crashes by 70%. Abdelwaheb and Abdel-Aty [24, 55] have also examined the relationship between driver injury severity and driver, vehicle, roadway, and environment characteristics using multilayer perceptron (MLP) and fuzzy adaptive resonance theory (ART) neural networks. They applied their methodology on a case study of two-vehicle accidents that occurred at signalized intersections in the Central Florida. They found that MLP neural network performed well with over 70% and 60% for the training and testing phases, respectively. They also compared the performance of the NN with that of an ordered logit model. The ordered logit model’s performance was only 58.9% and 57.1% for the training and testing phases, respectively. Several important conclusions were also drawn from a simulation case study conducted using the neural network. For instance, rural intersections were found to be more dangerous in terms of driver injury severity, female drivers were more likely to experience a severe injury, speed ratio increased the likelihood of injury severity, drivers at fault were less likely to experience severe injury, and wearing a seat belt decreased the chance of sustaining severe injuries. Clearly, the classification power of neural network is relatively higher than statistical models. However, compared to the number of researches on statistical models, neural network models have not been studied extensively possibly due to the complexity of estimating this type of model, and the problem of “over-fitting” the data. In fact, when the complexity and heterogeneity of a dataset increases, a neural network may train the data to an optimal level;

12 however, it lacks the capability of predicting new data, such as the test dataset. In other words, the algorithm memorizes the data instead of learning it. To circumvent the latter problem, several researchers have proposed the use of Bayesian neural network (BNN) models [56]. They claim that these models perform better than back-propagation NN models while reducing the difficulty associated with the over-fitting. For further information on the application of neural networks in crash severity detection and their differences between statistical methods, please refer to [57-60].

2.2.2 Traffic Crash Analysis: Other Approaches

Few recent researches focused on hybrid or soft computing techniques to overcome the conventional difficulties related to classical methods such as overfitting in neural networks. For instance, Sohn and Lee [61] considered clustering techniques, analyzed the relationship between the driving environmental factors and the severity of road traffic accidents, and performed several analyses to improve the accuracy of individual classifiers for two severity categories with an application in South Korea. They considered three methodologies for the classifier fusion: (a) Dempster–Shafer algorithm, the Bayesian procedure and logistic model, (b) data ensemble fusion based on arcing and bagging, and (c) clustering based on the k-means algorithm. They found that the clustering based classification algorithm worked the best in order to classify the road accidents for their interest area. A classification and regression tree (CART) method, one of the most widely applied data mining techniques, was developed to discover the relationship between injury severity and influencing factors such as driver and vehicle characteristics, and highway, environmental and accident variables using the 2001 accident data for Taipei, Taiwan [62]. Unlike statistical tools, CART does not require any pre-defined underlying relationship between the target (dependent) variable and predictors (independent variables). The results of this research indicated that the most important variable associated with crash severity was the vehicle type. In addition, pedestrians, motorcycle and bicycle riders have found to have higher risks of getting injured. Similarly, in order to overcome the problems associated with BP in ANN, Chang and Chen [63] proposed a CART model and a negative binomial regression model to establish the empirical relationship between traffic accidents and highway geometric variables, traffic characteristics, and environmental factors with a case study in Taiwan.

13

Gang and Zhuping [64], on the other hand, proposed a PSO-VSM hybrid algorithm in order to address the drawbacks associated with the BP neural networks. They analyzed several significant factors in terms of traffic safety, established a traffic safety forecasting model by PSO– SVM based on these significant factors, and evaluated the forecasting ability of the proposed method. They suggested that the proposed model outperformed PBNN in terms of efficiency and accuracy. Another research performed by Xu, Wang and Liu [65] employed Genetic Programming (GP) for real-time crash prediction on a freeway in California considering traffic, weather, and crash data. They used the random forest (RF) technique to select the significant variables under both uncongested and congested traffic conditions. They used ROC curves in order to evaluate the prediction performance of the developed GP model for each traffic state. The validation results showed that the prediction performance of the GP models was satisfactory, and improved the classification accuracy by 8.2% under congested and 4.9% under uncongested conditions. A PSO- ANN hybrid was employed by Srinivasan, Loo and Cheu [66] for incident detection systems in order to solve ANN-related problems of slow convergence, heuristic determination of parameters and the possibility of getting stuck at a local minima. In Chapter 3, we present the proposed research methodology including the data collection, modeling approach and comparison to existing methods.

14

CHAPTER 3

RESEARCH METHODOLOGY AND RESULTS

This research consists of two main steps. We first perform a comprehensive data analysis on several data sets obtained from a variety of traffic and weather-related sources. Next, we apply statistical and soft computing techniques to model the crash frequency and severity behavior on highway segments with different characteristics. Figure 4 shows an overview of the research methodology.

Figure 1 Research Methodology

15

Figure 1 - continued

3.1 Data Collection and Pre-Processing

The proposed approach allows one to accommodate the effects of different factors such as environmental, traffic, and human-related factors on the sensitivity and the probability of a crash at a given geographical location both for aging populations and other adult age groups. Unlike previous approaches found in the literature, which are mostly based on the AADT, the proposed methodology focuses on determining the effects of the hourly traffic flow on the frequency and severity of crashes. The methodology is applied on the (I-95) highway corridor in Florida via a comparison of the two metropolitan areas along this corridor, namely Miami and Jacksonville metropolitan areas. Note that Miami and Jacksonville counties are among the high priority counties, identified by the Safe Mobility for Life Coalition of Florida, based on the high aging-involved crash rates per the aging population of the county [67]. The upper block in Figure 1 considers the data collected across the I-95 corridor together, which is used for an aggregate analysis. The lower block shows details of an analysis conducted that focus on the individual sites in Miami and Jacksonville areas. We build a separate model for each site. Statewide crash data sets obtained from the Florida Department of Transportation (FDOT) and hourly traffic volume data obtained from the Telemetered Traffic Monitoring Sites (TTMS) of FDOT for the year 2011 (2011 is selected because of the high variability observed in meteorology and natural phenomena than other years) are used to fit the models. Figure 2 depicts

16 the locations of the TTMS stations, which provide hourly flow data on I-95. For each TTMS station, we identify the associated roadway segments and create a set of maps which classifies the I-95 into homogeneous segments according to the roadway characteristics obtained from the FDOT. After this segmentation, crashes that happened on those segments are extracted (lower block in Figure 1). The precipitation and light condition data used in the models are obtained from the National Oceanic and Atmospheric Administration (NOAA) and Florida Automated Weather Network (FAWN).

3.1.1 Statewide FDOT Crash Dataset

Statewide crash data sets (including other attributes and crash characteristics) have been obtained from the Traffic Safety Office of the Florida Department of Transportation (FDOT) for the year 2011. Required fields, such as spatial and temporal characteristics, crash severity, driver characteristics, were extracted, refined and classified followed by a careful GIS-based examination in order to identify the most significant attributes in crash for further steps of this study.

Table 1 FDOT Crash Data Attributes Attribute Source Department of Highway Safety and Motor 1 Temporal attributes Vehicles (DHSMV) 2 Spatial Attributes FDOT Safety Office Injury and severity Index 3 FDOT Safety Office (minor rear-end, property damage, sever injury, fatality) Environmental condition 4 - Climate (humidity, light, visibility conditions, rain and snow, fog etc.) FDOT Safety Office and DHSMV - Roadway (Pavement type, sign, Lanes, etc.)

Vehicle characteristic Department of Highway Safety and Motor 5 (Type, tunes, lifts, functionality, etc.) Vehicles (DHSMV)

Person characteristic 6 FDOT Safety Office and DHSMV (Age, DUI, driver or involved, pedestrian, etc.)

Flags 7 FDOT Safety Office and DHSMV Age bins (-15, 50:5:80, +80)

Bicycle and Pedestrian characteristics 8 FDOT Safety Office and DHSMV (For vehicle-pedestrian or bicycle crashes)

17

This crash data includes three categories: (a) point crash data, (b) occupants involved in the crash, and (c) vehicles involved in the crash. In this thesis, two categories were linked in ArcGIS in order to enable us to access this extended dataset with combined attributes. Table 1 briefly summarizes the available attributes in this data set including the attributes obtained from the Department of Highway Safety and Motor Vehicles (DHSMV). The details of the GIS-based examination of this crash data is discussed in the next section. The main Attributes of FDOT statewide crash data that we pulled out for this research are summarized in Table 1.

3.1.2 Meteorological Data

The climatological data has been obtained from the National Oceanic and Atmospheric Administration (NOAA). This extended data set includes a variety of attributes; nevertheless, we will utilize the following data in this thesis: the mean precipitation rate and departure from the average precipitation rates. For this purpose, we extracted the 15-minutes precipitation values and sum them up to obtain hourly precipitation values for the weather stations in the vicinity of the roadway studies. Similar weather data was also received from Florida Automated Weather Network (FAWN) for comparison purposes. We also obtained exact sunrise and sunset times in order to study the effect of light from the United States Naval Observatory (USNO).

3.1.3 Hourly Traffic Flow

As mentioned earlier, we incorporated flow data at different times of the day instead of the Average Annual Daily Traffic (AADT). For this purpose, the hourly traffic data has been obtained for 2011 from the Statistics Office of the FDOT, collected using Telemetered Traffic Monitoring Sites (TTMS) and Portable Traffic Monitoring Sites (PTMS). The values represent hourly flow in each hour of the day throughout a year. This data set was processed for the selected location in the Miami and Jacksonville Metropolitan Area for further use in the model calculations. Locations selected for the analysis are shown in Figure 2.

18

Figure 2 I-95 Corridor and TTMS Stations in Florida

Red points in Figure 3 show the locations for which hourly flow data is available in the .

Figure 3 A Sample of FTI User Interface

19

The data retrieval process in ArcGIS consisted of selecting uniform roadway segments on I-95 based on the topographic maps and raster layers as well as the available roadway network and TTMS locations. These segments are approximately 3.5 (for five lanes) and 5 miles (for three lanes) long with minimum grades in each direction (both for northbound and southbound), respectively. These segments are shown in Figure 2, where each node represents a TTMS station. We processed the available flow data for both north and south directions on the sites. This data set included the number of vehicles passing from a TTMS station each hour. Next, we classified the flow data in order to obtain a nested 12-month data structure using MATLAB, which showed the flow per month in each direction. Following the nested structure created before, we created another data set showing the amount of precipitation as well as the surface condition (dry/wet) corresponding to each traffic flow. Finally, light condition (day/night) data obtained from the USNO was added to the nested structure, and encoded as follows: the binary codes returning 1 for the daylight hours, and 0 for the night time. This data structure was again compiled, where each cell represented one hour for each traffic direction throughout 2011. After the data retrieval step to create the training data, we fit the binary logistic regression and the proposed clustering models for each site. This approach allowed us to investigate the effects of weather condition, light condition, and traffic flow on the probability of crash occurrences and use the models to predict crash frequency and severity in the future. This methodology was applied to both aging population involved crashes and those involving overall- age group. Model validity and prediction capability of the models are studied by predicting crash frequency and severity unobserved data in a cross-validation study. The next section describes the details of the proposed methodology. There are few limitations in the available data, as well as access to other datasets that authors had to deal with. First, the reliability of FDOT statewide crash data set is questionable, since the data is mainly based on the reports prepared by police officers. Several researchers studied the accuracy of the police reports and concluded the accuracy, validity and reliability of the reports are quite acceptable [68, 69]. Furthermore, we compared the light, surface and weather condition to the datasets that we collected from other sources of data, such as NOAA, FAWN and USNO, and we found out the similarity between FDOT and other datasets is acceptable. Another limitation of the dataset was the lack of hourly flow data for aging population specifically. Access

20 to aging population driving schedule and preferences in Florida is another limitation of the collected datasets. In addition, available crash data only considers the involved vehicles and persons. It does not give any information regarding the exposure. Figure 4 shows a schematic view of both crash frequency and severity data sets.

Figure 4 Output Database Processed for the Research

21

Figure 4 - continued

3.2 Exploratory Descriptive Analysis

In this section, we present an exploratory analysis of the traffic flow and crash data collected from the six locations. We select three stations from the Miami area (Miami 1, Miami 2 and Miami 3 in Figure 2), and three stations from the Jacksonville region (JAX 1, JAX 2 and JAX 3 in Figure 2). Table 2 summarizes the features of data from the individual stations based on different age groups, time of day, precipitation and flow levels. We observe that traffic flows are higher for the Miami locations than those in Jacksonville. In addition, during the day, the average flow during the crash hours is higher than the average flow for all other times in Miami. In Jacksonville, on the other hand, we do not see such a clear difference. This difference in the flow is even more distinct for Miami at night. Therefore, we can argue that crashes in Miami mostly occur when the flow is higher for both age groups for both day and night conditions. For Jacksonville, the behavior is drastically different at night. Locations closer to the City of Jacksonville (JAX 2 and JAX 3 in Figure 2) behave differently. Here, many crashes occur after midnight with low traffic volume.

22

Table 2 Summary of Crashes According to Age Groups, Location, Time of Day and Weather Conditions. Miami Jacksonville 1 2 3 1 2 3 All Aging All Aging All Aging All Aging All Aging All Aging

Crashes 398 24 282 33 277 30 105 15 131 8 73 10 Total Hours 8520 8520 8520 8520 8520 8520 8520 8520 8520 8520 8520 8520

Day Average Flow 6892 6892 5800 5800 5095 5095 2728 2728 3810 3810 1858 1858 Mean Flow During Crash 7117 6845 6150 6363 5386 5930 2779 2656 4277 3972 1885 2118

Crashes 219 41 165 10 161 13 61 1 47 4 24 3 Total Hours 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000

Temporal Effect Temporal Average Flow 3136 3136 2353 2353 1848 1848 936 936 1340 1340 649 649

Night Mean Flow During Crash 4230 6153 2977 3381 2508 3123 1338 812 2130 2609 716 347

Crashes 615 65 440 43 433 43 161 16 175 12 90 13 Total Hours 17388 17388 17388 17388 17345 17345 17411 17411 17402 17402 17402 17402 Average Flow 4980 4980 4026 4026 3423 3730 1805 1805 2539 2539 1236 1236

Clear Mean Flow During Crash 4956 6499 4972 5670 4333 5193 2285 2533 3721 3529 1600 1709

Crashes 207050 503070 Total Hours 132 132 132 132 175 175 109 109 109 109 109 109 Average Flow 5126 5126 4399 4399 3423 3730 2145 2145 3036 3036 1490 1490

Rainy Precipitation Precipitation Effect Mean Flow During Crash 5435 --- 5062 --- 4138 --- 1531 --- 3063 --- 1539 ---

Based on Table 2, the effect of precipitation appears to be minimal in both age groups since we do not observe a considerable number of crashes during the rainy weather. The average number of vehicles when crashes occur is higher than the average number of vehicles for all time periods (with and without crashes) for all weather conditions and age groups. Moreover, we observe more traffic on the roadways during rain.

Time Series Plot of Flow by Direction, Jax 1

120000

100000

80000

60000

Flow

Variable 40000 Apr N Apr S 20000 Sep N Sep S 0 2 4 6 8 10 12 14 16 18 20 22 24 Hour of Day (a) (d) Figure 5 Time Series Plots for Traffic Flow: North and South Directions and April and Summer Months Hourly Averages

23

Time Series Plot of Flow by Direction, Miami 2 Time Series Plot of Flow by Direction, Jax 2 160000 200000 140000

120000 150000 100000

100000 80000

Flow

Flow Variable 60000 Variable Apr N Apr N 50000 Apr S 40000 Apr S Sep N Sep N 20000 Sep S Sep S 0 0 2 4 6 8 10 12 14 16 18 20 22 24 2 4 6 8 10 12 14 16 18 20 22 24 Hour of Day Hour of Day

(b) (e)

Time Series Plot of Flow by Direction, Miami 3 Time Series Plot of Flow by Direction, Jax 3 90000 200000 80000

70000 150000 60000

50000 100000

Flow Flow 40000 Variable Variable 30000 Apr N AprN 50000 Apr S 20000 AprS Sep N 10000 Sep N Sep S Sep S 0 0 2 4 6 8 10 12 14 16 18 20 22 24 2 4 6 8 10 12 14 16 18 20 22 24 Hour of Day Hour of Day

(c) (f) Figure 5 - continued

Figure 5 shows sample time series plots of traffic volumes. Average hourly flows are shown for the two months with the highest (April) and lowest (September) traffic. For Miami area (Figure 5 a-c), we see that the traffic flow increases suddenly in the morning when people drive to work. After a subtle decrease, the flows increase with a smaller slope until the afternoon peak, and it drops back at night. In Miami 1 and 2, close to the downtown, no significant AM and evening PM peaks are observed. However, in Miami 3, away from the downtown Miami, AM and PM peaks are more apparent (Figure 5-c). The flow pattern is similar for both northbound and southbound directions, which is probably due to the usage of this roadway by residents as well as the tourists, which indicates the effect of the seasonal traffic. For the Jacksonville area (Figure 5 d-f), we observe higher traffic in the morning towards the (northbound) whereas the higher traffic occurs away from the downtown (southbound) in the evening. The AM and PM peaks are more apparent at JAX 1 and

24

JAX 2, indicating the presence of home-based work trips (Figure 5-d and 5-e). The only abnormal behavior is observed at JAX 3 (Figure 5-f), which can be due to the proximity of the Jacksonville International Airport. Figure 6 shows the hourly crash counts for the selected locations in 2011 for both aging and overall population groups. All locations in the Miami region (FIGURE 6 a-c) appears to have two distinct peaks corresponding to the AM and PM rush hours for both age groups. Similarly, in Jacksonville, the downtown area (JAX 2) appear to have a two peak pattern (FIGURE 6-e). However, for the other two Jacksonville locations (FIGURE 6 d and f), the crash counts are relatively constant throughout the day.

Time Series Plot for Crash Frequency, Jax 1 35 Variable 30 9905 All 9905 Aging 25

20

15

No of Crashes No 10

5

0 2 4 6 8 10 12 14 16 18 20 22 24 Hour of Day

(a) (d)

(b) (e) Figure 6 Time Series Plot of Hourly Crash Counts for All and Aging Drivers

25

Times Series Plot for Crash Frequency, Miami 3 Time Series of Crash Frequency, Jax 3 70 35 Variable Variable 9923 All 60 0174 All 30 0174 Aging 9923 Aging 50 25

40 20

30 15

No of Crashes No

No of Crashes No 20 10

10 5

0 0 2 4 6 8 10 12 14 16 18 20 22 24 2 4 6 8 10 12 14 16 18 20 22 24 Hour of Day Hour

(c) (f) Figure 6 - continued

Figure 7 shows the crash flow (flow during crash hours) histograms for both overall and aging populations. Flow distributions in the Miami region are relatively left skewed (Figure 7) implying that the chance of having a crash is higher with an increase in the traffic volume. While in Jacksonville 1 and 3 most crashes occur during average flow times. For Jacksonville 2, which is close to downtown the behavior is relatively similar to those in Miami.

(a) (b) Figure 7 Histograms of Crash Flows for All and an Aging Drivers in Miami

26

(b) (e)

Histogram of Flow During Crash Hours, Jax 3

18 Variable Flow During Crash 16 Flow During Crash (Aging) 14

12

10

8

6

4

Frequency of Observations Frequency

2

0 500 1000 1500 2000 2500 3000 3500 4000 Flow Values

(c) (f) Figure 7 – continued

Figure 8 shows dot-plots of flow data for one location in Miami and one in Jacksonville during daytime and during nighttime. Considering Table 2, in Jacksonville, unlike Miami, the PM rush hours after the sunset is very short, and therefore we have longer hours with lower traffic volume at night (Figure 8). This indicates that crashes at night occur when higher traffic volumes are observed. For aging drivers, this behavior only occurs in Miami. In Jacksonville, JAX 1 and JAX 3 locations have more aging driver-involved crashes at lower traffic volumes (Figure 7-b).

27

(a) (b) Figure 8 Flow Dot-plots of Flow Data for Different Light Conditions (a) Miami, (b) Jacksonville [70, 71].

3.3 Correlation Analysis for the Entire Corridor

In this section, we aim to capture a picture of how different traffic and roadway-related factors influence the frequency of the crashes on the I-95 corridor in Florida, from the north to the south. These steps were shown in the upper block shown in Figure 1. As shown in Figure 9, there is a significant strong positive correlation (r = 0.86) between the AADT and surface width (with P-Value ≈ 0). This indicates that if the analysis includes one of these variables, adding the other one does not result in significant additional predictive power. Therefore, in our analysis we consider traffic (flow) information as a predictor and not the surface width. The relationship between the number of crashes and AADT for both age groups is shown in Figure 9-b. This plot shows that the frequency of crashes is high in those segments of the roadway with high AADT values. In fact, the correlation between AADT and Crash Frequency for aging drivers is highly positive (r = 0.921, P-Value ≈ 0) indicating that wider roadways with high volume of traffic may be overwhelming for aging drivers. For the entire corridor, we use the AADT as a measure to represent the average traffic volume and to observe the relationship between the traffic volume, and crash frequency. Our previous studies have shown that the hourly traffic flow (instead of AADT) explains larger proportion of variability in crash frequency and severity. Therefore, in the next section, when we will focus on the individual segments on I-95 in the Miami and Jacksonville metropolitan areas, we use the hourly traffic volumes to provide a more accurate representation of the traffic.

28

(a) (b) Figure 9 Traffic Characteristics for the I-95 Corridor in Florida, (a) Correlation Matrix for the AADT vs. Roadway Width for All Segments, (b) AADT, and Crashes for All Age Groups and Aging-Involved Crash Frequencies Versus Distance (miles)

3.4 Logistic Regression Analysis

We extensively studied and discussed the nature of available data sets, and logistic regression with binomial distribution (Logit) seems to be the best fit for our analysis since the outcome of the response variable is binary. However, the following link functions can also be employed for this analysis in addition to the logit: complementary log-log (Clog-log) and Probit. Therefore, in this research, we focus on all three link functions, namely Logit, Probit and Clog- log. Based on the best fit, we select one of them for the rest of our analysis. In addition, there is no necessity to utilize zero-inflated models and the conventional GLM can perform sufficiently since we do not have a substantial amount of zeros in the data set. Tobit regression is another common approach that can be employed. This model, introduced by James Tobin [72], can describe the relationship between a non-negative dependent variable and an independent variable. The Tobit model, referred also as the censored regression model, is formulated in order to estimate linear relationships between variables when there is a censoring (either left- or right-censoring) in the output variable. Censoring from below (left) takes place when all cases with a value at or below some threshold take the value of that threshold. In fact, the true value might be equal to or lower than that threshold. Similarly, the nature of our data sets does not allow us to use such a model for our analysis. Therefore, we employ the Logit, Clog- log and Probit models in this thesis.

29

Binomial link functions for these three models are defined as:

(1) Logit: η = ln − (2) − Probit: η = � (3)

These three modelsCloglog: returned η extremely = ln −ln similar − factors as significant attributes in crash severity and frequency. To test the performance of each model, we used the deviance test defined as:

, (4) � � ��−� = ∑= [ ln ̂� + − ln ��−̂�] ; ̂� = ̂� and

� (5) � �̂ � � Logit: ̂ = � �̂ + (6) � Probit: ̂� = φ �̂ (7) � Cloglog: ̂� = − exp −exp ([ �̂])

Table 3 summarizes the Deviance test results for Logit, Probit and Clog-log models, for both age groups and for all locations. In this test the hypothesis is that there is no meaningful difference between the actual observed and predicted values (according to the model). Hence, a model is adequate only if the p-value is over 0.05 (95% confidence interval). For all frequency models we can see than we have adequately fitted models. The difference between the modelling capabilities of the three linking functions is not big, however, logit seems to be slightly better than the other two. For severity model we observe lack of fit and inadequacy of the models. This issue

30 will be addressed in further sections. Table 3 also presents goodness of fit results for the severity models. The details of the severity models will be later discussed in Section 3.4.3. Although the differences between the three models are not statistically significant, we will choose one of these models for further analysis, namely the Logit model.

Table 3 Deviance Goodness of Fit Test Results for Different Models (LoF: Lack of Fit) Frequency Severity Frequency Severity All Aging All Aging All Aging All Aging

Miami 1 JAX 1 Probit 0.40 0.10 0.06 0.08 0.10 0.10 <0.01 <0.01 Logit 0.41 0.10 0.12 0.08 0.09 0.10 <0.01 <0.01 Clog-log 0.37 0.06 0.06 0.09 0.10 0.09 <0.01 <0.01

Miami 2 JAX 2 Probit 0.38 0.10 0.09 0.07 0.10 0.10 <0.01 <0.01 Logit 0.38 0.09 0.09 0.07 0.10 0.10 <0.01 <0.01 Clog-log 0.37 0.10 0.09 0.04 0.10 0.10 <0.01 <0.01

Miami 3 JAX 3 Probit 0.38 0.10 0.07 0.08 0.08 0.08 <0.01 <0.01 Logit 0.40 0.10 0.09 0.09 0.10 0.10 <0.01 <0.01 Clog-log 0.37 0.10 0.07 0.06 0.04 0.06 <0.01 <0.01

3.4.1 Logistic Regression Analysis of Crash Frequency for Roadway Segments

In this section, we consider the hourly flow data obtained from six TTMS stations in the Miami and Jacksonville metropolitan areas and build individual binary logistic regression models for every location. The response variable in our model has two outcomes: defining to represent a crash occurrence and to represent no crash occurrence at a given one hour =period (the TTMS data is recorded = hourly). The probability of having a crash is represented �using = , a … Bernoulli , distribution as follows [68]:

and (8)

= = � = = − �

31 where is the crash probability for the -th record.

�In a binary logistic regression model� the probability of crash is modeled as a function of a set of regressors through the logit transformation where is the ′ vector of regression coefficients that quantifies the effectlog ofπi /the regressor − � = variables � on the� crash probability, estimated from observed data using Maximum Likelihood Estimation. In addition to the main-effects of variables, two-factor and three-factor interactions of these variables are also considered in the model. Once the parameters are estimated, the probability of a crash is predicted for a given set of regressor variables using the following logistic response function:

(9) exp ′� = � = In the literature, two other regression approaches, + exp ′� namely Negative Binomial and Poisson regressions, have been commonly used. A comparison was offered between some of the most commonly used techniques including the two mentioned approaches in [5, 73]. These papers suggest that negative binomial model works properly for crash data since crash is a random sparse observation, and that models usually should be zero inflated. However, Poisson and NB regression approaches can only be used when the response variable represents a count of some relatively rare events. In our study, the response is hourly crash count which takes values of 0 or 1, therefore, is a binary nominal variable. Therefore, we selected the binary logistic regression for our analysis, which is proven to be suitable choice in the field of traffic safety and crash analysis. Using this logistic regression approach, we examined the factors listed in the FDOT crash database to observe the significance of each factor on the response (number of crashes in the first analysis and crash severity in the second). We also compared the impact of each factor on the aging-involved crashes and the crashes involving the whole population. Interactions of these factors were also studied in order to see if there was a combined effect of these factors on the response variable. The following factors were used as the regressors: 1. Precipitation (with range 0 to 2.48 inches) obtained from FAWN 2. Surface Condition (wet/dry) obtained from NOAA 3. Light condition (day/night) obtained from NOAA 4. Real-time hourly flow (Vehicle/Hour): obtained from FDOT 5. Direction of a crash (North/South) obtained from FDOT

32

6. Indicator of peak hours: AM Peak (6-9am), PM peak (5-8pm), off-peak (other hours). Here, we incorporated the effect of peak hours and the direction of traffic in order to capture the effect of flow in each direction. A separate model was fitted for each geographical location to observe the significance of each factor and to quantify the impact of each factor on the aging driver-involved crashes versus those for the whole population. In order to build the model, we started from the full model, including all factors and the interactions (two and three factors). In each iteration, the regression model was fitted, and the least significant factors or interactions (indicated by large p-values) were removed until obtaining a model consisting of significant factors, at a significance level of 0.15. When using stepwise variable selection on a data set with a large number of predictors it is common to use relatively large significance levels, ranging from 0.1 to 0.25. This is because the predictors may be correlated, and some of the remaining ones may appear as insignificant when other parameters are added to the model [74]. Table 4 shows the coefficient estimates of the models obtained for the selected locations along with their p-values. First discernible output of Table 4 is that traffic flow has the most significant effect on the number of crashes for the overall population consistently for both locations. For the aging population, the impact of flow is highly significant. As we have discussed in the corridor-focused correlation analysis, the locations have distinctive characteristics, such as the different number of lanes (three lanes in Jacksonville locations, and five-to-six lanes in Miami). In Miami, since flow appears to be more significant, we can argue that the aging population may get overwhelmed with the high traffic volume due to their cognitive and other limitations. This reveals an interesting aging-specific behavior since aging people may prefer to visit those places such as grocery stores and pharmacies with the least congested roadways available [33, 74]. This is not a similar pattern for other adult age groups, who may prefer to drive on the shortest route possible, regardless of the traffic volume. We can conclude that the traffic flow is an important factor in crash occurrence for all drivers, and should be given extra attention while studying the important factors that affect the crashes. This result has been discussed by many researchers in the literature. However, the unique contribution of our approach is that it enables to predict the crash probability from the hourly traffic flow through the use of Equation (2) and coefficients given in Table 4 for different geographical sites. For the aging population, from a traffic engineering perspective, the relationship between the � flow and crash frequency is highly site-specific. That is, if we observe serious congestion on the

33 roadway, the impact is very significant on the aging drivers. Their cognitive limitations as well as the hour of travel could be potential reasons for this behavior, and therefore needs further evaluation.

Table 4 Fitted Regression Models. The Values on Top Are P-Values and The Underlined and Bolded Italic Numbers Are Coefficient Estimates. Miami Jacksonville 1 2 3 1 2 3 All Aging All Aging All Aging All Aging All Aging All Aging

~0.0 0.002 0.001 0.002 ~0.0 0.003 0.002 0.109 0.055 Flow/1000 0.081 0.303 0.098 0.449 0.121 0.336 0.273 0.144 0.403 0.049 0.05 0.027 Light Condition (Flag) -0.17 0.506 0.718 0.052 0.011 Precipitation 0.019 0.025 0.06 0.054 0.053 0.037 0.008 0.88 0.942 0.023 0.037 2.057 AM/PM Peaks (Flag) 0.45 0.637 0.068 0.033 0.168 0.34 0.261 0.01 1.464 1.744 0.728 0.228 Direction × AM/PM Peaks 0.078 1.49 0.108 Direction × Flow/1000 -0.33 0.003 0.11 0.052 0.35 0.057 0.215 0.115 0.47 0.152 -1.12 AM/PM Peaks × Flow/1000 0.458 0.201 0.049 0.096 0.187 0.085 0.115 0.753 -0.39 -0.87 Tests for Terms with More Than One Degree of Freedom AM/PM Peaks 0.13 0.14 0.03 0.1 0.015 Direction × AM/PM Peaks 0.17 AM/PM Peaks × Flow/1000 0.01 0.17 0.023 0.105 0.08

The impact of light condition and precipitation is also site-specific, however, their interaction with the flow does not show any significance. In addition, the results indicate that the precipitation has a negligible effect on the crash frequency. The effect of light on the crash frequency, on the other hand, is rather site-specific. In order to compare the effect of traffic flow on the crash probability of aging drivers and others, we plot the probability curves for the fitted logistic regression models at different Miami locations in Figure 10. For the overall population (Figure 10-a), flow has a different impact on the crash probability depending on the AM/PM peak: for the Miami-2 location it is higher in the

34 morning and in the Miami-3 location it is higher in the evening. For the aging population, the effect of flow on the crash probability seems to be more complex and the direction of travel becomes significant along with the peak hour effect. For example, in the Miami-2 location, the southbound has a significantly higher crash occurrence in the evening than in the morning, while in the northbound direction the crash rate is more similar (Figure 10-b). In the Miami-3 location, the probability of a crash shows an abrupt a sharp increase to about 1.0 around 7,000 vehicles per hour, which is not a typical behavior seen in the other locations or the overall age group. For the Miami-2 location (which is close to Miami downtown) we see the sensitivity of the curves of the overall population is higher than those of the aging. In this location, the chance of aging driver-involved crashes is much lower than that of the overall population [66]. In this location, the chance of having a crash for aging population increases rapidly after the traffic volume exceeds 7,000 vehicles per hour (Figure 10-b). This indicates that aging populations are more vulnerable to crashes at this volume, which does not appear to be as significant for the overall population. Therefore, this location requires extra attention with a specific focus on the aging populations when there are congestions in the roadway.

Logit Curves for Crash Frequency of All Drivers, Miami 2 Logit Curve for Crash Frequency of Aging Drivers, Miami 2

0.12 Variable Variable 0.12 Phi (PM) * Flow Phi (PM-S)* Flow Phi (Pm-N) * Flow 0.10 Phi (AM) * Flow 0.10 Phi (Am-S) * Flow Phi(AM-N) * Flow 0.08 0.08

0.06 Phi 0.06

Phi

0.04 0.04

0.02 0.02

0.00 0.00 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Flow Flow

(a) (b) Figure 10 Logit Curves for Crash Probability vs. Flow, (a) Miami Station 2 All Drivers, (b) Miami Station 2 Aging Drivers

We observe that the change in the traffic flow has a considerable impact on the crash frequency for both age groups in the Miami locations. The analysis indicates that aging people

35 may prefer to drive during the less congested time periods of the day in order to complete their daily activities [75]. The proposed practical and easily applicable methodology, supported with the easy-to- interpret crash probability curves presented in Figure 10, can be incorporated into transportation safety plans, and transportation officials can become aware of the various potentials for the vital aging-related crash problems that can occur at the roadways, as well as their probabilities. For instance, practitioners/engineers can use the proposed methodology to create logit graphs for different segments of the same highway in order to: (a) determine which roadway segments are more dangerous (classified as black-spots and high crash probability locations), and (b) which roadways are more sensitive to the change in the traffic with different light or weather conditions. By concentrating on these high crash probability locations for both aging and overall populations, transportation officials can take mitigation and recovery actions to ensure the safety of the public. These actions can include, but not limited to, providing better engineering countermeasures, such as traffic signs/signals, intelligent transportation systems, and other design and construction- related actions. To sum up, we conclude that aging population crashes are influenced by the number of lanes and traffic volume more than other age groups. Especially at those congested roadway segments, aging populations tend to have more crashes, which can be used to develop better mitigation plans with a specific focus on aging populations.

3.4.2 Exploratory Analysis for Crash Severity

A similar approach is utilized while studying the severity of crashes on the same road segments. Severity analysis has been performed for all nodes. The analysis for only two nodes in Miami, namely those with the highest number of crash observations, converged to a finite model. For the rest of the roadway segments, regression analysis was not able to find a valid model. Therefore, the analysis was only conducted for the aforementioned two nodes. For the severity analysis, we looked at the crash occurrence data and subdivided them into two categories: Severe and Not Severe crashes, following the KABCO severity classification used by FDOT and AASHTO:

36

K – Fatal A – Incapacitating injury B – Non-incapacitating injury C – Possible injury O – The property damage only

We categorized the severity degrees “O” and “C” as non-severe, degrees “B”, “A” and “K” as severe, and used this flag as the response variable in the severity models. For the severity analysis, we employed the FDOT crash data sets only, and included the environmental and weather-related data given in those data sets. Before the regression analysis, we visually investigated the distribution of severe crashes on the roadway segment. For this purpose, we created a set of kernel density function (KDE) raster maps that illustrate the crash severity for different I-95 segments and for different years, as shown in Figure 11 for Miami 1 location (Please refer to Appendix A and B for the kernel density plots for other locations). Initial observations imply that more severe crashes occur near intersections as well as exits, ramps and merges in general.

N N

(a) (b) (c) (d) Figure 11 (a) Crash Severity Kernel Density Maps for the Miami 1-location in (b) 2010, (c) 2011, (d) 2012

In Figure 11, the thickness of density raster can be misleading. In order to be able to compare the kernel density plots, we conducted a set of GIS-based analysis. For this analysis, we normalized

37 the kernel values between zero and one after creating kernel density functions. Next, we created approximately 150 dummy points on each segment, and extracted the kernel value overlaying each point. The closeness of the devised points is dense enough to represent the actual behavior of kernel density function raster on a two dimensional plane.

Figure 12 Kernel Density for Miami 1 for Year 2010 and Comparison to The Scaled Density Values on The Highway for The Years 2010, 2011 and 2012.

Figure 12 shows the results of this analysis for Miami 1. Similar approach is taken for other locations and the results are collected in Figure 13. These figures simply represent the 2-D versions of Figure 11 for all years and all locations. In Figure 13, note that x-axis shows the location on each segment ranging from the beginning to the end of that segment. This spatial analysis shows some similarities between the three years considered in terms of severe crash occurrences, and the patterns follow similar trends in most cases. In fact, if one location has a high concentration of severe crashes in a certain year, we observe high densities of severe crashes in other years as well with different intensity values. When there are more merge/exit ramps and intersections, and when the roadway passes through highly populated areas (Miami 1), plots usually fluctuate more frequently than those with less merge/exits, intersections and population (JAX 1). In addition, we mostly observe the peaks

38 mostly at those areas where intersections and ramps are located. This is especially visible for Miami 1, Miami 2 and JAX 1 locations.

(a) (b) (c)

(d) (e) (f) Figure 13 Kernel Density Function on a 2-D Plain for 3 Years. X: Segment Length. Y: Normalized Kernel Raster Value

3.4.3 Logistic Regression Analysis of Crash Severity for Roadway Segments

Severity Analysis: Individual Locations – 2011

Regression analysis for the crash severity is a conditional probability analysis, where the severity probability is the probability of having a severe crash only if we observe a crash. Therefore, the new dataset is a new subset of the original one, where the hours with no observations (crash) were removed. A binary response variable is used in which class 1 indicates a non-severe crash and class 2 indicates a severe crash. Factors related to the weather and roadway surface condition were removed by the model since there was no aging-involved crash observation during rain and therefore slippery roadway conditions. We evaluated Logit, Probit and complementary log-log functions as a possible link function. Model goodness of fit indicated a better fit for the

39 logit link function for both sites compared to alternatives, hence, we proceeded with this function. Modelling assumptions are checked by standardized Delta residual plots (Figure 14). For the Miami 1 location, the Pearson goodness of fit (GOF) test -values for the aging driver-involved crashes were close to 1 indicating that the model fit was quite adequate. However, for the crash data of all age groups populations, goodness-of-fit -value was 0.025. This indicates that the logistic regression provides a less adequate fit to the observed crash data.

Delta Beta graph of residuals for all data Delta Beta graph of residuals for revised data

(a) (b) Figure 14 Delta-Beta Residual Analysis for Outlier Detection And Removal (Node Miami 1). The Right-Hand Side Figure Is The Expanded Version of The Box on Left-Hand Side Figure Shown After Outlier Handling.

As shown in Figure 14-a (the residuals of the crash frequency model for the overall population in Miami 1), we observe two outliers, which may be the reason for the poor goodness- of-fit. We refit the model after identifying assignable causes and removing these two outliers. Results can be seen in Figure 14-b for the residuals of the model fitted after deleting the outliers. For Miami 1, the goodness of fit -values for the new models are 0.43, 0.09 and 0.12, for

Pearson, Deviance and Hosmer-Lemeshow tests, respectively, indicating a more improved fit. Table 5 gives a portion of the goodness of fit statistics of the final models. The same approach was followed to identify and remove similar outliers for the remaining crash models. The goodness-of- fit results of our final models were summarized in Table 5, indicating a sufficient adequacy for all of the models.

40

The Pearson Chi-square goodness-of-fit test and the Deviance goodness of fit tests both try to assess the discrepancies between the full model and the reduced model. In fact, they test whether the probabilities, which are predicted, are different from the observed probabilities that the binomial distribution is unable to predict. If the p-value for the goodness-of-fit test is lower than a pre-defined significance level, the predicted probabilities are significantly different from the observed ones. For the binary logistic regression analysis, the format of the data affects the significance test since the number of trials/row changes for the Pearson test. Therefore, the chi- square approximation is not accurate in the case of small expected number of events per row in the data. For the deviance test, the p-value decreases as the number of trials per row decreases (in most cases). Pearson GOF statistics is calculated as follows:

(10) ( − ) � = ∑ On the other hand, the statistics for deviance test is formulated as:

(11) � = ∑ � where each j is a cell in the 2-way confusion table. Each row is a profile and each column is one of the two response categories. Oj denotes the observed frequency whereas Ej shows the expected frequency obtained from the fitted model. If Oj=0, then the summation term is set to 0. Moreover, the degree of freedom in this test is the number of profiles subtracted from the number of estimated parameters. If the fitted model is correct, both statistics have approximately a chi square distribution. In the traffic safety field, most applications of logistic regression use data with a continuous format that do not allow for aggregation into profiles. With one observation per profile, both the statistics have distributions that considerably deviate from a true chi-square distribution, and the deviance does not depend on the observed values. Therefore, the p-values can be inaccurate and using these tests as GOF might be rather useless. Therefore, in our analysis, we tried to change the format of attributes in the data to several categories in order to enable multiple cases per profile.

41

Hosmer-Lemeshow test evaluates the level of agreement between the model predictions and observations by dividing the observations into a number of bins, using the model to predict the dependent variable, and comparing the predicted values to the observed values by running a Chi-Square hypothesis test. Table 5 also shows the percentage of concordant and discordant pairs from the Hosmer-Lemeshow (HL) test (from the number of correct and incorrect classifications) of the fitted models. The proportion of concordant pairs ranges between 60 and 75% in all models, a reasonably high proportion indicating a high goodness of fit.

Table 5 Goodness of Fit p-values between Observed Data and Predicted Values. (Data for 2011) Frequency Severity Frequency Severity All Aging All Aging All Aging All Aging

Miami 1 JAX 1 Pearson 0.43 0.10 0.19 0.28 0.12 0.10 <0.01 <0.01 Deviance 0.41 0.10 0.08 0.12 0.11 0.10 <0.01 <0.01 HL 0.12 0.41 0.10 0.74 0.72 0.49 <0.01 <0.01

Miami 2 JAX 2 Pearson 0.40 0.10 0.13 0.22 0.10 0.10 <0.01 <0.01 Deviance 0.38 0.09 0.09 0.07 0.10 0.10 <0.01 <0.01 HL 0.14 0.49 0.10 0.55 0.88 0.70 <0.01 <0.01

Miami 3 JAX 3 Pearson 0.20 0.24 0.13 0.11 0.10 0.10 <0.01 <0.01 Deviance 0.40 0.10 0.09 0.09 0.10 0.10 <0.01 <0.01 HL 0.12 0.12 0.32 0.68 0.57 0.53 <0.01 <0.01

The regression models demonstrate no or little effect of these factors on the severity of crashes (see Table 6).

42

Table 6 Fitted Regression Models (Crash Severity). The Values Underlined in Italic Are P-Values And Other Numbers Are Coefficients. Results Are Significant Only for Miami 1 and Miami 2 Locations. Miami Jacksonville 1 2 3 1 2 3 All Aging All Aging All Aging All Aging All Aging All Aging

0.04 0.02 0.08 0.02 ------Flow -2 -2.26 -1.06 -2.08 ------0.16 ------Light -- 0.9 ------

Table 5 shows the poor performance of the models for severity analysis. However, since the database prepared for the severity analysis is different from the frequency analysis (i.e. each row in severity data corresponds to one crash), we can aggregate the available data to capture more potential attributes that can affect the crash severity. In addition, we can also analyze more data by incorporating the crash data for other years.

Severity Analysis: Spatially and Temporally Aggregated, 2010-2012

As discussed earlier, each segment share similar roadway characteristics; however, combining the selected segments in each metropolitan region adds the attributes that show the differences within segments. Hence, the data sets of 2010, 2011 and 2012 for all segments in Miami and Jacksonville were combined, and named as ‘Miami’ and ‘Jacksonville’ datasets for further severity analysis. A total of 44 attributes were pulled out from the Florida statewide on system traffic crashes. They were categorized in 6 general classes as shown in Table 7.

Table 7 Crash Attributes for Crash Severity Category Attributes 1. Temporal Day of week and Time of Crash 2. Environmental Light Condition, Weather Condition, Surface condition and Visibility 3. Traffic related AADT, Traffic Flow, Direction, Average Truck Traffic and Maximum Speed 4. Driver related Crash Cause, DUI and Lane Departure 5. Occupant Minor, Young and Aging 6. Roadway design related Shoulder and median width, Shoulder Type, Skid Tests and Crash Lane

43

After a correlation analysis and eliminating relatively constant factors, we came up with 25 final candidate factors for further use in the regression analysis (Table 8). The new regression analyses were conducted on four sets of data: (1) Jacksonville for drivers of all ages, (2) Jacksonville for aging drivers only, (3) Miami for drivers of all ages and (4) Miami for aging drivers only. Table 9 lists the number of observations in severity analysis.

Table 8 Candidate Factor for Regression Analysis of Crash Severity Factor Format Range 1 Day of Week Binary 0: Week day 1: Weekend 2 Direction Binary 0: North 1: South 3 Side of Road Binary 0: Intersection 1: Segment 4 Crash Lane Binary 0: Median, left lanes 1: Right lanes 5 Shoulder Type Nominal 8 Levels: Curb, Paved, Lawn, etc. 6 Shoulder width Continuous > 0 7 Median width Continuous > 0 8 AADT Continuous > 0 9 % of Trucks in traffic Continuous > 0 10 Light Condition Binary 0: Day 1: Night 11 Weather Condition Binary 0: Clear 1: Rainy, Cloudy 12 Surface Condition Binary 0: Dry 1: Wet 13 Rear End Crashes Binary 0: Others 1: Rear End 14 Collision with barrier Binary 0: Others 1: Barrier 15 Crash Cause Binary 0: Driving Related 1: Others 16 Commercial Veh. Involve. Binary 0: No 1: Yes 17 Lane Departure Binary 0: No 1: Yes 18 15-19 involvement Binary 0: No 1: Yes 19 +65 involvement Binary 0: No 1: Yes 20 50-64 involvement Binary 0: No 1: Yes 21 DUI Binary 0: No 1: Yes 22 1-17 involvement Binary 0: No 1: Yes 23 Seatbelt restrained? Binary 0: No 1: Yes 24 Surface width Continuous > 0 25 Crash Severity Binary 0: Non-Severe 1: Severe

Table 9 Number of Observations in Severity Data and Test Data Sets

Total Crash Obs. Severe Obs. Test Dataset Obs. Miami All 5117 394 236 Miami Aging 502 33 26 JAX All 1543 94 56 JAX Aging 166 13 10

44

Here, we would like to observe how different age groups were influenced by two severity levels in the crashes: severe and non-severe. Table 10 summarizes the results. On average, 6-8% of all crashes are found to be severe at all locations for both age group drivers. At all locations, 1- 16 year old passengers are involved less in a crash when an aging person is the driver. This trend is similar for other adult age groups. These observations could be either due to the fact that aging population drive more carefully or less 65- year old passengers are present in a crash when the driver is an aging one. Needless to say, if the driver is +65, the rate of occupants is 100 %. This fact is shown in the table through cells with no value

Table 10 Crash Severity Involvement by Age (All numbers are in percent). The Number of Crashes Experienced by an Age Group Is Shown as A Percentage of Total Number of Crashes in a Location. Driver Miami (All Ages) Miami (Aging) Occupant Severe Not Sev. Severe Not Sev. All Ages 7.70 92.30 6.57 93.43 Not Present 91.62 90.07 -- -- Aging (+65) Present 8.38 9.93 -- -- Not Present 67.26 68.75 75.76 75.69 Mid-Age (50-64) Present 32.74 31.25 24.24 24.31 Not Present 84.26 86.64 90.91 88.06 Minor (1-17) Present 15.74 13.36 9.09 11.94 Driver Jax (All Ages) Jax (Aging) Occupant Severe Not Sev. Severe Not Sev. All Ages 6.09 93.91 7.83 92.17 Not Present 86.44 89.44 -- -- Aging (+65) Present 13.83 10.56 -- -- Not Present 73.40 69.29 76.92 76.86 Mid-Age (50-64) Present 26.6 30.71 23.08 26.14 Not Present 77.66 82.95 84.62 86.27 Minor (1-17) Present 22.34 17.05 15.38 13.73

The 50-64 age group appears to have the most involvement in severe and non-severe crashes. Here, by involvement, we consider both drivers and passengers. In Miami, the rate of

45 involvement of this group is similar for both crash severity levels (severe and non-severe). On the other hand, their rate of involvement in severe crashes is lower than that of non-severe crashes (approximately by 3%) in Jacksonville. Minors also appear to be involved in severe crashes. Among all age groups, this group has the highest involvement in severe crashes than non-severe ones (by 2% on average) when an aging driver is present. Following this descriptive analysis, the logistic regression analysis has been carried out. Results can be seen in Table 10. This table shows the p-values and the direction that each factor is affecting crash severity. In order to see if the factors mentioned in Table 8 have a meaningful impact on crash severity, we included age groups as factors in the regression analysis as well.

Table 11 Fitted Regression Models for Crash Severity Analysis: Locations in Miami and Jacksonville Are Temporally Aggregated for 3 Years (2010-2012) JAX (All) Miami (All) JAX (Aging) Miami (Aging) Factor Attribute Sig. Dir. Prob>ChiSq Sig. Dir. Prob>ChiSq Sig. Dir. Prob>ChiSq Sig. Dir. Prob>ChiSq Ranking 1 Day of Week 0.9584 + 0.0991 0.4609 0.333 21 2 Direction 0.3627 0.2482 - 0.0194 0.2629 10 3 Side of Road 0.1955 0.8589 0.3636 - 0.0855 19 4 Crash Lane 0.6601 - 0.0044 0.3241 0.6105 13 5 Shoulder Type 0.7445 - 0.1294 + 0.0007 0.5225 12 6 Shoulder width - 0.1494 - 0.0769 0.5447 0.322 15 7 Median width - 0.0601 + 0.0236 0.423 0.2054 4 8 AADT + 0.0548 0.3685 - 0.0263 0.8632 7 9 % of Trucks in traffic 0.6217 + 0.1266 0.2007 0.4864 17 10 Light Condition + 0.0065 + 0.0071 + 0.0533 0.5964 2 11 Weather Condition 0.1629 0.2387 - 0.0747 0.1956 11 12 Surface Condition + 0.1249 0.3333 + 0.0189 - 0.1323 6 13 Rear End Crashes 0.7005 - 0.0088 0.2464 0.1568 14 14 Collision with barrier 0.2217 + 0.0327 + 0.0532 - 0.0158 3 15 Crash Cause 0.9513 0.202 0.1865 0.582 18 16 Commercial Veh. Involve. 0.5827 0.2154 0.9501 0.6042 22 17 Lane Departure + 0.0304 0.6255 + 0.0584 - 0.0739 8 18 15-19 involvement 0.5707 0.6304 0.8138 0.5528 23 19 +65 involvement 0.2232 0.9325 ------20 50-64 involvement 0.9918 + 0.028 0.7258 0.5897 20 21 DUI 0.6562 + 0.0031 0.2545 + 0.1181 9 22 1-17 involvement + 0.1167 + 0.0922 0.7588 0.3651 16 23 Seatbelt restrained? + <.0001 + <.0001 + 0.0727 0.4502 1 24 Surface width 0.8576 + 0.0252 + <.0001 + 0.0997 5

In terms of goodness of fit, the p-values of the Pearson test are 0.44, 0.1, 0.56, and 0.08 and of the deviance test are 0.49, 0.1, 0.57 and 0.1 for Miami (All ages), Miami (Aging only), Jacksonville (All Ages) and Jacksonville (Aging only), respectively. Therefore, we conclude that all models can represent the behavior of data sets properly at the 95% confidence level. However, in the chi-square goodness of fit hypothesis test we fail to reject lack of fit, and therefore, we need

46 larger p-values for make valid statements about the model. In other words, lower values of p-value indicates fail to rejection, not accepting the goodness of fit for models. Here p-value for aging models are relatively low and close to 0.05 threshold value. So we should take caution while making statements about these models. The objective of the crash severity analysis is twofold. First, we would like to compare the behavior of aging drivers and other age groups in two different environments. Second, we want to observe how crash severity for aging drivers differ temporally (touristic locations, urban areas with high traffic volume, etc.). Some of the observations from Table 11 have already been discussed in many researches and reports [75-77]. However, according to Table 11, we notice that the probability of severe crash increases for all drivers when the shoulder width gets reduced whereas the shoulder width does not play a significant role in aging driver-involved crashes. All age group drivers are prone to crash risk while driving at night; however, this affect is less on aging drivers. This observation could be due to several reasons: Aging people may tend to drive less at night, or they may drive with more caution than younger drivers. Furthermore, comparing Jacksonville locations to those of Miami discloses some interesting results. For instance, driving at night seems to be problematic in Jacksonville whereas the crash risk during night is less in Miami for both age classes. Presence of seatbelt has been found to be of utmost importance for the crash severity for all age groups. The effect of this factor on aging driver-involved crashes is found to be less than other age groups. This may indicate that aging drivers are more cautious, and therefore they do not usually forget to buckle up while driving. On the other hand, the significance of roadway surface width is more influential for aging drivers than other ages. One surprising fact is the role of DUI in crash severity, which is distinctively higher in Miami. In other words, DUI is a significant factor in causing severe crashes for both age groups. Considering the fact that alcohol impairs the judgement and ability of drivers, both age groups in Miami commit DUI-involved crashes considerably more than Jacksonville. DUI-involved crashes by aging drivers is a serious issue, and therefore, policy makers should take serious measures, such as education or enforcement, in order to reduce the rate of drunk driving amongst aging drivers. Wet and slippery roadway surfaces also have an influence on the crash severity for both age groups in Jacksonville. This could be justified by the plots in Figure 5. Considering the similar pavements conditions and texture at both locations, reckless driving and speeding appears to

47 happen more although the traffic volume is lower in Jacksonville. Therefore, more severe crashes involving both age groups are observed. In addition, in Jacksonville, most of the severe crashes occurred while performing a lane departure action for both age groups.

Severity Analysis: Spatially Aggregated, 2011

The objective of this section is to show how aggregating the three locations may improve decision making in the area of traffic safety. Here, we aggregate the data corresponding to three locations in Miami, and draw inferences based on only one year (on the contrast to the three years-based analysis in the previous section). Figure 15 shows a schematic view of this approach.

Figure 15 A Schematic Illustration of Aggregation Approach

Logistic regression analysis results for Miami all age groups is shown in Table 12. The goodness of fit p-value for the second model (2011) is 0.1 according to both Pearson and deviance tests. This indicates less adequacy if modelling fitting compared to the original model for three years. Moreover, some factors seem to be significant regardless of the data size. For example, day of the

48 week, shoulder width, median width, and seatbelt usage are found to be significant in both models (aggregated and one year). However, other factors such as weather condition or lane departure seem to differ temporally. This comparison enables one to detect the significant factors that do not change among the years.

Table 12 Logistic Regression Analyses for Aggregated Models of Three Years vs. Aggregated Model of 2011 Miami (All Years) Miami (2011) Attribute Signif. Dir. Prob>ChiSq Signif. Dir. Prob>ChiSq Day of Week + 0.0991 + 0.0387 Direction 0.2482 0.4695 Side of Road 0.8589 0.6015 Crash Lane - 0.0044 0.4524 Shoulder Type - 0.1294 0.2024 Shoulder width - 0.0769 - 0.0013 Median width + 0.0236 + 0.0894 AADT 0.3685 0.8311 % of Trucks in traffic + 0.1266 0.3346 Light Condition + 0.0071 + 0.0722 Weather Condition 0.2387 - 0.0361 Surface Condition 0.3333 0.3367 Rear End Crashes - 0.0088 0.5462 Collision with barrier + 0.0327 0.5964 Crash Cause 0.202 - 0.0139 Truck Involvement 0.2154 0.8201 Lane Departure 0.6255 + 0.0756 15-19 involvement 0.6304 0.6973 +65 involvement 0.9325 0.6371 50-64 involvement + 0.028 + 0.0549 DUI + 0.0031 0.1932 1-17 involvement + 0.0922 0.2301 Seatbelt restrained? + <.0001 + <.0001 Surface width + 0.0252 0.4202

3.5 Prediction Capabilities using ROC Curves

The model adequacy discussed above considers the dataset as a whole. However, it does not reflect how the models will perform for new data. In order to test the performance of the fitted models with new data we conducted a prediction study in which we split the entire dataset into two subsets: Training dataset and Test dataset. For the test data sets, we randomly selected 15%

49 from the hourly data for both crash and non-crash records to obtain a representative sample for crash frequency and for crash severity, we take the same approach for severe and non-severe crashes. We fit a new regression model on the Train data set and test its prediction performance on the Test dataset. In the training phase, data set is divided to 10 sections, and in 10 iterations, 9 sections are used for training the model and 1 section is used to test the trained model. In other words, we employed 10-fold cross validation in order to prevent training from common issues, such as overfitting. We measure both true positive rate (sensitivity) and true negative rate (specificity) of our models and test them against gradual increase of probability cut-off from 0 to 1, draw Receiver Operating Characteristic curve (ROC). For crash frequency, we discuss the procedure for the crash frequency model for all age groups in Miami 1. The procedure is similar for all other logistic regression models.

True positive rate refers to the proportion of correctly identified true events. Therefore, if sensitivity = 1, any observation predicted as a positive, must be a true positive (there are no false positive events). On the other hand, ‘specificity’ refers to the proportion of correctly identified true non-events, therefore, if specificity= 1, any observation predicted as a non-event, must be a true non-event (there are no false negatives). These measures can be calculated using 2x2 tables as confusion table or contingency matrix.

Table 13 Schematic Observation Frequency Table for Observation Vs. Prediction

Therefore, sensitivity and specificity could be calculated as:

(4) ����� = + (5) ������ = +

50

For each “cut-off” value on ROC curve there is a confusion matrix. All probability above the cut-off value will considered as true event, or crash / severe crash, and the values below the cut- off as non-event (no crash or non-severe crashes).

3.5.1 Crash Frequency Prediction Analysis

Fig. 11-left, shows an example of Table 11. This point on the curve (Kappa) is the closest point to the upper left corner of the plot and generally assumed as the optimal cut-off point in most cases. However, this value mainly depends on the decision maker. For example, for those interested in identifying accident hotspots, higher values of sensitivity is important, there for a lower cut-off value should be considered. From 17270 data points, we select, 61 crashes and 61 hours with no crash. At this specific cut-off point, 32 out of 61 crashes, and 35 no crashes out of 61 hours were predicted accurately.

Classifi- 1 0 Total cation 32 26 1 58 (52.46) (42.62) 29 35 0 64 (47.54) (57.38)

Total 61 61 122

Figure 16 (left) A Sample Frequency Table for Observation vs. Prediction Generated by Algorithm, (right) ROC Curve for Crash Frequency Model for Node Miami1-All Age Groups

An ROC curve that is perfectly linear from the point (0,0) to (1,1), i.e. the diagonal line, represents complete random classification or indifference line. We show this diagonal plot for comparison purposes. In an ROC curve, the area under the curve is reported as a measure of predictive accuracy (or discrimination). This area is close to 0.55 (out of one), which shows

51 justifiable validity of models in prediction. However, the prediction is not very accurate and other techniques such as support vector machines, soft computing and complex statistical learning techniques can be employed to increase the prediction capability. ROC analysis for crash severity is proposed for all four datasets in a similar approach as discussed for crash frequency. Here, a balanced test dataset is selected according to Table 9 and ROC curves are generated as shown in Figure 13. The approaches are coded in Matlab 2015 environment on an Intel core i7-5500U 2.4 GHz processor, 8 GB Ram, Win 10 Pro platform.

0.54 0.52

(a) (b)

0.53 0.51

(c) (d) Figure 17 Receiver Operating Characteristic Plots for Crash Severity Prediction by logit for Miami-All Ages (a), Miami-Aging Drivers (b), Jacksonville-All Ages (c) and Jacksonville-Aging Drivers (d). Values at The Lower Right Corner of The Plots Show Area Under Curves (AUC).

52

These plots show with the reduction in the number of observations, the accuracy of the methodology decreases as well. Similar to crash frequency, although all AUCs are above 0.50, the prediction capability of logit approach is not quite satisfactory and more advanced approaches are needed for prediction analysis. However, as far as crash analysis and justifiable prediction, it works well. Although all AUCs are above 0.50, the prediction capability of logit approach is not quite satisfactory and more advanced approaches, such as artificial intelligent methods, meta-heuristics, neural networks, fuzzy logit or fuzzy cognitive maps [78, 79] can be considered for factor analysis or prediction are needed for a better prediction analysis. Comparison of ROC curves for temporally aggregated model (2010-2012) versus one year (2011) shows that the prediction capability of the model decreases and is very close to random guess condition when we have only one year (Figure14). Hence, we can conclude that the analysis including one year data is useful to observe the differences in the significance level of the factors; however, it is not very useful for prediction purposes.

0.515 0.54

Figure 18 Comparison of ROC Curves for Spatially Aggregated Model of Miami with Data for 2010-2012 Vs. Spatially Aggregated Model for 2011

Table 14 demonstrates the performance measures for Miami models with the temporally aggregated data (2010-2012) and with one year data only (2011). In terms of the model goodness of fit, aggregated model fits the data much better considering temporal variations. One year data

53 does not also seem enough to properly fit a model for aging drivers. Potential reasons for this problem can be the shortage of sufficient observations, and the large variability in the independent factors. However, aggregating the data for three years leads us to fit justifiable models due to sufficient number of observations, and this aggregated model can also be used for prediction purposes. In terms of crash severity prediction, aggregated models promise better prediction according to the area under curve values for ROC curves (Table 14). Analysis for 1 year (2011) and 1 location (Miami 1) returns shady results with some overfitting observed. The models in this condition are completely unreliable and shall not be used in analyses. The reason for such a weak performance is the shortage of the number of input factors and also the number of event observations. Overall, we demonstrated how aggregating datasets for this analysis can increase accuracy and precision of factor analysis and prediction.

Table 14 Performance Comparison of Temporally Aggregated Model for Miami vs. Miami-2011

AUC p-Pearson p-Deviance 3 Years, 3 All 0.54 0.44 0.49 Locations Aging 0.52 0.1 0.1 Miami 1 Year, 3 All 0.515 0.1 0.1 Locations Aging 0.5 <0.01 <0.01 1 Year, 0.5 0.19 0.08 Miami 1 All 1 Location Aging 0.5 0.28 0.12

3.6 Research Limitations

In this section, we explain the limitations and difficulties we encountered in this research. First constraint emerged from the TTMS locations. We focused on the data obtained through the TTMS devices in this research, and the number of TTMS sites is quite limited; however, this research can be extended by obtaining other traffic flow data from units such as portable traffic monitoring sites (PTMS), detectors and video processing units. In some locations, the collected data are not reliable, complete or accurate. So we had to benefit from the available data, even though limited. Another issue with the TTMS data was the missing values, which we interpolated them based on the records before and after the missing one.

54

Traffic crashes are rare events. Severe crashes are even sparser and on top of them all, a severe crash that involves an aging driver is utterly scarce. On average, 0.01% of the crashes are aging-driver involved severe crashes. Therefore, performing the analyses for this amount of data can return shady and unreliable results. In Section 3.4.3, we have tried to address some difficulties in fitting the models. The goodness of fit chi-square test shows the goodness of fit for α=0.05, however, some values are slightly higher than 0.05. This issue is mostly observed in severity analyses of the aging driver involved crashes. In goodness of fit test, failing to reject the null hypothesis, does not necessarily imply that the fit of the model is good. In other words, when we fail to reject the hypothesis it means that we did not find evidence to conclude that the observed proportions are significantly different from the specified proportions, which is not a strong finding in support of the hypothesized model. Therefore in order to have a good fit we usually seek to have relatively high p-values. The p-values we obtained with the crash models are slightly larger than the significance level, alpha =0.05 and the interpretation of models can result in wrong conclusions. For this reason we mostly focus on interpreting the well fitted models (with p-values around 0.5) with some general comparisons to the slightly fitted models. In order to address this issue, we concatenated the severity crash datasets from all three locations in each area and also used the data for three years. We found out that at least three years of crash severity data are needed for an acceptable model fit. In the frequency models, acceptable models for the all-age group were obtained even using a single segment and a single year. As shown in Table 5, Pearson p-values for all age-group models for Miami segments range between 0.20 and 0.43. For aging driver models, the Pearson p-values range between 0.13 and 0.19. Therefore, for aging-driver models, we recommend a spatial or temporal aggregation to improve the fit. This issue was more pronounced in the severity models, in which the data sets were much smaller than the frequency models. As shown in Table 14, for the all-age group, both spatial and temporal aggregation were required in order to provide an acceptable fit with Pearson p-value of 0.44. For the aging-driver model however, the spatial and temporal aggregation (3 segments and 3 years) provided a p-value of 0.1. Therefore, for aging severity models, we recommend using at least 5 years data and larger geographical areas. There are many factors that can influence a crash frequency or severity. We considered as many variables as we had at hand. However, many other factors, such as vehicle technical aspects, cognitive and behavioral factor, etc. have also been proven to be significant factors for crash

55 frequency and severity. Considering all these factors can return models with better fit to the data. Although some factors may alter the probability of crash frequency/severity, the randomness of traffic crashes is high. In other words, if we have two crash observations with a certain and unique configuration of input values, the models predict a crash for the same unique configuration in the prediction phase. However, due to erratic behavior of crashes, it may not be a crash in the real world. So even if we consider the maximum number of input variables, fitting a perfect model seems impossible. Along with the erratic nature of crash data, the latter two issues lead to another limitation in prediction analysis: lack of sufficient data to test the predictive capability of the model. Results in poor crash prediction which, in the cases for aging drivers, resulted in prediction results very close to randomly guess. The reason can be either the different behavior of severe crashes involving aging drivers, or lack of sufficient data. Comparing the severity results of Miami to Jacksonville, and also 2011 to three years, we may argue that the latter reason seems to be more realistic. Therefore, we suggest considering more observations both spatially and temporally.

56

CHAPTER 4

CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS

In this thesis, we present a logistic regression-based methodology in order to study the effect of traffic-related and environmental factors such as light and weather conditions on the occurrence and severity of aging driver-involved traffic crashes with a case study in the State of Florida. This methodology is supported by GIS-based visual illustrations. We also show how to quantify the effect of flow on the crash frequency and severity with respect to various factors by the help of logit curves. With respect to the crash frequency-focused results, several important conclusions are drawn. First of all, we incorporate the hourly real-time traffic flow instead of AADT, and demonstrate the significance of using the actual traffic flow during both peak and off-peak hours. Traffic volume at the time of a crash is also found to be higher than the one without a crash, regardless of the light conditions. The crash frequency is also higher during the day time. Furthermore, the effect of precipitation and surface condition is not significant for both sites studied in the Miami metropolitan area. On the other hand, results indicate that light condition and peak hours have a significant effect on the aging drivers, which increases the chance of having a crash. When the traffic volume increases, the crash frequency for the aging-involved crashes increases in a slower pace than the crashes for all age groups. Severity-focused analyses have also led to several important conclusions. For example, the crash severities for all other age groups decrease on roadways with narrower shoulders and at night unlike those of aging drivers. Furthermore, driving at night in Jacksonville seems to be problematic for both age classes whereas that risk is less for Miami locations. Higher roadway surface width also appears to increase the chance of having a severe crash for aging drivers. The DUI-influenced crashes have also been detected considerably high in Miami. This problem actually seems critical both for crash frequency and severity models. These conclusions are drawn based on spatially and temporally aggregated data for Miami and Jacksonville area (between the years of 2010 and 2012) to capture the temporal (annual variations) and spatial differences (roadway design attributes). We

57 have also analyzed the sensitivity of model goodness of fit and prediction capability for Miami locations, and observed that aggregated model provided more reasonable results when compared to the single year models. As mentioned earlier, the fitted models for aging drivers shows slight goodness of fit. So, deeper study with incorporating more factors and observations is required to be able to conclude the discussed arguments with more certainty. We have also employed receiver operating characteristic (ROC) curves to demonstrate the prediction capability of the models. For both crash frequency, and crash severity models, a positive correlation between the size of observations and the performance of ROC curves (considering the area under curve) is observed. Although all AUCs are above 0.50, the prediction capability of logit approach is not quite satisfactory and more advanced approaches are needed for a better prediction analysis.

This research also presents methods not only for fitting logistic regression models but also for model interpretation and validation in order to confirm the applicability of the fitted models in prediction. It is shown how logistic regression plots can be used to interpret the crash frequency and severity of the different age groups as a function of traffic flow and users can make sure the models obtained provide meaningful predictions. The proposed practical and easily applicable methodology, supported with the easy-to-interpret crash probability curves and sound validation results, can be incorporated into transportation safety plans. Transportation officials can become aware of the various possibilities for the vital aging-related crash problems that can occur on the roadways based on the predicted probability of crash occurrence from the fitted models. Practitioners/engineers can also use the proposed methodology to create logit graphs for different segments of the same highway in order to: (a) determine which roadway segments are more dangerous (classified as hotspots and high crash probability locations), and (b) which roadways are more sensitive to the change in the traffic with different light or weather conditions. By concentrating on these high crash probability locations for both aging and overall populations, transportation officials can take mitigation and recovery actions to ensure the safety of the public. These actions can include, but not limited to, providing better engineering countermeasures, such as traffic signs/signals, diversion, intelligent transportation systems, and other design and construction-related actions, as well as routing and trip scheduling. Studying the effect of average hourly speed on the frequency and severity of the aging- involved crashes is an interesting area of future work. Addition of cognitive and mental factors to

58 the models may reveal interesting outcomes. The presented methods can be implemented with a focus on aging occupants in the vehicles during the crash in addition to the drivers. In terms of modeling and methodology, artificial intelligent methods, such as meta-heuristics, neural networks, fuzzy logit or rough sets can be considered for factor analysis or prediction. In addition, statistical classification methods, clustering, Bayesian approach and learning algorithms can be implemented as well. This study can also be expanded to other roadways in Florida.

59

APPENDIX A

KERNEL DENSITY FUNCTION FOR CRASH SEVERITY (JAX)

JAX1 (2010) JAX1 (2011) JAX1 (2012)

JAX2 (2010) JAX2 (2011) JAX2 (2012)

60

JAX3 ( 2010) JAX3 (2011) JAX3 (2012)

61

APPENDIX B

KERNEL DENSITY FUNCTION FOR CRASH SEVERITY (MIAMI)

MIAMI1 (2010) MIAMI1 (2011) MIAMI1 (2012)

62

MIAMI2 (2010) MIAMI2 (2011) MIAMI2 (2012)

MIAMI3 ( 2010) MIAMI3 (2011) MIAMI3 (2012)

63

REFERENCES

1. ECMT, Managing urban traffic congestion. 2007: European conference of ministers of transport.

2. Administration, N.H.T.S., Traffic Safety Facts 2010: A Compilation of Motor Vehicle Crash Data from the Fatality Analysis Reporting System and the General Estimates System 2012. DOT HS 811 659.

3. Administration, N.H.T.S., Traffic Safety Facts 2012: A Compilation of Motor Vehicle Crash Data from the Fatality Analysis Reporting System and the General Estimates System 2014. DOT HS 812 032.

4. Lum, H. and J.A. Reagan, Interactive Highway Safety Design Model: Accident Predictive Module Public Roads, 1995. 59(2).

5. Lord, D. and F. Mannering, The statistical analysis of crash-frequency data: a review and assessment of methodological alternatives. Transportation Research Part A: Policy and Practice, 2010. 44(5): p. 291-305.

6. JMP, A. and M. Proust, Modeling and Multivariate Methods. 2012.

7. Al-Ghamdi, A.S., Using logistic regression to estimate the influence of accident factors on accident severity. Accident Analysis & Prevention, 2002. 34(6): p. 729-741.

8. Dissanayake, S. and J.J. Lu, Factors influential in making an injury severity difference to older drivers involved in fixed object–passenger car crashes. Accident Analysis & Prevention, 2002. 34(5): p. 609-618.

9. Tay, R., S.M. Rifaat, and H.C. Chin, A logistic model of the effects of roadway, environmental, vehicle, crash and driver characteristics on hit-and-run crashes. Accident Analysis & Prevention, 2008. 40(4): p. 1330-1336.

10. Sze, N.-N. and S. Wong, Diagnostic analysis of the logistic model for pedestrian injury severity in traffic crashes. Accident Analysis & Prevention, 2007. 39(6): p. 1267-1278.

11. Shankar, V., F. Mannering, and W. Barfield, Statistical analysis of accident severity on rural freeways. Accident Analysis & Prevention, 1996. 28(3): p. 391-401.

12. Shankar, V. and F. Mannering, An exploratory multinomial logit analysis of single-vehicle motorcycle accident severity. Journal of Safety Research, 1996. 27(3): p. 183-194.

64

13. Milton, J.C., V.N. Shankar, and F.L. Mannering, Highway accident severities and the mixed logit model: an exploratory empirical analysis. Accident Analysis & Prevention, 2008. 40(1): p. 260-266.

14. Poch, M. and F. Mannering, Negative binomial analysis of intersection-accident frequencies. Journal of Transportation Engineering, 1996. 122(2): p. 105-113.

15. Shankar, V., F. Mannering, and W. Barfield, Effect of roadway geometrics and environmental factors on rural freeway accident frequencies. Accident Analysis & Prevention, 1995. 27(3): p. 371-389.

16. Abdel-Aty, M.A. and A.E. Radwan, Modeling traffic accident occurrence and involvement. Accident Analysis & Prevention, 2000. 32(5): p. 633-642.

17. Skabardonis, A., T. Chira-Chavala, and D. Rydzewski, The I-880 field experiment: effectiveness of incident detection using cellular phones. California PATH Research Report, 1998. UCB-ITS-PRR-98-1.

18. Aguero-Valverde, J. and P.P. Jovanis, Spatial analysis of fatal and injury crashes in Pennsylvania. Accident Analysis & Prevention, 2006. 38(3): p. 618-625.

19. Chowdhury, S., M. Abdel-Aty, and K. Choi, Macroscopic spatial analysis of pedestrian and bicycle crashes. Accident Analysis & Prevention, 2012. 45: p. 382-391.

20. Lord, D., A. Manar, and A. Vizioli, Modeling crash-flow-density and crash-flow-V/C ratio relationships for rural and urban freeway segments. Accident Analysis & Prevention, 2005. 37(1): p. 185-199.

21. Plug, C., J.C. Xia, and C. Caulfield, Spatial and temporal visualisation techniques for crash analysis. Accident Analysis & Prevention, 2011. 43(6): p. 1937-1946.

22. Li, L., L. Zhu, and D.Z. Sui, A GIS-based Bayesian approach for analyzing spatial– temporal patterns of intra-city motor vehicle crashes. Journal of Transport Geography 2007. 15: p. 274-285.

23. Abdel-Aty, M., Analysis of driver injury severity levels at multiple locations using ordered probit models. Journal of safety research, 2003. 34(5): p. 597-603.

24. Abdel-Aty, M.A. and H.T. Abdelwahab, Predicting injury severity levels in traffic crashes: a modeling comparison. Journal of transportation engineering, 2004. 130(2): p. 204-210.

25. Lee, J., B. Nam, and M. Abdel-Aty, Effects of pavement surface conditions on traffic crash severity. Journal of Transportation Engineering, 2015. 141(10): p. 04015020.

65

26. Karlaftis, M.G. and I. Golias, Effects of road geometry and traffic volumes on rural roadway accident rates. Accident Analysis & Prevention, 2002. 34(3): p. 357-365.

27. Golob, T.F. and W.W. Recker, Relationships among urban freeway accidents, traffic flow, weather, and lighting conditions. Journal of Transportation Engineering, 2003. 129(4): p. 342-353.

28. Ahmed, M., M. Abdel-Aty, and R. Yu, Assessment of interaction of crash occurrence, mountainous freeway geometry, real-time weather, and traffic data. Transportation Research Record: Journal of the Transportation Research Board, 2012(2280): p. 51-59.

29. Abdel-Aty, M.A. and R. Pemmanaboina, Calibrating a real-time traffic crash-prediction model using archived weather and ITS traffic data. Intelligent Transportation Systems IEEE Transactions, 2006. 7.2: p. 167-174.

30. Abdel-Aty, M., et al., Predicting freeway crashes from loop detector data by matched case- control logistic regression. Transportation Research Record: Journal of the Transportation Research Board, 2004(1897): p. 88-95.

31. Alam, B. and L. Spainhour, Contribution of behavioral aspects of older drivers to fatal traffic crashes in Florida. Transportation Research Record: Journal of the Transportation Research Board, 2008(2078): p. 49-56.

32. Alam, B.M. and L. Spainhour. Logit and Case-Based Analysis of Drivers' Age as a Contributing Factor for Fatal Traffic Crashes on Highways and State Roads in Florida. in Transportation Research Board 93rd Annual Meeting. 2014.

33. Staplin, L., et al., Taxonomy of older driver behaviors and crash risk. 2012.

34. Plan, A.S.H.S., American Association of State Highway and Transportation Officials. Washington, DC, 1998.

35. Mannering, F.L. and C.R. Bhat, Analytic methods in accident research: Methodological frontier and future directions. Analytic Methods in Accident Research, 2014. 1: p. 1-22.

36. Cortes, C. and V. Vapnik, Support-vector networks. Machine learning, 1995. 20(3): p. 273- 297.

37. Li, X., et al., Predicting motor vehicle crashes using support vector machine models. Accident Analysis & Prevention, 2008. 40(4): p. 1611-1618.

38. Yu, R. and M. Abdel-Aty, Utilizing support vector machine in real-time crash risk evaluation. Accident Analysis & Prevention, 2013. 51: p. 252-259.

66

39. Chen, C., et al., Investigating driver injury severity patterns in rollover crashes using support vector machine models. Accident Analysis & Prevention, 2016. 90: p. 128-139.

40. Li, Z., et al., Using support vector machine models for crash injury severity analysis. Accident Analysis & Prevention, 2012. 45: p. 478-486.

41. Wang, W., C. Liu, and D. Chen. Predicting driver injury severity in freeway rear-end crashes by support vector machine. in Transportation, Mechanical, and Electrical Engineering (TMEE), 2011 International Conference on. 2011. IEEE.

42. Dong, N., H. Huang, and L. Zheng, Support vector machine in crash prediction at the level of traffic analysis zones: Assessing the spatial proximity effects. Accident Analysis & Prevention, 2015. 82: p. 192-198.

43. Washington, S.P., M.G. Karlaftis, and F.L. Mannering, Statistical and econometric methods for transportation data analysis. 2010: CRC press.

44. Mahalel, D., A. Hakkert, and J.N. Prashker, A system for the allocation of safety resources on a road network. Accident Analysis & Prevention, 1982. 14(1): p. 45-56.

45. Zadeh, L.A., Fuzzy logic, neural networks, and soft computing. Communications of the ACM, 1994. 37(3): p. 77-85.

46. Yang, X.-S., Nature-inspired metaheuristic algorithms. 2010: Luniver press.

47. Omidvar, A. and R. Tavakkoli-Moghaddam. Sustainable vehicle routing: Strategies for congestion management and refueling scheduling. in Energy Conference and Exhibition (ENERGYCON), 2012 IEEE International. 2012. IEEE.

48. Amini, A., R. Tavakkoli-Moghaddam, and A. Omidvar, Cross-docking truck scheduling with the arrival times for inbound trucks and the learning effect for unloading/loading processes. Production & Manufacturing Research, 2014. 2(1): p. 784-804.

49. Zadeh, L.A., Fuzzy sets. Information and control, 1965. 8(3): p. 338-353.

50. Engelbrecht, A.P., Computational intelligence: an introduction. 2007: John Wiley & Sons.

51. Mussone, L., A. Ferrari, and M. Oneta, An analysis of urban collisions using an artificial intelligence model. Accident Analysis & Prevention, 1999. 31(6): p. 705-718.

52. Chang, L.-Y., Analysis of freeway accident frequencies: negative binomial regression versus artificial neural network. Safety science, 2005. 43(8): p. 541-557.

67

53. Zeng, Q. and H. Huang, A stable and optimized neural network model for crash injury severity prediction. Accident Analysis & Prevention, 2014. 73: p. 351-358.

54. Abdel-Aty, M. and A. Pande, Identifying crash propensity using specific traffic speed conditions. Journal of safety Research, 2005. 36(1): p. 97-108.

55. Abdelwahab, H. and M. Abdel-Aty, Development of artificial neural network models to predict driver injury severity in traffic accidents at signalized intersections. Transportation Research Record: Journal of the Transportation Research Board, 2001(1746): p. 6-13.

56. Xie, Y., D. Lord, and Y. Zhang, Predicting motor vehicle collisions using Bayesian neural network models: An empirical analysis. Accident Analysis & Prevention, 2007. 39(5): p. 922-933.

57. Huang, H., et al., Predicting crash frequency using an optimized radial basis function neural network model. Transportmetrica A: Transport Science, 2016(just-accepted): p. 1- 24.

58. Chiou, Y.-C., An artificial neural network-based expert system for the appraisal of two- car crash accidents. Accident Analysis & Prevention, 2006. 38(4): p. 777-785.

59. Delen, D., R. Sharda, and M. Bessonov, Identifying significant predictors of injury severity in traffic accidents using a series of artificial neural networks. Accident Analysis & Prevention, 2006. 38(3): p. 434-444.

60. Karlaftis, M. and E. Vlahogianni, Statistical methods versus neural networks in transportation research: differences, similarities and some insights. Transportation Research Part C: Emerging Technologies, 2011. 19(3): p. 387-399.

61. Sohn, S.Y. and S.H. Lee, Data fusion, ensemble and clustering to improve the classification accuracy for the severity of road traffic accidents in Korea. Safety Science, 2003. 41(1): p. 1-14.

62. Chang, L.-Y. and H.-W. Wang, Analysis of traffic injury severity: An application of non- parametric classification tree techniques. Accident Analysis & Prevention, 2006. 38(5): p. 1019-1027.

63. Chang, L.-Y. and W.-C. Chen, Data mining of tree-based models to analyze freeway accident frequency. Journal of Safety Research, 2005. 36(4): p. 365-375.

64. Gang, R. and Z. Zhuping, Traffic safety forecasting method by particle swarm optimization and support vector machine. Expert Systems with Applications, 2011. 38(8): p. 10420- 10424.

68

65. Xu, C., W. Wang, and P. Liu, A genetic programming model for real-time crash prediction on freeways. Intelligent Transportation Systems, IEEE Transactions on, 2013. 14(2): p. 574-586.

66. Srinivasan, D., W.H. Loo, and R.L. Cheu. Traffic incident detection using particle swarm optimization. in Swarm Intelligence Symposium, 2003. SIS'03. Proceedings of the 2003 IEEE. 2003. IEEE.

67. FDOT, Available: http://www.safeandmobileseniors.org/floridacoalition.htm (Accessed: Jan 2015). 2015.

68. Grant, R.J., et al., A comparison of data sources for motor vehicle crash characteristic accuracy. 2000.

69. Farmer, C.M., Reliability of police-reported information for determining crash and injury severity. 2003.

70. Omidvar, A., et al., Understanding the factors affecting the frequency and severity of aging population-involved crashes in Florida. Advances in Transportation Studies, 2016(2).

71. Omidvar, A., et al., Effect of Traffic Patterns on the Frequency of Aging-Driver-Involved Highway Crashes: A Case Study on the Interstate-95 in Florida, in Transportation Research Board 95th Annual Meeting. 2016.

72. Tobin, J., Estimation of relationships for limited dependent variables. Econometrica: journal of the Econometric Society, 1958: p. 24-36.

73. Lord, D., S.P. Washington, and J.N. Ivan, Poisson, Poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory. Accident Analysis & Prevention, 2005. 37(1): p. 35-46.

74. Montgomery, D.C., E.A. Peck, and G.G. Vining, Introduction to linear regression analysis. 2015: John Wiley & Sons.

75. Charness, N., Impact of technology on successful aging. 2003: Springer Publishing Company.

76. Chen, C., et al., A multinomial logit model-Bayesian network hybrid approach for driver injury severity analyses in rear-end crashes. Accident Analysis & Prevention, 2015. 80: p. 76-88.

77. de Oña, J., et al., Analysis of traffic accidents on rural highways using Latent Class Clustering and Bayesian Networks. Accident Analysis & Prevention, 2013. 51: p. 1-10.

69

78. Arvan, M., A. Omidvar, and R. Ghodsi, Intellectual capital evaluation using fuzzy cognitive maps: A scenario-based development planning. Expert Systems with Applications, 2016. 55: p. 21-36.

79. Razmi, J., M. Arvan, and A. Omidvar, A hybrid AHP-FCM model for backup supplier selection in presence of disruption risk. International Journal of Decision Sciences, Risk and Management, 2014. 5(3): p. 213-233.

70

BIOGRAPHICAL SKETCH

Aschkan Omidvar is a Master of Science student at Florida State University – Department of Industrial and Manufacturing Engineering. He holds a M.Sc. degree in industrial and systems engineering from the University of Tehran. He has been involved and worked on various projects and published 10 papers in prestigious international journals and conferences with a focus on operations research and optimization modeling, as well as traffic safety and prediction analysis. He will join Department of Civil and Coastal Engineering at University of Florida to pursue his PhD. Education Current 2nd. M.Sc.: Industrial Engineering Full-Time Research Assistant at the Center for Accessibility and Safety for an Aging Population (ASAP), Dept. of Civil Eng., Florida State University, Tallahassee , FL, USA GPA: 4.0/4.0

Oct. 2012 M.Sc.: Industrial and Systems Engineering Department of Industrial and Systems Engineering, College of Engineering, University of Tehran, Tehran, Iran GPA: 3.93/4.0 Major GPA: 4.0/4.0 (12 credit hours)

Related Publications to This Thesis

Omidvar Aschkan, Eren Erman Ozguven, O. Arda Vanli and Ren Moses. "Frequency and Severity of Aging-involved Accidents in Florida." Road Safety & Simulation (RSS, 2015) Int. Conference, Orlando, Florida, 2015.

Omidvar, Aschkan, Eren Erman Ozguven and O. Arda Vanli. "Effect of Traffic Patterns on the Frequency of Aging Driver-involved Highway Crashes: A Case Study on the Interstate-95 in Florida." Transportation Research Board (TRB), Washington D.C., 2016.

Omidvar, Aschkan, Eren Erman Ozguven, O. Arda Vanli and Ren Moses. "Understanding the Factors Affecting Frequency and Severity of Aging-involved Accidents in Florida." Advances in Transportation Studies, an International Journal, 2016.

71