Florida State University Libraries
2016 Aging Driver Focused Traffic Crash Frequency and Severity Analyses Aschkan Omidvar
Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected] FLORIDA STATE UNIVERSITY
FAMU-FSU COLLEGE OF ENGINEERING
AGING DRIVER FOCUSED TRAFFIC CRASH FREQUENCY AND SEVERITY
ANALYSES
By ASCHKAN OMIDVAR
A Thesis submitted to the Department of Industrial & Manufacturing Engineering in partial fulfillment of the requirements for the degree of Master of Science
2016
© 2016 Aschkan Omidvar
Aschkan Omidvar defended this thesis on July 12, 2016 The members of the supervisory committee were:
O. Arda Vanli Professor Co-Directing Thesis
Eren Erman Ozguven Professor Co-Directing Thesis
Abhishek K. Shrivastava Committee Member
Chiwoo Park Committee Member
The Graduate School has verified and approved the above-named committee members, and certifies that the thesis has been approved in accordance with university requirements.
ii
Dedicated to those who made me a better man…
iii
ACKNOWLEDGMENTS
I would like to express my sincere gratitude to my advisors Dr. Arda Vanli and Dr. Eren Erman Ozguven for providing me the opportunity to come to Florida State University and be part of their research group. I will forever be indebted to them for the guidance, mentorship and support throughout my studies. The past two years of research experience under their guidance is a tremendous asset to my career and life.
I am also thankful to my MS committee members Dr. Chiwoo Park and Dr. Shrivastava, for their insightful advice and comments have served as valuable inputs for the engineering and scientific significance of this research.
This thesis was supported by United States Department of Transportation grant DTRT13-G- UTC42, and administered by the Center for Accessibility and Safety for an Aging Population (ASAP) at the Florida State University (FSU), Florida A&M University (FAMU), and University of North Florida (UNF). We also thank the Florida Department of Transportation and National Oceanic and Atmospheric Administration for providing the data. The opinions, results, and findings expressed in this manuscript are those of the authors and do not necessarily represent the views of the United States Department of Transportation, The Florida Department of Transportation, The National Oceanic and Atmospheric Administration, The Center for Accessibility and Safety for an Aging Population, the Florida State University, the Florida A&M University, or the University of North Florida.
iv
TABLE OF CONTENTS
List of Figures ...... vii List of Tables ...... ix List of Abbreviations ...... x Abstract ...... xi
1. INTRODUCTION ...... 1
2. LITERATURE REVIEW ...... 3
2.1 Traffic Crash Analysis: Statistical Tools ...... 3 2.1.1 Logistic Regression ...... 4 2.1.2 Poisson and Negative Binomial ...... 6 2.1.3 Statistical Learning Methods ...... 9 2.2 Traffic Crash Analysis: Computational Intelligence ...... 10 2.2.1 Traffic Crash Analysis: Neural Networks...... 11 2.2.2 Traffic Crash Analysis: Other Approaches ...... 13
3. RESEARCH METHODOLOGY AND RESULTS ...... 15
3.1 Data Collection and Pre-Processing ...... 16 3.1.1 Statewide FDOT Crash Dataset ...... 17 3.1.2 Meteorological Data...... 18 3.1.3 Hourly Traffic Flow ...... 18 3.2 Exploratory Descriptive Analysis ...... 22 3.3 Correlation Analysis for the Entire Corridor ...... 28 3.4 Logistic Regression Analysis ...... 29 3.4.1 Logistic Regression Analysis of Crash Frequency for Roadway Segments ...... 31 3.4.2 Exploratory Analysis for Crash Severity ...... 36 3.4.3 Logistic Regression Analysis of Crash Severity for Roadway Segments ...... 39 3.5 Prediction Capabilities using ROC Curves ...... 49 3.5.1 Crash Frequency Prediction Analysis ...... 51 3.6 Research Limitations ...... 54
4. CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS ...... 57
APPENDICES ...... 60
A. KERNEL DENSITY FUNCTION FOR CRASH SEVERITY (JAX) ...... 60 B. KERNEL DENSITY FUNCTION FOR CRASH SEVERITY (MIAMI ...... 62
v
References ...... 64
Biographical Sketch ...... 71
vi
LIST OF FIGURES
1 Research Methodology ...... 15
2 I-95 Corridor and TTMS Stations in Florida ...... 19
3 A Sample of FTI User Interface...... 19
4 Output Database Processed for the Research...... 21
5 Time Series Plots for Traffic Flow: North and South Directions and April and Summer Months Hourly Averages ...... 23
6 Time Series Plot of Hourly Crash Counts for All and Aging Drivers ...... 25
7 Histograms of Crash Flows for All and an Aging Drivers in Miami ...... 26
8 Flow Dot-plots of Flow Data for Different Light Conditions (a) Miami, (b) Jacksonville . 28
9 Traffic Characteristics for the I-95 Corridor in Florida, (a) Correlation Matrix for The AADT vs. Roadway Width for All Segments, (b) AADT, and Crashes for All Age Groups and Aging-Involved Crash Frequencies Versus Distance (miles) ...... 29
10 Logit Curves for Crash Probability vs. Flow, (a) Miami Station 2 All Drivers, (b) Miami Station 2 Aging Drivers ...... 35
11 (a) Crash Severity Kernel Density Maps for the Miami 1-location in (b) 2010, (c) 2011, (d) 2012...... 37
12 Kernel Density for Miami 1 for Year 2010 and Comparison to The Scaled Density Values on The Highway for The Years 2010, 2011 and 2012...... 38
13 Kernel Density Function on a 2-D plain for 3 years. X: Segment Length. Y: Normalized Kernel Raster Value ...... 39
14 Delta-Beta Residual Analysis for Outlier Detection and Removal (Node Miami 1). The Right-Hand Side Figure Is The Expanded Version of The Box on Left-Hand Side Figure Shown After Outlier Handling ...... 40
15 A Schematic Illustration of Aggregation Approach ...... 48
vii
16 (left) A Sample Frequency Table for Observation vs. Prediction Generated by Algorithm, (right) ROC Curve for Crash Frequency Model for Node Miami1-All Age Groups ...... 51
17 Receiver Operating Characteristic Plots for Crash Severity Prediction by Logit for Miami- All Ages (a), Miami-Aging Drivers (b), Jacksonville-All Ages (c) and Jacksonville-Aging Drivers (d). Values at The Lower Right Corner of The Plots Show Area Under Curves (AUC)...... 52
18 Comparison of ROC Curves for Spatially Aggregated Model of Miami with Data for 2010- 2012 Vs. Spatially Aggregated Model for 2011 ...... 53
viii
LIST OF TABLES
1 FDOT Crash Data Attributes ...... 17
2 Summary of Crashes According to Age Groups, Location, Time of Day and Weather Conditions...... 23
3 Deviance Goodness of Fit Test Results for Different Models (LoF: Lack of Fit) ...... 31
4 Fitted Regression Models. The Values on Top Are P-Values and The Underlined and Bolded Italic Numbers Are Coefficient Estimates...... 34
5 Goodness of Fit p-values between Observed Data and Predicted Values ...... 42
6 Fitted Regression Models (Crash Severity). The Values Underlined in Italic Are P-Values and Other Numbers Are Coefficients. Results Are Significant Only for Miami 1 And Miami 2 Locations...... 43
7 Crash Attributes for Crash Severity ...... 43
8 Candidate Factor for Regression Analysis of Crash Severity ...... 44
9 Number of Observations in Severity Data and Test Data sets ...... 44
10 Crash Severity Involvement by Age (All numbers are in percent). The number of crashes experienced by an age group is shown as a percentage of total number of crashes in a location...... 45
11 Fitted Regression Models for Crash Severity Analysis: Locations in Miami and Jacksonville Are Temporally Aggregated for 3 Years (2010-2012) ...... 46
12 Logistic Regression Analyses for Aggregated Models of Three Years vs. Aggregated Model of 2011 ...... 49
13 Schematic Observation Frequency Table for Observation Vs. Prediction ...... 50
14 Performance Comparison of Temporally Aggregated Model for Miami vs. Miami-2011 .. 54
ix
LIST OF ABBREVIATIONS
AADT: Annual Average Daily Traffic ANN: Artificial Neural Network ART: Adaptive Resonance Theory AVI: Automatic Vehicle Identification BNN: Bayesian Neural Networks BPNN: Back Propagation Neural Networks CART: Classification and Regression Tree CC: Convex Combination CFS: Correlation Based Feature Selector CI: Computational Intelligence DHSMV: Department of Highway Safety and Motor Vehicles EC: Evolutionary Computation FAWN: Florida Automated Weather Network FB: Full Bayes FDOT: Florida Department of Transportation FL: Fuzzy Logic GIS: Geographical Information System GP: Goal Programming HL: Hosmer Lemeshow ML: Machine Learning MLP: Multi-Layer Perceptron N2PFA: NN Pruning for Function Approximation NB: Negative Binomial NC: Neural Computing NLCCA: nonlinear (nonparametric) canonical correlation analysis NOAA: National Oceanic and Atmospheric Administration OP: Ordered Probit PNN: Probabilistic Neural Network PTMS: Portable Traffic Monitoring Sites RBF: Radius Basis Function RF: Random Forest ROC: Receiver Operating Characteristic SVC: Single Vehicle Crash SVM: Support Vector Machines TAZ: Traffic Analysis Zone TTMS: Telemetric Traffic Monitoring Sites USNO: United States Naval Observatory
x
ABSTRACT
The aim of this thesis is to investigate the effect of environmental and traffic-related factors on the frequency and severity of traffic crashes with a focus on different age groups, with special attention to the aging populations. Existing studies in the traffic safety have not specifically focused on aging driver-involved crashes. It is well known that, aging drivers are more vulnerable to the roadway crashes than other adult age groups due to their cognitive, physical and health limitations. This problem becomes more challenging due to the drastic variation in the traffic patterns that especially happen on the major highways. In this thesis, several datasets from different sources, such as the National Oceanic and Atmospheric Administration (NOAA) and Florida Automated Weather Network (FAWN), the Florida Department of Transportation (FDOT) and the United States Naval Observatory (USNO), are collected, refined and combined. With the aid of statistical correlation analysis and logistic regression, a top down analysis is performed in order to analyze the occurrence of crashes via a case study application on the I-95 highway corridor in the State of Florida. Using logit curves, a sensitivity analysis is carried out to quantify the effect of traffic volume on the crash frequency. In addition to the crash frequency analysis, factors influencing the crash severity are also analyzed in an aggregated manner for two metropolitan areas in the City of Jacksonville and Miami, Florida. Both frequency- and severity-focused analyses have led to several important conclusions. Results suggest that the variation in the hourly traffic volume significantly affects the crash occurrences for both aging drivers and drivers of all ages depending on the geographical location; however, the crash occurrence for aging drivers is less sensitive to the flow than all age groups in congested locations. Results indicate that crash severities for all other age groups decrease on roadways with narrower shoulders and at night unlike those of aging drivers. Furthermore, driving at night on I-95 in Jacksonville seems to be problematic for both age classes whereas that risk is less for Miami locations. Higher roadway surface width also appears to increase the chance of having a severe crash for aging drivers. The DUI-influenced crashes have also been detected considerably high on the I-95 highway corridor in the City of Miami, Florida. This problem seems critical both in terms of crash frequency and severity. The proposed methodology can help
xi transportation officials to understand the nature of the aging driver-involved crashes, and formulate more effective safety-oriented decisions.
xii
CHAPTER 1
INTRODUCTION
In 2013, 32,719 people died and approximately 2.3 million people were injured in motor vehicle crashes in the U.S. Traffic incidents also imposed an economic cost of $242.0 billion to the U.S. economy in 2010. In 2012, the number of police-reported vehicle crashes were over 5.5 million where 33,561 people were killed and over 2.3 million people were injured or paralyzed due to these crashes [1-3]. These figures show the importance of thoroughly studying the factors that affect the occurrence of crashes. In the literature, researchers classify the crash influencing factors in three main categories: (1) behavioral and human-related factors, (2) Environmental factors and (3) Vehicle-related factors. Recent literature indicates that approximately over 80 to 95% of the roadway crashes are driver-related [1, 4]. This shows the significant impact of cognitive limitations and driver errors, including reckless driving, fatigue and driving under influence. In addition, these main factors may interact and cause crashes. For instance, a harsh weather condition on a steep upgrade or on a poor quality pavement can put a driver prone to error, which may lead to a crash. Therefore, a robust crash prediction methodology should study the effect of these factors as well as their interactions on the likelihood of crash occurrences and severities. The objective in this thesis is to investigate the impact of environmental and traffic-related factors on the frequency and severity of highway crashes with a focus on different age groups including the aging populations. To the author’s knowledge, existing studies in the traffic safety have not specifically focused on aging driver-involved crashes. Aging drivers are more vulnerable to the roadway crashes than other adult age groups due to their cognitive, physical and health limitations. Therefore, we pay a special attention on studying the frequency and severity of aging driver- involved crashes. In this thesis, several datasets from different sources, such as the National Oceanic and Atmospheric Administration (NOAA) and Florida Automated Weather Network (FAWN), the Florida Department of Transportation (FDOT) and the United States Naval Observatory (USNO), are collected, refined and combined. With the aid of statistical correlation analysis and logistic
1 regression, a top down analysis is performed in order to analyze the occurrence of crashes via a case study application on the I-95 highway corridor in the State of Florida. Using logit curves, a sensitivity analysis is carried out to quantify the effect of traffic volume on crash frequency. In Chapter 2, we perform a comprehensive review on the existing literature. Several core and related topics to our research are covered in this section. In detail, we discuss researches that have studied the various factors affecting crash frequency and severity followed by a review of the studies that employed GIS tools in their research in order to identify influencing factors and their correlations using different spatial and temporal analyses. We also focus on those researches that have used support vector machines in order to analyze and predict the significant parameters in the involvement of crashes. Finally, we perform a survey on the studies that have employed artificial intelligence for crash factor analysis and prediction. In Chapter 3, we discuss the contribution of this thesis. We first present the overall research methodology followed by the data collection, synchronization and processing approach. Next, we perform an exploratory descriptive analysis on the selected I-95 segments to get a better sense of the crash characteristics as well as the traffic flow behavior in those segments. Moreover, we investigate the trends and patterns of crash occurrence, and the frequency of crashes with respect to different flows. This extensive evaluation is followed by analyzing the highway crashes on the I-95 corridor in the State of Florida. Here, we perform correlation and factor analyses on each Telemetered Traffic Monitoring Sites (TTMS) station to understand the crash patterns of the entire corridor with respect to several roadway-related factors. Finally, we introduce the overall modelling approach. In order to perform this, we first conduct a survey of the most common and powerful statistical modelling approaches, and then select the most suitable one with the highest merit for this research, which is the logistic regression model. We use this model to find the significant factors affecting crash frequency on all selected I-95 segments in the Metropolitan areas of Jacksonville and Miami. We measure the validity and goodness of fit for all the models through several statistical tests. We also propose logit curves to help policy makers and practitioners in the traffic safety decision making process. In this thesis, we also focus on the crash severity in addition to the crash frequency. Therefore, we analyze the factors influencing the crash severity in an aggregated manner for two metropolitan areas in Jacksonville and Miami, and we conclude this section with some managerial insights and suggestions. Chapter 4 presents the conclusions and future work directions.
2
CHAPTER 2
LITERATURE REVIEW
2.1 Traffic Crash Analysis: Statistical Tools
For decades, researchers in the field of crash analysis and traffic safety have tried to develop statistical and computational intelligent tools to identify the major factors in crashes and predict the frequency and severity of crashes in the future. Statistical tools, such as generalized linear models and regression analyses, have been extensively studied in different branches of traffic safety [5]. Most studies in the field of traffic safety focus on two main problems: (1) crash hotspot analysis, which detects the locations with high risk of crash occurrence and corresponding significant factors affecting the crash frequency and severity, and (2) crash prediction, trends and forecasting. Various researches have taken different approaches using statistical tools to identify, predict and analyze the crashes spatially and temporally, and to investigate the risk and severity associated with each crash location. We will present an extensive review of these studies in this section. In this research, we review different modelling approaches available in the literature. Almost all modelling approaches can be classified under generalized linear models (GLM). In GLM, response variables can have an error distribution other than a normal distribution, and in the exponential family such as Poisson, gamma and binomial distributions. In fact, GLM is a generalized version of linear regression by allowing the linearity in the model to be related to the response variable with a link function. SAS Institute separates GLM from a similar approach, named Generalized Regression models [6]. These techniques try to fit better models by shrinking the model coefficients to zero. Therefore, the resulting estimates are biased. This increase in bias may result in a reduction of prediction variance, and consequently decrease in overall prediction error. Some of the most frequently used models are Maximum Likelihood, Forward Selection, Elastic Net, Lasso and Ridge Regression, where the Elastic Net and the Lasso methods need variable selection in the
3 modeling procedure. For more traditional modeling techniques, in case of p > n, (p is the number of predictors and n is the number of observations) where the number of variables is more than observations, variable selection can be used. In addition, these techniques, specifically the Elastic Net and the Lasso perform, produce promising results for large-scale datasets and those where collinearity is a significant issue. Several researches have utilized these techniques to analyze factors affecting the crash frequency or severity as discussed; however, to the best of authors’ knowledge, they mostly employed the Maximum Likelihood for this purpose. The Maximum Likelihood method is a rather classical approach, and it is commonly used as a baseline for comparison with other techniques.
2.1.1 Logistic Regression
The Generalized Regression enables one to select a variety of distributions based on the purpose of the research and the nature of the response variables in the analysis: exponential, gamma, Cauchy, Gaussian, negative binomial distributions as well as the zero-inflated versions, namely zero-inflated negative binomial, zero-inflated binomial, zero-inflated Poisson, zero- inflated beta binomial, and zero-inflated gamma (e.g., categorical, continuous etc.). Although negative binomial distribution is one of the most frequently implemented modelling approaches in the traffic safety field, it is not feasible to implement this approach in this research since the response variable of our data is binary (and not of type of count). In either case, the Generalized Regression or GLM, the nature of data specifies the distribution of the model and the link function. In the field of crash analysis using logistic regression tools, numerous researches have been conducted. Al-Ghamdi [7] applied logistic regression to accident-related data to examine the contribution of several variables to accident severity. He studied the accident severity as a dichotomous variable with two outcomes of fatal and non-fatal. He tested nine independent variables, and found out that two were most significantly associated with accident severity: location and cause of crashes. A similar research has been carried out by Dissanayake and Lu [8] with a focus on aging drivers to identify factors influencing severity of injury in fixed object– passenger car crashes in Florida. They developed two sets of models for driver injury severity and crash severity, where severity is considered from no injury to fatality. The fitted model was then used to identify the influence of factors such as roadway, environmental, vehicle, and driver related
4 attributes on severity. They concluded that travel speed, restraint device usage, point of impact, use of alcohol and drugs, personal condition, gender, whether the driver is at fault, urban/rural nature and grade/curve existence at the crash location are the important factors for making a severity difference to aging drivers involved in single vehicle crashes. Furthermore, Tay et al. [9] employed logistic regression to focus on factors affecting hit and run crashes (Leaving the scene of a crash without reporting it). They considered driver characteristics, vehicle types, crash characteristics, roadway features and environmental characteristics to distinguish the potential factors that contribute to hit and run crashes from non- hit and run ones. They suggested that drivers are more likely to run when crashes occurred at night, on a bridge and flyover, bend, straight road and near shop houses, two-wheel vehicles and vehicles from neighboring countries. Male drivers, minority, and aged between 45 and 69 were also found to be more prone to commit hit and run crashes. On the other hand, collisions involving right turn and U-turn maneuvers, and occurring on undivided roads were mostly non-hit-and-run crashes. Sze et al. [10] emphasized on pedestrian involved crashes. They evaluated the injury severity of pedestrian casualties and found the factors contributing to mortality and severe injury in Hong Kong. They considered demographic, crash, environmental, geometric, and traffic characteristics of crashes and pedestrian and utilized binary logistic regression is used to measure the probability of fatality and severe injury. They revealed that there is a downward trend in pedestrian injury risk, controlling for the influences of mostly demographic and road environment factors. Moreover, they discussed that the effect of pedestrian behavior, traffic congestion, and junction type on pedestrian injury risk are subject to temporal variation. Studies in this field are not limited to binary logistic regression. Some researchers used different variations of logistic regression depending on the purpose of research and the format of data. For instance, Shankar et al. [11] employed nested logit formulation to determine crash severity in Washington State. The estimation results demonstrated evidence of the influence of environmental conditions, highway design, accident type, driver characteristics and vehicle attributes on crash severity, and they also conclude that nested logit is a suitable approach for this analysis. Shankar and Mannering [12] used multivariate analyses to eliminate the potential ambiguity and bias caused by univariate analyses to identify the causality of motorcycle rider crash severity in single-vehicle collisions. Their findings suggested that the multinomial logit formulation is a promising approach to assess the determinants of motorcycle accident severity.
5
Mixed logit is another variation employed by Milton et al. [13] in this context. The approach they took allowed for estimated model parameters to vary randomly across roadway segments to account for unobserved effects potentially relating to factors such as roadway characteristics, environmental factors, and also driver behavior. Their findings suggested that traffic volume-related variables, namely average daily traffic per lane, average daily truck traffic, truck percentage, interchanges per mile and weather effects are best modeled as random- parameters. However, some other parameters such as roadway characteristics (the number of horizontal curves, number of grade breaks per mile and pavement friction) were best modeled as fixed parameters. They finally suggested that the mixed logit model has considerable promise in highway safety.
2.1.2 Poisson and Negative Binomial
Different approaches have been used in the literature to identify, predict and analyze the crash locations. For example, Poisson or Negative Binomial distributions were used in order to present an accident frequency-focused approach. One important assumption in a Poisson regression model is that the mean and variance of the number of crashes at a given road segment are equal. This assumption is found to be restrictive in many applications in which the variance is not necessarily equal to the mean. The Negative Binomial distribution employs a dispersion parameter which relaxes this assumption and in negative binomial regression variance does not have to equal to the mean. Poch and Mannering [14] analyzed seven-year accident histories, which were targeted for operational improvements, from 63 intersections in Bellevue, Washington. Their objective was to estimate the frequency of accidents at intersection approaches using a negative binomial regression, and they uncovered important interactions between traffic-related elements, accident frequencies and geometry. Shankar, Mannering and Barfield [15] studied the frequency of crash occurrence on highways based on a multivariate analysis of roadway geometric, namely horizontal and vertical alignments, weather, and other seasonal effects. They estimated the overall crash frequencies using a negative binomial model along with models of the frequency of specific crash types. They also studied the interactions between weather and geometric variables, and uncovered several important determinants of accident frequency.
6
Abdel-Aty and Radwan [16] used the Negative Binomial distribution to model the frequency of accident occurrence using a 3-year data set, accounting for 1606 accidents on a principal arterial in Central Florida. In this research, Annual Average Daily Traffic (AADT), degree of horizontal curvature, lane, shoulder and median widths, urban/rural, and the section’s length were shown to be significant factors affecting the crash frequency. They also studied the demographic characteristics of the driver (age and gender), and concluded that heavy traffic volume, speeding, narrow lane width, larger number of lanes, urban roadway sections, narrow shoulder width and reduced median width increase the likelihood for crash involvement. In addition, they found out that male drivers experience more traffic crashes while speeding whereas female drivers have a greater tendency to be involved in crashes with heavy traffic volume, narrow lane width, reduced median width, and larger number of lanes. As for the age of the drivers, they indicated that young and older drivers experienced more crashes than middle aged drivers with heavy traffic volume, and reduced shoulder and median widths. Younger drivers were found to be more prone in crash involvement while turning and speeding. There is also a growing interest in the use of Geographical Information Systems (GIS)- based spatial analysis in order to understand spatial patterns in crash data. Most of the studies in the literature have selected one portion of a roadway network (e.g., road segment, intersection, corridor), and studied the spatial patterns of crashes with different geographic levels, from Traffic Analysis Zones (TAZ) to census tracts as well as at the scale of county, state or national level [17]. For example, Valverde and Jovanis [18] conducted a spatial analysis for the state of Pennsylvania by comparing Negative Binomial (NB) and Full Bayes (FB) methods based on the different factors such as weather, transportation infrastructure, traffic volume and socio-demographical characteristics. They concluded that the crash rate was higher in counties with higher poverty rate and traffic volume, and the crash rate increased for the following age groups: 10-14, 15-24, 65+. Several researches also focused on the motor vehicle crashes with the pedestrians and cyclists [19, 20]. The development of GIS also allowed the researchers to investigate spatial and temporal solutions to the problem both simultaneously and hierarchically. Although spatial-temporal (or spatio-temporal) analysis has been widely used in different sectors such as fire locations, bio surveillance, disease outbreak surveillance, such an application for roadway crash pattern analysis is relatively limited. An interesting study was conducted by Plug et al. [21], who studied spatial,
7 temporal and spatio-temporal patterns of single vehicle crashes (SVC) based on a 10-year data (1998-2008) in Western Australia. They generated several spider graphs to demonstrate the temporal patterns of crashes with a focus on the hours of a day and days of a week. For a spatial pattern analysis, they implemented the Kernel approach, and they used the Comap method in order to conduct a spatio-temporal study. Using these methods, they generated maps and graphs to investigate the occurrence of crashes. Another interesting research in this area was conducted by Li et al. [22]. They analyzed the spatio-temporal crash risk patterns in Texas by creating a series of posterior risk maps indicating a relative risk degree to each intersection and segment. They used a Bayesian approach in order to rank and identify segments according to the assigned risk values, having the ability to forecast risks with a high certainty even with insufficient data. Although this research successfully analyzed the crash pattern of different segments, it did not consider the severity of the crashes. Several researchers have argued on the need to include the environmental and/or traffic- related factors in order to investigate the severity of crash occurrences [15, 23-26]. Golob and Recker [27] conducted linear and non-linear canonical correlation analyses to investigate the relationship between the traffic flow, weather and light conditions, and concluded that crash rates increased as the median speed increased. Ahmed, Abdel-Aty and Yu [28] considered the roadway geometry, real-time weather and traffic data in order to predict the crash occurrences. They concluded that the roadway geometry, real time weather and automatic vehicle identification (AVI) system data were influential on the occurrence of crashes, especially during winter and snowy weather conditions. They also stated that the traffic density fluctuation and the road geometry impacts on the crashes were more obvious. Similarly, Abdel-Aty and Pammanaboina developed a crash likelihood prediction model using a real-time traffic flow variable in addition to the rainfall data [29]. For more information on the methods of crash analyses, please refer to [30]. Among the literature that studied the effect of driver age on the crashes, Alam and Spainhour [31] showed that older drivers were more prone to intersection-related crashes than those that happen at the roadway segments. They also showed that older people were more at fault compared to younger drivers in fatal crashes. The driver age was also identified as a critical contributing factor for fatal traffic crashes on state highways in Florida [32]. Staplin et al. [33] prepared taxonomy tables that identified the risk factors in crashes involving aging drivers, such
8 as risky behavior, driving habits and exposure patterns. For further information on related highway traffic safety studies, please refer to [34, 35].
2.1.3 Statistical Learning Methods
Support Vector Machine (SVM) probably is the most widely used statistical learning technique and refers to supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis [36]. For a given set of marked-to-be- categorized training instances, an SVM training algorithm creates a model that assigns new data instances into one of the available categories (deterministic binary linear classifier). In detail, a support vector machine creates a set of hyperplanes in a high-dimensional space. The hyperplane with the largest distance to the nearest trained data point has the highest classification power and lowest generalization error. Recently, the application of SVM in crash data-mining and analysis has gained extra attention. Researchers have targeted different deficiencies in classical methods, and tried to overcome them using SVM. For example, Li et al. [37] proposed Negative Binomial (NB) regression and SVM models, and compared these models using the data collected in Texas. The study showed that SVM models predicted crash data more effectively and accurately without over- fitting the data than the traditional NB models or Back-Propagation Neural Network (BPNN). They suggested the use of SVM if the sole purpose of the study consisted of predicting motor vehicle crashes since it was easier to implement than the BPNN method. Yu and Abdel-Aty [38] developed a SVM model to eliminate the linear functional form and over-fitting drawbacks of logistic regression and neural network models, respectively. In this research, they proposed a Classification and regression tree (CART) model to select the most important explanatory variables, then estimated three candidate Bayesian logistic regression models to capture the unobserved heterogeneity. Finally, they developed SVM models with different kernel methods, and compared them with the Bayesian logistic regression model. They found that Receiver Operating Characteristic (ROC) curve demonstrated that the SVM model with the Radial-basis kernel function performed more efficiently than others. A similar work has been carried out by Chen et al. [39] with a focus on the driver injury severities in rollover crashes in New Mexico. They employed a classification and regression tree (CART) model to identify the
9 significant factors and used SVM models with polynomial and Gaussian radius basis function (RBF) kernels to investigate driver injury severity patterns in rollover crashes. Results showed that seatbelt use, number of lanes, comfortable driving environment conditions, alcohol or drug involvement, driver demographic features, maximum vehicle damages in crashes and crash time and location were the significant factors associated with serious and fatal injuries. Li et al. [40] developed an SVM with a focus on crash injury severity analyses, and compared it with the Ordered Probit (OP) model. The results showed that it provided better prediction outputs for small proportion injury severities even though the SVM model might suffer from the multi-class classification problem (48.8% correct prediction for the SVM model compared to 44.0% for OP model). They also conducted a sensitivity analysis to investigate the potential of using the SVM model for evaluating the impacts of external factors on crash injury, and the results demonstrated that the SVM model produced comparable results with the OP model. Wang et al. [41] created a multi-layer perceptron artificial neural network model benchmark in order to evaluate the performance of the proposed SVM model, and to investigate the relationship between driver injury severity level and driver, roadway, environmental and vehicle-related factors. Based on several case studies in Wisconsin, their results from SVM suggested an overall classification accuracy of 63.4% and 58.6% for the training and testing group datasets, respectively. For the Traffic Analysis Zone (TAZ) level crash prediction, SVM has also shown satisfactory capability considering spatial correlations [42]. The authors applied a Correlation- based Feature Selector (CFS) to evaluate candidate factors related to zonal crash frequency in order to handle high-dimensional spatial data, and showed that SVM models involving the spatial proximity trumped the non-spatial models in terms of both fitting and predictive performance.
2.2 Traffic Crash Analysis: Computational Intelligence
Most of the exiting literature focused on predicting crash frequency and severity using statistical tools and generalized linear models, such as Negative Binomial (NB), Poisson Regression [43] or Bayesian Empirical methods [44] since the occurrence of accidents on a highway section can be assumed as a random event. However, soft computing techniques are capable of capturing highly nonlinear relationships between the independent and the dependent
10 variables. Therefore, they are able to discover the hidden and complex correlations and impacts of input factors in complex datasets. In computer science, soft computing, as a branch of Computational Intelligence (CI), mainly refers to the use of sub-optimal methods for computationally hard tasks, namely NP- complete problems where there is no known algorithm to compute an exact solution in polynomial time [45]. The role model for soft computing is the human mind and the nature [46] .They can be modified and applied to a wide variety of complex problems, such as OR modeling [47, 48]. In fact, soft computing techniques are capable of resembling biological processes more closely and effectively than traditional and classical techniques. The principal constituents of Soft Computing are Fuzzy Logic (FL) [49], Machine Learning (ML), Neural Computing (NC) and Evolutionary Computation (EC), evolutionary algorithms, swarm intelligence and metaheuristic algorithms are categorized under this class. For an in-depth understanding of computationally intelligent methods, readers may refer to [50].
2.2.1 Traffic Crash Analysis: Neural Networks
Most researchers in the field of traffic safety and crash analysis have employed statistical methods such as regression analysis; however, the use of neural networks has recently gained extra attention. Despite the benefits and the prevalence of use, the application of computational intelligence in traffic crash prediction and safety studies has mostly been limited to the use of Artificial Neural Network (ANN) packages and decision tree-based approaches. For example, Mussone, Rinelli and Reitani [51] used artificial neural networks in order to analyze vehicular accidents in Milan, and quantified the degree of danger on urban intersections using different scenarios by the proposed ANN model. Several researchers have also attempted to modify the structure of a neural network in order to obtain results more accurately and efficiently. Most researchers studied the application of neural networks on the severity of the crash injury, and little attention has been given to the crash frequency. Chang [52] compared the efficiency and the performance of ANN with a negative binomial regression model using the 1997–1998 freeway accident data in Taiwan, and concluded that ANN was a more consistent alternative to analyzing freeway accident frequency.
11
Zeng and Huang [53] proposed a convex combination (CC) algorithm to train a neural network for crash severity prediction and a modified NN Pruning for Function Approximation (N2PFA) algorithm to optimize the network overall structure. They compared the proposed algorithm with the NN trained by traditional back-propagation (BP) algorithm with an ordered logit model with a case study in Florida. They found that the CC algorithm outperforms the BP algorithm in training speed, classification power and convergence ability with a less complex network. Comparing the results of NN to the ordered logit model demonstrated the NN’s superiority over statistical models in predicting the crash injury severity. Abdel-Aty and Pande [54] classified traffic speed patterns emerging from the loop detector data. In order to classify these data into either crashes or non-crashes, they proposed a Bayesian classifier based methodology, a probabilistic neural network (PNN), and a neural network implementation of Bayesian-Parzen classifier. Their results showed that PNN trained much faster than the multilayer feed forward networks, and was able to classify the crashes by 70%. Abdelwaheb and Abdel-Aty [24, 55] have also examined the relationship between driver injury severity and driver, vehicle, roadway, and environment characteristics using multilayer perceptron (MLP) and fuzzy adaptive resonance theory (ART) neural networks. They applied their methodology on a case study of two-vehicle accidents that occurred at signalized intersections in the Central Florida. They found that MLP neural network performed well with over 70% and 60% for the training and testing phases, respectively. They also compared the performance of the NN with that of an ordered logit model. The ordered logit model’s performance was only 58.9% and 57.1% for the training and testing phases, respectively. Several important conclusions were also drawn from a simulation case study conducted using the neural network. For instance, rural intersections were found to be more dangerous in terms of driver injury severity, female drivers were more likely to experience a severe injury, speed ratio increased the likelihood of injury severity, drivers at fault were less likely to experience severe injury, and wearing a seat belt decreased the chance of sustaining severe injuries. Clearly, the classification power of neural network is relatively higher than statistical models. However, compared to the number of researches on statistical models, neural network models have not been studied extensively possibly due to the complexity of estimating this type of model, and the problem of “over-fitting” the data. In fact, when the complexity and heterogeneity of a dataset increases, a neural network may train the data to an optimal level;
12 however, it lacks the capability of predicting new data, such as the test dataset. In other words, the algorithm memorizes the data instead of learning it. To circumvent the latter problem, several researchers have proposed the use of Bayesian neural network (BNN) models [56]. They claim that these models perform better than back-propagation NN models while reducing the difficulty associated with the over-fitting. For further information on the application of neural networks in crash severity detection and their differences between statistical methods, please refer to [57-60].
2.2.2 Traffic Crash Analysis: Other Approaches
Few recent researches focused on hybrid or soft computing techniques to overcome the conventional difficulties related to classical methods such as overfitting in neural networks. For instance, Sohn and Lee [61] considered clustering techniques, analyzed the relationship between the driving environmental factors and the severity of road traffic accidents, and performed several analyses to improve the accuracy of individual classifiers for two severity categories with an application in South Korea. They considered three methodologies for the classifier fusion: (a) Dempster–Shafer algorithm, the Bayesian procedure and logistic model, (b) data ensemble fusion based on arcing and bagging, and (c) clustering based on the k-means algorithm. They found that the clustering based classification algorithm worked the best in order to classify the road accidents for their interest area. A classification and regression tree (CART) method, one of the most widely applied data mining techniques, was developed to discover the relationship between injury severity and influencing factors such as driver and vehicle characteristics, and highway, environmental and accident variables using the 2001 accident data for Taipei, Taiwan [62]. Unlike statistical tools, CART does not require any pre-defined underlying relationship between the target (dependent) variable and predictors (independent variables). The results of this research indicated that the most important variable associated with crash severity was the vehicle type. In addition, pedestrians, motorcycle and bicycle riders have found to have higher risks of getting injured. Similarly, in order to overcome the problems associated with BP in ANN, Chang and Chen [63] proposed a CART model and a negative binomial regression model to establish the empirical relationship between traffic accidents and highway geometric variables, traffic characteristics, and environmental factors with a case study in Taiwan.
13
Gang and Zhuping [64], on the other hand, proposed a PSO-VSM hybrid algorithm in order to address the drawbacks associated with the BP neural networks. They analyzed several significant factors in terms of traffic safety, established a traffic safety forecasting model by PSO– SVM based on these significant factors, and evaluated the forecasting ability of the proposed method. They suggested that the proposed model outperformed PBNN in terms of efficiency and accuracy. Another research performed by Xu, Wang and Liu [65] employed Genetic Programming (GP) for real-time crash prediction on a freeway in California considering traffic, weather, and crash data. They used the random forest (RF) technique to select the significant variables under both uncongested and congested traffic conditions. They used ROC curves in order to evaluate the prediction performance of the developed GP model for each traffic state. The validation results showed that the prediction performance of the GP models was satisfactory, and improved the classification accuracy by 8.2% under congested and 4.9% under uncongested conditions. A PSO- ANN hybrid was employed by Srinivasan, Loo and Cheu [66] for incident detection systems in order to solve ANN-related problems of slow convergence, heuristic determination of parameters and the possibility of getting stuck at a local minima. In Chapter 3, we present the proposed research methodology including the data collection, modeling approach and comparison to existing methods.
14
CHAPTER 3
RESEARCH METHODOLOGY AND RESULTS
This research consists of two main steps. We first perform a comprehensive data analysis on several data sets obtained from a variety of traffic and weather-related sources. Next, we apply statistical and soft computing techniques to model the crash frequency and severity behavior on highway segments with different characteristics. Figure 4 shows an overview of the research methodology.
Figure 1 Research Methodology
15
Figure 1 - continued
3.1 Data Collection and Pre-Processing
The proposed approach allows one to accommodate the effects of different factors such as environmental, traffic, and human-related factors on the sensitivity and the probability of a crash at a given geographical location both for aging populations and other adult age groups. Unlike previous approaches found in the literature, which are mostly based on the AADT, the proposed methodology focuses on determining the effects of the hourly traffic flow on the frequency and severity of crashes. The methodology is applied on the Interstate 95 (I-95) highway corridor in Florida via a comparison of the two metropolitan areas along this corridor, namely Miami and Jacksonville metropolitan areas. Note that Miami and Jacksonville counties are among the high priority counties, identified by the Safe Mobility for Life Coalition of Florida, based on the high aging-involved crash rates per the aging population of the county [67]. The upper block in Figure 1 considers the data collected across the I-95 corridor together, which is used for an aggregate analysis. The lower block shows details of an analysis conducted that focus on the individual sites in Miami and Jacksonville areas. We build a separate model for each site. Statewide crash data sets obtained from the Florida Department of Transportation (FDOT) and hourly traffic volume data obtained from the Telemetered Traffic Monitoring Sites (TTMS) of FDOT for the year 2011 (2011 is selected because of the high variability observed in meteorology and natural phenomena than other years) are used to fit the models. Figure 2 depicts
16 the locations of the TTMS stations, which provide hourly flow data on I-95. For each TTMS station, we identify the associated roadway segments and create a set of maps which classifies the I-95 into homogeneous segments according to the roadway characteristics obtained from the FDOT. After this segmentation, crashes that happened on those segments are extracted (lower block in Figure 1). The precipitation and light condition data used in the models are obtained from the National Oceanic and Atmospheric Administration (NOAA) and Florida Automated Weather Network (FAWN).
3.1.1 Statewide FDOT Crash Dataset
Statewide crash data sets (including other attributes and crash characteristics) have been obtained from the Traffic Safety Office of the Florida Department of Transportation (FDOT) for the year 2011. Required fields, such as spatial and temporal characteristics, crash severity, driver characteristics, were extracted, refined and classified followed by a careful GIS-based examination in order to identify the most significant attributes in crash for further steps of this study.
Table 1 FDOT Crash Data Attributes Attribute Source Department of Highway Safety and Motor 1 Temporal attributes Vehicles (DHSMV) 2 Spatial Attributes FDOT Safety Office Injury and severity Index 3 FDOT Safety Office (minor rear-end, property damage, sever injury, fatality) Environmental condition 4 - Climate (humidity, light, visibility conditions, rain and snow, fog etc.) FDOT Safety Office and DHSMV - Roadway (Pavement type, sign, Lanes, etc.)
Vehicle characteristic Department of Highway Safety and Motor 5 (Type, tunes, lifts, functionality, etc.) Vehicles (DHSMV)
Person characteristic 6 FDOT Safety Office and DHSMV (Age, DUI, driver or involved, pedestrian, etc.)
Flags 7 FDOT Safety Office and DHSMV Age bins (-15, 50:5:80, +80)
Bicycle and Pedestrian characteristics 8 FDOT Safety Office and DHSMV (For vehicle-pedestrian or bicycle crashes)
17
This crash data includes three categories: (a) point crash data, (b) occupants involved in the crash, and (c) vehicles involved in the crash. In this thesis, two categories were linked in ArcGIS in order to enable us to access this extended dataset with combined attributes. Table 1 briefly summarizes the available attributes in this data set including the attributes obtained from the Department of Highway Safety and Motor Vehicles (DHSMV). The details of the GIS-based examination of this crash data is discussed in the next section. The main Attributes of FDOT statewide crash data that we pulled out for this research are summarized in Table 1.
3.1.2 Meteorological Data
The climatological data has been obtained from the National Oceanic and Atmospheric Administration (NOAA). This extended data set includes a variety of attributes; nevertheless, we will utilize the following data in this thesis: the mean precipitation rate and departure from the average precipitation rates. For this purpose, we extracted the 15-minutes precipitation values and sum them up to obtain hourly precipitation values for the weather stations in the vicinity of the roadway studies. Similar weather data was also received from Florida Automated Weather Network (FAWN) for comparison purposes. We also obtained exact sunrise and sunset times in order to study the effect of light from the United States Naval Observatory (USNO).
3.1.3 Hourly Traffic Flow
As mentioned earlier, we incorporated flow data at different times of the day instead of the Average Annual Daily Traffic (AADT). For this purpose, the hourly traffic data has been obtained for 2011 from the Statistics Office of the FDOT, collected using Telemetered Traffic Monitoring Sites (TTMS) and Portable Traffic Monitoring Sites (PTMS). The values represent hourly flow in each hour of the day throughout a year. This data set was processed for the selected location in the Miami and Jacksonville Metropolitan Area for further use in the model calculations. Locations selected for the analysis are shown in Figure 2.
18
Figure 2 I-95 Corridor and TTMS Stations in Florida
Red points in Figure 3 show the locations for which hourly flow data is available in the Miami Metropolitan Area.
Figure 3 A Sample of FTI User Interface
19
The data retrieval process in ArcGIS consisted of selecting uniform roadway segments on I-95 based on the topographic maps and raster layers as well as the available roadway network and TTMS locations. These segments are approximately 3.5 (for five lanes) and 5 miles (for three lanes) long with minimum grades in each direction (both for northbound and southbound), respectively. These segments are shown in Figure 2, where each node represents a TTMS station. We processed the available flow data for both north and south directions on the sites. This data set included the number of vehicles passing from a TTMS station each hour. Next, we classified the flow data in order to obtain a nested 12-month data structure using MATLAB, which showed the flow per month in each direction. Following the nested structure created before, we created another data set showing the amount of precipitation as well as the surface condition (dry/wet) corresponding to each traffic flow. Finally, light condition (day/night) data obtained from the USNO was added to the nested structure, and encoded as follows: the binary codes returning 1 for the daylight hours, and 0 for the night time. This data structure was again compiled, where each cell represented one hour for each traffic direction throughout 2011. After the data retrieval step to create the training data, we fit the binary logistic regression and the proposed clustering models for each site. This approach allowed us to investigate the effects of weather condition, light condition, and traffic flow on the probability of crash occurrences and use the models to predict crash frequency and severity in the future. This methodology was applied to both aging population involved crashes and those involving overall- age group. Model validity and prediction capability of the models are studied by predicting crash frequency and severity unobserved data in a cross-validation study. The next section describes the details of the proposed methodology. There are few limitations in the available data, as well as access to other datasets that authors had to deal with. First, the reliability of FDOT statewide crash data set is questionable, since the data is mainly based on the reports prepared by police officers. Several researchers studied the accuracy of the police reports and concluded the accuracy, validity and reliability of the reports are quite acceptable [68, 69]. Furthermore, we compared the light, surface and weather condition to the datasets that we collected from other sources of data, such as NOAA, FAWN and USNO, and we found out the similarity between FDOT and other datasets is acceptable. Another limitation of the dataset was the lack of hourly flow data for aging population specifically. Access
20 to aging population driving schedule and preferences in Florida is another limitation of the collected datasets. In addition, available crash data only considers the involved vehicles and persons. It does not give any information regarding the exposure. Figure 4 shows a schematic view of both crash frequency and severity data sets.
Figure 4 Output Database Processed for the Research
21
Figure 4 - continued
3.2 Exploratory Descriptive Analysis
In this section, we present an exploratory analysis of the traffic flow and crash data collected from the six locations. We select three stations from the Miami area (Miami 1, Miami 2 and Miami 3 in Figure 2), and three stations from the Jacksonville region (JAX 1, JAX 2 and JAX 3 in Figure 2). Table 2 summarizes the features of data from the individual stations based on different age groups, time of day, precipitation and flow levels. We observe that traffic flows are higher for the Miami locations than those in Jacksonville. In addition, during the day, the average flow during the crash hours is higher than the average flow for all other times in Miami. In Jacksonville, on the other hand, we do not see such a clear difference. This difference in the flow is even more distinct for Miami at night. Therefore, we can argue that crashes in Miami mostly occur when the flow is higher for both age groups for both day and night conditions. For Jacksonville, the behavior is drastically different at night. Locations closer to the City of Jacksonville (JAX 2 and JAX 3 in Figure 2) behave differently. Here, many crashes occur after midnight with low traffic volume.
22
Table 2 Summary of Crashes According to Age Groups, Location, Time of Day and Weather Conditions. Miami Jacksonville 1 2 3 1 2 3 All Aging All Aging All Aging All Aging All Aging All Aging
Crashes 398 24 282 33 277 30 105 15 131 8 73 10 Total Hours 8520 8520 8520 8520 8520 8520 8520 8520 8520 8520 8520 8520
Day Average Flow 6892 6892 5800 5800 5095 5095 2728 2728 3810 3810 1858 1858 Mean Flow During Crash 7117 6845 6150 6363 5386 5930 2779 2656 4277 3972 1885 2118
Crashes 219 41 165 10 161 13 61 1 47 4 24 3 Total Hours 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000
Temporal Effect Temporal Average Flow 3136 3136 2353 2353 1848 1848 936 936 1340 1340 649 649
Night Mean Flow During Crash 4230 6153 2977 3381 2508 3123 1338 812 2130 2609 716 347
Crashes 615 65 440 43 433 43 161 16 175 12 90 13 Total Hours 17388 17388 17388 17388 17345 17345 17411 17411 17402 17402 17402 17402 Average Flow 4980 4980 4026 4026 3423 3730 1805 1805 2539 2539 1236 1236
Clear Mean Flow During Crash 4956 6499 4972 5670 4333 5193 2285 2533 3721 3529 1600 1709
Crashes 207050 503070 Total Hours 132 132 132 132 175 175 109 109 109 109 109 109 Average Flow 5126 5126 4399 4399 3423 3730 2145 2145 3036 3036 1490 1490
Rainy Precipitation Precipitation Effect Mean Flow During Crash 5435 --- 5062 --- 4138 --- 1531 --- 3063 --- 1539 ---
Based on Table 2, the effect of precipitation appears to be minimal in both age groups since we do not observe a considerable number of crashes during the rainy weather. The average number of vehicles when crashes occur is higher than the average number of vehicles for all time periods (with and without crashes) for all weather conditions and age groups. Moreover, we observe more traffic on the roadways during rain.
Time Series Plot of Flow by Direction, Jax 1
120000
100000
80000
60000
Flow
Variable 40000 Apr N Apr S 20000 Sep N Sep S 0 2 4 6 8 10 12 14 16 18 20 22 24 Hour of Day (a) (d) Figure 5 Time Series Plots for Traffic Flow: North and South Directions and April and Summer Months Hourly Averages
23
Time Series Plot of Flow by Direction, Miami 2 Time Series Plot of Flow by Direction, Jax 2 160000 200000 140000
120000 150000 100000
100000 80000
Flow
Flow Variable 60000 Variable Apr N Apr N 50000 Apr S 40000 Apr S Sep N Sep N 20000 Sep S Sep S 0 0 2 4 6 8 10 12 14 16 18 20 22 24 2 4 6 8 10 12 14 16 18 20 22 24 Hour of Day Hour of Day
(b) (e)
Time Series Plot of Flow by Direction, Miami 3 Time Series Plot of Flow by Direction, Jax 3 90000 200000 80000
70000 150000 60000
50000 100000
Flow Flow 40000 Variable Variable 30000 Apr N AprN 50000 Apr S 20000 AprS Sep N 10000 Sep N Sep S Sep S 0 0 2 4 6 8 10 12 14 16 18 20 22 24 2 4 6 8 10 12 14 16 18 20 22 24 Hour of Day Hour of Day
(c) (f) Figure 5 - continued
Figure 5 shows sample time series plots of traffic volumes. Average hourly flows are shown for the two months with the highest (April) and lowest (September) traffic. For Miami area (Figure 5 a-c), we see that the traffic flow increases suddenly in the morning when people drive to work. After a subtle decrease, the flows increase with a smaller slope until the afternoon peak, and it drops back at night. In Miami 1 and 2, close to the downtown, no significant AM and evening PM peaks are observed. However, in Miami 3, away from the downtown Miami, AM and PM peaks are more apparent (Figure 5-c). The flow pattern is similar for both northbound and southbound directions, which is probably due to the usage of this roadway by residents as well as the tourists, which indicates the effect of the seasonal traffic. For the Jacksonville area (Figure 5 d-f), we observe higher traffic in the morning towards the downtown Jacksonville (northbound) whereas the higher traffic occurs away from the downtown (southbound) in the evening. The AM and PM peaks are more apparent at JAX 1 and
24
JAX 2, indicating the presence of home-based work trips (Figure 5-d and 5-e). The only abnormal behavior is observed at JAX 3 (Figure 5-f), which can be due to the proximity of the Jacksonville International Airport. Figure 6 shows the hourly crash counts for the selected locations in 2011 for both aging and overall population groups. All locations in the Miami region (FIGURE 6 a-c) appears to have two distinct peaks corresponding to the AM and PM rush hours for both age groups. Similarly, in Jacksonville, the downtown area (JAX 2) appear to have a two peak pattern (FIGURE 6-e). However, for the other two Jacksonville locations (FIGURE 6 d and f), the crash counts are relatively constant throughout the day.
Time Series Plot for Crash Frequency, Jax 1 35 Variable 30 9905 All 9905 Aging 25
20
15
No of Crashes No 10
5
0 2 4 6 8 10 12 14 16 18 20 22 24 Hour of Day
(a) (d)
(b) (e) Figure 6 Time Series Plot of Hourly Crash Counts for All and Aging Drivers
25
Times Series Plot for Crash Frequency, Miami 3 Time Series of Crash Frequency, Jax 3 70 35 Variable Variable 9923 All 60 0174 All 30 0174 Aging 9923 Aging 50 25
40 20
30 15
No of Crashes No
No of Crashes No 20 10
10 5
0 0 2 4 6 8 10 12 14 16 18 20 22 24 2 4 6 8 10 12 14 16 18 20 22 24 Hour of Day Hour
(c) (f) Figure 6 - continued
Figure 7 shows the crash flow (flow during crash hours) histograms for both overall and aging populations. Flow distributions in the Miami region are relatively left skewed (Figure 7) implying that the chance of having a crash is higher with an increase in the traffic volume. While in Jacksonville 1 and 3 most crashes occur during average flow times. For Jacksonville 2, which is close to downtown the behavior is relatively similar to those in Miami.
(a) (b) Figure 7 Histograms of Crash Flows for All and an Aging Drivers in Miami
26
(b) (e)
Histogram of Flow During Crash Hours, Jax 3
18 Variable Flow During Crash 16 Flow During Crash (Aging) 14
12
10
8
6
4
Frequency of Observations Frequency
2
0 500 1000 1500 2000 2500 3000 3500 4000 Flow Values
(c) (f) Figure 7 – continued
Figure 8 shows dot-plots of flow data for one location in Miami and one in Jacksonville during daytime and during nighttime. Considering Table 2, in Jacksonville, unlike Miami, the PM rush hours after the sunset is very short, and therefore we have longer hours with lower traffic volume at night (Figure 8). This indicates that crashes at night occur when higher traffic volumes are observed. For aging drivers, this behavior only occurs in Miami. In Jacksonville, JAX 1 and JAX 3 locations have more aging driver-involved crashes at lower traffic volumes (Figure 7-b).
27
(a) (b) Figure 8 Flow Dot-plots of Flow Data for Different Light Conditions (a) Miami, (b) Jacksonville [70, 71].
3.3 Correlation Analysis for the Entire Corridor
In this section, we aim to capture a picture of how different traffic and roadway-related factors influence the frequency of the crashes on the I-95 corridor in Florida, from the north to the south. These steps were shown in the upper block shown in Figure 1. As shown in Figure 9, there is a significant strong positive correlation (r = 0.86) between the AADT and surface width (with P-Value ≈ 0). This indicates that if the analysis includes one of these variables, adding the other one does not result in significant additional predictive power. Therefore, in our analysis we consider traffic (flow) information as a predictor and not the surface width. The relationship between the number of crashes and AADT for both age groups is shown in Figure 9-b. This plot shows that the frequency of crashes is high in those segments of the roadway with high AADT values. In fact, the correlation between AADT and Crash Frequency for aging drivers is highly positive (r = 0.921, P-Value ≈ 0) indicating that wider roadways with high volume of traffic may be overwhelming for aging drivers. For the entire corridor, we use the AADT as a measure to represent the average traffic volume and to observe the relationship between the traffic volume, and crash frequency. Our previous studies have shown that the hourly traffic flow (instead of AADT) explains larger proportion of variability in crash frequency and severity. Therefore, in the next section, when we will focus on the individual segments on I-95 in the Miami and Jacksonville metropolitan areas, we use the hourly traffic volumes to provide a more accurate representation of the traffic.
28
(a) (b) Figure 9 Traffic Characteristics for the I-95 Corridor in Florida, (a) Correlation Matrix for the AADT vs. Roadway Width for All Segments, (b) AADT, and Crashes for All Age Groups and Aging-Involved Crash Frequencies Versus Distance (miles)
3.4 Logistic Regression Analysis
We extensively studied and discussed the nature of available data sets, and logistic regression with binomial distribution (Logit) seems to be the best fit for our analysis since the outcome of the response variable is binary. However, the following link functions can also be employed for this analysis in addition to the logit: complementary log-log (Clog-log) and Probit. Therefore, in this research, we focus on all three link functions, namely Logit, Probit and Clog- log. Based on the best fit, we select one of them for the rest of our analysis. In addition, there is no necessity to utilize zero-inflated models and the conventional GLM can perform sufficiently since we do not have a substantial amount of zeros in the data set. Tobit regression is another common approach that can be employed. This model, introduced by James Tobin [72], can describe the relationship between a non-negative dependent variable and an independent variable. The Tobit model, referred also as the censored regression model, is formulated in order to estimate linear relationships between variables when there is a censoring (either left- or right-censoring) in the output variable. Censoring from below (left) takes place when all cases with a value at or below some threshold take the value of that threshold. In fact, the true value might be equal to or lower than that threshold. Similarly, the nature of our data sets does not allow us to use such a model for our analysis. Therefore, we employ the Logit, Clog- log and Probit models in this thesis.
29
Binomial link functions for these three models are defined as:
(1) Logit: η = ln − (2) − Probit: η = � (3)