<<

The Pennsylvania State University

The Graduate School

Department of Civil and Environmental Engineering

FUNCTIONAL FORM AND HETEROGENEITY EFFECTS ON SAFETY

PERFORMANCE FUNCTION ESTIMATION

A Dissertation in

Civil Engineering

by

Baradhwaj Hariharan

 2015 Baradhwaj Hariharan

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

August 2015

The dissertation of Baradhwaj Hariharan was reviewed and approved* by the following:

Venky N. Shankar Professor of Civil Engineering Dissertation Adviser Chair of Committee

Swagata Banerjee Basu Assistant Professor of Civil Engineering

Jeremy Blum Associate Professor of Computer Science

Evelyn Thomchick Associate Professor of Supply Chain Management

Peggy A. Johnson Professor of Civil and Environmental Engineering Head of the Department of Civil and Environmental Engineering

*Signatures are on file in the Graduate School

ii

ABSTRACT

The American Association of State Highway and Transportation Officials (AASHTO) defines safety performance functions (SPFs) as statistical models used to estimate the average crash for a specific site type with specific base conditions, based on traffic volume, roadway segment length, and other site characteristics such as lane width, shoulder width, and radius and degree of horizontal curvature. In essence, SPFs are mathematical equations developed through statistical regression modeling of historical , and are used to predict crash occurrence at sites comparable to those where the historical data was obtained.

The aim of this research is to use statistically derived functional form transformations coupled with the parameterization of the negative binomial coefficient of over-dispersion, to minimize the effect of heterogeneity on the predictions of crash frequencies. The functional form transformations were targeted at the parameter heterogeneity that could arise from the use of improper functional form, while the over-dispersion parameterization aimed at accounting for any heterogeneity resulting from unobserved effects empirical to the dataset. The scope of this study is approximately 5,443 centerline miles of 142 2-lane State highways in Washington State, with crash data for nine years from 2002-2010.

This dissertation investigates the utility of the multinomial fractional polynomial (MFP) search algorithm for deriving functional forms for SPFs, while accounting for heterogeneous effects. The original contribution of this dissertation lies in the development of novel nonlinear functional form SPFs, and their comparative evaluation against a baseline negative binomial (NB) specification, a baseline heterogeneous negative binomial, and a random parameter negative binomial specification. A 10-fold cut method was used for the MFP search due to the size of the modeling dataset- this enabled the teasing out of minute variations in the dataset. Cumulative residual (CURE) plots were also employed to visually gauge the effect of the MFP functional

iii

form transformations on the model cumulative residuals. Following this, the resulting functional forms were incorporated with a heterogeneous negative binomial specification, as a of obtaining segment specific parameterizations of the negative binomial over-dispersion term alpha, while obtaining some insight into the factors that contribute to the observed over- dispersion. It was found that Average Annual Daily Traffic (AADT), homogeneous segment length, the degree of curvature of horizontal curves, the length of a roadside culvert, the presence of a tree group on the roadside, and the presence of a retaining wall, contributed significantly to the over-dispersion of the observed crash data.

The model specifications were also compared to a random parameter negative binomial specification estimated for the same dataset, and it was found that the heterogeneous negative binomial specification with MFP derived functional form transformations provided a better convergent log-likelihood, as well as lower prediction validation estimates for absolute error (MAE), mean absolute percentage error (MAPE), mean square error (MSE), and root mean square error (RMSE).

The gaps in the extant literature that this dissertation was hoping to fill were the lack of a statistical procedure to empirically determine variable functional forms without any assumptions on the distribution of the variable’s effect on crash counts, and a modeling approach that could capture segment specific heterogeneity effectively. Towards this goal, the proposed algorithm proves to be an acceptable method towards modeling SPFs, while providing good insight into segment level parameterization of over-dispersion, thereby enabling effective safety rating and decision making at specific regions of a roadway network.

iv

TABLE OF CONTENTS

List of Figures ...... vii

List of Tables ...... viii

Acknowledgements ...... x

Chapter 1 Introduction ...... 1

Chapter 2 Literature review ...... 7

2.1 Statistical analysis of crash counts ...... 7 2.2 Functional form in count regressions ...... 11

Chapter 3 Overview of study area and data descriptions ...... 16

3.1 Database evolution ...... 18 3.1.1 Crash data ...... 18 3.1.2 Average annual daily traffic data ...... 28 3.1.3 Roadway geometrics data ...... 31 3.1.4 Roadside data ...... 36 3.2 Combined homogeneous segments dataset ...... 39

Chapter 4 Modeling methodology ...... 44

4.1 Poisson distribution, over-dispersion and the negative binomial regression ...... 46 4.2 Model structure for functional form treatments ...... 48 4.3 Estimation algorithm and ...... 49 4.4 Modeling a large dataset with FP(m,p) ...... 51 4.5 Incorporation of FP(m) structure into heterogeneous negative binomial estimation ...... 55 4.6 Prediction validation measures ...... 58 4.7 CURE plots to test functional forms resulting from the MFP algorithm ...... 60 4.8 Modeling methodology summary ...... 62

Chapter 5 Model estimation framework and results ...... 66

5.1 Baseline negative binomial specifications ...... 67 5.2 Random parameter negative binomial specification ...... 72 5.3 Multinomial fractional polynomial negative binomial specification ...... 76 5.4 CURE plots ...... 82 5.4.1 Untransformed variables ...... 83 5.4.2 MFP algorithm derived variable functional form transformations ...... 86 5.5 Heterogeneous negative binomial specification with MFP and non-MFP predictors ...... 89 5.6 Variable elasticities and model prediction measures ...... 96 5.6.1 Elasticities ...... 96 5.6.2 Model prediction evaluations ...... 97 v

5.6.3 AIC and BIC comparisons ...... 99 5.6.4 Model prediction measure insights...... 102

Chapter 6 Summary ...... 103

Bibliography ...... 109

Appendix A Description of homogeneous roadway segment dataset parameters ...... 117 Appendix B MFP search results for 10 subsets...... 122 Appendix C Recoded MFP variables ...... 128

vi

LIST OF FIGURES

Figure 1-1 Conceptual flow of modeling process ...... 6

Figure 3-1 Centerline mileage for 142 State Routes in Washington State ...... 16

Figure 3-2 Sample annual traffic report from the WSDOT TRIPS system ...... 28

Figure 3-3 AADT segment Interpolation ...... 29

Figure 3-4 Sample highway log report from TRIPS ...... 32

Figure 4-1 Example CURE plot ...... 61

Figure 4-2 CURE plot with bounds ...... 62

Figure 4-3 Structure of modeling process ...... 63

Figure 4-4 Proposed algorithm in the context of this dissertation ...... 64

Figure 5-1 FP() forms for LNADT and Lnlength ...... 80

Figure 5-2 CURE plot for untransformed AADT ...... 84

Figure 5-3 CURE plot for untransformed segment length ...... 84

Figure 5-4 CURE plot for untransformed lane width ...... 85

Figure 5-5 CURE plot for untransformed degree of horizontal curvature ...... 85

Figure 5-6 CURE plot for polynomial transformation of AADT ...... 87

Figure 5-7 CURE plot for polynomial transformation of segment length ...... 87

vii

LIST OF TABLES

Table 3-1 Study scope- centerline miles by State Route...... 17

Table 3-2 Crash identification information from WSDOT raw data ...... 19

Table 3-3 Crash location information from WSDOT raw data...... 20

Table 3-4 WSDOT raw data depicting information on the crash ...... 21

Table 3-5 WSDOT raw data depicting information on the crash (continued) ...... 22

Table 3-6 WSDOT raw data describing the circumstances of the driver/individual involved in the crash ...... 24

Table 3-7 WSDOT crash records depicting physical environment conditions at the time of the incident ...... 25

Table 3-8 Raw crash data vehicle information ...... 26

Table 3-9 State highway log related roadway type identifiers ...... 31

Table 3-10 Raw geometric data horizontal alignment information ...... 33

Table 3-11 Raw geometric data vertical alignment information ...... 34

Table 3-12 Raw geometric data number of lanes and roadway width information ...... 35

Table 3-13 Raw geometric data shoulder width information ...... 35

Table 3-14 Roadside feature metadata files from RFIP ...... 37

Table 3-15 Table depicting the related variables in the raw data that were aggregated ...... 38

Table 3-16 Processed Data Roadside Features ...... 38

Table 3-17 Summary of roadway characteristics ...... 41

Table 3-18 of roadside features ...... 41

Table 3-19 Summary statistics of roadside features (continued) ...... 42

Table 3-20 Summary statistics of crash categories ...... 43

Table 4-1 Segment and total crash counts in the 10 randomly selected subsets ...... 53

Table 5-1 Negative binomial specification with AADT and segment length ...... 68

viii

Table 5-2 Baseline NB with AADT, segment length, State Route, roadway and roadside effects ...... 70

Table 5-3 Baseline heterogeneous negative binomial specification ...... 71

Table 5-4 Random parameter NB specification for comparison ...... 75

Table 5-5 MFP incorporated NB specification ...... 77

Table 5-6 MFP incorporated NB specification (continued) ...... 78

Table 5-7 MFP incorporated NB specification (continued) ...... 79

Table 5-8 Heterogeneous NB specification with MFP functional forms ...... 91

Table 5-9 Heterogeneous NB specification with MFP functional forms (continued) ...... 92

Table 5-10 Heterogeneous NB specification with MFP functional forms (continued) ...... 93

Table 5-11 Comparison of AIC and BIC across models ...... 100

Table 5-12 Comparison of continuous variable elasticities across all models ...... 101

Table 5-13 Model prediction error summary ...... 101

ix

ACKNOWLEDGMENTS

I would like to express my gratitude towards my parents, my sister and brother-in-law, who patiently supported me, and dreamed my dreams all through my journey to completing my dissertation. The sacrifices they have made to see me through to this day can never be repaid, but hopefully I can take life forward well enough to make them proud.

Following my parents, are the two people who have influenced my days in State College the most, by taking me into their lives and treating me like a part of their family. I am very grateful to my advisor Professor Shankar Venkataraman for all the support and guidance he has provided me, not only academically, but also through stimulating conversations on philosophy and life. I would also like to thank (soon to be Dr.) Jungyeol Hong, for always being there for me as a mentor and as a friend, helping me look ahead when times got tough, and for always putting up with my often poor/irritating sense of humor with a smile. Words cannot express how thankful

I am for the motivation I have derived from them.

I also thank both Jungyeol Hong and Dr. Minho Park for providing me the data that would eventually become my dissertation modeling dataset.

My special acknowledgment goes out to my committee members Dr. Swagata Banerjee

Basu, Dr. Jeremy Blum, and Dr. Evelyn Thomchick for their time invested in me, and their invaluable suggestions and contributions towards my dissertation.

I would also like to thank all my friends through the many years of my graduate education, without whom, I would have probably ended up finishing much sooner, but I definitely would not have felt as satisfied with life.

x

Chapter 1

Introduction

The American Association of State Highway and Transportation Officials (AASHTO) defines safety performance functions (SPFs) as statistical models used to estimate the average crash frequency for a specific site type with specific base conditions, based on traffic volume, roadway segment length, and other site characteristics such as lane width, shoulder width, and radius and degree of horizontal curvature. In essence, SPFs are mathematical equations developed through statistical regression modeling of historical data, and are used to predict crash occurrence at sites comparable to those where the historical data was obtained.

The Highway Safety Manual (HSM) published by AASHTO provides guidance on how to utilize SPFs to make better safety evaluations of sites in three ways. The first application is towards the screening of a network to identify sections that may have the best potential for improvements. The second application is to use SPFs to determine safety impacts of design changes at a project level, while the third suggestion is to use SPFs to evaluate the safety effects of engineering treatments in the form of before-after studies. Given the nature of the applications of SPFs, a significant level of accuracy in predictions is desirable. The predominant form of SPFs in use, utilize count regressions with parameters fixed across all observations. These are not always the best models to use for predictions as they ignore the effects of across the datasets.

As a solution to this issue, research has been done on using random effect and random parameter (RPM) regression models, with random intercept terms, and varying parameter estimates for each observation respectively, but these methods are not without their own

1

assumptions and prediction issues. Advanced safety performance functions in the form of random parameter models have tended to treat parameter heterogeneity as a representation of unobserved effects not accounted for in fixed parameter models (For example, Anastasopoulos 2009;

Venkataraman et al 2011; 2013). Even after accounting for over-dispersion of data, interactions due to segment-to-segment variations, driver behavior, environmental factors and roadway geometry can have a substantial effect on heterogeneity. Random parameter models are structured on the assumption that unobserved effects contributing to parameter heterogeneity are in the form of stochastic continuous representations across segments that can be drawn from known distributions.

An alternate view might be that a significant portion of the heterogeneity and its variation across segments may be due to the lack of proper functional form. Therefore, an investigation of functional form classes is proposed to determine which roadway segments, or combinations have similar functional forms, and how many classes of these similar functional forms are required to capture heterogeneous effects arising from improper functional form. The first goal of this study is to develop a method that can estimate nonlinear functional forms for SPFs for large datasets.

The extant literature provides no material relating to the impact of functional form on heterogeneity, and therefore, no guidance is available for broad application of refined functional forms for a state network.

The second goal of this dissertation is investigating the impact of functional form on heterogeneity. The use of fractional polynomials in multivariable regression modeling is an empirical approach towards functional form specification. Since heterogeneity is also an empirical artifact specific to a dataset, it is desirable to know if an empirical functional form basis such as multivariable fractional polynomial (MFP) regression provides a reasonable approximation of heterogeneity effects. Heterogeneity that is of unknown origin, not associated

2

with exogenous variables, would manifest itself as an unobserved effect on crash propensities.

Since it is empirically plausible that some heterogeneity can occur even after fully treated functional forms, this study aims at using the functional form approach to determine the representation of heterogeneity via a generalized dispersion parameter approach where the over- dispersion parameter itself is a functional form of multivariable polynomials.

The combination of these two research questions is fundamental to the addressing of heterogeneity treatment in the methodological evaluation of geometric effects on crash propensities. In the absence of proper functional form, fixed parameter models now in widespread use in the safety literature can misinform safety policy. In contrast, RPMs can be computationally burdensome for large datasets, notwithstanding the continuous distributional base assumptions they require for estimation. Sustainable safety policy can be developed only if models can capture differences between roadway groups accurately, while accommodating nonlinear parameterizations that have been ignored in the current literature. The functional form limitations in current literature have limited the sustainable application of safety policy in addressing issues of substantive importance, such as, treatment effects of safety interventions, energy versus safety implications of interventions such as roadway lighting, targeted multifunctional evaluation of roadside interventions such as slope stabilization, etc.

Decomposing the contribution of functional form to heterogeneity can provide valuable insights into the consideration of roadway geometric factors that can provide the basis for adequate functional forms for SPFs, in the event that environmental and driver behavior data is not easily available. A crash on a roadway is akin to a failure in a dynamic system. Given a contextual basis for a complex system, failure could result from one or more critical components; or from related components; or the near failure of all components, resulting in the system performing at below threshold levels. In the case of roadways, the critical components could be

3

the road/roadside design parameters and materials, while related components could be bridge support structures or environmental factors. Similar complex systems could be computer networks, manufacturing systems, or human physiology. As such the methodology proposed here can be applied to the modeling of system level failures in other areas of study, assuming comprehensive system parameter data is available.

This study would also enable the rating of the safety implications of a roadway at a segment level, which could serve as a trigger for technology interventions in locations where roadway geometrics are maxed out. For example, this can be seen in icy weather runoff crashes, widely documented on bridge overpasses constructed with maximum shoulder widths, and wide turning radii. This trigger could lead to the bridge itself becoming a candidate for other sensing needs such as movement under freeze-thaw, seismic effect studies, and bridge scour analysis.

The parent dataset used for this study consists of 9-year history of total crash frequencies on 117 state routes in Washington State, from 2002 till 2010. The data for each year consists of 62,598 homogenous segments, yielding an aggregate of 563,382 rows, and a total of

124,883 documented crashes over the 9 year period. The study utilizes 40 different crash types, categorized based on location type, crash severity, crash type, and number of vehicles, along with

36 variables describing the roadway features for each segment, and 55 variables describing roadside features. The roadside variables consist of 37 dummies indicating the presence or absence of the respective roadside features, along with 18 length percentage variables to show the extent to which the respective features are present along the length of each segment.

Negative binomial (NB) count regression models were constructed using a data subset for the year 2010, with total crashes as the dependent variable, and average daily traffic (ADT), segment length, roadway variables and roadside variables as predictors. Baseline NB models with fixed dispersion parameter (alpha) were estimated, and the significant variables were used to

4

develop baseline NB models with varying dispersion parameters (heterogeneous NB). Then an

MFP NB with fixed alpha was developed resulting in the nonlinear transformations. The resulting transformations to the parent variables were then used to develop an MFP NB with dispersion heterogeneity due to the roadside variables (heterogeneous NB), with the transformed variables included in the parameter search set for predicting the dispersion parameter alpha. The predictions from the resulting SPFs were then compared using regression model predictive measures across a of model types including, the baseline NB, the baseline heterogeneous

NB, the MFP NB, and the random parameter NB. The validity of the functional form transformations obtained through the MFP search algorithm were also tested against the baseline

NB model using CURE plots for model predictions versus the transformed and untransformed continuous covariates. Figure 1-1 shows the basic structure of the methodological flow through this dissertation, starting from the development of the modeling dataset from the raw data obtained from the Washington State Department of Transportation (WSDOT), leading into the methodology used to meet the aims of this study.

Going forward, this dissertation is organized as follows: The subsequent chapter deals with a comprehensive summary of the extant literature related to the methodologies proposed in this dissertation, along with studies that brought to light the lack in the available techniques to account for the effect of improper functional form on data heterogeneity in count model estimation and predictions. Chapter 3 is a summary of the empirical context of this study, along with summary statistics of the key parameters that were available for the , while chapter 4 provides some insight into the theory and methodological sequence behind the research process proposed in this dissertation. Chapter 5 contains the resulting model outputs and their interpretations, followed by the insights obtained from the CURE plots, elasticity calculations, and model prediction measures (mean square error, mean prediction error, root mean

5

square error, etc.). Following this, chapter 6 provides a summary of the results obtained through this line of thought, with discussions on the advantages and shortcomings of the work, and the future direction of the proposed process.

Figure 1-1 Conceptual flow of modeling process

6

Chapter 2

Literature review

This chapter serves as a summary of work that forms the basis of safety performance function estimations in the field of traffic safety, while providing some insight into the factors that brought about the need for the objectives of this dissertation. The summary will transition into related methods utilized in fields outside of transportation engineering that could have possible applications towards addressing the gaps in the extant literature that this dissertation hopes to fill.

2.1 Statistical analysis of crash counts

Given the enormous impact of crashes both socially and economically, a multitude of research has been performed with the aim of understanding the factors that could affect the probability of crashes, so as to provide insights into policy decisions and targeted countermeasures that could reduce both crash occurrences and the resulting impacts. In general, crashes could result from complex interactions of driver, vehicle, roadway, and environmental characteristics. Data relating to individual drivers, vehicle metrics at the time of a crash, or environmental interactions, could possibly provide good situational insight into crash propensities at a cause-and-effect level, but are not always readily available. Crash prediction models are therefore estimated on the factors that could affect crash occurrence over a geographical space such as a segment of roadway or an intersection, over some specified time period. In using the spatial characteristics of the roadway itself to predict crashes, it is ensured that adequately measurable explanatory variables are available (Lord and Mannering, 2010).

7

While crash records are typically in the form of non-negative integer counts, initial forays into model estimations were done utilizing multiple techniques. This was not found to be the best technique to employ in the predictions of crash counts due to the issue that a linear regression could result in negative predictions, which is clearly impossible in the case of crash predictions. Due to the independent nature of crash observations, Jovanis and Chang (1986) proposed using a Poisson distribution to model crash counts. In a later study by Miaou and Lum

(1993), two conventional linear regression models were compared to estimations, and it was shown that as long as the observed was equal to the mean counts, the Poisson estimations performed more efficiently. The authors also attempted to account for cases where the observed variance was greater than the expected mean, by using Wedderburn’s over-dispersion parameter, and concluded by suggesting that the negative would probably provide more stable estimations of the likelihood of crash occurrence.

An extension of the Poisson distribution, the negative binomial model assumes that the

Poisson parameter follows a gamma , resulting in a closed-form equation that allows the mean to differ from the variance. Bring simple to interpret and estimate, the negative binomial model is the most frequently used model in crash analysis (Miaou, 1994;

Shankar et al., 1995; Poch and Mannering, 1996; El-Basyouny and Sayed, 2006; Lord, 2006; Kim and Washington, 2006; Malyshkina and Mannering, 2010), but as shown by Lord (2006), if the data available is under-dispersed, the model will result in inefficient estimates. Also, the base form of the negative binomial assumes a fixed mean estimated parameter for all observations in the dataset, assuming the effect of the independent variable on crash frequency is the same for every segment of roadway, something that can lead to significant prediction errors in the model specification.

8

Other techniques to model crash frequencies that have been tried in the past include,

Poisson-lognormal models (Miaou et al., 2005; Lord and Miranda-Moreno, 2008); zero-inflated count models (Miaou, 1994; Shankar et al., 1997, 2003; Lee and Mannering, 2002; Lord et al.,

2005, 2007; Malyshkina and Mannering, 2010); Conway-Maxwell-Poisson models (Lord et al.,

2008; Sellers and Shmueli, 2010); Gamma models (Oh et al., 2006; Daniels et al., 2010); generalized estimating equation models (Wang and Abdel-Aty, 2006; Lord and Mahlawat, 2009); generalized additive models (Xie and Zhang, 2008; Li et al., 2011); negative multinomial models

(Ulfarsson and Shankar, 2003; Caliendo et al., 2007); finite mixture and Markov switching models (Malyshkina et al., 2009; Park and Lord, 2009; Malyshkina and Mannering, 2010; Park et al., 2010). The need for all the different distributions that have been tried till date is to find a specification that provides good predictions, while accounting for the various data related issues that hamper statistical analysis. Lord and Mannering (2010) elucidate to the fact if incorrect functional form is used, then the result will be biased parameter estimates and possibly erroneous inferences with regard to the influence of explanatory variables, but they do not talk about the extent or quantitative impact of improper functional form usage in model building.

As mentioned previously, one major issue with the negative binomial model is that it results in an average estimate for the independent parameters in the regression. In doing so, it assumes that the effects of the variables are consistent across all observations or segments in the dataset. As a means to account for this issue, Hausman et al., (1984) in their study of research and development patents, introduced what is known as a random-effect model wherein, the intercept term for each observation was allowed to vary across observations. In the context of crash predictions, this would mean that each segment of roadway in the dataset could have a different intercept term, and a different base number of predicted crashes, depending on the respective segment characteristics. Random effect models are effective in handling spatial and temporal

9

correlation within panel datasets, and have been studied in the context of crash predictions by

Johansson (1996), Shankar et al. (1998), Miaou et al. (2003), Quddos (2008), Sittikariya and

Shankar (2009).

As an extension of random effects models, random-parameter models allow for unobserved heterogeneity from one roadway site to another through allowing the model to vary across each individual observation in the dataset, as opposed to only the intercept (Milton et al.,

2008). Examples of research that has attempted to use random-parameter models to build safety performance function estimations, include Anastasopoulos and Mannering (2009), and El-

Basyouny and Sayed (2009), Venkataraman et al. (2014). It would be expected that since the resulting specification from a random-parameter model is essentially a range of combinations of observation specific estimations, the model would provide statistically more accurate predictions than traditional fixed parameter models. While that could be the case, random parameter models are built on the assumption that the covariates that are found to vary across observations follow a set assumed distribution. This is something that might not always be the case, and could be the source of significant heterogeneity in predictions. Some researchers such as Shugan (2006), and

Washington et al. (2010), found that the complexity of estimating a model with random parameters or random effects, might not necessarily improve predictive capability, as the ranges of parameter estimates are observation specific, and the resulting specifications might not be transferrable to other datasets and sites. That being said, no model in theory is completely transferrable to other sites, something that is more data specific, so this cannot be attributed to only model structure. In the most exhaustive summary till date, Lord and Mannering (2010) survey a variety of model specifications employed in the field of transportation safety, but they do not include any information on the usage of MFPs for SPF estimations.

10

As a summary of relevant concepts thus far, fixed parameter negative binomial models cannot handle under-dispersion of data, and assume an average effect of each independent variable across the dataset. On the other hand, random parameter negative binomial models can be used to obtain observation specific parameter estimates, but these require strong assumptions on the distribution of the random parameters, and while they might provide good predictions on the modeling dataset, they might be prone to transferability issues that could limit their applications to other datasets for other time periods or sites. Extending this chain of thought, it would be desirable to develop a model that would provide good model predictions, without the need for distributional assumptions, and that might be possibly better transferrable than random parameter models.

2.2 Functional form in count regressions

The need for distributional assumptions in random parameter models, and the lack of efficient predictions from fixed parameter models, arise from the fact that most models assume that the explanatory variables in the model affect the dependent variables in some linear manner. The functional form of a variable being critical to the development of a relationship between the dependent and independent variables, incorporating non-linear functional forms into estimations procedures might go a long way towards characterizing the relationships between crash propensities and the independent variables. There are not many studies available in transportation safety analysis that detail methods to find independent variable functional form transformations through some form of statistical process.

In their 2013 report to the Federal Highway Administration Office of Safety, Srinivasan et al. (2013) discuss the need for using functional form transformations of independent variables to obtain better fitting SPFs. The authors quote excerpts from Hauer (2004), and Kononov et al. 11

(2011), leading to the conclusion that different types of exploratory analysis need to be conducted to determine the appropriate functional form of the relationship between crash counts and the independent variables, along with the possible need for terms between independent variables. The authors suggest using plots to illustrate the functional form of the relationship between crash counts and the different independent variables, and using classification and regression trees (CART) to identify which independent variables and interactions are relevant in the model.

In terms of building a model equation, choosing appropriate functional forms for variables that are to be included in the model, and examining whether the chosen functional form fits the data well, Hauer (2004) talks about how statistical safety model building is essentially the curve-fitting of various parameters to observed crash data, without any foundation for understanding the true nature of the effects of the independent variable, and little theoretical basis for the functional forms available to the modeler. In the author’s words, the sought after cause- effect relationships, while present in the data, are hidden by the randomness in crash counts, obscured by imprecise trait data, covered by layers of interdependencies and missing information.

The author suggests the use of CURE plots ordered by a variable of interest, to visually interpret the nature of the relationship between crash counts and the independent variable, for different ranges of the independent variable, and choose a functional form transformation for the variable that would help reduce the deviation of the CURE line from the “zero” line.

As a means of providing some quantifiable statistical basis to the search of independent variable functional form relationships, Kononov et al. (2008, 2011) attempted to use a general class of non-linear , neural networks, to explore the underlying relationship between variables without being limited to pre-selected mathematical functions , that are unconstrained by underlying distributional assumptions. The authors relate their method to the dose-response

12

relationship often found in medicine and pharmacology. They found that a neural network sigmoid curve relationship for crashes with Average Annual Daily Traffic was well described, and that crash data for urban freeways exhibited extra variation or over-dispersion relative to a

Poisson model.

Many methods to finding variable functional forms to improve model fit exist, but very few methods involve educated guesses for transformations through past knowledge. Royston and

Altman (1994) proposed a method for fractional polynomial (FP) function finding, primarily for the case with a single predictor, but also suggested and laid out an algorithm for fitting fractional polynomials in multivariable models. In 1999, Sauerbri and Royston took the FP search forward, and through combining a backward elimination method with the search for the most suitable FP transformation for continuous predictors, proposed the multivariable fractional polynomial (MFP) search algorithm. A simple description of the method, involves testing each continuous covariate in the model estimation to be tested for different combinations of powers taken from the search set {-2, -2, -0.5, 0, 0.5, 1, 2, 3}, and tested against either including the variable in the model in a linear form, or removing the variable from the specification. The option resulting in the best -

2LogL deviation would be fixed as the optimum functional form for that variable, and the next significant continuous variable would be tested, till all the variables have fixed functional forms.

As such, this search method would be completely dependent on the observed distribution of the data, and would not require and distributional assumptions. The algorithm limits the number of functional forms that are tested for each continuous covariate, but even then the functions provide a rich class of possible functional forms leading to a satisfactory fit to the data in many situations (Sauerbrei et al., 2005). Examples of research that was done incorporating

MFP searches include Valecky (2012), who used negative binomial models to model individual insured crash risk, using MFPs to correct for a badly specified link function; and Silk et al. (2009)

13

who researched into improving medical admissions risk system using MFP based models. As such, the MFP search while not having been extended to too many types of modeling techniques is not model specific. Any regression involving a maximum log-likelihood optimization procedure can be employed for the search algorithm. Also, stability investigations with bootstrap have shown that MFPs can find stable models, despite considerable flexibility in the family of FPs, and the consequent risk of over-fitting when several variables are considered (Royston and Sauerbri, 2003).

To summarize, there are many types of models that have been employed in the extant literature to predict crash frequencies on roadway segments, but these models either involve fixed parameters with largely linear effects on the dependent variables, or the models generate parameters that vary across observations, but require distributional assumptions to draw the parameter distributions. The need that this dissertation is aiming to fill is to generate specifications that do not make any assumptions on parameter distributions, while accounting for possible nonlinear relationships in the effect of the independent variables on crash predictions, based solely on the observed data. The MFP search provides a viable option coupled with some form of a negative binomial regression.

Another concern with the NB-1 and NB-2 models is that they provide a fixed estimate of the over-dispersion term alpha that provides an understanding of, and therefore a correction for the heterogeneity in the data that is being modeled. Just as with the parameter estimates, this assumes that the nature of the heterogeneity in the dataset is consistent across all observations in the dataset. To get around this issue, the heterogeneous negative binomial extends the negative binomial model by allowing observation specific parameterization of the ancillary parameter alpha. In other words, the value of alpha is partitioned by user specific predictors, providing information regarding which predictors influence the over-dispersion in the data, while allowing

14

one to determine whether over-dispersion varies over the significant predictors of alpha (Hilbe

2007).

The ultimate aim of this dissertation is therefore to employ MFP transformations to negative binomial regressions as a means of finding empirically based functional form transformations to target heterogeneity in model predictions; followed by a combination of the

MFP algorithm and a heterogeneous negative binomial distribution to counter the heterogeneity that exists in the crash observations.

15

Chapter 3

Overview of study area and data descriptions

The study area for this dissertation consisted of 142 State Highways in Washington State, as shown in the following map obtained through the WSDOT GeoPortal application. The subsequent table 3-1 depicts the centerline mileage applicable to each of the routes as they pertain to the raw datasets provided by WSDOT, resulting in a total of approximately 5,443 centerline miles.

Figure 3-1 Centerline mileage for 142 State Routes in Washington State

16

Table 3-1Study scope- centerline miles by State Route

State Route # (Mileage) 2(325.36 miles) 104(23.97 miles) 165(14.66 miles) 270(.65 miles) 508(32.41 miles) 3(60.02 miles) 105(44.5 miles) 169(17.67 miles) 271(8.38 miles) 510(7.83 miles) 4(55.24 miles) 106(20.09 miles) 170(3.68 miles) 272(19.2 miles) 525(20.53 miles) 6(50.95 miles) 107(7.91 miles) 172(32.86 miles) 274(1.4 miles) 530(50.71 miles) 7(39.09 miles) 108(11.92 miles) 173(11.84 miles) 278(5.44 miles) 531(5.48 miles) 9(68.37 miles) 109(35.83 miles) 174(40.56 miles) 281(9.68 miles) 532(8.99 miles) 10(15.86 miles) 110(11.02 miles) 182(.31 miles) 282(4.43 miles) 534(5.04 miles) 11(17.96 miles) 112(61.07 miles) 193(2.4 miles) 283(14.86 miles) 536(1.29 miles) 12(408.96 miles) 113(9.59 miles) 194(20.08 miles) 290(1.37 miles) 539(3.57 miles) 14(161.77 miles) 115(2.28 miles) 195(80.38 miles) 291(24.66 miles) 542(52.39 miles) 17(136.53 miles) 116(7.2 miles) 197(1.65 miles) 292(5.91 miles) 543(.86 miles) 18(6.76 miles) 119(10.92 miles) 202(17.12 miles) 300(3.35 miles) 544(9.01 miles) 19(14.09 miles) 121(5.01 miles) 203(23.19 miles) 302(9.31 miles) 546(8.02 miles) 20(436.91 miles) 122(6.41 miles) 206(14.2 miles) 305(2.71 miles) 547(10.59 miles) 21(189.51 miles) 123(1.57 miles) 207(4.26 miles) 307(5.13 miles) 548(12.82 miles) 22(35.69 miles) 124(43.3 miles) 211(14.55 miles) 395(95.05 miles) 702(9.32 miles) 23(65.79 miles) 125(15.5 miles) 215(6.24 miles) 401(12. miles) 706(13.58 miles) 24(72.73 miles) 127(27.02 miles) 221(26.07 miles) 409(3.64 miles) 730(6.07 miles) 25(121.09 miles) 128(1.74 miles) 223(3.74 miles) 410(90.21 miles) 821(24.6 miles) 26(133.49 miles) 129(35.01 miles) 224(4.68 miles) 411(7.44 miles) 900(2.22 miles) 27(81.99 miles) 131(1.95 miles) 225(11.32 miles) 432(.84 miles) 902(12.08 miles) 28(122.01 miles) 141(29.06 miles) 231(75.16 miles) 500(5.83 miles) 903(10. miles) 31(26.03 miles) 142(34.39 miles) 240(21.38 miles) 501(11.12 miles) 904(16.77 miles) 41(.33 miles) 150(11.7 miles) 241(25.21 miles) 502(2.92 miles) 906(2.48 miles) 97(335.99 miles) 153(30.7 miles) 243(28.26 miles) 503(44.81 miles) 970(10. miles) 100(4.61 miles) 155(80.4 miles) 260(39.34 miles) 504(50.05 miles) 971(14.48 miles) 101(350.12 miles) 160(3.8 miles) 261(62.47 miles) 505(19.29 miles)

102(2.8 miles) 161(15.59 miles) 262(22.45 miles) 506(11.11 miles)

103(19.36 miles) 164(7.07 miles) 263(.81 miles) 507(38.97 miles)

17

3.1 Database evolution

The database used for this study consisted of crash records for nine years from 2002 to

2010 for the aforementioned 142 State Routes in Washington State. The raw data obtained from

WSDOT included crash information, Annual Average Daily Traffic (AADT) information, roadway geometrics, and roadside information in separate files that were aggregated into a complete dataset to be used for modeling purposes. Roadway geometrics and roadside feature data were obtained through various WSDOT sources including the Washington State Highway

Log, Roadside Feature Inventory Program (RFIP), WSDOT GeoPortal Map Application, and the

Washington State Route Web Tool (SRweb). In order to combine all the available sources of information in a meaningful manner, certain calculations and modifications needed to be incorporated into the process. Going forward, this section will describe the raw data, along with the procedure for processing the raw data into the final integrated homogeneous database.

3.1.1 Crash data

The most comprehensive of the four data sources, the raw crash information was obtained from the WSDOT Statewide Travel and Collision Data Office (STCDO) in two files, one ranging from 2002-2005, and the other from 2006-2010. Each file contained 120 variables, with 528,385 observations in the former, and 627,311 observations in the later. The 120 variables are described in related groups to facilitate more effective understanding of the available data.

Table (3-2) shows the identifying information used by WSDOT for each observed crash in each of the datasets. The History Indicator, as the name suggests, determines whether the vehicle or individuals involved in the current crash had any prior crash history. The responses for the History Indicator were labeled as Yes, No, or Suspected. The Collision Report Number serves

18

as an identification number to distinguish between each crash, and repeated Collision Report

Numbers correspond to multiple persons or vehicles involved in a single incident. The Collision

Report Type refers to the jurisdiction or facility associated with the crash. The date and time of each incident were also provided, with additional information for the quarter of the year in which the incident occurred.

Table 3-2 Crash identification information from WSDOT raw data

Identification Information Description History Indicator yes, no, suspected Collision Report Number report identifier Collision Report Type state route Full Date month/day/year Year 4-digit year Yearmo year (4-digit), month (2-digit) Month Name full name of month Month Number numerical month (1 - 12) Day Of Week numerical day (1 - 7) Quarternumber Q1, Q2, Q3, Q4 Full Time XX:XX AM/PM Full Time 24 XX:XX Hour 24 XX00

Location related information contained in the raw crash dataset is captured by the variables listed in Table (3-3). The location information includes State Route identification, and the mile post markers associated with each observation- the nearest mile post being recorded as the nearest known location of the crash, with an indicator as to whether the mile post marker was ahead or behind the crash location The accumulated route mileage (ARM) variable shows the accumulated roadway centerline miles from the begin station till the current milepost. While milepost marker numbers need not be updated to completely reflect changes made to the length of 19

the roadway alignments, ARM information is continuously updated by WSDOT, and is used to obtain an accurate estimate for the centerline mileage of each route.

Table 3-3 Crash location information from WSDOT raw data

Location Description SR- state route number ID 7-digit character indicates non-mainline facility Mile Post nearest mile post marker Mile Post Ahead Back type location of mile post marker to crash location ARM accumulated route mileage IndicatorState Route Number (A, B) state route number Related Roadway Type facility type based on RRT acronym Related Roadway supplemental facility information for RRT (if Region Name geographic region of Washington QualifierCity Name required)reported if occurred within city boundaries County Name name of county

The Facility Information from the raw crash datasets contains descriptors on the roadway facility where the crash occurred. The facilities descriptors are based on standards established by

WSDOT and the Federal Highway Administration (FHWA), and are listed in Table (3-4).

Table 3-4 Raw Crash Data Facility Information

Facility Information Description State Functional Class Code 2-character code Urban Rural urban or rural classification Federal Functional Class Name federal class name of facility Traffic Control Type Description traffic control device (if applicable) Posted Speed Limit speed limit Roadway Type Description description of arterial

The State Functional Class Code consists of a two character identification code with the prefix of R or U signifying rural or urban arterial classification. The Urban Rural column lists whether the facility is considered as an urban or rural arterial, while the Federal Functional Class

20

Name uses the FHWA standards for naming the facility. The Traffic Control Device description

indicator provides information on the type of control at the crash site if it happens to be in the

vicinity of, or related to an intersection, i.e. the intersection type would determine whether a stop-

control or signal served as the traffic control device. The Roadway Type Description depicts

features of the roadway that may not be captured by the state or federal classification standards,

such as the presence of a roadway barrier, an undivided arterial, or a bridge section.

Following this, the next group of variables pertains directly to the collision information.

Table (3-5) displays the 43 descriptors that provide collision related information in the dataset.

Table 3-4 WSDOT raw data depicting information on the crash

Collision Information Description Number Of Fatalities count of fatal (0-5) Number Of Injuries count of all injury types (max = 47) Number Of Pedal Cyclists Involved count of pedal cyclists (max = 4) Number Of Pedestrians Involved count of pedestrians (max = 7) Number Of Motor Vehicles Involved count of motor vehicles (max = 33) Vehicle 1 Compass Direction Description compass direction (NW, N, NE, etc.) Vehicle 1 Movement Description vehicle movement: straight, turning movement, etc. Vehicle 1 Milepost Direction Description direction of travel: increasing, decreasing, entering, etc. Diagram Collision Type Description description of collision Vehicle 2 Compass Direction Description compass direction (NW, N, NE, etc.) (if applicable) Vehicle 2 Movement Description vehicle movement: straight, turning movement, etc. (if applicable) Vehicle 2 Milepost Direction Description direction of travel: increasing, decreasing, entering, etc. (if applicable) Impact Location Description facility type, lane, direction of travel Second Impact Position Description facility type, lane, direction of travel (if applicable) Unit Number number assigned to person/vehicle involved in crash

21

Table 3-5 WSDOT raw data depicting information on the crash (continued)

Collision Information Description Unit Type Description of transportation Most Severe Injury Type most severe reported: dead at scene Collision Severity fatal, PDO, injury First Collision Type initial crash type First Object Struck applicable to fixed object collision Second Collision Type subsequent crash that occurred (if applicable) Second Object Struck applicable to subsequent fixed object collision Junction Relationship junction related (intersection, driveway, roundabout, etc.)? Hazardous Material hazmat transport released or not released (if applicable) Fire yes, no Stolen yes, no Hit And Run yes, no Contributing Circumstance 1 event that led to crash Contributing Circumstance 2 subsequent secondary event that led to crash (if applicable) Contributing Circumstance 3 subsequent tertiary event that led to crash (if applicable) DRE Assessment Code 1 drug Recognition Expert Code 1 (0-9) DRE Assessment Code 2 drug Recognition Expert Code 2 (0-9) Involved Person Action 1 action that person 1 did related to the crash Involved Person Action 2 action that person 2 did related to the crash (if applicable) Involved Person Action 3 action that person 3 did related to the crash (if applicable) Pedacyclist Actions action that pedal cyclist did related to the crash (if applicable) Pedestrian Actions action that pedestrian did related to the crash (if applicable) Sequence Of Event 1 description of crash event Sequence Of Event 2 description of secondary crash event (if applicable) Sequence Of Event 3 description of tertiary crash event (if applicable) Sequence Of Event 4 description of quaternary crash event (if applicable) Compass Direction From direction of travel departure Compass Direction To direction of travel arrival

22

The Number of Fatalities variable was found to fall within the range of zero to five, while the Number of Injuries in the raw data was found to have a maximum recorded count of 47 and a minimum of zero. The non-motorized collision information reports a maximum of four for

Cyclists and seven for Pedestrians, while Motor Vehicles report a maximum of 33. Data pertaining to vehicle involvement in the collision is captured by the vehicle prefix descriptors.

The Diagram of Collision Type briefly describes the nature of the crash by providing more detail than just identifying the crash type. Impact Location Description identifies the type of arterial on which the crash took place, as well as where it occurred in terms of the location within the arterial. The variables Most Severe Injury Type, Collision Severity, First Collision Type, First

Object Struck, Second Collision Type, Second Object Struck, describe crash outcomes. The variables relating to Contributing Circumstance, DRE Assessment Code, Involved Person Action,

Pedacyclist Action, and Pedestrian Actions, provide circumstantial information that describe the various factors leading into the incident.

Table (3-6) shows driver information related variables, providing insight into the demographics of the individuals involved in the crash, and whether they were passengers or drivers. This section also includes information on crash related factors, described by Sobriety

Level, Alcohol Test Result, Clothing Visibility Type, Pedestrian Pedacyclist Was Using

(pedestrian facility), and Pedestrian Pedacyclist Type (nonmotorized transportation), while Injury related factors are explained by Helmet Use, Seat Position, and Restraining System Type.

Outcomes relating to deployment of airbag, ejection status of occupant, and most importantly, the resulting Injury Type to the individual involved in the crash are also provided. All of the possible recorded Injury Types are listed in the respective description column.

23

Table 3-6 WSDOT raw data describing the circumstances of the driver/individual involved in the crash

Driver Information Description Involved Person Type vehicle (passenger, driver), pedal cyclist (passenger, driver), Age age of individual pedestrian Gender male, female Air Bag Type type of air bag, deployed (yes, no), equipped (yes, no) Ejection Status not ejected, partially ejected, totally ejected, unknown Helmet Use used, not used, unknown (if applicable) Injury Type dead at scene, dead on arrival, died at hospital, evident Seat Position location individual was seated in at time of crash injury, no injury, non-traffic fatality, non-traffic injury, Sobriety Level had been drinking or not, level of impairment Liability Insurance possibleyes, no injury, serious injury, unknown On Duty Indicator yes, no Restraining System Type type of seatbelt used Alcohol Test Result blood alcohol content (if applicable) Clothing Visibility Type light, dark, mixed, reflective (if applicable) Pedestrian Pedacyclist Was Using pedestrian facility used (i.e. crosswalk, sidewalk) Pedestrian Pedacyclist Type nonmotorized mode of transportation used

The seven variables that provide insight into the physical environmental conditions reported at the time of each incident are shown in Table (3-7). The information available includes detail on the roadway environment, weather conditions, and special circumstances involved in each crash. The Weather classifications are limited to visibility-related designations, namely, clear, fog, rain, snow, etc. Similarly, the Roadway Surface Condition category identifies the elements on the roadway at the time of the reported crash that could affect traction, such as dry, ice, oil, other, sand/mud/dirt, snow/slush, standing water, unknown, or wet. Lighting Conditions identifies the source of illumination while loosely implying the time of day by indicating daylight or dark with or without street lights. Location Characteristics highlight unique features such as bridge, parking lot, shopping mall, tunnel, etc., along the arterial segment that may have some involvement with those respective crashes, and the Roadway Characteristic column provides a

24

general description of the geometrics present on the crash segment. In the event that a crash occurred within a work zone, there are three possible outcomes listed in the Work Zone parameter: In External Traffic Backup, Within Work Zone, and Workers Present. The

Investigative Agency descriptor is listed to show the party responsible to gather additional information relating to the crash if an investigation was required.

Table 3-7 WSDOT crash records depicting physical environment conditions at the time of the incident

Environmental Conditions Description Weather clear, fog, rain, snow, etc. Roadway Surface Condition wet, dry, ice, etc. Lighting Condition daylight, streetlights, etc. Location Characteristics unique roadway elements (bridge, parking lot, school zone, etc.) Roadway Characteristic straight, curve, grade, etc. Workzone within work zone boundaries (if applicable) Investigative Agency agency responsible for investigating crash (if applicable)

The last category of Vehicle Information shown in Table (3-8) lists the descriptors that define the type of vehicles involved in each incident- the commercial carrier and commercial vehicle information only applies if those types of vehicles were involved in the reported crash.

The vehicle involved in the crash, regardless of personal or commercial transport classification, is described by Vehicle Type, Vehicle Classification, Vehicle Use, VIN, and Registered State.

Vehicle Action explains what activity the vehicle was engaged in at the time of the crash, while

Vehicle Condition pertains to the operating condition of the vehicle prior to involvement in the crash. For instance, a vehicle’s headlights not being in operating condition prior to the crash may be a contributing factor to the crash. Commercial Carrier and Commercial Vehicle descriptors are also recorded in the dataset.

25

Table 3-8 Raw crash data vehicle information

Vehicle Information Description

Vehicle Type type of vehicle involved in crash

Vehicle Classification applicable to non-passenger cars special vehicle purpose (law enforcement, tow truck, Vehicle Use vanpool, etc.) VIN vehicle identification number

Registered State state vehicle is registered in

action vehicle 1 was engaged in at the time of the crash Vehicle Action 1 (turning, parked, etc.)

Vehicle Condition 1 operating condition of vehicle 1 before crash

Vehicle Condition 2 operating condition of vehicle 2 before crash (if applicable)

Vehicle Condition 3 operating condition of vehicle 3 before crash (if applicable)

Commercial Carrier Address applies to crash involving commercial transport vehicle

Commercial Carrier City Name applies to crash involving commercial transport vehicle

Commercial Carrier Name applies to crash involving commercial transport vehicle

Commercial Carrier State Code applies to crash involving commercial transport vehicle

Commercial Carrier Zip Code applies to crash involving commercial transport vehicle transport type/purpose (bus, dump, flatbed, garbage/refuse, Commercial Vehicle Cargo Body Type etc.) class: bus, tractor/doubles/semi-trailer/triples, truck tractor, Commercial Vehicle Class etc. Commercial Vehicle Name Source driver, log book, shipping papers, side of vehicle

GVWR gross Vehicle Weight Rating

Hazardous Material Name applies if hazardous materials transported

Interstate Intrastate intrastate or interstate travel

Number Of Axles axle count (max = 60)

Placard Number 1-to-4-digit identification code

combustible, corrosive, explosive, gas, radioactive, Placard Suffix Type Code infectious, oxidizer, other

USDOT Number 1-to-10-digit identification code

26

Crash prediction SPFs being the primary aim of this research, the raw crash data section forms the foundation for the modeling dataset around which, the raw AADT, roadway geometric and roadside datasets revolve. In making the combined dataset, taking the raw crash data forward, the information from 2002 to 2010 was formatted to sort all crashes by their associated State

Route, mile post, and year. The crash data was then arranged by the “Collision Severity” category, which lists the crashes as being either: Fatal, Injury only, or Property Damage only. The other categories from the raw data that were sorted and parsed include “Number of Injuries”,

“Number of Motor Vehicles Involved”, “Impact Location Description”, “Collision Severity”,

“First Collision Type”, and “Injury Type”. The most critical information obtained through the crash database concerns the mile post markers that define the locations of the crashes. These mile post markers play an integral part in organizing the crash data in the final crash databases.

Additionally, having the known locations of crash occurrence along each State Route allows for analysis of the non-crash areas along the routes. Therefore, in regard to segmentation, the crashes were associated to roadway segments based on the starting and ending mile markers that they were contained within.

The manner in which the crash information was accounted for was correspondingly transposed from mile marker to roadway segment. The procedure for assigning crashes to roadway segments was maintained consistent for all of the segment databases since the crashes were eventually tabulated from counts within each respective segment. Once the crash counts were consolidated, additional crash related attributes were also included while maintaining consistency with the count format. The integration of this processed data will be explained in further detail in the forthcoming sections.

27

3.1.2 Average annual daily traffic data

WSDOT regularly releases an Annual Traffic Report (ATR) that presents AADT data collected at various station locations in Washington State. Specific to this study, the AADT data obtained from WSDOT lists the values for each State Route at specific mile post markers. These traffic counts are presented for each State Route disaggregated by year. A sample of the AADT results, captured by the Transportation Information and Planning Support (TRIPS) system, from the ATR is displayed in Figure 3-2.

Figure 3-2 Sample annual traffic report from the WSDOT TRIPS system

While being readily available, the AADT count information obtained through the

WSDOT TRIPS system was found to be recorded not for every milepost along a specific route,

28

but dispersed along the length of the entire state route, with about 10 miles between subsequent data collectors. This complicated the issue as net AADT after intersection and ramp traffic flows were not completely accounted for with the raw data, and as they were, not all the given mile posts matched with the raw crash data mileposts. To bridge the gap between the AADT data and the crash data, traffic flows about the absent mileposts were calculated via an area-weighted

AADT interpolation. This process will be described in further detail.

The need for interpolating the traffic flow counts arises from the fact that modeling datasets require high resolution AADT counts at closely bound mile post sections or narrower mile post intervals. While the roadway segment lengths differ among the datasets, the procedure used to interpolate the AADT counts applies to all of the modeling datasets. The manner in which the weighted-area AADT calculations were executed will be explained using the example shown in Figure 3-3.

Figure 3-3 AADT segment Interpolation

The first step in the process is to identify the ARM of the mileposts where WSDOT ADT counts were observed. From this, the Beginning Mile Post (BMP) and the Ending Mile Post

29

(EMP) were established as the bounds for the analysis segment for which the ADT was to be obtained. From the example in Figure 2.2, the BMP ADT is 7300 at mile post 1.45 while the

EMP ADT is 7100 at mile post 1.87. Next, the location of the EMP of the crash analysis segment is to be identified, which in this example is mile post 1.76. Subsequently, the following linear interpolation equation is applied to obtain the ADT at the end of the crash analysis segment.

(퐸푀푃 퐴퐷푇−퐵푀푃 퐴퐷푇)∗(퐶푆퐸푀푃−퐵푀푃) 퐵푀푃 퐴퐷푇 + = 퐶푆퐸푀푃 퐴퐷푇 Eq. (1) (퐸푀푃−퐵푀푃)

Where: BMP ADT is the count at the Beginning Mile Post

EMP ADT is the count at the Ending Mile Post

BMP is the location of the Beginning Mile Post marker

EMP is the location of the Ending Mile Post marker

CSEMP is the location of the Crash Segment Ending Mile Post marker

CSEMP ADT is the count at the Crash Segment Ending Mile Post

For this example, the length-based weighted ADT is obtained as:

(7100 − 7300) ∗ (1.76 − 1.45) 7300 + = 7152.4 퐴퐷푇 푎푡 푚𝑖푙푒 푝표푠푡 1.76 (1.87 − 1.45)

The aforementioned interpolation procedure was applied to all of the final databases at the mile marker locations and distances corresponding to each respective crash segment. The basis of the interpolation procedure being dependent only on the locations of known mile post markers, the calculations could be implemented for all of the five databases.

30

3.1.3 Roadway geometrics data

During the process, two different sources from WSDOT were employed to obtain roadway geometric information for the State Routes in Washington State. The first source being the Washington State Highway Log, which is released by WSDOT on a yearly basis and accounts for all road mileage in Washington State, and the second being the WSDOT GIS and Roadway Data Office (GRDO) Linear Referencing System (LRS), that lists horizontal and vertical alignment information, number of lanes, and roadway width and shoulder width information for all State Routes in the state.

The Washington State Highway Log is divided into seven separate data files that include one comprehensive log and six regional logs. The information contained in the report lists each

State Route and the associated highway features, width and surface information, and classifications by mile post. The State Highway Log differentiates this information through three unique descriptors: 1) State Route Number, 2) Related Roadway Type (RRT), and 3) Related

Roadway Qualifier (RRQ). The State Route (SR) is identified by a three-digit number for the specific route, ranging from SR 002 to SR 971. The RRT identifies the type of the roadway as listed in Table (3-9).

Table 3-9 State highway log related roadway type identifiers

RRT Definition RRT Definition AR Alternate Route TR Temporary Route CO Couplet CD Collector Distributor FD Frontage Road Dec CI CollectorDec Distributor FI Frontage Road Inc LX CrossroadInc within FS Ferry Ship (Boat) P1 – P9 OffInterchange Ramp, Inc FT Ferry Terminal Q1 – Q9 On Ramp, Inc PR Proposed Route R1 – R9 Off Ramp, Dec RL Reversible Lane S1 – S9 On Ramp, Dec SP Spur HD Grade-Separated TB Transitional Turnback HOV-Dec

31

Within the Washington State Highway Logs, the RRTs used in the report include

Alternate Route, Couplet, Reversible Lane, Spur, Grade-Separated HOV-Dec, Grade-Separated

HOV-Inc, and Mainline. The RRQ further characterizes the SR and RRT information by listing location specific information such as street names and mile posts. The RRTs that are assigned

RRQ information include: Alternate Route (AR), Couplet (CO), Proposed Route (PR), Reversible

Lane (RL), Spur (SP), Transitional Turnback (TB), and Temporary Route (TR). The Alternate

Routes do not contain any RRW information.

The resulting road log released by the Washington State TRIPS system as part of the

State Highway Log Report forms the base for the geometric database. A sample portion of the

TRIPS report is shown in Figure 3-4.

Figure 3-4 Sample highway log report from TRIPS

32

Additionally, the horizontal and vertical alignment information, number of lanes, and

roadway width and shoulder width data obtained through the GRDO Linear Referencing System

were compiled in excel spreadsheets corresponding to each individual geometric roadway feature

class. The horizontal alignment, number of lanes and roadway width information, and shoulder

width information, cover the geometric data from 2004 to 2011, while the vertical alignment data

spans were obtained for the period from 2006 to 2011.

As shown in Table (3-10), the raw horizontal alignment data lists the main components of

each horizontal curve. The horizontal curves listed progress in the increasing mile post direction.

Table 3-10 Raw geometric data horizontal alignment information

Horizontal Alignment Information Definition LRS_Date date input into Linear Referencing System SRID state route ID SR state route RRT related route type RRQ related route qualifier BegARM beginning accumulated route mileage EndARM ending accumulated route mileage BegMP beginning mile post BegAB beginning mile post ahead/back EndMP ending mile post EndAB ending mile post ahead/back HorizontalCurvePointOfTangencyArm horizontal curve PT accumulated route mileage HorizontalCurvePointOfCurvatureArm horizontal curve PC accumulated route mileage HorizontalCurveType horizontal curve or angle HorizontalCurveRadius radius of curve (R) HorizontalCurveMaximum(Super)Elevation max super elevation (e) HorizontalCurveLength length of curve (L) in feet HorizontalCurveDirection curve left or curve right HorizontalCurveCentralAngle angle of deflection (∆) in degrees

The vertical alignment data includes all of the pertinent vertical curvature information for

all State Routes, as shown in Table (3-11). The raw data was found to utilize nonstandard

nomenclature to reference all vertical curves attributes to mile post markers. For example, instead

33

of using the definition of Vertical Point of Curvature (VPC), the raw data references the

Beginning Vertical Curve Accumulated Route Mileage.

Table 3-11 Raw geometric data vertical alignment information

Vertical Alignment Information Definition LRS_Date date input into linear referencing system SRID state route ID State Route Number state route Related Route Type related route type Related Route Qualifier related route qualifier Begin ARM beginning accumulated route mileage End ARM ending accumulated route mileage Begin SRMP beginning state route mile post Begin AB beginning mile post ahead/back End SRMP ending state route mile post End AB ending mile post ahead/back Begin SRMP2 beginning state route mile post (ahead/back) End SRMP2 ending state route mile post (ahead/back) Related Roadway Type Description RRT description State Route Description state route and cross street RRT_RRQ RRQ description Vertical Curve Bvc Arm beginning vertical curve accumulated route mileage Vertical Curve Vpi Arm vertical point of intersection accumulated route mileage Vertical Curve Evc Arm ending vertical curve accumulated route mileage Vertical Curve Type crest or sag curve Vertical Curve Length length of curve (ft) Vertical Curve Percent Grade Ahead grade (%) ahead of curve Vertical Curve Percent Grade Back grade (%) back of curve

The raw data for the number of lanes and roadway width information differentiates between the increasing and decreasing mile post directions for the State Routes. The raw data captured in the number of lanes and roadway information is listed in Table (3-12).

34

Table 3-12 Raw geometric data number of lanes and roadway width information

Number of Lanes and Roadway Width Definition LRS_Date date input into linear referencing system SRID state route ID SR state route RRT related route type RRQ related route qualifier BegARM beginning accumulated route mileage EndARM ending accumulated route mileage BegMP beginning mile post BegAB beginning mile post ahead/back EndMP ending mile post EndAB ending mile post ahead/back RoadwayDirection increasing or decreasing or bothways NumberOfLanesIncreasing number of lanes in increasing direction NumberOfLanesDecreasing number of lanes in decreasing direction RoadwayWidthInc roadway width (ft) in increasing direction RoadwayWidthDec roadway width (ft) in decreasing direction

Similar to the number of lanes data, the raw shoulder width data also accounts for increasing and decreasing mile post directions for the State Routes. The shoulder locations are referenced as Left, Left Center, Right Center, and Right. The shoulder width descriptors and their associated definitions are listed in Table (3-13).

Table 3-13 Raw geometric data shoulder width information

Shoulder Widths Definition LRS_Date date input into linear referencing system SRID state route ID SR state route RRT related route type RRQ related route qualifier BegARM beginning accumulated route mileage EndARM ending accumulated route mileage BegMP beginning mile post BegAB beginning mile post ahead/back EndMP ending mile post EndAB ending mile post ahead/back RoadwayDirection increasing or decreasing or bothways ShoulderWidthLeft shoulder width (ft) of outer portion of decreasing direction ShoulderWidthLeftCenter shoulder width (ft) of side of decreasing direction ShoulderWidthRightCenter shoulder width (ft) of median side of increasing direction ShoulderWidthRight shoulder width (ft) of outer portion of increasing direction

35

In addition to the two main sources of roadway data, manually recorded geometric observations from WSDOT Geoportal map and SRweb were used to substantiate the information available in the final homogeneous segment database. In the context of integrating the information from the various sources, the raw geometric data was not altered in any way with regard to assembling a stand-alone base processed geometric dataset. Instead, the raw geometric data was pulled from the various data sources and incorporated into the final databases using certain set rules. The nature of the geometric data in the final homogeneous segment database will be explained in more detail in the forthcoming discussions.

3.1.4 Roadside data

The roadside data was provided by the WSDOT GRDO through the Roadside Features

Inventory Program (RFIP) in which GIS information was obtained through various metadata files. The metadata files, which were separated according to specific roadside feature (58 files total), were extracted from GIS and exported to excel in the raw data format. This roadside data uses the same route segmentation scale, consistent with the raw roadway geometric data. The roadside features obtained from the GIS metadata are listed in Table (3-14).

For each roadside feature database, extensive data inventory information was captured in

61 descriptors. For the purposes of this study, the information of particular interest was the location information of the roadside feature that concerns State Route, Beginning Mile Post, and

Ending Mile Post. Supplemental geometric and roadside information was recorded and obtained through the use of WSDOT Geoportal map and SRweb programs.

Assembling the roadside data for modeling purposes required the raw data to be structured and integrated into the existing data structure according to roadway segment attributes.

The raw data was not altered in any way, but rather presented in a different manner based on the 36

clustered road segments. The roadside data was processed to be presented in two different forms- one as a dummy indicator showing the presence of a corresponding roadside feature within a homogeneous segment; the second being a percent length variable showing the percentage of the roadside feature in the total length of the homogeneous segment. In addition to this, of the 58 roadside features, 21 variables were found to be closely related, with marginal differences that would not warrant individual categories. These related variables were combined, resulting in 37 unique roadside feature variables in the final homogeneous segment model. Table (3-15) shows the 21 roadside features that were aggregated from the raw dataset, while Table (3-16) lists the 37 roadside characteristics that were eventually conditioned for inclusion into the final database.

Table 3-14 Roadside feature metadata files from RFIP

Roadside Feature/Metadata Files Roadside Feature/Metadata Files bridgerail guardrail ditchwidthevent rockoutcropping bridgestructure guywire downguy specialusebarrier cabinet hydrant downguyanchor specialusebarrierhteven cable_barrier impactattenuator drainageinlet stormwaterpond t concbaeesctnlngthevent intersection_point drywell stormwatervault concbarrtypeevent mailbox fence support concrete_barrier miscellaneousfixedobjec fenceheightevent table concretebarrfacetrtmnteven pedestal fencetypeevent tree t culvert pipe_end glarescreen tree_group t culvertend redirectionallandform glarescreenheighteven wall curb regulatoryoutfall grdrldoublesidedevent wallheightevent t ditch roadapproach grdrlpostmatltypeeven wallmaterialtypeevent ditchbackslopeevent roadsideslope grdrlpostspacingevent walltypeevent t ditchdepthevent roadsideslopehtevent grdrtypeevent waterhazard ditchforeslopeevent roadsideslopeslopevent

37

Table 3-15 Table depicting the related variables in the raw data that were aggregated

Final Aggregated Variable Related Raw Dataset Variables cable_barrier , concbaeesctnlngthevent, Cable Barrier concbarrtypeevent Concrete Barrier concrete_barrier, concretebarrfacetrtmntevent ditch, ditchbackslopeevent, ditchdepthevent, Ditch ditchforeslopeevent, ditchwidthevent Fence fence, fenceheightevent, fencetypeevent Glare Screen glarescreen, glarescreenheightevent guardrail, grdrldoublesidedevent, Guardrail grdrlpostmatltypeevent, grdrlpostspacingevent, grdrtypeevent roadsideslope, roadsideslopehtevent, Roadside Slope roadsideslopeslopevent Support support, table wall, wallheightevent, wallmaterialtypeevent, Wall walltypeevent Special Use Barrier Specialusebarrier, specialusebarrierhtevent

Table 3-16 Processed Data Roadside Features

Roadside Feature bridgerail intersection_point bridge structure mailbox cabinet miscellaneous fixed object cable Barrier pedestal concrete Barrier pipe_end culvert redirectionallandform culvertend regulatoryoutfall curb roadapproach ditch roadsideslope downguy rockoutcropping downguyanchor specialusebarrier drainageinlet stormwaterpond drywell stormwatervault fence support glarescreen tree guardrail tree_group guywire wall hydrant waterhazard impact attenuator

38

3.2 Combined homogeneous segments dataset

In combining the datasets, a decision was made to use a homogeneous segment method.

Key reasons for this decision were that fixed or random segment lengths would lead to the division of design elements such as vertical/horizontal curves, or bridge structures across segments, resulting in heteroskedastic variations of roadway geometry within segments.

Additionally, fixed length segments could have crashes recorded within a sub-segment of homogeneous geometry, while including other possibly unrelated characteristics. Homogeneous road segmentation would enable the isolation of observed crashes within the bounds of a consistent roadway geometry mix, also allowing for the modeling of segments with consistent geometric features, reflecting underlying patterns in observed crash counts.

This section briefly describes the development of the homogenous segments crash database that was used for the modeling process in this study. Homogenous roadway segments are defined as segments that maintain consistency in characteristics over their entire length, with any change in roadway characteristics resulting in the beginning of the next homogeneous segment. For the purposes of this study, segment homogeneity was based on segment uniformity over 20 roadway geometry characteristics, corresponding to horizontal alignment (horizontal curve point of tangency arm, horizontal curve point of curvature arm, horizontal curve radius, horizontal curve maximum (super) elevation, horizontal curve length, horizontal curve central angle), vertical alignment (vertical curve BVC arm, vertical curve VPI arm, vertical curve EVC arm, vertical curve length, vertical curve percent grade ahead, vertical curve percent grade back), number of lanes and their width (number of lanes increasing, number of lanes decreasing, roadway width increasing, roadway width decreasing), and shoulder widths (shoulder width left, shoulder width left center, shoulder width right center, shoulder width right). The shortest segment length that maintained consistent roadway geometrics was found to measure 0.01 miles.

39

This fine level of detail resulted in an increase in the total segment count within the database, with 62,598 homogeneous segments per year, and a total number of 563,382 segments for the nine years of crash data combined.

A total number of 138 parameters were retained in the database, covering crash descriptors, roadway geometrics, roadside information, and AADT values. The information from the source data was input into the homogenous roadway segment format based on each segment's milepost markers. The WSDOT source crash data was input as counts or number of occurrences that occurred on any specific homogeneous roadway segment for each year. The crash counts for any particular roadway segment were determined by the recorded milepost location from the crash observations. The reported crashes were then assigned to the corresponding homogeneous segments based on the milepost limits. These counts were accumulated to obtain total crash counts, impact location, collision severity, number of vehicles involved, and collision type on a segment-by-segment basis.

Not all segments contained complete roadway geometric information and cells with omitted geometric information were populated with the value -99 to signify . Of the

142 routes in the raw dataset, only 117 routes were included in the final database, as 25 state routes did not have any roadway geometric information available. The omitted State Routes were:

2, 19, 41, 100, 102, 103, 110, 113, 116, 119, 121, 122, 131, 170, 194, 197, 225, 262, 263, 278,

300, 508, 546, 547, and 971. The parameters in the homogenous roadway segments database along with a brief description can be found in Appendix A. The subsequent tables provide summary statistics of the variables used for building model specifications, sub-grouped into crash categories, roadway segment characteristics, and roadside features, for the year 2010.

40

Table 3-17 Summary statistics of roadway characteristics

Variable Mean Std. Dev. Min Max Ln(Average Annual Daily Traffic) 7.905926 1.225559 3.78419 11.47508 Ln(Homogeneous segment length) -3.02697 1.007003 -4.60517 2.988708 Number of lanes- increasing direction 1.10593 0.324822 0 4 Number of lanes- decreasing direction 1.103534 0.33244 0 4 Horizontal curve angle 29.25855 73.3353 0 436.07 Difference in gradients for vertical curve 1.093967 2.016285 0 24.36 K-value for vertical curve 182.5474 1468.381 0 150000 Degree of horizontal curvature 2.108558 5.823135 0 163.7022 Average lane width in feet 11.84304 2.096876 0 43 Dummy- right shoulder 2-3 feet increasing, 6-7 feet decreasing 0.002316 0.048073 0 1 Dummy- right shoulder 4-5 feet increasing, < 1 foot decreasing 0.002812 0.05295 0 1 Dummy- right shoulder 4-5 feet increasing, 2-3 feet decreasing 0.008754 0.093155 0 1 Dummy- right shoulder 8-9 feet increasing, < 1 foot decreasing 0.003163 0.056152 0 1 Dummy- right shoulder 8-9 feet increasing, > 10 feet decreasing 0.003483 0.058911 0 1 Dummy- right shoulder > 10 feet increasing, < 1 foot decreasing 0.001486 0.038516 0 1 Dummy- right shoulder > 10 feet increasing, 2-3 feet decreasing 0.001102 0.033182 0 1 Dummy- right shoulder > 10 feet increasing, 4-5 feet decreasing 0.00262 0.051118 0 1 Dummy- Center shoulder 1-9 feet 0.027669 0.164023 0 1

Table 3-18 Summary statistics of roadside features

Variable Mean Std. Dev. Min Max Dummy- bridge rail 0.01286 0.112671 0 1 Percent length of bridge rail on segment 0.009415 0.091382 0 1 Dummy- bridge structure 0.002892 0.053695 0 1 Dummy- cabnet 0.008882 0.093826 0 1 Dummy- cable barrier 0.00147 0.038309 0 1 Percent length of cable barrier on segment 0.001146 0.032248 0 1 Dummy- concrete barrier 0.004697 0.068371 0 1 Percent length of concrete barrier on segment 0.003404 0.054755 0 1 Dummy- culvert 0.035864 0.185952 0 1 Percent length of culvert on segment 0.007574 0.060154 0 1 Dummy- culvert end 0.053708 0.225442 0 1 Dummy- curb 0.043979 0.20505 0 1 Percent length of curb on segment 0.028752 0.151803 0 1 Dummy- ditch 0.207882 0.405795 0 1 Percent length of curb on segment 0.143783 0.316186 0 1 Dummy- down guy 0.010448 0.101679 0 1

41

Table 3-19 Summary statistics of roadside features (continued)

Variable Mean Std. Dev. Min Max Percent length of down guy on segment 0.001886 0.028482 0 1 Dummy- down guy anchor 0.042094 0.200805 0 1 Dummy- drainage inlet 0.03417 0.181668 0 1 Dummy- drywell 0.000447 0.021145 0 1 Dummy- fence 0.095898 0.294453 0 1 Percent length of fence on segment 0.066975 0.231353 0 1 Dummy- glare screen 0.000128 0.011304 0 1 Percent length of glare screen on segment 8.75E-05 0.008576 0 1 Dummy- guard rail 0.100355 0.300475 0 1 Percent length of guard rail on segment 0.069006 0.233669 0 1 Dummy- guywire 0.001294 0.035949 0 1 Percent length of guywire on segment 0.000261 0.010927 0 1 Dummy- hydrant 0.00417 0.064437 0 1 Dummy- impact attenuator 0.000655 0.025584 0 1 Dummy- intersection point 0.000176 0.013255 0 1 Dummy- mailbox 0.033963 0.181135 0 1 Dummy- miscellaneous fixed object 0.069539 0.254371 0 1 Dummy- pedestal 0.083022 0.275917 0 1 Dummy- pipe end 0.003019 0.054865 0 1 Dummy- directional land form 0.000655 0.025584 0 1 Percent length of directional land form 0.000295 0.014347 0 1 Dummy- regulator outfall 0.00262 0.051118 0 1 Dummy- road approach 0.099843 0.299794 0 1 Dummy- roadside slope 0.31239 0.463472 0 1 Percent length of roadside slope on segment 0.233412 0.381188 0 1 Dummy- rock outcrop 0.009505 0.09703 0 1 Percent length of rock outcrop on segment 0.006852 0.077123 0 1 Dummy- special use barrier 0.000543 0.023299 0 1 Percent length of special use barrier 0.000272 0.014283 0 1 Dummy- storm water pond 0.000719 0.026802 0 1 Dummy- storm water vault 0.001166 0.03413 0 1 Dummy- support 0.180038 0.384222 0 1 Dummy- tree 0.047094 0.211842 0 1 Dummy- tree group 0.128103 0.334207 0 1 Percent length of tree group on segment 0.07593 0.233084 0 1 Dummy- retaining wall 0.009218 0.095565 0 1 Percent length of retaining wall on segment 0.003466 0.048645 0 1 Dummy- water hazard 0.017892 0.13256 0 1 Percent length of water hazard on segment 0.011906 0.100345 0 1

42

Table 3-20 Summary statistics of crash categories

Variable Mean Std. Dev. Min Max Total count of crashes on segment 0.193585 1.144581 0 55 Location based Count of roadside crashes on segment 0.053772 0.368041 0 16 Count of roadway crashes on segment 0.139701 1.03842 0 55 Count of other location crashes on segment 0.000112 0.01199 0 2 Severity based Property damage only crash 0.10617 0.705296 0 35 Possible injury from crash 0.043404 0.500827 0 31 Evident injury from crash 0.029873 0.34872 0 16 Serious injury from crash 0.009617 0.188246 0 10 Fatality from crash 0.003243 0.104406 0 6 Unknown 0.001278 0.03873 0 2 Number of injuries based Crashes reported with more than 1 injury 0.084571 0.810832 0 45 Crashes reported with only 1 injury 0.006757 0.193185 0 18 Crashes reported with 0 injury 0.102256 0.796943 0 55 Number of vehicles involved based Crashes reported involving 1 vehicle 0.074763 0.434581 0 17 Crashes reported involving 2 vehicle 0.099668 0.837054 0 38 Crashes reported involving 3 vehicle 0.016247 0.34174 0 25 Crashes reported involving 4 vehicle 0.002476 0.130045 0 13 Crashes reported involving 5 vehicle 0.000399 0.061009 0 12 Crashes reported involving > 5 vehicles 3.19E-05 0.005652 0 1 Crash type based Rear end type crashes 0.05251 0.686262 0 50 Turning rear end type crashes 0.000415 0.059282 0 11 Same direction turning sideswipe crashes 0.00016 0.02038 0 3 Same direction sideswipe crashes 0.005112 0.149035 0 11 Same direction turning crashes 0.007285 0.16733 0 9 Same direction others 0.004185 0.143605 0 17 Head-on collision crashes 0.002892 0.104263 0 6 Opposite direction sideswipe crashes 0.005288 0.150999 0 14 Opposite direction turning crashes 0.005959 0.152343 0 7 Opposite direction others 0.005687 0.169479 0 14 Fixed object type crashes 0.042382 0.313204 0 15 Entering-at-angle type crashes 0.024442 0.387444 0 23 Overturn type crashes 0.009697 0.138982 0 7 Animal hit type crashes 0.018898 0.220252 0 10 Bicycle hit type crashes 0.000687 0.041723 0 4 Pedestrian hit type crashes 0.001102 0.051329 0 5 One parked vehicle type crashes 0.001949 0.080908 0 8 Entering/leaving driveway type crashes 0.002604 0.098439 0 10 Crashes classified as “other” 0.004265 0.098055 0 6 Crashes with no stated information 0.000016 0.003997 0 1 Crashes involving trucks 0.099971 0.662864 0 24

43

Chapter 4

Modeling methodology

This section will provide a summary of the overall structure and modeling basis for this dissertation, leading into the modeling techniques utilized to achieve the proposed goals.

The primary research question to be addressed is the extent to which unobserved effects and the variations across segments may be affected by the usage of predictors in improper functional form. If the modeling technique employed to predict a certain outcome assumes that an independent variable follows a linear distribution, while in reality it is found to be nonlinear, it could result in improper or inaccurate predictions. Insights from this part of the study can provide guidance on accommodation of heterogeneity through functional form treatments for a broad functional class of highways on a regional or statewide network.

Secondly, this dissertation aims at investigating the impact of functional form on heterogeneity- this type of heterogeneity manifests itself as an unobserved effect of unknown origin, unassociated with any exogenous variables. As mentioned in the previous chapters, this form of heterogeneity is empirically specific to the dataset that is being used, as are the resulting fractional polynomials through multivariable regression modeling. The usage of fractional polynomials might provide some insight into parameter heterogeneity, but could still leave some heterogeneity unanswered. This study also aims to couple the functional form approach with a generalized dispersion parameter approach, where the over-dispersion parameter itself is parameterized against multivariable polynomials.

In the absence of proper functional form, fixed parameter models can misinform safety policy, and sustainable safety policy can be developed only if models can capture differences between roadway groups accurately, while accommodating nonlinear parameterizations. The

44

combination of the proposed research questions is fundamental to the addressing of heterogeneity treatment in the methodological evaluation of geometric effects on crash propensities.

To study the effect of functional form and over-dispersion parameterization on count model heterogeneity, an iterative procedure is proposed involving three types of modeling techniques, namely, count models using negative binomial regression; the Royston and Altman algorithm for sequential multivariable fractional polynomial selection; and the parameterization of the over dispersion term alpha using generalized negative binomial regressions. This procedure is the first technique proposed to address two modeling domains: a) heterogeneity effects due to improper functional form, and b) incorporating a multivariable fractional polynomial specification search in large datasets.

Over-dispersion is a phenomenon wherein the observed variance in a dependent variable is higher than the expected variance. In a negative binomial/ Poisson context, the extent to which over-dispersion exists is captured by the negative binomial heterogeneity parameter alpha. A positive alpha value indicates the presence of over-dispersion in the data, while as alpha approaches zero, the model becomes Poisson. A convergent Log-likelihood close to zero, and a small alpha value with narrow standard deviation from one modeling cycle to the next, are desirable as indicators of good fit, and of reduced dispersion heterogeneity.

Generalized dispersion parameter models (Heterogeneous Negative Binomial Models) as the name suggest, not only enable the estimation of independent variable parameters and their standard errors, but also allow for the parameterization of the ancillary or over-dispersion parameter alpha. Significant variables contributing to the parameter alpha imply significant contribution to the over-dispersion of the data.

Multivariable fractional polynomials on the other hand target the influence of continuous covariates on the outcome of regression models. Introduced by Royston and Altman in 1994 and

45

improved by Sauerbrei and Royston between 1999 and 2008, these models may in addition to the search set of covariates, include binary, categorical or other continuous covariates which are introduced into the variable selection process without the need for nonlinear transformations.

4.1 Poisson distribution, over-dispersion and the negative binomial regression

The foundation of count regression modeling in transportation engineering, the Poisson regression, is based on the Poisson probability distribution, and assumes that the dependent variables in the regression analysis are counts that follow a Poisson distribution, and that the observations are independent of each other. The general form of the Poisson regression involves a series of Bernoulli trials, without knowledge of the probability of success on a particular trial ‘p’ or the number of trials ‘n’ in the series. The mean number of successes in the series is taken as λ, the mean rate of the event, and the Poisson distribution results in the probability of ‘y’ successes given a large series and a small ‘p’.

The Poisson model for even counts is given by:

Eq. (2)

Where yi is the count of observed events for observation ‘i’ in time period ‘t’, and as before, λi is the mean rate of the event. The goal of the regression is to estimate the density of yi through the estimation of λi.

Eq. (3)

As mentioned before, modeling counts using a Poisson regression is limited by a phenomenon known as over-dispersion. A major assumption of the Poisson distribution is that the

46

mean should equal the variance. Over-dispersion occurs when the observed variance is greater than the expected mean, something that is found in crash count data, frequently caused by under- reporting errors, and the uneven distribution of event probabilities in small time intervals. A

Poisson regression could be employed if over-dispersion is found to be very small in effect, with the unaccounted over-dispersion leading to inefficient estimates, but if unobserved effects significantly affect the distribution of observed counts, there will be significant bias in the estimated parameters.

To account for this shortcoming of the Poisson distribution, a negative binomial distribution is commonly employed, allowing the variance to differ from the mean through accounting for unobserved heterogeneity by assuming a Gamma distributed error term.

Alternative forms for the distribution of the random error term exist, such as a log-, but in this case, the likelihood is in non-closed form, and the Gamma distributed error term is preferred. The negative binomial model is derived by way of an error structure such that,

ij = exp(Xij + ij) Eq. (4)

Where, exp(ij) is the Gamma distributed error term allowing the presence of unobserved heterogeneity. Integration with respect to the error term results in the marginal distribution form of the negative binomial distribution.

1 1 1 푛 훤(( +푛 ) 훼 푖 훼 푖 훼 휆푖 푃(푛 |휆 , 훼) = 1 ( 1 ) ( 1 ) Eq. (5) 𝑖 𝑖 훤( )푛 !) ( )+휆 ( )+휆 훼 푖 훼 푖 훼 푖

The variance is allowed to differ from the mean by:

2 푉푎푟[푛𝑖] = 퐸[푛𝑖][1 + 훼퐸[푛𝑖]] = 퐸[푛𝑖] + 훼퐸[푛𝑖] Eq. (6)

It should be noted that in the standard negative binomial regression model, the over- dispersion term  is fixed across all observations. The presence of a statistically significant 

47

shows the existence of over-dispersion in the observed counts, but does little to improve on one’s understanding on the source of the over-dispersion. One of the key objectives of this study is to break down the unobserved effects through the parameterization of the over-dispersion term, by providing some insight into the factors that contribute to the occurrence of the over-dispersion.

The procedure employed to achieve this goal will be discussed going forward through this chapter. As such, the specifications obtained through the negative binomial regressions were employed as a baseline to compare predictive insights and improvements for subsequent model specifications. The next modeling component to be discussed will be the MFP procedure for finding workable functional forms for the continuous covariates in the specification search set.

4.2 Model structure for functional form treatments

A regression function in general polynomial form with powers pi is given by:

푚 푦 = 훽0 + ∑푗=1 훽푗 퐻푗(푥) Eq. (7)

푥(푝푗) 𝑖푓 푝 ≠ 푝 (푝1) 푗 푗−1 Where, 퐻1(푥) = 푥 , and for j = 2,…m, and 퐻푗(푥) = { 퐻푗−1(푥)ln (푥) 𝑖푓 푝푗 = 푝푗−1

(Eq. (8))

As an example, a fractional polynomial (FP) of degree three with powers (1,3,3) is given

3 3 3 3 as:β0 + β1x + β2x + β3x ln (x), since H1(x) = x , H2(x) = x , and H3(x) = x ln (x). In vector notation, the polynomial can be represented as x(1,3,3)′β . In empirical applications to large datasets, the degree is typically limited to 2 (also known as FP2). In spite of the set of powers being relatively small, {−2,1, −0.5,0,0.5,1,2,3} (A power of 0 is taken as Ln(X).), the evaluation of models with an FP2 specification can be substantial. In empirical applications this seemingly small sampling set of powers appears to be sufficient for extraction of rich FP2 48

specifications (Royston and Altman 1994; Sauerbei et al 2006; Royston and Sauerbei 2008). To understand why the FP specifications to be tested within the small sampling set of powers can be substantial, consider for example, an FP2 consideration set with powers p1 and p2 being sampled from {−2,1, −0.5,0.0.5,1,2,3}. If one allows for repeated powers, that is p1 = p2, then the number of FP2 specifications is 36. In a large dataset, this can be a significant computational challenge, since the FP estimation algorithm described below takes multiple cycles to converge at an optimum functional form within the search set.

4.3 Estimation algorithm and model selection

Useful when one wishes to preserve the continuous nature of the covariates in a model, with the need to test the nonlinearity of their relationships, model selection in the MFP algorithm combines a backward elimination procedure with a systematic search for a suitable variable transformation. At each step of the backfitting algorithm, a fractional transformation is constructed for each of the continuous covariates beginning with the most significant, while fixing the current functional forms of the other covariates. The algorithm terminates when no more covariates are excluded, and the functional forms of the existing covariates do not change.

The linear predictor for a fractional polynomial of order M for covariate X is given by the form

푀 p β0 + ∑푚=1 βmX m , where each power pm is chosen from the predefined set S = {-2, -1, -0.5, 0, 0.5,

퐾 푀 p 1, 2, 3}, and the multivariable model will be of the form: β0 + (∑푘=1 ∑푚=1 βmXk m ) + XK+1

+…XL, where Xk for k=1 to K, represents the set of covariates included in the fractional polynomial search, while the range XK+1 to XL comprises of the dummy variables and other continuous covariates not included in the search.

49

All combinations of these powers are fitted to find the best fitting model. An FP1 transformation does only one power transformation to an X variable, while higher order transformations such as FP2, FP3 and so on, involve combinations of power transformations of

X. In STATA, using the "sequential" option for the mfp nebreg command, ensures model selection is performed using the algorithm suggested by Royston and Altman in 1994. This algorithm follows a three step process for each transformation made to a covariate. The FP2 is first compared to the FP1 on 2 degree of freedom at the alpha ( ) significance level. If significant, the final model is FP2. Otherwise, FP1 is tested against a straight line on 1 degree of freedom at the alpha ( ) significance level. If significant, the final model is FP1. Else, a the straight line is tested against omitting the x variable on 1 degree of freedom at the select ( ) level. If significant, the final model is a straight line, otherwise, the x variable is dropped. When a continuous covariate is found to be in an optimal functional form, the algorithm fixes it and moves on to the next most significant variable.

Model selection is based on the criterion given by: -2LogL, where LogL is the log-likelihood at convergence. Since the FP algorithm compares multiple models, it is useful to consider a baseline model against which deviance of the preferred models can be evaluated. This approach measures the deviance gain which is given by: D(1,1) – D(m,p), which suggests that the deviance gain be compared against a first-degree FP with power 1, i.e., a straight line. This gain criterion is referred to a χ2 criterion with degrees of freedom equal to 2m-1, when a constant is included in the specification. In FP(m) specifications, for a single variable model, the degrees of freedom are approximately 2m+1, whereas for a straight line model, the degrees of freedom would be two. For a multivariable specification, with a large design matrix such as one that is used in this dissertation, the degrees of freedom can be substantial. Searching for the model with

50

the lowest deviance can be a computational challenge. In order to address this computational challenge, a k-cut approach to model building and selection is discussed in the next section.

4.4 Modeling a large dataset with FP(m,p)

The 2010 subset of the homogenous crash database consisted of 12,118 observed crashes over 62,598 homogenous segments. The size of the dataset brought about two issues of conflict.

The first being that for the purposes of this study, the search vector for the FP(m,p) included roadway characteristic and roadside feature variables, along with the state route dummies. Each iteration cycle of the ‘Royston and Altman MFP search algorithm’ tests deviations achieved by transforming one continuous variable while keeping the remaining untransformed variables linear. With 46 continuous covariates to search between, the model would be computationally burdensome, and would require a long time to achieve convergence. Additionally, since the nonlinear parameterizations achieved through the search are empirically dependent on the data in hand, there is always the risk of obtaining too general a fit for the dependent variable with such a large number of observations- this too is a potential source for heterogeneity. To address these issues, it was decided that the 2010 homogeneous crash dataset would have to be further divided into smaller subsets to enable both efficient computability, and the teasing out of nonlinear trends that might be overlooked when regarding the dataset as a whole.

The next step in the process was to determine an optimum number of subsets required to obtain both feasible convergence, and stable FP specifications. One interesting aspect to this determination was that every data subset k would be used independently for an FP search, resulting in K sets of FP specifications. Depending on the distribution of the variables found to be significant in these specifications, there would be some common variables with similar or the 51

same transformations, and there could be unique variables either transformed or untransformed. It would be expected that as the number of cuts change, both the total number of significant variables, and the number of common and unique variables would vary. The nature of this variation was not something that was immediately apparent, but the understanding was that testing different cut sizes could provide insight into the distribution of the number of parameters at each level, ultimately resulting in an optimal number of cuts that would along with being computationally efficient, also provide the best fitting combination of unique and common parameters. Given the size of the dataset, it was concluded that anything lesser than or equal to four cuts would not yield any computational benefits. Having between four and seven divisions would result in anything between 9,000 and 15,000 observations per subset, which might be viable, but in studying the distribution of the data, it was decided that a 10 cut approach might provide a better fit.

With regard to existing research on an optimum number of cuts, there is little to no work detailing applications to model building. The basis for this part of the dissertation was taken from studies performed on K-fold cuts employed towards cross-validations and bootstrapping in error estimations for model predictions. Numerous studies exist to do with finding an optimum number of cuts for cross validation of models built on large datasets. One significant source of guidance was obtained from Kohavi (1995) who found that decreasing K below five resulted in increased variance due to instability of the training set. Similarly, Rodriguez et al. (2010), and Tibshirani et al. (2013) showed empirically that K = 5 or 10 resulted in error estimates that had neither excessively high bias, nor very high variance, while having a lower associated computational cost than using a cut of K= n-1. Most of the extant literature uses a 10-cut approach to cross validation of model specifications. The efficacy of using K=10 was also tested using model performance measures on the 2010 homogenous dataset, and validated using the 2009 dataset.

52

In order to divide the 2010 homogenous segment dataset into 10 consistent subsets based on observation numbers and crash counts, a random number generator was utilized to assign each segment in the dataset a number between 0 and 1. Following this, rows corresponding to the 10 ranges were copied out and saved as separate files for separate MFP specification searches. Table

(4-1) shows the number of observations and total crashes within each of the 10 subsets.

Table 4-1 Segment and total crash counts in the 10 randomly selected subsets

Subset Number Number of Segments Total Number of Crashes 1 6,333 1,140 2 6,299 1,115 3 6,316 1,266 4 6,236 1,309 5 6,150 1,158 6 6,245 1,251 7 6,217 1,154 8 6,257 1,266 9 6,285 1,205 10 6,260 1,254 Sum: 62,598 12,118

Once modeled, in order to be able to compare the resulting estimations to the other models in the process, the 10 subset specifications would have to be aggregated. To ensure no loss of information occurred, a set of rules were used such that each observation in the combined

2010 dataset would access its respective sample cut’s parameters, thereby ensuring that the segment's net effect on the aggregate model prediction would always be the same as its effect in the subset model. Recombining the datasets and fractional polynomials in this manner would also negate the possibility of correlations occurring between multiple functional form transformations of the same covariate across the 10 data subsets. To begin with, each of the 10 cut models were simplified in terms of their parent variables. Next, the common and unique reduced form variables were segregated, with the unique variables being interacted with their respective subset 53

dummies in the aggregate model. Common variables were sorted according to coefficient magnitude, and recoded such that the coefficient for the interacted variables would be the highest coefficient in the vector plus the variable's original coefficient in the subset model- the difference between the recoded coefficient and highest coefficient returning the original sample cut model coefficient. When being included in the aggregate model, these variables would be specified such that the highest coefficient would be assigned to the parent variable, and each of the subset dummy interaction variables would be assigned the recoded coefficients. This would result in retaining a consistent effect on the segment level predictions.

Once recombined, these recoded variables were used as the search set for a fixed parameter negative binomial model. The resulting count model was compared for prediction accuracy against a random parameter negative binomial regression performed on the original untransformed variables. As previously mentioned the negative binomial by construction picks up some heterogeneity that manifests itself in the over-dispersion. By allowing model estimates to vary across individual observations, random parameter models enable the explanation of the heterogeneity across individuals as well as the heterogeneity across groups. The reason for choosing a random parameter negative binomial model for the comparison was to study the extent to which the fractional form transformations capture time-invariant heterogeneity across individuals, without the distributional assumptions required by the random coefficient estimations.

The next phase of the modeling process was to integrate the transformed variables with a generalized negative binomial regression model, as a means of parameterization of the over- dispersion term α, to study the parameters that contribute to the over-dispersion in the data, and the functional forms that they take on. The following section provides some insight into this integration.

54

4.5 Incorporation of FP(m) structure into heterogeneous negative binomial estimation

The heterogeneous negative binomial (HNB) model is the target model for functional form evaluation in terms of heterogeneity. As before, negative binomial distributions based on a

Gamma distributed error term are commonly used when estimating overdispersed data. The negative binomial model is derived by way of an error structure such that,

ij = exp(Xij + ij) (from Eq. 4)

Where, exp(ij) is the Gamma-distributed error term allowing for the presence of unobserved heterogeneity.

The marginal distribution in a form of a negative binomial distribution is given by:

((1 )  n ) 1   P(n |  ,)  ij ( )1  ( ij )nij ij ij (1 )n ! (1 )   (1 )   ij ij ij

(from Eq. (5))

Under this form, the variance is allowed to differ from the mean as below,

2 Var[nij] = E[nij][1+E[nij]] = E[nij]+  E[nij]

(from Eq. (6))

The over-dispersion parameter alpha is not subscripted in the negative binomial regression equations, implying it is the same for all segments. Given some correlation between segment specific geometry and the over-dispersion effect, some heterogeneity would be expected in the over-dispersion parameter. To accommodate this, alpha is modeled as follows:

ijexp(z ij ) ln ln  z ij ij Eq. (9)

55

This model where the over-dispersion term is parameterized in addition to the negative binomial specification is called the heterogeneous negative binomial regression, or the generalized negative binomial regression. In order to obtain estimates (β̂), the unknown parameters β, the under its equivalent logarithm is maximized, making the

HNB a maximum likelihood method. The HNB algorithm in limdep uses a Berndt-Hall-Hall-

Hausman (BHHH) estimator to optimize the likelihood function.

The Hessian can be obtained from the derivative of

Computing the Hessian being computationally intensive, the variance and information matrices for the MLE, are obtained using the BHHH approximation technique, that uses the moments of the log likelihood derivatives, and the outer product of the derivatives. 56

This is given by

Var[gi(θ0)] = -E[Hi(θ0]

Where,

gi = ∂ Ln f(yi|θ)/∂θ

The n x K matrix with i th row equal to the transpose of the i th vector of derivatives is given by,

̂ ̂ ̂ ′ 퐺 = [푔1, 푔2, … . 푔̂푛] The HNB model for the whole dataset was developed by incorporating the F(m,p) specifications previously described. Due to the nature of the variable transformations that could result from the MFP search algorithm, and their incorporation with the HNB, interpretation of the resulting specifications can at times be complicated, as can the calculation of the elasticities of the variables in the models. To obtain the elasticity measures for the continuous variables in the final model specification, the following procedure can be followed.

Given a negative binomial specification of the form ij = exp(Xij + ij),

The over-dispersion term α = exp(ε) = exp( θ + Fp(X))

⇒ ln(α) = θ + Fp(X), and

ε = ln[exp( θ + Fp(X))]

Therefore, the effect of change in any given X variable can be obtained by:

휕휆 휕 = [βx + ln[exp( θ + Fp(x))]] ; x ∈ Fp(X) Eq. (10) 휆휕푥 휕푥

57

In order to better understand the above formulation; consider for example the

MFP variable:

Variable1 = Ln(AADT)3 ×Ln(AADT)

The resulting HNB specification for Variable1 would be given by (For simplicity,

Ln(AADT) is rewritten as Lnadt):

휆𝑖 = β0+ β1×(Lnadt) + β2×퐹푝[퐿푛푎푑푡]

휕휆푖 ⇒ = β1× 휕(Lnadt) + β2×휕[퐹푝[퐿푛푎푑푡]] 휆푖

β1 3 = + β2× 휕[lnadt ×Ln(Lnadt)] 퐴퐴퐷푇

β1 2 3 1 1 = + β2× [3×Lnadt ×Ln(Lnadt) + Lnadt × × ] 퐴퐴퐷푇 퐿푛푎푑푡 퐴퐴퐷푇

휕휆푖 β1 2 1 ⇒ = + β2× Lnadt ×[3×Ln(Lnadt) + ] 휆푖 퐴퐴퐷푇 퐴퐴퐴퐷푇

휕휆푖 퐴퐴퐷푇 The elasticity of AADT is given by 퐸휆푖 = × 퐴퐴퐷푇 휆푖 휕푓(퐴퐴퐷푇)

휆푖 β1 2 1 Therefore, 퐸 = AADT× [ + β2× Lnadt ×[3×Ln(Lnadt) + ]] 퐴퐴퐷푇 퐴퐴퐷푇 퐴퐴퐷푇

This procedure can be followed for all transformed variables in the final specification to achieve an accurate measure of variable elasticities. For the purposes of this dissertation, a 1% change method was employed to estimate variable elasticities, wherein, the effect on the dependent crash counts, due to increasing the base independent variables by 1% were computed for the parameter estimates obtained through the model specification.

4.6 Prediction validation measures

The methodological flow detailed in this study is aimed at approaching data heterogeneity through a combination of functional form treatments, and the parameterization of 58

the negative binomial over-dispersion parameter α. While the Poisson and negative binomial specifications serve as baselines, providing a starting point for the subsequent model specifications, the MFP search negative binomial, and the resulting heterogeneous negative binomial specifications form an iterate loop of sorts, with each stage looking to tease out some of the observed and unobserved heterogeneity effects, while providing stable and improved predictions. During the modeling stages, improvements in log-likelihood, strong goodness-of-fit measures, and reductions in the over-dispersion term alpha, were used to take the process forward. But given that the aim of this study is to obtain effective and stable crash predictions while accounting for heterogeneity, it is desirable to evaluate the specifications further.

The ‘out of sample’ validation runs for this study were performed using STATA on the

2009 subset of the homogeneous segment crash database. When a model is executed in STATA, the resulting parameters are stored in the system until the next model run. To obtain the model predictions for the 2009 dataset, the ‘predict varname’ syntax was used in combination with reading-in both the 2010 and 2009 dataset, and the resulting prediction vectors were exported in comma-separated-value format into MS Excel. The predicted crashes were used to compute mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean squared error

(RMSE) measures.

The MAE is given by the mean of the absolute value of the difference between the observed number of crashes and the predicted crashes at each segment. The MAPE on the other hand, uses the absolute value of the error term as a percentage of the observed crashes- segments with zero observed crashes cannot be included. The RMSE is the square root of the Mean Square

Error (MSE) term, which as the name suggests is the average of the square of the error at each segment. Since the RMSE involves squaring the error term, segments with large differences between observed and predicted crashes will be weighted disproportionately as compared to the

59

others, leading to a higher overall RMSE value. As with any error term, it is desirable to obtain as low a value as possible for each of these predictive measures.

4.7 CURE plots to test functional forms resulting from the MFP algorithm

As mentioned in the previous sections, the MFP search algorithm is empirically keyed to the dataset being used for the specification search. It is desirable to test the functional form transformations dictated by the search. Additionally, the aforementioned model prediction validation techniques along with the model goodness-of-fit measures regularly used in model building provide insight only into overall model fit. It is also desirable to obtain some level of understanding as to how well a model specification behaves for all values of every variable.

Towards this goal, Hauer (2015) employs a graphical method known as a CURE plot to visually gauge the effectiveness of a model by plotting cumulative residuals against variables of interest. Obtaining a CURE plot requires sorting model residuals in the ascending order of the variable of interest, cumulating the resulting values, and then plotting these values against the variable of interest. Figure 4-1 shows an example of a CURE plot for cumulative residuals versus segment length, as provided by Hauer. The rising regions of the curve correspond to segment lengths for which the observed crash counts tend to be larger than what the model predicts, as seen from the origin to point A; point B to point C; and point E to point F. Regions where drops in the curve are observed, show segment lengths where the model predicts more crash counts than were observed. Also, areas with steady increasing or decreasing runs correspond to regions of consistent overestimation and underestimation.

60

Figure 4-1Example CURE plot

Given the need to distinguish between stretches of ranges of over/under fitting consistent with random variations in an unbiased model, and those that indicate the presence of bias in fit,

Hauer suggests the use of a computational limit beyond which a random walk of an unbiased model would rarely stray. For any observation ‘i’ in a dataset with ‘n’ observations, this limit can be computed by:

Eq. (11)

Figure 4-2 shows an example provided by Hauer to show the cumulative residuals of a regression model versus segment length, with the ±2σ’ limits included, with the author indicating that regions of the CURE plot that exceed the standard deviation bounds, being possible sources for biased model fits.

61

Figure 4-2 CURE plot with standard deviation bounds

For the purpose of this dissertation, the CURE plot was employed as a means of visually testing the model fit for each step of the modeling process, along with studying the effectiveness of the functional form transformation insight obtained through the MFP search algorithm- comparisons were made for the fixed parameter negative binomial model, for both the untransformed independent variables, and the MFP dictated functional form treatments.

4.8 Modeling methodology summary

The modeling sequence presented in this dissertation begins with a negative binomial specification forming the baseline model. Subsequently, the variables obtained from this base specification were input into a multivariable fractional polynomial search, to test whether the continuous variables in the model would provide a better fit through nonlinear transformations.

Owing to the size of the dataset and the complexity in the distributions of the variables to be included in the specification, the 2010 homogeneous segment dataset was subdivided into 10

62

sections, each section being subjected to an MFP search. Subtle changes in variable distributions are likely to get averaged out when a specification search is being conducted on a large dataset- leading to an increased chance of unanswered heterogeneity. Breaking down this step into 10 sub- specifications enabled the MFP search to account for these changes in distributions to a certain extent. The resulting model specifications were aggregated, bringing about one big model that contained in addition to the roadside and State Route dummies, transformed continuous variables.

Figure 4-3 Structure of modeling process

Taking the concept further, the specifications obtained from this MFP search, were incorporated into a heterogeneous negative binomial model, with alpha being parameterized

63

against the natural log of AADT, natural log of segment length, the 22 roadway geometry variables, the 54 roadside variables, and all the resultant fractional polynomial transformations, culminating in a model wherein continuous variables that were found to significantly influence over-dispersion were tested for nonlinear functional forms, and reintroduced into the prediction of alpha with the aim of addressing the heterogeneous effects in the dataset.

Figure 4-4 Proposed algorithm in the context of this dissertation

64

In conclusion, this dissertation puts forth an algorithm to employ the F(m,p) specification search for a large dataset with apparent over-dispersion, so that the results of that search can be embedded in a single heterogeneous negative binomial estimation step. The algorithm is briefly described below:

1. Identify an initial design matrix X using a traditional negative binomial (NB) model such

푋 훽푀퐿퐸 = argmax훽 퐿 (훽|푋), where

 푛 ((1 )ni ) 1 1  i ni 퐿(훽|푋) = ∏푖=1 ()() (1 )ni ! (1  )   i (1  )   i

2. Partition the dataset into randomly and uniformly sized sub-datasets, with K cuts (k=10)

3. Apply the F(m,p) search individually on each subset j ( j = 1,…10) and identify g(Xj) where

g(.) is a fractional polynomial specification for the multivariable vector Xj

U 푈 4. Create an updated design matrix X such that 푋 = 푔(푋1) ∗ 퐷1 + ⋯ 푔(푋푗) ∗ 퐷푗 =

푗 ∑1 푔(푋푗) ∗ 퐷푗 where 푔(푋푗) is an F(m,p) transformation of the original X vector, and Dj is a

subset dummy indicator

푋 푈 5. Re-estimate 훽푀퐿퐸 = argmax훽 퐿 (훽|푋 ), where

 푈 푛 ((1i )n i ) 1 i 1 iii n 퐿(훽|푋 ) = ∏푖=1 ()() = (1i )n i ! (1  i )   i (1  i )   i

푛 ZZii ((exp )n ) (exp )  Zi  ∏ i ()()(exp ) i ni ZZZi   i      i    푖=1 (exp )ni ! ((exp ) i ) ((exp ) i )

U In this step, it is possible that Zi may include elements of X .

65

Chapter 5

Model estimation framework and results

The model specification goal for this study is to predict crash counts with statistically significant over-dispersion parameterization, using variants on negative binomial regressions. In line with the current SPFs in use, the base models were made using only AADT and segment length information. In the next step, parameters were included to reflect the effect of roadway characteristics, roadside features, and state route specific effects. This model forms the base for comparisons with subsequent model building iterations.

One major challenge with any modeling undertaking with moderately large datasets, is the issue of variable selection. It is not ideal to include more variables than necessary in a model where a simpler specification would provide a comparable fit. At the same time, valuable interactions and predictive insight might be lost by ignoring a variable during the modeling process. For the purposes of this study, all available variables were utilized during the model building process, and elimination was done based on variable significance. At each stage, the resulting significant variables were used as inputs for the next stage, the understanding being that the effect of a significant variable on heterogeneity would carry forward from one step to the next, and any improvements observed in subsequent steps would be as a result of the corresponding specification changes.

The set of models presented in this study were executed using the 2010 subset of the homogeneous segment crash database. As before, the dataset contains 12,118 documented crashes over 62,598 homogeneous segments. The original homogeneous segment dataset had 138 variables. After including dummies for the State Routes, and separating out other continuous variables by range, the number of columns in the dataset was increased to 202. These 202 variables in addition to the total crashes, AADT and segment length, also included 22 roadway 66

geometry descriptors, 37 roadside feature dummies, 17 roadside feature length percentage variables, and 56 State Route dummy indicators. The significant specifications at each stage were then validated against the 2009 subset of the homogenous segment database. Going forward, this section will present the model outputs at each step of the process, along with a detailed explanation of the transition between the steps in regard to model progression, variable selection, and changes in model performance measures.

5.1 Baseline negative binomial specifications

The modeling process for this study began with setting the comparative baseline for forthcoming models. The restricted log-likelihood for the model sets was obtained from what is known as a constant-only model. This basically amounts to understanding how much of the dependent variable variation can be answered in the absence of any other variables. The intercept term from a constant-only regression, will in addition to an actual minimum number of predictions, also take on some of the effects of the heterogeneity in the dependent variable. As the specification of the model is defined and improved upon, the intercept value will change, and iterate towards eliminating the heterogeneity effect on its true value. The log-likelihood value for the constant only model was found to be -22,668.192 for the 62,598 observations, and as mentioned before, this value will be expected to increase or become less negative, tending towards zero, as the model predictions improve.

Among the various data descriptors available for transportation engineers, daily traffic flow along the length of a segment would be among the most commonly obtained, and hence are among of the most used predictors in crash model development. Table (5-1) shows the model results from regressing the total crash count variable against the natural log of the segment AADT and the natural log of homogeneous segment length. Both variables were found to be strongly 67

significant at a 99% confidence level, with the dispersion parameter alpha having a value of

16.817.

A positive alpha value shows that over-dispersion is indeed present in the data, and that choosing to model the crash counts using a negative binomial regression is justified. Positive coefficients for both variables indicate that as they increase, so do the expected crash counts- something that follows the logic that as the traffic volume of flow increases along a segment, opportunities for conflict between vehicles would also be expected to increase. With a strong

Chi-squared value of 2521.13, and a pseudo R2 value of 5.56%, this model results in an increase of 1260.564 in the convergent log likelihood, from -22,668.192 to -21,407.628.

Table 5-1 Negative binomial specification with AADT and segment length

Variable Coeff. S.E. t-stat Constant -4.907 0.172 -28.470 Natural log of AADT 0.635 0.020 31.040 Natural log of homogeneous segment length 0.775 0.020 38.900 Dispersion parameter for count data model Alpha 16.817 0.397

Number of obs =62,598 Chi squared = 2521.13 Prob>chi2=0.000 Restricted LL = -22,668.192 Log likelihood = -21,407.628 Pseudo R2 = 0.0556

The next step in the modeling procedure was to introduce the roadway geometric, roadside features, and route dummy variables into the model specification. The idea behind including geometry and route dummy data in the model was that, AADT and segment length being significant predictors of crashes is understandable, but some underlying effect due to the nature of the roadway, and the specific location of the segments could also be influencing the propensity for crashes.

68

The result of this model is depicted in Table (5-2). In addition to AADT and segment length, seven roadway and roadside characteristic variables were found to be significant at a 99% level of confidence. As with the previous SPF, AADT and segment length were found to have the highest , with positive coefficients that follow the same interpretation as before. The number of lanes in an increasing direction variable was found to also have very high statistical evidence of being different from zero, with a negative coefficient. This would suggest that as the number of lanes on a segment increase, the number of crashes would be predicted to decrease. This observation is again something that is expected based on the reduction in possible conflict points between vehicles, even though increasing the number of lanes could also increase the propensity for lane changing. One interesting observation from the dataset was that increasing the average segment lane width and the degree of horizontal curvature led to an increase in predicted crash counts, albeit by a very small factor, as dictated by the small coefficient.

Similarly, having a vertical curve length between 300 and 450 feet had a decreasing effect on crash counts. State route dummies relating to three routes were also found to be significant, wherein, the segments along State Route 21 were predicted to have 0.678 crashes fewer than the other State Routes, while the interaction dummy between State Routes 03 and 26, would indicate a prediction of 0.564 crashes more on segments along these routes as compared to segments on the other routes. Also, three roadside feature variables were found to be significant with positive coefficients, resulting in the interpretation that as the length of a culvert is increased along the side of a road segment, the number of crashes on the segment would be expected to increase.

Similarly, the presence of a cabinet or a fixed object alongside the road segment would be expected to increase crash propensities by 0.576 and 0.283 crashes respectively.

The convergent log likelihood for this model was found to be an improvement over the previous SPF, at -20,573.57, with a pseudo R2 of 9.24%. The dispersion parameter alpha was also

69

found to have decreased significantly from the baseline model to 12.190, with the also reducing from 0.397 in the baseline model to 0.304. These measures along with a strongly significant chi-squared would indicate a definite improvement in model fit and heterogeneity effect reduction.

Table 5-2 Baseline NB with AADT, segment length, State Route, roadway and roadside effects

Variable Coeff. S.E. t-stat Constant -2.610 0.217 -11.980 Natural log of AADT 0.832 0.022 36.920 Natural log of homogeneous segment length 0.779 0.020 38.570 Number of lanes in increasing direction -4.471 0.154 -28.880 Average lane width (ft) 0.054 0.009 5.740 Degree of curvature for horizontal curve 0.024 0.004 6.530 State Route Dummy (If srn = 21, 1 else 0) -0.678 0.199 -3.410 State Route Interaction Dummy (If srn = 003 or srn = 026, 1 else 0) 0.564 0.108 5.190 Length of vertical curve between 300 and 450, 0 otherwise -0.001 0.001 -3.480 Percent length of culvert on segment 1.203 0.278 4.330 Dummy- miscellaneous fixed object 0.283 0.069 4.100 Dummy- Cabinet 0.576 0.182 3.160 Dispersion parameter for count data model Alpha 12.190 0.304

Number of obs =62,598 Chi squared = 4189.24 Prob>chi2=0.000 Restricted LL = -22,668.192 Log likelihood = -20,573.570 Pseudo R2 = 0.0924

One of the major goals of this dissertation was to study the possible improvement in model predictions through the use of a heterogeneous negative binomial estimation. As a means of providing some insight into the interpretation of an HNB, table 5-3 shows the output for a simple specification involving the significant variables from the baseline NB SPF. Alpha being the dispersion parameter, parameterization of it could provide insight into factors that influence the over-dispersion of the data more than others. The core logic behind this statement being the nonzero probability that segments of equal length having the same geometric features, with similar levels of daily traffic flow volume, could show significantly different observed crashes. 70

This over-dispersion heterogeneity could be as a result of certain segments being inherently safe, or due to the length of time over which data collection was performed and included in the study, or also because of variations in the segment level roadside features. Therefore, the roadside variables were introduced into the model specification as predictors for the dispersion parameter alpha.

Table 5-3 Baseline heterogeneous negative binomial specification

Variable Coeff. S.E. t-stat Constant -2.616 0.218 -11.970 Natural log of AADT 0.853 0.023 37.630 Natural log of homogeneous segment Length 0.797 0.019 40.290 Number of lanes Increasing -4.533 0.156 -29.140 Average lane width (ft) 0.054 0.009 5.800 Degree of curvature for horizontal curve 0.024 0.004 6.230 State Route Dummy (If srn = 21, 1 else 0) -0.703 0.202 -3.480 State Route Interaction Dummy (If srn = 003 or srn = 026, 1 else 0) 0.609 0.107 5.680 Length of vertical curve between 300 and 450, 0 otherwise -0.001 0.001 -3.550 Dummy- Cabinet 0.597 0.172 3.470

Heterogeneity in dispersion parameter

Constant 2.592 0.027 94.690 Dummy for miscellaneous fixed object -0.594 0.068 -8.760 Percent length of culvert on segment 0.608 0.289 2.100 Number of obs = 62,598 Chi squared = 4195.81 Prob>chi2=0.000 Restricted LL =-22,668.19 Log likelihood = -20,553.376 Pseudo R2=0.0926

It can be seen that the variables that were carried forward from the baseline negative binomial specification remained strongly significant with no change in the sign of their coefficients. Therefore, the sign related significance of their effect and prediction interpretations remains the same. Of the roadside variables, it was found that the dummy indicating the presence of a fixed object alongside a segment, and the variable representing the percentage of segment length for a roadside culvert were strongly significant, at over 95% levels of confidence. In other 71

words, having a fixed object present on the roadside, or having a certain length of culvert statistically, contributes to the over-dispersion of the data. The nature of alpha for the HFN model requires a little more analysis. The alpha value obtained from the baseline negative binomial model that did not take the roadside effects on over-dispersion into account, was 12.190. From the HNB output for the fixed object dummy, it can be determined that:

αfxobj=1 = exp{2.592 + 1.0(−0.594)} = exp(1.998) = 7.3743

αfxobj=0 = exp{2.592 + 0.0(−0.594)} = exp(2.592) = 13.3564

From this it can be seen that as compared to the baseline negative binomial specification, there is clearly much higher over-dispersion in the subset of segments that do not have any fixed objects present alongside them. In accounting for this fact, this model would be expected to provide a significant improvement in predictions as compared to the baseline model.

5.2 Random parameter negative binomial specification

The baseline negative binomial models shown in the previous section assumed a fixed parameter for every observation in the dataset. In doing so, they assume that the effect of the estimated parameter does not vary across the segments in the dataset. Random parameter negative binomial models assume that some parameters can vary across observations, and that they follow a known distribution. Since one of the objectives of this dissertation was to study the effect of nonlinear functional form transformations on the crash predictions of fixed parameter regressions, a random parameter negative binomial specification was developed as a means of comparison.

The random parameter negative binomial model shown in table (5-4) was estimated by specifying a functional form of the parameter density function and using simulation based maximum likelihood with 200 Halton draws. Halton draws were chosen for the estimation owing to their

72

general , and the low number of draws required for attaining convergence (Train, 2003).

The choice of limiting the random draws to 200 is believed to provide accurate parameter estimates, as shown by Bhat, 2003, Milton et al., 2008, and Anastasopoulos et al., 2009.

In choosing a functional form for the random parameter density function, normal, lognormal, uniform and triangular distributions were given consideration. In the case of this model, all random parameters were found to be significant under a normal distribution. If the estimated standard deviation of a parameter was not statistically different from zero, the variable estimation was treated as being fixed across the dataset, otherwise, the variable was treated as a random parameter.

For the specification in Table 5-4, variables that were found to be statistically significant random parameters, were the natural log of AADT, the natural log of segment length, the number of lanes in an increasing direction, average lane width, and the length of vertical curve between

300 and 450 feet.

The natural log of AADT variable was found to be random with a positively signed normally distributed parameter, having a mean of 0.9373, and a standard deviation of 0.0230.

This would indicate that increasing AADT would have an increasing impact of crash frequency but with varying magnitude across the roadway segments, as dictated by the standard deviation.

Similarly, the natural log of segment length variable was found to be normally distributed with a mean of 1.5463, and a standard deviation of 0.5754, and an increasing impact on crash frequencies, while the number of lanes in an increasing direction variable was found to be normally distributed with a negative mean, indicating a reducing impact on crash counts, with magnitudes varying according to a standard deviation of 0.3683.

On the other hand, vertical curves with length between 300 and 450 feet showed a normally distributed mean of -0.00054, and a standard deviation of 0.0007. This would indicate

73

that its range of estimated parameters includes positive and negative values, suggesting that the effect of a vertical curve between 300 and 450 feet in length is to increase the number of predicted crashes on some segments, in this case, 22.02% of the segments, while decreasing crash propensities on other 77.98%.

74

Table 5-4 Random parameter NB specification for comparison

Variable Coefficient Standard Error t-statistic Constant -1.6761 0.1747 -9.5910 Natural log of AADT 0.9373 0.0230 40.715 Standard deviation of parameter distribution (normally distributed) 0.0129 Natural log of segment length 1.5463 0.0292 53.007 Standard deviation of parameter distribution (normally distributed) 0.5745 Number of lanes in increasing direction -5.4352 0.1173 -46.349 Standard deviation of parameter distribution (normally distributed) 0.3683 Average lane width (feet) 0.0516 0.0083 6.2110 Standard deviation of parameter distribution (normally distributed) 0.0187 Length of vertical curve between 300 and 450 feet -0.00054 0.0002 -2.6130 Standard deviation of parameter distribution (normally distributed) 0.0007 Degree of horizontal curvature 0.0402 0.0042 9.4980 Percent of segment length containing a culvert 1.2742 0.3311 3.8480 Dummy for miscellaneous fixed object along segment 0.2939 0.0623 4.7200 Dummy for cabinet along segment 0.5640 0.1599 3.5280 Dummy – 1 if State route number = 21 -0.6795 0.2025 -3.3560 Dummy – 1 if State route number = 129 -1.8843 0.4386 -4.2960 Dummy – 1 if State route number = 03 or 26 0.7122 0.0975 7.3050 ScalParam 0.2002 0.0054 36.856 Number of Observations 62,598 Pseudo R2 0.5419 Constant only Log-likelihood -22,668.192 Convergent Log-likelihood -20,090.30

75

5.3 Multinomial fractional polynomial negative binomial specification

As stated in the modeling methodology section, nonlinear functional forms for the continuous covariates in the model specifications were developed using the Royston and Altman

MFP search algorithm. Owing to the large size of the dataset being used for model building in this dissertation, the 2010 homogeneous segment dataset was divided into 10 random subsets with the aim of reducing computational complexity, and enabling the teasing out of minute variations that would not be very apparent if the dataset was regarded as a whole. The MFP algorithm was executed for each of these 10 cuts, and the resulting models were later combined using a set of rules to get the final model for this step. The resulting functional form specifications were recoded using the rules detailed in the methodology section of this dissertation. Once recoded, these functional form transformed variables were used to build a fixed parameter negative binomial regression. The results of the MFP search on the 10 subsets, and a comprehensive list of the 216 recoded variables is provided in Appendices B and C.

Table 5-5 shows the negative binomial model that resulted from the estimation performed using the functional form transformed variables. All variables in the baseline negative binomial model were found to have significant transformations with at least a 95% level of confidence, the interpretation of which will be discussed going forward. As such, the model showed a significant improvement over the baseline negative binomial specification with a convergent log-likelihood at -20,410.319, and an alpha value of 11.799 as compared to 12.19.

76

Table 5-5 MFP incorporated NB specification

Variable Description Coeff. S.E. t-stat Constant 4.488 1.056 4.250

dgc1 Degree of curvature for horizontal curve -0.204 0.058 -3.520 dmfxobj Dummy for miscellaneous fixed object, 1 if exists, 0 otherwise 0.337 0.073 4.640

lclvert Percent length of segment for culvert 1.300 0.288 4.520

nlanei Number of lanes in increasing direction -5.983 0.438 -13.660

rlnw Average lane width (ft) 0.048 0.009 5.140 Lnlength s10u2 If sd10=1, ⁄(3.01385 + Lnlength), else 0 -1.246 0.080 -15.540 s10u5 Lnlength × √4.60951 + Lnlength⁄ 1.969 0.097 20.370 If sd10=1, (3.01385 + Lnlength) 2 s10u6 Lnlength × √4.60951 + Lnlength⁄ 0.326 0.034 9.640 If sd10=1, (3.01385 + Lnlength) Lnlength s1u3 If sd1=1, ⁄(3.01693 + Lnlength) -1.418 0.085 -16.630 2 s1u6 Lnlength × √4.60951 + Lnlength⁄ 0.330 0.035 9.370 If sd1=1, (3.01693 + Lnlength) s1u9 Lnlength × √4.60951 + Lnlength⁄ 2.121 0.099 21.440 If sd1=1, (3.01693 + Lnlength) s2nlan Interaction term, if sd2=1, value = number of lanes in increasing direction 2.074 0.685 3.030 Lnlength s2u3 If sd2=1, ⁄(3.04161 + Lnlength) 2.365 0.188 12.560 2 s2u4 Lnlength 0.776 0.062 12.560 If sd2=1, ⁄(3.04161 + Lnlength) s2vcl Interaction term, if sd2=1, value =vcl4530 -0.002 0.001 -2.440 Lnlength s3u2 If sd3=1, ⁄(3.00523 + Lnlength) 2.669 0.110 24.300 2 s3u3 Lnlength 0.888 0.037 24.270 If sd3=1, ⁄(3.00523 + Lnlength)

77

Table 5-6 MFP incorporated NB specification (continued)

Variable Description Coeff. S.E. t-stat s4dmfx If sd4=1 and miscellaneous fixed object dummy=1 -0.550 0.244 -2.250 Lnlength s4u2 If sd4=1, ⁄(3.03934 + Lnlength) 2.363 0.104 22.650 2 s4u3 Lnlength 0.777 0.034 22.610 If sd4=1, ⁄(3.03934 + Lnlength) s5nlan Interaction term, if sd5=1, value = number of lanes in increasing direction 4.409 0.489 9.020 Lnlength s5u2 If sd5=1, ⁄(3.03676 + Lnlength) -1.351 0.111 -12.200 s5u5 Lnlength × √4.60951 + Lnlength⁄ 2.342 0.265 8.850 If sd5=1, (3.03676 + Lnlength) 2 s5u6 Lnlength × √4.60951 + Lnlength⁄ 0.417 0.068 6.090 If sd5=1, (3.03676 + Lnlength) s6dcab Interaction dummy if sd6=1 and dcabnet=1 1.028 0.525 1.960 Lnlength s6u2 If sd6=1, ⁄(3.04674 + Lnlength) 2.554 0.107 23.900 2 s6u3 Lnlength 0.839 0.035 23.880 If sd6=1, ⁄(3.04674 + Lnlength) s7rlnw If sd7=1, value= average lane width -0.070 0.033 -2.130

dgc1 s7t1 If sd7=1, Log(9.7006 × 10−8 + ) -0.176 0.031 -5.580 100 dgc1 s7t2 If sd7=1, Log(9.7006 × 10−8 + ) × √(9.70066 × 10−6 + dgc1) -0.373 0.069 -5.410 100 1 s7u1 If sd7=1, ⁄(3.01729 + Lnlength) -6.879 0.564 -12.200 2 s7u3 Lnlength 0.755 0.062 12.190 If sd7=1, ⁄(3.01729 + Lnlength) vcl4530 s7u8 If sd7=1, ⁄(−11.843 + rlnw) 0.0001 0.000 -2.010 Lnlength s8u2 If sd8=1, ⁄(3.04562 + Lnlength) 2.599 0.107 24.270 2 s8u3 Lnlength 0.853 0.035 24.230 If sd8=1, ⁄(3.04562 + Lnlength) 78

Table 5-7 MFP incorporated NB specification (continued)

Variable Description Coeff. S.E. t-stat

dgc1 s9t1 If sd9=1, Log(9.7006 × 10−8 + ) -0.256 0.089 -2.880 100 dgc1 s9t2 If sd9=1, Log(9.7006 × 10−8 + ) × √(9.70066 × 10−6 + dgc1) -0.535 0.185 -2.900 100 1 s9u1 If sd9=1, ⁄(3.0068 + Lnlength) -11.381 4.368 -2.610 Lnlength s9u2 If sd9=1, ⁄(3.0068 + Lnlength) -5.249 1.451 -3.620 s9u5 Lnlength × √4.60951 + Lnlength⁄ 2.278 0.283 8.060 If sd9=1, (3.0068 + Lnlength) 2 s9u6 Lnlength × √4.60951 + Lnlength⁄ 0.373 0.074 5.050 If sd9=1, (3.0068 + Lnlength) s9vcl Interaction term, if sd9=1, value =vcl4530 -0.002 0.001 -2.280

sd2 K-cut dummy variable, if data subset k=2 -2.327 0.721 -3.230

sd5 K-cut dummy variable, if data subset k=5 -4.186 0.568 -7.370 srn0326 State Route Interaction Dummy (If srn = 003 or srn = 026) 0.712 0.111 6.430

t0full √(9.70066 × 10−6 + dgc1) 1.876 0.441 4.260 dgc1 t1full Log(9.7006 × 10−8 + ) 0.140 0.058 2.420 100 dgc1 t2full Log(9.7006 × 10−8 + ) × √(9.70066 × 10−6 + dgc1) 0.766 0.216 3.540 100 t3full LNADT3 × Log(LNADT) 0.005 0.000 21.980 t4full LNADT6 × Log(LNADT) 0.0001 0.000 -14.880 LnAlpha 2.468 0.025 99.090 Number of obs =62,598 Pseudo R2 = 0.0996 Constant only Log-likelihood = -22,668.192 Convergent Log likelihood = -20,410.319

79

Figure 5-1 FP() forms for LNADT and Lnlength

Figure 5-1 shows the resulting fractional polynomial forms for AADT and segment length.

80

The continuous covariates found to be significant in the baseline negative binomial model were the natural log of AADT (Lnadt), the natural log of segment length (Lnlength), the number of lanes in an increasing direction (nlanei), average lane width (rlnw), degree of horizontal curvature (dgc1), vertical curve length between 300 and 450 feet (vcl4530), and percent length of culvert alongside segment (lclvert), with the respective variable names in brackets. In the baseline negative binomial model, these variables were included with a linear functional form, while the

MFP negative binomial model uses the same variables in alternate functional forms. Variables sd1 to sd10 are dummy variables representing the 10 data subsets that were used for modeling the multinomial fractional polynomial search, and any interactions with these dummies indicate transformations that were unique to the respective subset.

For the given specification, the functional form that was found to be significant for the natural log of AADT was

Fp(LnAADT) = {0.005 × LNADT3 × Log(LNADT)} + {0.0001 × LNADT6 ×

Log(LNADT)} = LNADT3 × Log(LNADT) × {0.005 + 0.0001 × LNADT3}

Similarly, the functional form for degree of horizontal curvature was found to be significant at

퐹푝(푑푒푔푟푒푒 표푓 ℎ표푟𝑖푧표푛푡푎푙 푐푢푟푣푎푡푢푟푒) = (−0.204 ∗ dgc1) + {−0.176 ∗ sd7 ∗

dgc1 dgc1 Log (9.7006 × 10−8 + )} + {−0.373 ∗ sd7 ∗ Log(9.7006 × 10−8 + ) × 100 100

−6 sd7 √(9.70066 × 10 + dgc1)} + {-6.879× ⁄(3.01729 + Lnlength)} + {-0.256×

dgc1 dgc1 푠푑9 ∗ Log(9.7006 × 10−8 + )} + {−0.535 ∗ sd9 ∗ Log(9.7006 × 10−8 + ) × 100 100

√(9.70066 × 10−6 + dgc1)} + {1.876 ∗ √(9.70066 × 10−6 + dgc1)} + {0.140 ∗

81

dgc1 dgc1 Log(9.7006 × 10−8 + )} + {0.766 ∗ Log(9.7006 × 10−8 + ) × 100 100

√(9.70066 × 10−6 + dgc1)}

The other continuous covariate transformations were combined in a similar manner, and were taken forward for the model prediction analysis phase. While the statistical significance and validity of inclusion for the transformed variables follows the same argument as with the baseline negative binomial model, interpreting the transformed functional forms is not as straightforward.

The effect of the transformations will be interpreted going forward using the elasticity effects of the variables, as will the visual validation of the functional form treatments using CURE plots.

5.4 CURE plots

The method proposed to find optimum functional forms for continuous covariates in this dissertation, involves using the Royston Altman multinomial fractional polynomial search algorithm. The functional forms tested by the algorithm for best fit, are contained within the set {-

2, -1, -0.5, 0, 0.5, 1, 2, 3}, with a functional form of zero representing a natural-log transformation.

Given a universe of possibilities for functional form transformations, limiting the search set could have obvious impacts on the optimality of the transformations decided upon. It is therefore desirable to obtain some level of confidence in the improvements in model specifications obtained through using the MFP search. This section presents a visual comparison of model predictions, by using the CURE plot method to study the nature of the resulting cumulative residuals plotted against both, the transformations obtained through the MFP search, and the untransformed linear forms of the predictors.

82

To set up the CURE plots, the predicted values resulting from the respective models were input into a spreadsheet, along with the continuous variables in their respective functional forms.

For each covariate at a time, the spreadsheet was sorted in ascending order of the covariate, and model residuals were computed. Following this, the residuals were cumulated in the same order as the covariate being studied. Standard deviation bounds were also computed using the method prescribed by Hauer (2015).

CURE plots provide insight into how well fitting a variable specification is, through the nature of the distribution of the covariate-sorted cumulative residuals. Regions of the CURE plot showing steady increasing trends represent segments of the data that are being consistently under- predicted by the model. Conversely, regions of steady decline depict segments that are being overestimated, and regions with sudden sharp drops or rises, show segments with inordinately large residuals. In setting the standard deviation bounds, Hauer (2015) makes the assumption that as the CURE plot is a sum of many independent random variables, through the , it would be approximately normally distributed, and that 95% of the probability distribution would be within two standard deviations from the mean. An unbiased SPF would rarely go beyond the bounds set through the CURE method, and any model changes, either through the addition of variables, or through employment of alternate functional forms, that ensures the cumulative residual random walks do not breach the standard deviation bounds, could be considered as improvements to the model specification.

5.4.1 Untransformed variables

The continuous variables included in the baseline model were the natural log of AADT, the natural log of segment length, the average lane width, and the degree of horizontal curvature.

CURE plots were set-up for all these variables from the base model, but it should be noted that 83

horizontal curvature would only be present in homogeneous segments containing a horizontal curve, and that even though average lane width could in reality have a continuous distribution, it is recorded in the form of standard values, and can only be taken as pseudo-continuous. Due to this, the transformations to average lane width were largely linear, and would therefore only serve to ‘scale’ the cumulative residual plot. Also, the log transformations of AADT and segment length were performed to ensure non-negativity, something that was kept consistent throughout the process of this dissertation, and were therefore treated as untransformed variables.

Figure 5-2 CURE plot for untransformed AADT

Figure 5-3 CURE plot for untransformed segment length 84

Figure 5-4 CURE plot for untransformed lane width

Figure 5-5 CURE plot for untransformed degree of horizontal curvature

Figures 5-2, 5-3, 5-4, and 5-5 depict the CURE plots and standard deviation bounds for the untransformed covariates in the baseline negative binomial model. As can be seen, all four plots show regions that consistently breach the standard deviation bounds, with regions of sudden drops and rises. The plot for AADT shows consistently growing over-fitting for traffic volumes greater that the average traffic volume, while the segment length plot shows the occurrence of

85

consistently growing under-fitting, with many segments breaching the standard deviation bounds.

Overall, even if there was no basis for saying that the untransformed variables led to a bias-in-fit issue, all four CURE plots show evidence of the model fitting the observed crash data poorly.

This would dictate that improvements would be required to the baseline negative binomial model, either through the addition of other significant variables, or through experimenting with alternate functional form transformations.

5.4.2 MFP algorithm derived variable functional form transformations

Having established that the baseline negative binomial model with untransformed continuous covariates did indeed require improvements, the next step was to produce CURE plots for the functional form transformations obtained through employing the MFP search algorithm.

As with the baseline specification, the predictions obtained through the multinomial fractional polynomial negative binomial model were used to generate the CURE plots for the transformations effected to the continuous variables. Figures 5-6 and 5-7 show the resulting plots and their standard deviation bounds for the fractional polynomial forms for AADT and segment length respectively.

86

Figure 5-6 CURE plot for polynomial transformation of AADT

Figure 5-7 CURE plot for polynomial transformation of segment length

From figure 5-6, it can be seen that the functional form transformation to the natural-log of AADT, resulted in a much smoother transition to the cumulative residual plot. The resulting cumulative residuals have far fewer regions that are steadily increasing or decreasing, and for the most part, the function appears to have shorter random runs of over-fitting, or under-fitting. The cumulative residual does exceed the standard deviation bounds at the tail end of the plot, but the

87

number of segments that do so, are much lesser than for the untransformed natural log of AADT plot in figure 5-2.

The most significantly transformed variable from the MFP search algorithm was the natural log of segment length. The CURE plot for the transformed ln(segment length) variable, shows considerable improvement over the untransformed plot in figure 5-3. While the untransformed variable’s CURE plot had shown significant areas outside the bounds of the plot, the transformed cumulative residuals were found to be completely within the standard deviation limits. Also, regions that showed consistent under-fitting over long runs in the untransformed plot show some degree of variation, indicating some of the segments in those areas were being predicted better through the functional form transformation.

Overall, the plots obtained for the MFP transformed variables appear to portray an improvement in model predictions over the untransformed baseline negative binomial model. It is plausible that a more optimum set of variable functional form transformations might be found outside the range of the multinomial fractional polynomial search set defined in the Royston-

Altman algorithm, but as it is, the CURE plots for the resulting functional forms, combined with the improvements in log-likelihood going from the baseline negative binomial model to the MFP negative binomial model, would suggest that the functional form transformations are viable for further analysis. Model prediction measures were also computed to corroborate the improvement in model estimations through the functional form transformations, and will be presented at a later stage of this chapter.

Having found evidence that the functional form transformations effected to the continuous covariates could be plausible indicators of improvements to parametric heterogeneity in the model specifications, the next step in the process was to combine the transformations with

88

the heterogeneous negative binomial model, so as to target the heterogeneity in the error term of the NB. Going forward, the results of this combination will be discussed.

5.5 Heterogeneous negative binomial specification with MFP and non-MFP predictors

This step in the modeling process details the integration of the functional form transformations with a generalized negative binomial model, aimed at targeting the remaining unobserved effects in the count model predictions. The heterogeneous negative binomial model differs from the fixed parameter negative binomial through the parameterization of alpha.

Essentially, the fixed parameter negative binomial model assumes the over-dispersion parameter alpha has a constant effect across all the segments in the dataset. Since the over-dispersion parameter captures the heterogeneity in the dataset, parameterizing alpha against roadway and roadside characteristics could provide some insight into the factors that affect the heterogeneity, and allow for better predictions through the model estimations by providing a segment specific

‘scaling factor’ for the over-dispersion. Consistent with the iterative nature of the process, the functional form transformations were taken forward as the search set for this model, with both transformed and untransformed variables being tested for significance in predicting alpha.

Variable statistical significance was determined at a 95% level of confidence.

Table 5-8 shows the output for the resulting heterogeneous negative binomial model. The model shows a significantly improved log-likelihood of -19,701.045, with a strongly significant chi-square value of 3575.16, and a pseudo R2 value of 8.32%. The interpretation of the main section of the model is consistent with the previous multinomial fractional polynomial negative binomial model. In predicting the over-dispersion term alpha, a fractional polynomial form of the natural log of AADT was found to be significant, consisting of a linear and a cubic form of the variable. Of the continuous variables, the natural log of AADT, the natural log of segment length, 89

degree of curvature for horizontal curve, and percent length of segment for culvert were found to have negative coefficients indicating that as their values increase, the over-dispersion of the crash data reduces by the magnitude of their respective coefficients. On the other hand, even though its coefficient is a very small value, the cubic form of the natural log of AADT was found to have an increasing effect on over-dispersion. Taken in isolation, this difference in signs might appear discrepant, but it should be noted that the overall effect of the variable on the over-dispersion in the dataset would depend on the interaction of the fractional polynomial taken as a whole.

90

Table 5-8 Heterogeneous NB specification with MFP functional forms

Variable Description Coeff. S.E. t-stat Constant 2.728 0.500 5.450 dgc1 Degree of curvature for horizontal curve -0.073 0.020 -3.660 dmfxobj Dummy for miscellaneous fixed object, 1 if exists, 0 otherwise 0.292 0.063 4.630 lclvert Percent length of segment for culvert 1.313 0.350 3.750 nlanei Number of lanes in increasing direction -6.121 0.444 -13.800 rlnw Average lane width (ft) 0.041 0.011 3.790 s10u2 Lnlength -1.348 0.121 -11.100 If sd10=1, ⁄(3.01385 + Lnlength), else 0 s10u5 2.087 0.088 23.590 Lnlength × √4.60951 + Lnlength⁄ If sd10=1, (3.01385 + Lnlength) s10u6 2 0.338 0.038 8.870 Lnlength × √4.60951 + Lnlength⁄ If sd10=1, (3.01385 + Lnlength) s1u3 Lnlength -1.474 0.126 -11.720 If sd1=1, ⁄(3.01693 + Lnlength) s1u6 2 0.341 0.039 8.690 Lnlength × √4.60951 + Lnlength⁄ If sd1=1, (3.01693 + Lnlength) s1u9 2.199 0.091 24.260 Lnlength × √4.60951 + Lnlength⁄ If sd1=1, (3.01693 + Lnlength) s2nlan Interaction term, if sd2=1, value = number of lanes in increasing direction 1.942 0.688 2.820 s2u3 Lnlength 2.577 0.181 14.220 If sd2=1, ⁄(3.04161 + Lnlength) s2u4 Lnlength2 0.845 0.060 14.160 If sd2=1, ⁄(3.04161 + Lnlength) s3u2 Lnlength 2.796 0.109 25.540 If sd3=1, ⁄(3.00523 + Lnlength) s3u3 Lnlength2 0.930 0.036 25.490 If sd3=1, ⁄(3.00523 + Lnlength)

91

Table 5-9 Heterogeneous NB specification with MFP functional forms (continued)

Variable Description Coeff. S.E. t-stat s4dmfx If sd4=1 and miscellaneous fixed object dummy=1 -0.672 0.213 -3.150 s4u2 Lnlength 2.513 0.107 23.450 If sd4=1, ⁄(3.03934 + Lnlength) s4u3 Lnlength2 0.826 0.036 23.270 If sd4=1, ⁄(3.03934 + Lnlength) s5nlan Interaction term, if sd5=1, value = number of lanes in increasing direction 4.096 0.508 8.070 s5u2 Lnlength -1.330 0.119 -11.150 If sd5=1, ⁄(3.03676 + Lnlength) s5u5 2.210 0.169 13.110 Lnlength × √4.60951 + Lnlength⁄ If sd5=1, (3.03676 + Lnlength) s5u6 2 0.379 0.051 7.380 Lnlength × √4.60951 + Lnlength⁄ If sd5=1, (3.03676 + Lnlength) s6u2 Lnlength 2.703 0.105 25.810 If sd6=1, ⁄(3.04674 + Lnlength) s6u3 Lnlength2 0.888 0.035 25.650 If sd6=1, ⁄(3.04674 + Lnlength) dgc1 s7t1 If sd7=1, Log(9.7006 × 10−8 + ) -0.156 0.018 -8.880 100 dgc1 s7t2 If sd7=1, Log(9.7006 × 10−8 + ) × √(9.70066 × 10−6 + dgc1) -0.344 0.048 -7.200 100 s7u1 1 -7.962 0.503 -15.820 If sd7=1, ⁄(3.01729 + Lnlength) s7u3 Lnlength2 0.874 0.055 15.790 If sd7=1, ⁄(3.01729 + Lnlength) s8u2 Lnlength 2.719 0.105 26.000 If sd8=1, ⁄(3.04562 + Lnlength) s8u3 Lnlength2 0.893 0.035 25.830 If sd8=1, ⁄(3.04562 + Lnlength) s9u2 Lnlength -1.438 0.120 -12.010 If sd9=1, ⁄(3.0068 + Lnlength) s9u5 2.125 0.089 23.940 Lnlength × √4.60951 + Lnlength⁄ If sd9=1, (3.0068 + Lnlength)

92

Table 5-10 Heterogeneous NB specification with MFP functional forms (continued)

Variable Description Coeff. S.E. s9u6 2 Lnlength × √4.60951 + Lnlength⁄ 0.329 0.038 8.760 If sd9=1, (3.0068 + Lnlength) sd2 K-cut dummy variable, if data subset k=2 -2.121 0.706 -3.000 sd5 K-cut dummy variable, if data subset k=5 -4.095 0.537 -7.630 srn0326 State Route Interaction Dummy (If srn = 003 or srn = 026) 0.548 0.090 6.080 t0full √(9.70066 × 10−6 + dgc1) 0.902 0.171 5.280

dgc1 t2full Log(9.7006 × 10−8 + ) × √(9.70066 × 10−6 + dgc1) 0.245 0.043 5.680 100 t3full LNADT3 × Log(LNADT) 0.005 0.001 16.800 t4full LNADT6 × Log(LNADT) 0.0001 0.001 -10.860

Heterogeneity in dispersion parameter Constant 14.839 1.110 13.360 Lnadt Natural log of AADT -2.440 0.202 -12.100 Lnlength Natural log of homogeneous segment length -0.783 0.023 -34.760 Dgc1 Degree of horizontal curvature -0.044 0.006 -7.920 Lclvert Percent length of segment with culvert -0.839 0.344 -2.440 Dwall Dummy- roadside retaining wall 0.577 0.170 3.400 Dtregrp Dummy- roadside tree group -0.262 0.061 -4.330 Srn0326 State Route Interaction Dummy (If srn = 003 or srn = 026, 1 else 0) -0.421 0.100 -4.220 Lnad3 (Natural log of AADT)3 0.011 0.001 10.900 Number of obs =62,598 Pseudo R2 = 0.0832 Constant only Log-likelihood = -22,668.192 Convergent Log likelihood = -19,701.045

93

Additionally, three dummy variables were found to be statistically significant in predicting the over-dispersion term alpha. Of these, the interaction dummy for State Routes 03 and 26; and the variable representing the presence of a tree group along a roadway segment, were found to have negative coefficients, indicating that their absence along a segment leads to an increase in the observed heterogeneity. On the other hand, the dummy variable signifying the presence of a wall along a roadway segment was found to have a positive coefficient, indicating that the presence of a wall increases the over-dispersion in the data.

The interpretive form of the heterogeneous negative binomial count model is as follows:

(The model equation is spelt out in terms of the coded variables, the definitions of which are provided in table 5-8)

Ln(totalacc) = 2.728 - 0.073×dgc1 + 0.292×Dmfxobj + 1.313×Lclvert - 6.121×Nlanei + 0.041×Rlnw - 1.348×s10u2 + 2.087×s10u5 + 0.338×s10u6 - 0.474×s1u3 + 0.341×s1u6 + 2.199×s1u9 + 1.942×s2nlan + 2.577×s2u3 + 0.845×s2u4 + 2.796×s3u2 + 0.93×s3u3 - 0.672×s4dmfx + 2.513×s4u2 + 0.826×s4u3 + 4.096×s5nlan - 1.33×s5u2 + 2.21×s5u5 + 0.379×s5u6 + 2.703×s6u2 + 0.888×s6u3 - 0.156×s7t1 - 0.344×s7t2 - 7.962×s7u1 + 0.874×s7u3 + 2.719×s8u2 + 0.893×s8u3 - 1.438×s9u2 + 2.125×s9u5 + 0.329×s9u6 - 2.121×sd2 - 4.095×sd5 + 0.548×srn0326 + 0.902×t0full + 0.245×t2full + 0.005×t3full + 0.0001×t4full

Ln(α) =

14.839 - 2.44×Lnadt - 0.783×Lnlength - 0.044×dgc1 - 0.839×Lclvert + 0.577×Dwall - 0.421×srn0326 - 0.262×Dtregrp + 0.011×lnad3

Here, Ln(α) takes up some of the residual heterogeneity from the negative binomial regression, by predicting some of the residual error. The effect of alpha on the predictions is dictated by the heterogeneous negative binomial formulation:

94

ij = exp(Xij + ij)

α = exp(ε) = exp( θ + Fp(X) + ε’)

 Ln(ij ) = βx + ln[exp( θ + Fp(x))] + ε’

Given the general understanding of the theory behind the model building, the next stage in the process was to test the efficacy of the proposed modeling process in terms of model fit and prediction error. The subsequent section will deal with an analysis of the model variable elasticities, followed by a discussion on model prediction measures at each step of the process.

95

5.6 Variable elasticities and model prediction measures

5.6.1 Elasticities

Following the model development process, variable elasticities were computed to better understand the relative effects of the explanatory variables on crash propensities within each specification. Elasticities represent the percentage change in a dependent variable for a 1% change in the respective dependent variable. A mean elasticity approaching 1 would imply the dependent variable is elastic to changes in the independent variable. Breaking the elasticity down to an observation level, could also provide some insight into roadway segments that show various levels of elasticity to changes in dependent variables representing segment characteristics, enabling the targeting of specific segments for safety analysis. For variables such as the natural log of AADT, and the natural log of homogeneous segment length, the model takes up a log-log form, and the expected elasticity would be approximately equal to the estimated parameter itself.

The elasticities for these variables were computed, and the expected elasticities were corroborated with the calculated values. The functional forms obtained from the MFP search are combinations of various transformations, the elasticities of which can be computed using the derivative method shown in the modeling methodology section. As such, elasticity measures for all models were evaluated using a 1% change method, wherein, the dependent variables were individually increased by 1%, and their estimated parameters were used in combination with the model intercept, to compute the percent change in the crash counts from those predicted by the original dependent variable values. Table 5-12 shows the result of these computations.

For the most part, all the variables in all the models were found to be inelastic, with the exceptions of AADT in the random parameter negative binomial model, and the number of lanes 96

in an increasing direction in the MFP negative binomial model. The random parameter model shows that a 1% change in AADT would result in a mean expected increase of 1.3% in the log of crashes, while the MFP negative binomial shows that a 1% change in the number of lanes in an increasing direction would increase the log of crash expectancies by 2.37%.

The MFP derived models showed a change in elasticity signs for some variables, an observation that could be due to the nature of the functional form transformations, but further investigations would be required to pinpoint the reason why. As such, the functional form transformations resulted in a reduction in variable elasticities, which could be due to the increased number of covariates of varying nonlinear functional forms in the MFP negative binomial, and the heterogeneous negative binomial models as compared to the other linear functional form estimations.

5.6.2 Model prediction evaluations

The final stage of the model analysis was to compute model prediction evaluation measures as a means to understand how well the models fit the observed data. While these measures only provide some insight into the aggregate performance of the models in predicting the independent variable, they do allow for the understanding of viable specification improvements. As mentioned previously, the modeling building exercise for this dissertation was performed on the 2010 subset of the homogeneous segment crash database, and the model prediction measures were evaluated for both the 2010 subset (to test the predictions on the modeling dataset), and also on the 2009 subset (as a form of out-of-sample evaluation). The variables that were different between the two years, were the crash counts, and observed AADT volumes, with the roadside features and roadway characteristic descriptors for both years being the same. 97

The measures prediction error that were evaluated, were the mean absolute error (MAE), mean absolute percentage error (MAPE), mean squared error (MSE), and the root mean square error (RMSE). The MAE is given by the mean of the absolute value of the difference between the observed number of crashes and the predicted crashes at each segment. The MAPE uses the absolute value of the error term as a percentage of the observed crashes- segments with zero observed crashes cannot be included. The RMSE is the square root of the Mean Square Error

(MSE) term, which as the name suggests is the average of the square of the error at each segment.

Since the RMSE involves squaring the error term, segments with large differences between observed and predicted crashes will be weighted disproportionately as compared to the others, leading to a higher overall RMSE value. As with any error term, it is desirable to obtain as low a value as possible for each of these predictive measures, and any reduction in prediction error from model to model could be considered an improvement in model fit.

Table 5-13 shows the consolidated evaluations for both the 2010 dataset, and the 2009 out-of-sample dataset. The MAE for both datasets shows the least error for the random parameter model, with the heterogeneous negative binomial model being the next smallest. On the other hand, the random parameter negative binomial model has the worst MSE and RMSE measures of all the models in the comparison. The heterogeneous negative binomial model shows the least error in prediction for both datasets for all the other predictive measures. It should be noted that the base variables in all five models are the same. The random parameter model assumes a distribution for the parameter estimations, and estimates a standard deviation and mean for the variable coefficients, while the MFP NB and MFP heterogeneous negative binomial models target the functional forms and error term in the respective model estimations. Given the lower error in predictions for both the MFP NB and the MFP heterogeneous negative binomial models, it would be plausible to suggest that through the functional form treatments and over-dispersion

98

parameterization, the model predictions improved through accounting for some heterogeneity in the dataset.

5.6.3 AIC and BIC comparisons

During the model building process in this dissertation, improvements in log-likelihood values were taken as improvements in model specifications. While the MFP based models appear to show good prediction estimates, they do significantly change the interpretation of the baseline model variables. While choosing a model with too few parameters could lead to a selectivity bias issue and poor predictions, models with too many parameters could end up over fitting the modeling dataset. Some commonly used factors to account for model fit versus the number of parameters and number of observations, known as penalized-likelihood information criteria, are the Akaike’s Information Criterion (AIC), and the Bayesian Information Criterion (BIC).

“AIC is an estimate of a constant plus the relative distance between the unknown true likelihood function of the data and the fitted likelihood function of the model, and a lower AIC means a model is considered to be closer to the truth. BIC is an estimate of a function of the of a model being true under a certain Bayesian setup, with the BIC penalizing model complexity more heavily than the AIC, and a lower BIC indicating that a model could be considered more likely to be the true model.” (Dziak et al. 2012)

Table 5-11 shows a comparison of the AIC and BIC measures obtained from the estimation of the five models in this study’s flow. As expected, the BIC values for all models were found to be higher than the AICs due to the BICs penalizing for model complexity. Also, the baseline NB specification was found to have the poorest AIC and BIC values indicating that the model provides the poorest fit of all five specifications. The baseline heterogeneous NB with fixed alpha shows a marginal improvement over the baseline NB. The multinomial fractional 99

polynomial NB model shows a better AIC value than the baseline models, but owing possibly to the complexity of the functional form transformations and the larger number of estimated parameters, shows the worst BIC value of all the five models. The random parameter NB accounts for heterogeneity in predictions by estimating parameters that vary across observations according to a set distribution, and overall shows AIC and BIC values almost 1000 points better than the baseline models. In spite of the nonlinear functional form transformations and a high number of estimated parameters, the final model of this study, the multinomial fractional polynomial heterogeneous NB specification, showed the best improvement in AIC and BIC factors. This in addition to the model prediction measures would lead one to believe in the plausibility of the MFP heterogeneous NB accounting for more heterogeneity in predictions than the baseline and random parameter models.

Table 5-11 Comparison of AIC and BIC across models

Convergent log- Model Alpha (S.E.) AIC BIC likelihood

Baseline NB 12.190 (0.304) -20,573.57 41,173.14 41,290.72

Baseline heterogeneous NB 12.875 (0.153) -20,553.38 41,132.75 41,250.33

Random parameter NB 11.988 (0.005) -20,090.30 40,218.59 40,390.73 Multinomial fractional polynomial 11.799 (0.025) -20,410.32 40,924.64 41,394.95 NB MFP heterogeneous NB 11.524 (0.099) -19,701.05 39,504.09 39,965.36

100

Table 5-12 Comparison of continuous variable elasticities across all models

Variable Elasticities (%) MFP Baseline Random parameter Multinomial fractional Baseline NB heterogeneous heterogeneous NB NB polynomial NB NB AADT 0.8323 1.2461 1.3047 0.7023 0.6577 Homogeneous segment length 0.7751 0.8501 0.7076 0.5955 0.6068 Number of lanes in an increasing direction 0.6460 0.6486 0.7747 2.3776 -0.6097 Average lane width (feet) -0.3313 -0.3302 -0.6177 0.1068 0.4468 Degree of horizontal curvature -0.0181 -0.0173 -0.0765 0.0400 -0.4447 Length percent of culvert on segment -0.0047 -0.0020 -0.0121 -0.0019 0.0002 Vertical curve length between 300-450 feet 0.0013 0.0114 -0.0098 -0.0004 -

Table 5-13 Model prediction error summary

Model/Year 2010 (Model set evaluation) 2009 (Out of sample evaluation) MAE MAPE MSE RMSE MAE MAPE MSE RMSE Baseline NB 0.31631 0.7737 1.47884 1.21607 0.32188 0.77675 1.54596 1.24336 Baseline heterogeneous NB 0.31957 0.7694 1.56718 1.25187 0.32524 0.77470 1.63463 1.27853 Random parameter NB 0.25013 0.8079 15.1124 3.88746 0.25561 0.81093 2.78405 1.66855 Multinomial fractional polynomial NB 0.30923 0.7725 1.17315 1.08312 0.32824 0.80133 1.34262 1.15872 Multinomial fractional polynomial 0.30777 0.7702 1.13134 1.06365 0.32398 0.79710 1.18771 1.08982 heterogeneous NB

101

5.6.4 Model prediction measure insights

Table 5-13 also provides for some interesting points of discussion pertaining to the predictive capabilities of the various models in this dissertation. The mean absolute error is a measure of the absolute deviation of the model predictions from the observed values, something that is useful for strategic costing for accident severity types, with the cost of investment being directly correlated to the deviation. From the table it can be seen that the random parameter NB shows the lowest values for MAE, indicating that parameter heterogeneity has a stronger impact on the MAE than the functional form treatments.

The mean absolute percentage error being the mean absolute deviation from the observed values as a percentage of the observed values, locations with low observed counts would be expected to show high percentage errors, providing insight into possible crash reduction strategies for locations with low crash counts. The baseline HNB model specification was found to have the lowest MAPE, showing that the parameterization of the overdispersion heteroskedasticity might be leading to better predictions for segments with low observed crash counts.

The mean square error provides a risk function corresponding to the expected value of the quadratic loss incorporating both estimator bias and variance. Similarly, the root mean square error represents the sample standard deviation of the differences between the observed and predicted values. Both these measures provide insight into how close predictions are to observed values, with segments of high residuals being penalized higher than those with low residual values. Both the baseline HNB and MFP HNB were found to have significantly lower MSE and

RMSE values than the other models considered in this study, indicating that functional form dramatically influences the variance and standard deviation of the model residual errors.

102

Chapter 6

Summary

The research detailed in this dissertation was aimed at using statistically derived functional form transformations coupled with the parameterization of the negative binomial coefficient of over-dispersion, to minimize the effect of heterogeneity on the prediction of crash frequencies, while providing some insight into the factors that contribute to the heterogeneity in the observed crash counts. The extant literature shows many different model types applied towards the various data related issues that can arise, with the zero-inflated, random effects and random parameter model specifications providing the best insight into data heterogeneity. With regard to the functional forms of the effect of the independent variables on crash counts, the extant literature shows tests that could be implemented towards analyzing the effectiveness of various transformations to variables, but little work exists relating to a statistical or theoretical basis for choosing a specific functional form for a specific variable. This dissertation hopes to fill these gaps in the extant literature through applying the Royston-Altman multinomial fractional polynomial search algorithm to obtain empirically optimum functional forms for the various continuous variables used to model crash propensities. Also, by incorporating these functional form transformations with a heterogeneous negative binomial model, this dissertation aims to obtain segment specific parameterizations of the over-dispersion term, along with an understanding of the parameters that possibly influence the over-dispersion in the first place.

The data used for this dissertation consisted of close to 5,443 centerline miles of 142 two- lane State Highways in Washington State, for a period of nine years from 2002 to 2010. The

103

modeling dataset consisted of 62,598 homogeneous segments based on changes in roadway geometry, with the models being estimated on the 2010 subset, and model prediction error validations being performed on the 2009 subset of the data. The modeling procedure began with setting the baseline model specifications that would be used to both develop subsequent models, and to gauge improvements in model fit and predictions. The first baseline model for this study was the fixed parameter negative binomial specification, that showed that the natural log of

AADT, the natural log of segment length, the number of lanes in an increasing direction, average segment lane width, degree of horizontal curvature, vertical curve lengths between 300 and 450 feet, the percent length of a culvert alongside a segment, and the dummies showing the presence of fixed objects and cabinets on the roadside, had coefficients that were significantly greater than zero, in the prediction of crash counts. The model also showed a statistically significant coefficient of dispersion, showing that the negative binomial model was favored over the Poisson.

Overall, with a constant-only log-likelihood of -22,668.192, the fixed parameter negative binomial model was found to converge at a log-likelihood of -20,573.570.

Since the ultimate goal of this study was to develop a modeling procedure that could use functional form transformations to counter the effects of some heterogeneity in the data, it was decided that model specifications would be compared to random parameter estimations on the same dataset. The random parameter specification was built considering normal, log-normal, uniform, and triangular distributions as possible parameter density functions, using a simulation based maximum likelihood estimation with 200 Halton draws. Variables that were found to be random were the natural log of AADT, the natural log of homogeneous segment length, the number of lanes in an increasing direction, average lane width, and the length of a vertical curve between 300 and 450 feet, with all variables being found to follow a normal distribution.

Increasing AADT was found to have an increasing impact on crash frequency, but with magnitudes that vary across the segments, as dictated by the mean and standard deviation of het 104

estimations. Similarly, the natural log of segment length was also found to be normally distributed, with an increasing effect on crash frequencies. Conversely, the number-of-lanes-in- an-increasing direction, were found to have a reducing impact on crash counts, with a normally distributed parameter estimate set. Vertical curve lengths between 300 and 450 feet were found to have a standard deviation and mean that would result in both an increase and a decrease in crash propensities, depending on the segment. The model also showed an improved convergent log- likelihood of -20,090.30.

The next phase was to divide the 2010 homogeneous segment dataset into 10-k cuts, to enable the teasing out of minute variations, when doing the MFP based functional form search.

The resulting specifications were recombined, and the functional forms that resulted, were used to estimate a fixed parameter negative binomial model specification. The resulting model specification was found to have an improved value of alpha as compared to the baseline fixed parameter negative binomial model, with a lower convergent log-likelihood of -20,410.319. The functional form transformations obtained through the MFP search algorithm, were found to be too complex to interpret directly, but since achieving the best model predictions and fits were the goal of the study, this was not seen as a detracting issue. CURE plots were used to visually gauge the effect of the functional form transformation on the cumulative residuals of the baseline model specification, and it was found that the MFP functional forms resulted in smoother CURE plots, with very small areas of consistent over-fitting or under-fitting.

The functional form transformations were aimed at parameter heterogeneity that could occur from choosing the wrong functional form for a covariate in a regression analysis. The true functional form of the variable of interest is impossible to know, and the MFP limits the search set of transformations to within eight choices that would result in specifications that are empirically driven by the nature of the modeling dataset. Therefore, even after accounting for heterogeneity in functional forms, the effects of some heterogeneity could still remain in the 105

model estimations. To understand the causative factors, and the segment level extent of this heterogeneity, a heterogeneous negative binomial model was developed where the MFP variables would form the search vector for the NB part of the specification, and the vector of variables to parameterize the over-dispersion term alpha, could include both untransformed and transformed forms of the roadway characteristic and roadside feature variables.

The resulting model specification showed a much improved convergent log-likelihood of

-19,791.045, an improvement of almost 300 over the random parameter negative binomial model.

This would show a definite improvement over the distribution assumed random parameter model, through the use of a fixed parameter model with MFP functional forms, and segment specific over-dispersion terms. A fractional polynomial form of AADT involving a cubic term and a log term were found to significantly show the effect of AADT on the segment specific over- dispersion. The natural log of homogeneous segment length, the degree of curvature for horizontal curves, and percent length of culvert along a segment, were all found to have negative coefficients, indicating that as their values increase, the over-dispersion of the crash data reduces by the magnitude of their respective coefficients. Dummy variables that were found to contribute significantly to the data over-dispersion were the dummy interaction of State Routes 03 and 26, and the dummy representing the presence of a tree group alongside a roadway segment that showed negative coefficients, indicating that their absence along a segment leads to an increase in observed heterogeneity. Also, the dummy variable signifying the presence of a wall along a roadway segment was found to have a positive effect on over-dispersion, indicating that the presence of a wall would increase the over-dispersion in the observed crash propensities.

Following the model development process, the specifications were used to estimate model performance error measures, using the 2009 data subsample as a form of an out-of-sample evaluation. The results of these showed a definite improvement going from the random parameter negative binomial model to the multinomial fractional polynomial specified heterogeneous 106

negative binomial model. The random parameter specification showed a very large root mean square error as compared to the heterogeneous negative binomial, indicating that the later was better at predicting crashes closer to the observed crashes.

That being said, there are some very obvious shortcomings of the algorithm proposed.

While the MFP algorithm does not require any assumptions on variable/parameter distributions, it does perform an empirical search on the dataset, for a limited set of functional forms. It was not found in the case of this dissertation, but it is possible that for other datasets, none of the powers of transformations in the search set could provide any significant improvement in fit. Also, due to the empirical nature of the functional form search algorithm, the functional forms could result in an over-fitting issue, leading to the non-transferability of the model to other datasets. The model specifications for this dissertation were tested against a different year’s observations, and were found to be transferrable in terms of model prediction errors, but it would be ideal to test the model specifications against data obtained from a different spatial region. As such, problems of over-fitting would be expected to occur in small datasets with low variability, which is not the case with the modeling dataset used in this dissertation.

While fixed parameter models provide parameter estimates averaged over the entire dataset, and random parameter models provide a distribution of estimates for each segment, grouping random groups variables that contribute to data heterogeneity, the MFP based heterogeneous negative binomial regression model proposed in this dissertation would produce fixed parameters in empirically derived functional forms, with segment specific over-dispersion effects. In essence, it would be akin to grouping segments based on similar functional forms, and scaling predictions by some measure of the predicted heterogeneity. Taking the concept forward could involve incorporating the functional form transformations in some form of latent class negative binomial schema, or performing a principal component analysis to model the homogeneous segments independently. Such methods would require some level of theoretical 107

knowledge on the nature and number of classes that could exist among the roadway segments.

Such applications are largely absent in the extant literature, but it could be a viable option to consider as many classes as the number of roadway geometry characteristics that were used to determine segment heterogeneity.

In conclusion, the gap in the existent work on modeling crash frequencies that this dissertation was aiming to fill, was the application of a statistical procedure to determine independent variable functional forms without any assumptions on the distribution of the variable’s effect on crash counts; and a combination of this approach with a model that could capture segment specific heterogeneity- parameter heterogeneity through improper functional form usage, and the heterogeneity that could result from unobserved effects that fixed parameter models cannot account for. Towards this goal, the Royston-Altman MFP search algorithm provides stable variable functional forms, and in combination with the heterogeneous negative binomial regression model, provides a fit that is at least as good as a random parameter negative binomial count regression. While no one model stood out as having the lowest model prediction errors across all measures, the MFP HNB stands out cumulatively and more so when considering the lower prediction residuals, as shown through the MSE and RMSE calculations.

Overall, the findings of this research would indicate that the employment of MFP searches to functional form finding, and coupling the resulting functional form specifications with a heterogeneous negative binomial approach, led to significant improvements in model fit and predictions, leading to the conclusion that the proposed methodology to deal with heterogeneity in large datasets is a plausible technique for SPF generation. The proposed algorithm for implementing this procedure could be a significant method towards modeling SPFs, while providing good insights into segment level parameterization of over-dispersion, thereby enabling effective safety decisions at specific regions of a roadway network.

108

Bibliography

1. Anastasopoulos, P.Ch., and Mannering, F.L. A note on modeling vehicle crash frequencies

with random-parameters count models. Crash Analysis and Prevention, 41 (2009), pp. 153-

159.

2. Anastasopoulos, P.Ch., and Mannering, F.L. An empirical assessment of fixed and random

parameter models using crash-and non-crash-specific injury data. Crash Analysis and

Prevention, 43 (2011), pp. 1140-1147.

3. Anastasopoulos, P.Ch., Mannering, F.L., Shankar, V.N., and Haddock, J.E. A study of

factors affecting highway crash rates using the random-parameters tobit model. Crash

Analysis and Prevention, 45 (2012), pp. 628-633.

4. Anastasopoulos, P.Ch., Tarko, A.P., and Mannering, F.L. Tobit analysis of vehicle crash

rates on interstate highways. Crash Analysis and Prevention, 40 (2008), pp. 768-775.

5. Bhat, C. Simulation estimation of mixed models using randomized and

scrambled Halton sequences. Transportation Research Part B, 37 (1) (2003), pp. 837-855.

6. Caliendo, C., Guida, M., and Parisi, A. A crash-prediction model for multilane roads. Crash

Analysis and Prevention 39 (4) (2007), pp. 657-670.

7. Castro, M., Paleti, R., and Bhat, C.R. A representation of count data models

to accommodate spatial and temporal dependence: application to predicting crash

frequencies at intersections. Transportation Research Part B, 46 (1) (2012), pp. 253-272.

8. Castro, M., Paleti, R., and Bhat, C.R. A spatial generalized ordered response model to

examine highway crash injury severity. Crash Analysis and Prevention, 52 (2013), pp. 188-

203.

9. Daniels, S., Brijs, T., Nuyts, E., and Wets, G. Explaining variation in safety performance of

roundabouts. Crash Analysis and Prevention, 42 (2) (2010), pp. 393-402.

109

10. Dziak, J.J., Coffman, D.L., Lanza, S.T., and Li, R. Sensitivity and specificity of information

criteria. The Pennsylvania State University, Technical Report Series #12-119 (2012).

11. El-Basyouny, K., and Sayed, T. Comparison of two negative binomial regression techniques

in developing crash prediction models. Transportation Research Record, 1950 (2006), pp. 9-

16.

12. El-Basyouny, K., and Sayed, T. Crash prediction models with random corridor parameters.

Crash Analysis and Prevention, 41 (5) (2009), pp. 1118-1123.

13. Hardin, J.W., and Hilbe, J.M. Generalized Linear Models and Extensions. Stata Press, 2012.

14. Hauer, E. Statistical road safety modeling. Transportation Research Record, 1897 (2004),

pp. 81-87.

15. Hauer, E. The art of regression modeling in road safety, Springer, 2015

16. Hausman, J.A., Hall, B.H., and Griliches, Z. Econometric models for count data with an

application to the patents- R&D relationship. Econometrica, 52 (4) (1984), pp. 909-938.

17. Hilbe, J.M. Negative Binomial Regression. Cambridge University Press, 2011.

18. Johansson, P. Speed limitation and motorway causalities: a count data regression

approach. Crash Analysis and Prevention, 28 (1) (1996), pp. 73-87.

19. Jovanis, P.P., and Chang, H.L. Modeling the relationships of crashes to miles traveled.

Transportation Research Record, 1068 (1986), pp. 42-51.

20. Kim, D., and Washington, S.P. The significance of endogeneity problems in crash models:

and examination of left-turn lanes in intersection crash models. Crash Analysis and

Prevention, 38 (6) (2006), pp. 1094-1100.

21. Kohavi, R. A study of cross validation and bootstrap for accuracy estimation and model

selection. International Joint Conference on Artificial Intelligence, 1995.

110

22. Kononov, J., Lyon, C., and Allery, B.K. Relationship of flow, speed, and density of urban

freeways to functional form of a safety performance function. Transportation Research

Record, 2236 (2011), pp. 11-19.

23. Lee, J., and Mannering, F.L. Impact of roadside features on the frequency and severity of

run-off roadway crashes: an empirical analysis. Crash Analysis and Prevention, 34 (2)

(2002), pp. 149-161.

24. Li, X., Lord, D., and Zhang, Y. Development of crash modification factors for rural frontage

road segments in Texas using results from generalized additive models. Journal of

Transportation Engineering, 137 (1) (2011), pp. 74-83.

25. Lord, D. Modeling motor vehicle crashes using Poisson-gamma models: examining the

effects of low sample mean values and small sample size on the estimation of the fixed

dispersion parameter. Crash Analysis and Prevention, 38 (4) (2006), pp. 751-766.

26. Lord, D., Guikema, S., and Geedipally, S.R. Application of the Conway-Maxwell-Poisson

generalized for analyzing motor vehicle crashes. Crash Analysis and

Prevention, 40 (3) (2008), pp. 1123-1134.

27. Lord, D., and Mahlawat, M. Examining the application of aggregated and disaggregated

Poisson-gamma models subjected to low sample mean bias. Transportation Research

Record, 2136 (2009), pp. 1-10.

28. Lord, D., and Miranda-Moreno, L.F. Effects of low sample mean values and small sample

size on the estimation of the fixed dispersion parameter of Poisson-gamma models for

modeling motor vehicle crashes: a Bayesian perspective. Safety Science, 46 (5) (2008), pp.

751-770.

29. Lord, D., and Mannering, F.L. The statistical analysis of crash-frequency data: A review and

assessment of methodological alternatives. Transportation Research Part A, 44 (2010), pp.

291-305. 111

30. Lord, D., Washington, S.P., and Ivan, J.N. Poisson, Poisson-gamma and zero-inflated

regression models of motor vehicle crashes: balancing statistical fit and theory. Crash

Analysis and Prevention, 37 (1) (2005), pp. 35-46.

31. Lord, D., Washington, S.P., and Ivan, J.N. Further notes on the application of zero inflated

models in highway safety. Crash Analysis and Prevention, 39 (1) (2007), pp. 53-57.

32. Malyshkina, N., and Mannering, F.L. Empirical assessment of the impact of high-way

design exceptions on the frequency and severity of vehicle crashes. Crash Analysis and

Prevention, 42 (1) (2010), pp. 131-139.

33. Malyshkina, N., and Mannering, F.L. Zero-state Markov switching count-data models: an

empirical assessment. Crash Analysis and Prevention, 42 (1) (2010), pp. 122-130.

34. Malyshkina, N., Mannering, F.L., and Tarko, A.P. Markov switching negative binomial

models: an application to vehicle crash frequencies. Crash analysis and Prevention, 41 (2)

(2009), pp.217-226.

35. Miaou, S.-P. The relationship between truck crashes and geometric design of road sections:

Poisson versus negative binomial regressions. Crash Analysis and Prediction, 26 (4) (1994),

pp. 471-482.

36. Miaou, S.-P., Bligh, R.P., and Lord, D. Developing median barrier installation guidelines: a

benefit/cost analysis using Texas data. Transportation Research Record, 1904 (2005), pp. 3-

19.

37. Miaou, S.-P., and Lum, H. Modeling vehicle crashes and highway at geometric design

relationships. Crash Analysis and Prevention, Vol. 25, No. 6 (1993), pp. 689-709.

38. Miaou, S,-P., Song, J.J., and Mallick, B.K. Roadway traffic crash mapping: a space-time

modeling approach. Journal of Transportation and Statistics, 6 (1) (2003), pp. 3-19.

112

39. Milton, J., Shankar, V.N., and Mannering, F.L. Highway crash severities and the

model: an exploratory empirical analysis. Crash Analysis and Prevention, 40 (1) (2008), pp.

260-266.

40. Mitra, S., and Washington, S. On the nature of over-dispersion in motor vehicle crash

prediction models. Crash Analysis and Prevention, 39 (2007), pp. 459-468.

41. Nylund, K.L., Asparouhov, T., and Muthen, B.O. Deciding on the number of classes in

latent class analysis and growth mixture modeling: A Monte-Carlo simulation study.

Structural Equation Modeling, 14 (4) (2007), pp. 535-569.

42. Oh, J., Washington, S.P., and Nam, D. Crash prediction model for railway-highway

interfaces. Crash Analysis and Prevention, 38 (2) (2006), pp. 346-356.

43. Park, B.-J., and Lord, D. Application of finite mixture models for vehicle crash data

analysis. Crash Analysis and Prevention, 41 (4) (2009), pp. 683-691.

44. Park, B.-J., Lord, D., and Hart, J.D. Bias properties of Bayesian statistics in finite mixture of

negative regression models for crash data analysis. Crash Analysis and Prevention, 42 (2)

(2010), pp. 741-749.

45. Park, M., 2013. Longitudinal Methods for Prioritizing Highway Safety Investments. PhD

dissertation, Civil Engineering, The Pennsylvania State University.

46. Poch, M., and Mannering, F.L. Negative binomial analysis of intersection-crash frequencies.

Journal of Transportation Engineering, 122 (2) (1996), pp. 105-113.

47. Quddus, M.S. time series count data models: an empirical application to traffic crashes.

Crash Analysis and Prevention, 40 (5) (2008), pp. 1732-1741.

48. Rodriguez, J.D., Perez, A., and Lozano, J.A. Sensitivity analysis of k-fold cross validation in

prediction error estimation. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 32 (3) (2010), pp. 569-575.

113

49. Royston, P., and Altman, D.G. Regression using fractional polynomials of continuous

covariates: Parsimonious parametric modelling. Journal of the Royal Statistical Society,

Series C, Vol. 43, No. 3 (1994), pp. 429-467.

50. Royston, P., Ambler, G., and Sauerbrei, W. The use of fractional polynomials to model

continuous risk variables in . International Journal of Epidemiology, 28

(1999), pp. 964-974.

51. Royston, P., and Sauerbrei, W. Stability of multivariable fractional polynomial models with

selection of variables and transformations: a bootstrap investigation. Statistics in Medicine,

22 (2003), pp. 639-659.

52. Royston, P., and Sauerbrei, W. Multivariable model-building: A pragmatic approach to

regression analysis based on fractional polynomials for modelling continuous variables.

John Wiley & Sons, 2008.

53. Sauerbrei, W., Meier-Hirmer, C., Benner, A., and Royston, P. Multivariable regression

model building by using fractional polynomials: description of SAS, STATA, and R

programs. Computational Statistics and Data Analysis, 50 (2006), pp. 3464-3485.

54. Sauerbrei, W., and Royston, P. Building multivariable prognostic and diagnostic models:

transformation of the predictors by using fractional polynomials. Journal of the Royal

Statistical Society, Series A (1999), Vol. 162, Part 1, pp. 71-94.

55. Sauerbrei, W., Royston, P., and Schumacher, M. Bootstrap methods for developing

predictive models [letter], American , 59 (2005), pp. 116-118.

56. Savolainen, P.T., Mannering, F.L., Lord, D., and Quddus, M.A. The statistical analysis of

highway crash-injury severities: A review and assessment of methodological alternatives.

Crash Analysis and Prevention, 43 (2011), pp. 1666-1676.

57. Sellers, K.F., and Shmueli, G. A flexible regression model for count data. Annals of Applied

Statistics, 4 (2) (2010), pp. 943-961. 114

58. Shankar, V.N., Albin, R.B., Milton, J.C., and Mannering, F.L. Evaluating median cross-over

likelihoods with clustered crash counts: an empirical inquiry using random effects negative

binomial model. Transportation Research Record, 1635 (1998), pp. 44-48.

59. Shankar, V.N., Mannering, F.L., and Barfield, W. Effect of roadway geometrics and

environmental factors on rural crash frequencies. Crash Analysis and Prevention, 27 (3)

(1995), pp. 371-389.

60. Shankar, V.N., Milton, J., and Mannering, F.L. Modeling crash frequency as zero-altered

probability processes: an empirical inquiry. Crash Analysis and Prevention, 29 (6) (1997),

pp. 829-837.

61. Shankar, V.N., Ulfarsson, G.F., Pendyala, R.M., and Nebergall, M.B. Modeling crashes

involving pedestrians and motorized traffic. Safety Science, 41 (7) (2003), pp. 627-640.

62. Shugan, S.M. Editorial: errors in the variables, unobserved heterogeneity, and other ways of

hiding statistical error. Marketing Science, 25 (3) (2006), pp. 203-216.

63. Silke, B., Kellett, J., Rooney, T., Bennett, K., and O’riordan, D. an improved medical

admissions risk system using multivariable fractional polynomial logistic regression

modelling. QJ Med, 103 (2009), pp. 23-32.

64. Sittikariya, S., and Shankar, V.N. Modeling heterogeneity: Traffic crashes. VDM-Verlag,

Vol. 80 (2009).

65. Srinivasan, R., Carter, D., and Bauer, K. Safety performance function decision guide: SPF

calibration vs SPF development, Federal Highway Administration Office of Safety Reports,

FHWA-SA-14-004 (2013).

66. Tibishirani, R., James, G., Witten, D., and Hastie, T. An introduction to statistical learning

with applications in R. Springer-Verlag New York, 2013.

67. Train, K. Halton sequences for mixed logit. Working paper, University of California,

Department of Economics, Berkeley, 1999. 115

68. Ulfarsson, G., and Shankar, V.N. A crash count model based on multi-year cross-sectional

roadway data with serial correlation. Transportation Research Record, 1840 (2003), pp.

193-197.

69. Valecky, J. Modelling of individual insured crash risk for given motor-hull insurance

portfolio. International Days of Statistics and Economics, (2012).

70. Venkataraman, N.S., Shankar, V.N., Ulfarsson, G.F., and Deptuch, D. A heterogeneity-in-

means count model for evaluating the effects of interchange type on heterogeneous

influences of interstate geometrics on crash frequencies. Analytical Methods in Accident

Research, Vol. 2 (2014), pp. 12-20.

71. Venkataraman, N.S., Ulfarsson, G.F., Shankar, V.N., Oh, J., and Park, M. Model of

relationship between interstate crash occurrence and geometrics: Exploratory insights from

random parameter negative binomial approach. Transportation Research Record, No 2236

(2011), pp. 41-48.

72. Venkataraman, N., Ulfarsson, G.F., and Shankar, V.N. Random parameter models of

interstate crash frequencies by severity, number of vehicles involved, collision and location

type. Crash Analysis and Prevention, 59 (2013), pp. 309-318.

73. Wang, X., and Abdel-Aty, M. Temporal and spatial analyses of rear-end crashes at

signalized intersections. Crash Analysis and Prevention, 38 (6) (2006), pp. 1137-1150.

74. Washington, S.P., Karlaftis, M.G., and Mannering, F.L. Statistical and econometric methods

for transportation data analysis. Chapman Hill CRC, 2010.

75. Xie, Y., and Zhang, Y. Crash frequency analysis with generalized additive models.

Transportation Research Record, 2061 (2008), pp. 39-45.

116

Appendix A

Description of homogeneous roadway segment dataset parameters

Parameter Description cyear Crash year srn State Route Number barm Beginning Accumulated Route Mileage earm Ending Accumulated Route Mileage length length of segment (miles) totalacc total count of roadside, roadway, and other location crashes in segment rdside count of roadside crashes in segment rdway count of roadway crashes in segment othloc count of other location crashes in segment pdo count of reported Property Damage Only from crashes in segment pinj count of reported Possible Injury from crashes in segment evi count of reported Evident Injury from crashes in segment sinj count of reported Serious Injury from crashes in segment fatal count of reported Fatal from crashes in segment unkown count of reported Unknown Injury from crashes in segment hiinj count of crashes in segment reporting more than one injury justinj count of crashes in segment reporting one injury loinj count of crashes in segment reporting no injuries veh1 count of crashes in segment involving 1 vehicle veh2 count of crashes in segment involving 2 vehicles veh3 count of crashes in segment involving 3 vehicles veh4 count of crashes in segment involving 4 vehicles veh5 count of crashes in segment involving 5 vehicles othveh count of crashes in segment involving more than 5 vehicles rend count of Rear End type crashes in segment trend count of Turning Rear End type crashes in segment sdirtsw count of Same Direction Turning Sideswipe type crashes in segment sdirsw count of Same Direction Sideswipe type crashes in segment sdirt count of Same Direction Turning type crashes in segment sdiroth count of Same Direction Others type crashes in segment headon count of Head On type crashes in segment odirsw count of Opposite Direction Sideswipe type crashes in segment odirt count of Opposite Direction Turning type crashes in segment odiroth count of Opposite Direction Others type crashes in segment fobj count of Fixed Object type crashes in segment eang count of Entering At Angle type crashes in segment oturn count of Overturned type crashes in segment animal count of Animal type crashes in segment bicycle count of Bicycle type crashes in segment ped count of Pedestrian type crashes in segment onepark count of One Parked vehicle type crashes in segment

117

Parameter Description onemove count of One Parked, One Moving type crashes in segment Entlvdr7 count of Entering/Leaving Driveway type crashes in segment other count of crashes classified as Other in segment nostate count of crashes classified as Not Stated in segment truck count of Truck type crashes in segment lanein Number of Lanes in Increasing Direction lanede Number of Lanes in Decreasing Direction rwywidi Roadway Width (ft) in Increasing Direction rwywidd Roadway Width (ft) in Decreasing Direction hcptarm Horizontal Curve PT Accumulated Route Mileage hcpcarm Horizontal Curve PC Accumulated Route Mileage hcrdius Radius of Curve (R) hcmxel Max Super Elevation (e) hclnth Length of Curve (L) in feet hcca Curve Left or Curve Right vcbr Beginning Vertical Curve Accumulated Route Mileage vcva Vertical Point of Intersection Accumulated Route Mileage vcearm Ending Vertical Curve Accumulated Route Mileage vclnth Length of Curve (ft) vcpgah Grade (%) ahead of Curve vcpgbck Grade (%) back of Curve swidthl Shoulder Width (ft) of outer portion of Decreasing Direction swidthlc Shoulder Width (ft) of median side of Decreasing Direction swidthrc Shoulder Width (ft) of median side of Increasing Direction swidthr Shoulder Width (ft) of outer portion of Increasing Direction nlanei number of lanes in increasing direction nlaned number of lanes in the decreasing direction hcang1 horizontal curve angle a1 difference in gradients for vertical Curve k1 K-value for vertical curve dgc1 degree of curvature for horizontal curve rlnw average lane width (ft) swr2367 dummy right shoulder width 2-3 ft increasing | 6-7 ft decreasing swr4501 dummy right shoulder width 4-5 ft increasing | <1 ft decreasing swr4523 dummy right shoulder width 4-5 ft increasing | 2-3 ft decreasing swr8901 dummy right shoulder width 8-9 ft increasing | <1 ft decreasing swr8910 dummy right shoulder width 8-9 ft increasing | >10 ft decreasing swr1001 dummy right shoulder width >10 ft increasing | <1 ft decreasing swr1023 dummy right shoulder width >10 ft increasing | 2-3 ft decreasing swr1045 dummy right shoulder width >10 ft increasing | 4-5 ft decreasing swc19 dummy center shoulder width 1-9 ft aadt Average Annual Daily Traffic dbrdral dummy for bridge rail lbrdral percent length of segment for bridge rail dbrgstr dummy for bridge structure dcabnet dummy for cabinet dcblbar dummy for cable barrier lcblbar percent length of segment for cable barrier dconbar dummy for concrete barrier

118

Parameter Description lconbar percent length of segment for concrete barrier dclvert dummy for culvert lclvert percent length of segment for culvert dclvend dummy for culvert end dcurb dummy for curb lcurb percent length of segment for curb dditch dummy for ditch lditch percent length of segment for ditch ddwngy dummy for down guy ldwngy percent length of segment for down guy ddwgync dummy for down guy anchor ddrglet dummy for drainage inlet ddrywel dummy for drywell dfence dummy for fence lfence percent length of segment for fence dglscrn dummy for glare screen lglscrn percent length of segment for glare screen dgrdral dummy for guardrail lgrdral percent length of segment for guardrail dgywire dummy for guywire lgywire percent length of segment for guywire dhydrnt dummy for hydrant dimpact dummy for impact attenuator dintrpt dummy for intersection point dmalbox dummy for mailbox dmfxobj dummy for miscellaneous fixed object dpdstal dummy for pedestal dpipend dummy for pipe end drdrect dummy for directional land form lrdrect percent length of segment for directional land form drgfall dummy for regulatory outfall drdaprc dummy for road approach drdslp dummy for roadside slope lrdslp percent length of segment for roadside slope drchcrp dummy for rock outcropping lrchcrp percent length of segment for rock outcropping dspbar dummy for special use barrier lspbar percent length of segment for special use barrier dstwpnd dummy for storm water pond dstwvlt dummy for storm water vault dsuprt dummy for support dtree dummy for tree dtregrp dummy for tree group ltregrp percent length of segment for tree group dwall dummy for wall lwall percent length of segment for wall dwhzrd dummy for water hazard lwthzrd percent length of segment for water hazard lnadt Ln(AADT)

119

Parameter Description lnlength Ln(Segment length) srn3 State Route Dummy (If srn = 3, 1 else 0) srn7 State Route Dummy (If srn = 7, 1 else 0) srn21 State Route Dummy (If srn = 21, 1 else 0) srn23 State Route Dummy (If srn = 23, 1 else 0) srn25 State Route Dummy (If srn = 25, 1 else 0) srn26 State Route Dummy (If srn = 26, 1 else 0) srn28 State Route Dummy (If srn = 28, 1 else 0) srn31 State Route Dummy (If srn = 31, 1 else 0) srn97 State Route Dummy (If srn = 97, 1 else 0) srn104 State Route Dummy (If srn = 104, 1 else 0) srn123 State Route Dummy (If srn = 123, 1 else 0) srn127 State Route Dummy (If srn = 127, 1 else 0) srn129 State Route Dummy (If srn = 129, 1 else 0) srn153 State Route Dummy (If srn = 153, 1 else 0) srn155 State Route Dummy (If srn = 155, 1 else 0) srn160 State Route Dummy (If srn = 160, 1 else 0) srn165 State Route Dummy (If srn = 165, 1 else 0) srn169 State Route Dummy (If srn = 169, 1 else 0) srn172 State Route Dummy (If srn = 172, 1 else 0) srn173 State Route Dummy (If srn = 173, 1 else 0) srn206 State Route Dummy (If srn = 206, 1 else 0) srn215 State Route Dummy (If srn = 215, 1 else 0) srn223 State Route Dummy (If srn = 223, 1 else 0) srn240 State Route Dummy (If srn = 240, 1 else 0) srn241 State Route Dummy (If srn = 241, 1 else 0) srn260 State Route Dummy (If srn = 260, 1 else 0) srn261 State Route Dummy (If srn = 261, 1 else 0) srn270 State Route Dummy (If srn = 270, 1 else 0) srn274 State Route Dummy (If srn = 274, 1 else 0) srn290 State Route Dummy (If srn = 290, 1 else 0) srn302 State Route Dummy (If srn = 302, 1 else 0) srn305 State Route Dummy (If srn = 305, 1 else 0) srn307 State Route Dummy (If srn = 307, 1 else 0) srn410 State Route Dummy (If srn = 410, 1 else 0) srn432 State Route Dummy (If srn = 432, 1 else 0) srn500 State Route Dummy (If srn = 500, 1 else 0) srn502 State Route Dummy (If srn = 502, 1 else 0) srn503 State Route Dummy (If srn = 503, 1 else 0) srn504 State Route Dummy (If srn = 504, 1 else 0) srn505 State Route Dummy (If srn = 505, 1 else 0) srn507 State Route Dummy (If srn = 507, 1 else 0) srn510 State Route Dummy (If srn = 510, 1 else 0) srn532 State Route Dummy (If srn = 532, 1 else 0) srn542 State Route Dummy (If srn = 542, 1 else 0) srn544 State Route Dummy (If srn = 544, 1 else 0) srn548 State Route Dummy (If srn = 548, 1 else 0) srn702 State Route Dummy (If srn = 702, 1 else 0) srn821 State Route Dummy (If srn = 821, 1 else 0)

120

Parameter Description srn900 State Route Dummy (If srn = 900, 1 else 0) srn902 State Route Dummy (If srn = 902, 1 else 0) srn906 State Route Dummy (If srn = 906, 1 else 0) srn970 State Route Dummy (If srn = 970, 1 else 0) srn50307 State Route Interaction Dummy (If srn = 503 or srn = 007, 1 else 0) srn31165 State Route Interaction Dummy (If srn = 031 or srn = 165, 1 else 0) srn0326 State Route Interaction Dummy (If srn = 003 or srn = 026, 1 else 0) srn24695 State Route Interaction Dummy (If srn = 246 or srn = 095, 1 else 0) vcurvel length of vertical curve vcl300 length of vertical curve < 300, 0 otherwise vcl4530 length of vertical curve between 300 and 450, 0 otherwise vcl7545 length of vertical curve between 450 and 750, 0 otherwise vcl35075 length of vertical curve between 750 and 1350, 0 otherwise vcl351 length of vertical curve >1350, 0 otherwise

121

Appendix B

MFP search results for 10 subsets

First stage MFP NB

Parameter Transformation from 1st stage MFP

Ilnad1 퐿푁퐴퐷푇3 − 494.149308 Ilnad2 퐿푁퐴퐷푇3 × 퐿푛(퐿푁퐴퐷푇) − 1021.709322 √퐿푛퐿푒푛푔푡ℎ + 4.6095086336 − 1.257991 Ilnle1 Ilnle2 (퐿푛푙푒푛푔푡ℎ + 4.6095086336)2 − 2.5044374 Inlan1 nlanei-1.1059299 Irlnw1 rlnw-11.84303545 푑푔푐1 + 9.700655937 × 10−6 퐿푛 ( ) + 3.859161 Idgc11 100

푑푔푐1 + 9.700655937 × 10−6 √ − 0.1452091 Idgc12 100 Ivcl41 vcl4530-34.51615128 Ilclv1 lclvert-.0075738841

10-K cut MFP NB, Cut # 1

Variable Description Coeff. S.E. t-stat Constant -2.614 0.111 -23.630 ilnadt10K ilnad1* ilnad2/ 10000 -0.019 0.005 -3.680 Inlne ( ilnle1* ilnle2)/ (lnlength+3.016931482) 0.566 0.043 13.080 inlan1 nlanei-1.1059299 -4.152 0.633 -6.560 idgc112 Idgc1__1*(idgc12* idgc11) 0.023 0.010 2.220 State Route Interaction Dummy (If srn = 003 or srn0326 0.865 0.314 2.750 srn = 026, 1 else 0) ilclv1 lclvert-.0075738841 2.622 0.742 3.530 ilnad__1 lnadt-7.899747253 1.023 0.083 12.370 Dispersion parameter for count data model Alpha 10.704 0.855 Number of obs =6,333 Chi squared = 471.7 Prob>chi2=0.000 Convergent Log likelihood = -2021.5878 Pseudo R2 = 0.1044

122

10-K cut MFP NB, Cut # 2

Variable Description Coeff. S.E. t-stat Constant -2.920 0.156 -18.700 ilnadt10K ilnad1* ilnad2/10000 -0.012 0.005 -2.250 Inlne ( ilnle1* ilnle2)/ (lnlength+3.041614263) 0.556 0.046 12.200 inlan1 nlanei-1.1059299 -4.113 0.588 -7.000 idgc113 (idgc12* idgc11) 1.621 0.381 4.260 ivcl41 vcl4530-34.51615128 -0.002 0.001 -2.610 ilclv1 lclvert-.0075738841 2.204 0.720 3.060 Ilnad__1 lnadt-7.893131536 1.010 0.085 11.900

푑푔푐1 + .0000218 Idgc1__2 √ − 6.7473 -0.001 0.000 -4.210 100 Dispersion parameter for count data model Alpha 12.048 0.996

Number of obs =6,299 Chi squared = 430.36 Prob>chi2=0.000 Convergent Log likelihood = -1923.4454 Pseudo R2 = 0.1006

10-K cut MFP NB, Cut # 3

Variable Description Coeff. S.E. t-stat Constant -2.305 0.091 -25.330 ilnadt10K ilnad1* ilnad2/10000 -0.029 0.004 -6.690 Inlne ( ilnle1* ilnle2)/ (lnlength+3.005234) 0.641 0.048 13.220 State Route Interaction Dummy (If srn = 003 or srn0326 1.102 0.350 3.150 srn = 026, 1 else 0) Dmfxobj dummy for miscellaneous fixed object 0.662 0.218 3.040 ilnad__1 lnadt-7.92576245 1.019 0.089 11.470 idgc1__1 dgc1-2.22124971 0.032 0.013 2.500 Dispersion parameter for count data model Alpha 14.332 1.125

Number of obs =6,316 Chi squared = 381.35 Prob>chi2=0.000 Convergent Log likelihood = -2052.3141 Pseudo R2 = 0.0850

123

10-K cut MFP NB, Cut # 4

Variable Description Coeff. S.E. t-stat Constant -2.664 0.101 -26.270 inlne ( ilnle1* ilnle2)/ (lnlength+3.039337) 0.505 0.044 11.490 inlan1 nlanei-1.1059299 -5.721 0.743 -7.700 Ilnad__1 lnadt-7.92972 0.907 0.065 13.850 Dispersion parameter for count data model Alpha 13.293 1.015

Number of obs =6,236 Chi squared = 402.03 Prob>chi2=0.000 Convergent Log likelihood = -2118.0267 Pseudo R2 = 0.0867

10-K cut MFP NB, Cut # 5

Variable Description Coeff. S.E. t-stat Constant 4.835152 2.802623 1.73 ilnad1 퐿푁퐴퐷푇3 − 494.149308 10.578 3.933 2.690 ilnad2 퐿푁퐴퐷푇3 × 퐿푛(퐿푁퐴퐷푇) − 1021.709322 -2.715 1.004 -2.700 inlne ( ilnle1* ilnle2)/ (lnlength+3.03676) 0.556 0.046 12.070 inlan1 nlanei-1.1059299 -1.611 0.316 -5.090 irlnw1 rlnw-11.84303545 0.110 0.032 3.470 푑푔푐1 + 9.700655937 × 10−6 퐿푛 ( ) 100 idgc11 -0.072 0.023 -3.050 + 3.859161

푑푔푐1 + 9.700655937 × 10−6 √ − 0.1452091 idgc12 100 4.369 1.168 3.740

State Route Interaction Dummy (If srn = 003 or srn0326 0.894 0.339 2.640 srn = 026, 1 else 0) Ilnad__1 lnadt-7.89596 249.566 95.500 2.610 Ilnad__2 lnadt^2-62.3462 -63.864 24.088 -2.650 Dispersion parameter for count data model Alpha 12.797 1.016 Number of obs =6,150 Chi squared = 375.44 Prob>chi2=0.000 Convergent Log likelihood = -2028.0334 Pseudo R2 = 0.0847

124

10-K cut MFP NB, Cut # 6

Variable Description Coeff. S.E. t-stat Constant -3.245 0.374 -8.670 ilnad1 0.051 0.012 4.370 퐿푁퐴퐷푇3 − 494.149308 ilnad2 -0.019 0.005 -4.010 퐿푁퐴퐷푇3 × 퐿푛(퐿푁퐴퐷푇) − 1021.709322 inlne ( ilnle1* ilnle2)/ (lnlength+3.046736) 0.529 0.044 11.950 inlan1 nlanei-1.1059299 -8.251 3.277 -2.520 idgc113 (idgc12* idgc11) 0.210 0.086 2.450 State Route Interaction Dummy (If srn = 003 or srn0326 0.748 0.327 2.290 srn = 026, 1 else 0) ivcl41 vcl4530-34.51615128 -0.001 0.001 -2.160 dcabnet dummy for cabinet 1.015 0.492 2.060 Idgc1__1 dgc1-2.0639 0.030 0.013 2.270 Dispersion parameter for count data model Alpha 10.704 0.855 Number of obs =6,245 Chi squared = 501.61 Prob>chi2=0.000 Convergent Log likelihood = -2103.4755 Pseudo R2 = 0.1065

10-K cut MFP NB, Cut # 7 Variable Description Coeff. S.E. t-stat Constant -1.852 0.184 -10.060 ilnad1 퐿푁퐴퐷푇3 − 494.149308 0.507 0.104 4.890 ilnad2 퐿푁퐴퐷푇3 × 퐿푛(퐿푁퐴퐷푇) − 1021.709322 -0.179 0.036 -5.030 inlne ( ilnle1* ilnle2)/ (lnlength+3.01729) 0.514 0.047 11.020

푑푔푐1 + 9.700655937 × 10−6 퐿푛 ( ) + 3.859161 idgc11 100 -0.066 0.025 -2.610

푑푔푐1 + 9.700655937 × 10−6 √ − 0.1452091 idgc12 100 3.881 1.315 2.950 rlnvcl4 (vcl4530-34.515128)/( rlnw-11.84303545) 0.0001 0.000 -2.110 dmfxobj dummy for miscellaneous fixed object 0.550 0.233 2.360 Ilnad__1 lnadt-7.86955 -13.215 3.372 -3.920 Dispersion parameter for count data model Alpha Convergent Log likelihood = -2003.4646 13.965 1.108 Number of obs =6,217 Chi squared = 352.91 Prob>chi2=0.000 Pseudo R2 = 0.0809 125

10-K cut MFP NB, Cut # 8 Variable Description Coeff. S.E. t-stat Constant -3.138 0.282 -11.140 ilnadt10k ilnad1* ilnad2/10000 -0.019 0.005 -3.540 inlne ( ilnle1* ilnle2)/( lnlength+3.0456) 0.556 0.046 12.210 inlan1 nlanei-1.1059299 -7.219 2.339 -3.090 idgc113 (idgc12* idgc11) 0.216 0.084 2.560 dmfxobj dummy for miscellaneous fixed object 0.565 0.212 2.660 Ilnad__1 lnadt-7.90628 1.051 0.084 12.520 Idgc1__1 dgc1-2.16389 0.029 0.013 2.180 Dispersion parameter for count data model Alpha 10.749 0.839 Number of obs =6,257 Chi squared = 518.27 Prob>chi2=0.000 Convergent Log likelihood = -2056.1046 Pseudo R2 = 0.1119

10-K cut MFP NB, Cut # 9

Variable Description Coeff. S.E. t-stat Constant -3.465 0.401 -8.640 inlne ( ilnle1* ilnle2)/ (lnlength+3.00680) 0.556 0.048 11.690 inlan1 nlanei-1.1059299 -8.964 3.542 -2.530 irlnw1 rlnw-11.84303545 0.078 0.033 2.380 푑푔푐1 + 9.700655937 × 10−6 퐿푛 ( ) + 3.859161 idgc11 100 -0.163 0.048 -3.370

푑푔푐1 + 9.700655937 × 10−6 √ − 0.1452091 idgc12 100 14.029 4.561 3.080

State Route Interaction Dummy (If srn = 003 or srn0326 1.001 0.338 2.960 srn = 026, 1 else 0) ivcl41 vcl4530-34.51615128 -0.001 0.001 -2.200 ilnad__1 lnadt-7.9171428 0.789 0.070 11.330 idgc1__1 dgc1-2.06714 -0.168 0.073 -2.300 Dispersion parameter for count data model Alpha 12.441 0.955 Number of obs =6,285 Chi squared = 428.62 Prob>chi2=0.000 Convergent Log likelihood = -2068.8771 Pseudo R2 = 0.0939

126

10-K cut MFP NB, Cut # 10

Variable Description Coeff. S.E. t-stat Constant -1.902 0.083 -22.960 ilnadt10K ilnad1* ilnad2/10000 -0.035 0.004 -8.280 inlne ( ilnle1* ilnle2)/( lnlength+3.01385) 0.437 0.043 10.120 dmfxobj dummy for miscellaneous fixed object 0.561 0.217 2.580 ilnad__1 lnadt-7.90226 1.022 0.084 12.140 Dispersion parameter for count data model Alpha 14.637 1.089 Number of obs =6,260 Chi squared = 309.24 Prob>chi2=0.000 Convergent Log likelihood = -2200.7559 Pseudo R2 = 0.0656

127

Appendix C

Recoded MFP variables

Variable Description sd1 K-cut dummy variable, if k=1, 1 else 0 s1dgc1 interaction term, if sd1=1, value =degree of curvature for horizontal curve, else 0 s1lnadt interaction term, if sd1=1, value = 퐿푛(퐴퐴퐷푇), else 0 s1lnad3 interaction term, if sd1=1, value =퐿푛(퐴퐴퐷푇)3, else 0 s1srn interaction dummy if sd1=1 and srn0326=1, 1 else 0 s1vcl interaction term, if sd1=1, value =vcl4530, else 0 푑푔푐1 s1t2 if sd1=1, 퐿표푔(9.7006 × 10−8 + ) × √(9.70066 × 10−6 + 푑푔푐1), else 0 100 s1t3 if sd1=1, 퐿푁퐴퐷푇3 × 퐿표푔(퐿푁퐴퐷푇), else 0 s1t4 if sd1=1, 퐿푁퐴퐷푇6 × 퐿표푔(퐿푁퐴퐷푇), else 0 s1dmfx if sd1=1 and miscellaneous fixed object dummy=1, 1 else 0 s1rlnw if sd1=1, value= average lane width, else 0 s1t0 if sd1=1, √(9.70066 × 10−6 + 푑푔푐1), else 0 s1u1 if sd1=1, 푑푔푐1 × √(9.70066 × 10−6 + 푑푔푐1), else 0 1 s1u2 if sd1=1, ⁄(3.01693 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ s1u3 if sd1=1, ⁄(3.01693 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ2 if sd1=1, ⁄ , else 0 s1u4 (3.01693 + 퐿푛푙푒푛푔푡ℎ)

√4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s1u5 if sd1=1, (3.01693 + 퐿푛푙푒푛푔푡ℎ), else 0

2 퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s1u6 if sd1=1, (3.01693 + 퐿푛푙푒푛푔푡ℎ), else 0

푑푔푐1 s1u7 if sd1=1, 푑푔푐1 × 퐿표푔(9.70066 × 10−8 + ), else 0 100 푑푔푐1 s1u8 if sd1=1, 푑푔푐1 × √9.7066 × 10−6 + 푑푔푐1 × 퐿표푔(9.70066 × 10−8 + ), else 0 100 퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s1u9 if sd1=1, (3.01693 + 퐿푛푙푒푛푔푡ℎ), else 0

128

Variable Description Sd2 K-cut dummy variable, if k=2, 1 else 0 s2dgc1 interaction term, if sd2=1, value =degree of curvature for horizontal curve, else 0 s2lclv interaction term, if sd2=1, value =length of culvert, else 0 s2lnadt interaction term, if sd2=1, value =퐿푛(퐴퐴퐷푇), else 0 s2lnad3 interaction term, if sd2=1, value =퐿푛(퐴퐴퐷푇)3, else 0 s2nlan interaction term, if sd2=1, value = number of lanes in increasing direction, else 0 s2srn interaction dummy if sd2=1 and srn0326=1, 1 else 0 s2vcl interaction term, if sd2=1, value =vcl4530, else 0 푑푔푐1 s2t1 if sd2=1, 퐿표푔 (9.7006 × 10−8 + ), else 0 100 s2t3 if sd2=1, 퐿푁퐴퐷푇3 × 퐿표푔(퐿푁퐴퐷푇), else 0 s2dmfx if sd2=1 and miscellaneous fixed object dummy=1, 1 else 0 s2rlnw if sd2=1, value= average lane width, else 0 s2t0 if sd2=1, √(9.70066 × 10−6 + 푑푔푐1), else 0 s2u1 if sd2=1,1⁄√2.18044 × 10−5 + 푑푔푐1 , else 0 1 s2u2 if sd2=1, ⁄(3.04161 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ s2u3 if sd2=1, ⁄(3.04161 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ2 if sd2=1, ⁄ , else 0 s2u4 (3.04161 + 퐿푛푙푒푛푔푡ℎ)

√4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s2u5 if sd1=1, (3.04161 + 퐿푛푙푒푛푔푡ℎ), else 0

퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s2u6 if sd1=1, (3.04161 + 퐿푛푙푒푛푔푡ℎ), else 0

2 퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s2u7 if sd1=1, (3.04161 + 퐿푛푙푒푛푔푡ℎ), else 0

Sd3 K-cut dummy variable, if k=3, 1 else 0 s3lclv interaction term, if sd3=1, value =length of culvert, else 0 s3lnadt interaction term, if sd3=1, value =퐿푛(퐴퐴퐷푇), else 0 s3lnad3 interaction term, if sd3=1, value =퐿푛(퐴퐴퐷푇)3, else 0 s3nlan interaction term, if sd3=1, value = number of lanes in increasing direction, else 0 s3vcl interaction term, if sd3=1, value =vcl4530, else 0 푑푔푐1 s3t1 if sd3=1, 퐿표푔(9.7006 × 10−8 + ), else 0 100 푑푔푐1 s3t2 if sd3=1, 퐿표푔(9.7006 × 10−8 + ) × √(9.70066 × 10−6 + 푑푔푐1), else 0 100 s3t3 if sd3=1, 퐿푁퐴퐷푇3 × 퐿표푔(퐿푁퐴퐷푇), else 0 s3t4 if sd3=1, 퐿푁퐴퐷푇6 × 퐿표푔(퐿푁퐴퐷푇), else 0 s3rlnw if sd3=1, value= average lane width, else 0 s3t0 if sd3=1, √(9.70066 × 10−6 + 푑푔푐1), else 0 129

Variable Description 1 s3u1 if sd3=1, ⁄(3.00523 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ s3u2 if sd3=1, ⁄(3.00523 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ2 if sd3=1, ⁄ , else 0 s3u3 (3.00523 + 퐿푛푙푒푛푔푡ℎ)

√4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s3u4 if sd3=1, (3.00523 + 퐿푛푙푒푛푔푡ℎ), else 0

퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s3u5 if sd3=1, (3.00523 + 퐿푛푙푒푛푔푡ℎ), else 0

2 퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s3u6 if sd3=1, (3.00523 + 퐿푛푙푒푛푔푡ℎ), else 0

Sd4 K-cut dummy variable, if k=4, 1 else 0 s4dgc1 interaction term, if sd4=1, value =degree of curvature for horizontal curve, else 0 s4lclv interaction term, if sd4=1, value =length of culvert, else 0 s4lnadt interaction term, if sd4=1, value =퐿푛(퐴퐴퐷푇), else 0 s4lnad3 interaction term, if sd4=1, value =퐿푛(퐴퐴퐷푇)3, else 0 s4nlan interaction term, if sd4=1, value = number of lanes in increasing direction, else 0 s4srn interaction dummy if sd4=1 and srn0326=1, 1 else 0 s4vcl interaction term, if sd4=1, value =vcl4530, else 0 푑푔푐1 s4t1 if sd4=1, 퐿표푔(9.7006 × 10−8 + ), else 0 100 푑푔푐1 s4t2 if sd4=1, 퐿표푔(9.7006 × 10−8 + ) × √(9.70066 × 10−6 + 푑푔푐1), else 0 100 s4t3 if sd4=1, 퐿푁퐴퐷푇3 × 퐿표푔(퐿푁퐴퐷푇), else 0 s4t4 if sd4=1, 퐿푁퐴퐷푇6 × 퐿표푔(퐿푁퐴퐷푇), else 0 s4dmfx if sd4=1 and miscellaneous fixed object dummy=1, 1 else 0 s4rlnw if sd4=1, value= average lane width, else 0 s4t0 if sd4=1, √(9.70066 × 10−6 + 푑푔푐1), else 0 1 s4u1 if sd4=1, ⁄(3.03934 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ s4u2 if sd4=1, ⁄(3.03934 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ2 if sd4=1, ⁄ , else 0 s4u3 (3.03934 + 퐿푛푙푒푛푔푡ℎ)

√4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s4u4 if sd4=1, (3.03934 + 퐿푛푙푒푛푔푡ℎ), else 0

퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s4u5 if sd4=1, (3.03934 + 퐿푛푙푒푛푔푡ℎ), else 0

2 퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s4u6 if sd4=1, (3.03934 + 퐿푛푙푒푛푔푡ℎ), else 0

130

Variable Description Sd5 K-cut dummy variable, if k=5, 1 else 0 s5dgc1 interaction term, if sd5=1, value =degree of curvature for horizontal curve, else 0 s5lclv interaction term, if sd5=1, value =length of culvert, else 0 s5nlan interaction term, if sd5=1, value = number of lanes in increasing direction, else 0 s5srn interaction dummy if sd5=1 and srn0326=1, 1 else 0 s5vcl interaction term, if sd5=1, value =vcl4530, else 0 푑푔푐1 s5t1 if sd5=1, 퐿표푔(9.7006 × 10−8 + ), else 0 100 푑푔푐1 s5t2 if sd5=1, 퐿표푔(9.7006 × 10−8 + ) × √(9.70066 × 10−6 + 푑푔푐1), else 0 100 s5t3 if sd5=1, 퐿푁퐴퐷푇3 × 퐿표푔(퐿푁퐴퐷푇), else 0 s5t4 if sd5=1, 퐿푁퐴퐷푇6 × 퐿표푔(퐿푁퐴퐷푇), else 0 s5dmfx if sd5=1 and miscellaneous fixed object dummy=1, 1 else 0 s5t0 if sd5=1, √(9.70066 × 10−6 + 푑푔푐1), else 0 1 s5u1 if sd5=1, ⁄(3.03676 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ s5u2 if sd5=1, ⁄(3.03676 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ2 if sd5=1, ⁄ , else 0 s5u3 (3.03676 + 퐿푛푙푒푛푔푡ℎ)

√4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s5u4 if sd5=1, (3.03676 + 퐿푛푙푒푛푔푡ℎ), else 0

퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s5u5 if sd5=1, (3.03676 + 퐿푛푙푒푛푔푡ℎ), else 0

2 퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s5u6 if sd5=1, (3.03676 + 퐿푛푙푒푛푔푡ℎ), else 0

Sd6 K-cut dummy variable, if k=6, 1 else 0 s6dgc1 interaction term, if sd6=1, value =degree of curvature for horizontal curve, else 0 s6lclv interaction term, if sd6=1, value =length of culvert, else 0 s6lnadt interaction term, if sd6=1, value =퐿푛(퐴퐴퐷푇), else 0 s6lnad3 interaction term, if sd6=1, value =퐿푛(퐴퐴퐷푇)3, else 0 s6nlan interaction term, if sd6=1, value = number of lanes in increasing direction, else 0 s6srn interaction dummy if sd6=1 and srn0326=1, 1 else 0 푑푔푐1 s6t1 if sd6=1, 퐿표푔(9.7006 × 10−8 + ), else 0 100 푑푔푐1 s6t2 if sd6=1, 퐿표푔(9.7006 × 10−8 + ) × √(9.70066 × 10−6 + 푑푔푐1), else 0 100 s6t3 if sd6=1, 퐿푁퐴퐷푇3 × 퐿표푔(퐿푁퐴퐷푇), else 0 s6t4 if sd6=1, 퐿푁퐴퐷푇6 × 퐿표푔(퐿푁퐴퐷푇), else 0 s6dmfx if sd6=1 and miscellaneous fixed object dummy=1, 1 else 0 s6rlnw if sd6=1, value= average lane width, else 0 s6t0 if sd6=1, √(9.70066 × 10−6 + 푑푔푐1), else 0 131

Variable Description s6dcab interaction dummy if sd6=1 and dcabnet=1, 1 else 0 1 s6u1 if sd6=1, ⁄(3.04674 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ s6u2 if sd6=1, ⁄(3.04674 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ2 if sd6=1, ⁄ , else 0 s6u3 (3.04674 + 퐿푛푙푒푛푔푡ℎ)

√4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s6u4 if sd6=1, (3.04674 + 퐿푛푙푒푛푔푡ℎ), else 0

퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s6u5 if sd6=1, (3.04674 + 퐿푛푙푒푛푔푡ℎ), else 0

2 퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s6u6 if sd6=1, (3.04674 + 퐿푛푙푒푛푔푡ℎ), else 0

Sd7 K-cut dummy variable, if k=7, 1 else 0 s7dgc1 interaction term, if sd7=1, value =degree of curvature for horizontal curve, else 0 s7lclv interaction term, if sd7=1, value =length of culvert, else 0 s7lnadt interaction term, if sd7=1, value =퐿푛(퐴퐴퐷푇), else 0 s7lnad3 interaction term, if sd7=1, value =퐿푛(퐴퐴퐷푇)3, else 0 s7nlan interaction term, if sd7=1, value = number of lanes in increasing direction, else 0 s7srn interaction dummy if sd7=1 and srn0326=1, 1 else 0 s7vcl interaction term, if sd7=1, value =vcl4530, else 0 푑푔푐1 s7t1 if sd7=1, 퐿표푔(9.7006 × 10−8 + ), else 0 100 푑푔푐1 s7t2 if sd7=1, 퐿표푔(9.7006 × 10−8 + ) × √(9.70066 × 10−6 + 푑푔푐1), else 0 100 s7t3 if sd7=1, 퐿푁퐴퐷푇3 × 퐿표푔(퐿푁퐴퐷푇), else 0 s7t4 if sd7=1, 퐿푁퐴퐷푇6 × 퐿표푔(퐿푁퐴퐷푇), else 0 s7dmfx if sd7=1 and miscellaneous fixed object dummy=1, 1 else 0 s7rlnw if sd7=1, value= average lane width, else 0 s7t0 if sd7=1, √(9.70066 × 10−6 + 푑푔푐1), else 0 1 s7u1 if sd7=1, ⁄(3.01729 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ s7u2 if sd7=1, ⁄(3.01729 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ2 if sd7=1, ⁄ , else 0 s7u3 (3.01729 + 퐿푛푙푒푛푔푡ℎ)

√4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s7u4 if sd7=1, (3.01729 + 퐿푛푙푒푛푔푡ℎ), else 0

퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s7u5 if sd7=1, (3.01729 + 퐿푛푙푒푛푔푡ℎ), else 0

132

Variable Description 2 퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s7u6 if sd7=1, (3.01729 + 퐿푛푙푒푛푔푡ℎ), else 0

1 s7u7 if sd7=1, ⁄(−11.843 + 푟푙푛푤), else 0 푣푐푙4530 s7u8 if sd7=1, ⁄(−11.843 + 푟푙푛푤), else 0 Sd8 K-cut dummy variable, if k=8, 1 else 0 s8dgc1 interaction term, if sd8=1, value =degree of curvature for horizontal curve, else 0 s8lclv interaction term, if sd8=1, value =length of culvert, else 0 s8lnadt interaction term, if sd8=1, value =퐿푛(퐴퐴퐷푇), else 0 s8lnad3 interaction term, if sd8=1, value =퐿푛(퐴퐴퐷푇)3, else 0 s8nlan interaction term, if sd8=1, value = number of lanes in increasing direction, else 0 s8srn interaction dummy if sd8=1 and srn0326=1, 1 else 0 s8vcl interaction term, if sd8=1, value =vcl4530, else 0 푑푔푐1 s8t1 if sd8=1, 퐿표푔(9.7006 × 10−8 + ), else 0 100 푑푔푐1 s8t2 if sd8=1, 퐿표푔(9.7006 × 10−8 + ) × √(9.70066 × 10−6 + 푑푔푐1), else 0 100 s8t3 if sd8=1, 퐿푁퐴퐷푇3 × 퐿표푔(퐿푁퐴퐷푇), else 0 s8t4 if sd8=1, 퐿푁퐴퐷푇6 × 퐿표푔(퐿푁퐴퐷푇), else 0 s8dmfx if sd8=1 and miscellaneous fixed object dummy=1, 1 else 0 s8rlnw if sd8=1, value= average lane width, else 0 s8t0 if sd8=1, √(9.70066 × 10−6 + 푑푔푐1), else 0 1 s8u1 if sd8=1, ⁄(3.04562 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ s8u2 if sd8=1, ⁄(3.04562 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ2 if sd8=1, ⁄ , else 0 s8u3 (3.04562 + 퐿푛푙푒푛푔푡ℎ)

√4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s8u4 if sd8=1, (3.04562 + 퐿푛푙푒푛푔푡ℎ), else 0

퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s8u5 if sd8=1, (3.04562 + 퐿푛푙푒푛푔푡ℎ), else 0

2 퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s8u6 if sd8=1, (3.04562 + 퐿푛푙푒푛푔푡ℎ), else 0

Sd9 K-cut dummy variable, if k=9, 1 else 0 s9dgc1 interaction term, if sd9=1, value =degree of curvature for horizontal curve, else 0 s9lclv interaction term, if sd9=1, value =length of culvert, else 0 s9lnadt interaction term, if sd9=1, value =퐿푛(퐴퐴퐷푇), else 0 s9lnad3 interaction term, if sd9=1, value =퐿푛(퐴퐴퐷푇)3, else 0 s9nlan interaction term, if sd9=1, value = number of lanes in increasing direction, else 0

133

Variable Description s9srn interaction dummy if sd9=1 and srn0326=1, 1 else 0 s9vcl interaction term, if sd9=1, value =vcl4530, else 0 푑푔푐1 s9t1 if sd9=1, 퐿표푔(9.7006 × 10−8 + ), else 0 100 푑푔푐1 s9t2 if sd9=1, 퐿표푔(9.7006 × 10−8 + ) × √(9.70066 × 10−6 + 푑푔푐1), else 0 100 s9t3 if sd9=1, 퐿푁퐴퐷푇3 × 퐿표푔(퐿푁퐴퐷푇), else 0 s9t4 if sd9=1, 퐿푁퐴퐷푇6 × 퐿표푔(퐿푁퐴퐷푇), else 0 s9dmfx if sd9=1 and miscellaneous fixed object dummy=1, 1 else 0 s9rlnw if sd9=1, value= average lane width, else 0 1 s9u1 if sd9=1, ⁄(3.0068 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ s9u2 if sd9=1, ⁄(3.0068 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ2 if sd9=1, ⁄ , else 0 s9u3 (3.0068 + 퐿푛푙푒푛푔푡ℎ)

√4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s9u4 if sd9=1, (3.0068 + 퐿푛푙푒푛푔푡ℎ), else 0

퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s9u5 if sd9=1, (3.0068 + 퐿푛푙푒푛푔푡ℎ), else 0

2 퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s9u6 if sd9=1, (3.0068 + 퐿푛푙푒푛푔푡ℎ), else 0

Sd10 K-cut dummy variable, if k=10, 1 else 0 s10dgc1 interaction term, if sd10=1, value =degree of curvature for horizontal curve,else 0 s10lclv interaction term, if sd10=1, value =length of culvert, else 0 s10lnadt interaction term, if sd10=1, value =퐿푛(퐴퐴퐷푇), else 0 s10lnad3 interaction term, if sd10=1, value =퐿푛(퐴퐴퐷푇)3, else 0 s10nlan interaction term, if sd10=1, value = number of lanes in increasing direction, else 0 s10srn interaction dummy if sd10=1 and srn0326=1, 1 else 0 s10vcl interaction term, if sd10=1, value =vcl4530, else 0 푑푔푐1 s10t1 if sd10=1, 퐿표푔(9.7006 × 10−8 + ), else 0 100 푑푔푐1 s10t2 if sd10=1, 퐿표푔(9.7006 × 10−8 + ) × √(9.70066 × 10−6 + 푑푔푐1), else 0 100 s10t4 if sd10=1, 퐿푁퐴퐷푇6 × 퐿표푔(퐿푁퐴퐷푇), else 0 s10dmfx if sd10=1 and miscellaneous fixed object dummy=1, 1 else 0 s10rlnw if sd10=1, value= average lane width, else 0 s10t0 if sd10=1, √(9.70066 × 10−6 + 푑푔푐1), else 0 1 s10u1 if sd10=1, ⁄(3.01385 + 퐿푛푙푒푛푔푡ℎ), else 0 퐿푛푙푒푛푔푡ℎ s10u2 if sd10=1, ⁄(3.01385 + 퐿푛푙푒푛푔푡ℎ), else 0

134

Variable Description 퐿푛푙푒푛푔푡ℎ2 if sd10=1, ⁄ , else 0 s10u3 (3.01385 + 퐿푛푙푒푛푔푡ℎ)

√4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s10u4 if sd10=1, (3.01385 + 퐿푛푙푒푛푔푡ℎ), else 0

퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s10u5 if sd10=1, (3.01385 + 퐿푛푙푒푛푔푡ℎ), else 0

2 퐿푛푙푒푛푔푡ℎ × √4.60951 + 퐿푛푙푒푛푔푡ℎ⁄ s10u6 if sd10=1, (3.01385 + 퐿푛푙푒푛푔푡ℎ), else 0 lnad3 퐿푛(퐴퐴퐷푇)3 t0full √(9.70066 × 10−6 + 푑푔푐1) 푑푔푐1 t1full 퐿표푔(9.7006 × 10−8 + ) 100 푑푔푐1 t2full 퐿표푔(9.7006 × 10−8 + ) × √(9.70066 × 10−6 + 푑푔푐1) 100 t3full 퐿푁퐴퐷푇3 × 퐿표푔(퐿푁퐴퐷푇) t4full 퐿푁퐴퐷푇6 × 퐿표푔(퐿푁퐴퐷푇)

135

Vita

Baradhwaj Hariharan was born in Muscat, Oman in 1984. He got his B.E. degree in

Mechatronics Engineering from Anna University, India in 2005 following which he worked in software programming till 2007. He then began his Masters in Industrial Engineering at Texas

A&M University, College Station, graduating in 2008, following which he worked in port operations at Muscat, Oman for two years. He began his Ph.D. in Civil Engineering at The

Pennsylvania State University with a focus on Transportation Engineering in 2011. His field of specialization is applied econometric modeling towards transportation safety, and his current research interests extend to developing mechatronics systems towards ITS applications, and the analysis and understanding of big data.