1 2014 TRB Data Contest Combing Factor Analysis with Binary Logistic

2014 TRB Data Contest Combing Factor Analysis with Binary Logistic Regression for Analysis of Driver Behavior in Dilemma Zone Wenfu Wang (Corresponding Author) and Kushal Mehta Department of Civil Engineering, University of Waterloo, Waterloo, Ontario N2L3G1, Canada Corresponding author e-mail: [email protected] Problem Formulation The study intends to understand the behavior of different drivers at dilemma zone while distracted by phone usage. Using the binary logistic regression model and the factor analysis method, this study predicts the possibility of drivers stop (or go-through) at the intersection based on a series of independent predictors. The performances of binary logistic model alone were compared to the combined model structure of factor analysis and binary logistic regression. Data Preparation The data used in this study were collected from the University of Iowa National Advanced Driving Simulator (NADS), where drivers of three age groups were asked to travel pass intersections while engaged in one of the three secondary tasks (No Phone Call, Outgoing Call, Incoming Call). The original data were collected at 240 Hz, so the given time frame data were divided by 240 to obtain the occurrence time point of each event. Several data filtering rules were applied to the original datasets to ensure that the obtained variables add value to the model. Data records with any of the following attributes were not used in this analysis: 1) negative yellow phase length 2) negative red phase length 3) positive deceleration rate 4) negative acceleration rate In the end, a total of 812 records were selected out of the original 1157 records and used in this study. Table 1 lists all the potential input variables. Table 1: Aggregated Variables for Driver Behaviors at Dilemma Zone Variable Variable Coding MAge Dummy Variable, 1= Middle Age, 0=others Age Group OAge Dummy Variable, 1= Older, 0=others Gender Dummy Variable, 1= Male, 0=others HF Dummy Variable, 1= Hand free, 0=others Cell Phone Interface HS Dummy Variable, 1= Headset, 0=others 1 OCall Dummy Variable, 1= Outgoing Call, 0=others Call Interface ICall Dummy Variable, 1= Incoming Call, 0=others Yellow Length Scaled variable, unit =seconds Acceleration Pedal Change Direction Dummy Variable, 1=Depressing, 0=Released Acceleration Rate Scaled variable, unit = Deceleration Rate Scaled variable, unit = Distance at Green to Yellow Scaled variable, unit = Velocity at Green to Yellow Scaled variable, unit = Time Headway (Binned) Dummy Variable, 1=more than 3.06 seconds, 0=others Velocity at Yellow to Red Scaled variable, unit = It should be noted that the data for red phase were not used in this study, as the data did not match the description provided (possibly over/under recording of red phase). The Time Headway is a combination of two provided variables namely distance @ green to yellow and velocity at green to yellow (former divided by the latter), because time headway has been identified by previous researchers as a significant predictor of passing events (1). The dummy variables are used for the Time Headway is because most headway values were distributed around either 3 seconds or 3.75 seconds and were not following the normal distribution. The frame of acceleration pedal change 10% (Column F) was not used, because the values did not make intuitive sense. Distance from the Stop Line (Column I) was not used because of insufficient observation of stop beyond the stop line events; the Velocity at Stop Line (Column N) and the Frame at Stop Line (Column O) were not used because they were not related to the decision process of drivers when they drive past the intersections. The independent variable was derived from First Stop Frame (Column H), and was coded as dummy variables with 1=stop, and 0=go through. Methods and Assumptions Stop and go-through events were examined by the combination of factor analysis and binary logistic regression models. The factor analysis is selected because factor scores can reveal the underlying patterns in the original data while reducing data dimensions and resolving the variable collinearities (2, 3) . Factor Analysis The basic factor analysis equation can be represented in matrix form as follows: 2 Where, Z is a n by 1 vector of variables, λ is a n by m matrix of factor loadings, F is a m by 1 vector of factors and ε is a n by 1 vector of error(4). Factor loadings represent the correlation coefficients between variables and factors. Higher absolute loading values indicate higher contributions to the factor meanings from the corresponding variables, and vice versa. The extent to which a factor represents the variations in the data can be evaluated by Eigen value, and a larger than 1.0 Eigen value indicates a significant factor (5). The Varimax rotation is used in this study to produce orthogonal/uncorrelated factors. And the factor scores were used as inputs into binary logistic regression in the combined model structure. Binary Logistic Regression Model Binary logistic regression is a widely used method for predicting probability of a binary outcome (i.e., stop event or go-through event in this study) based on values of a set of explanatory variables (1) . In logistic regression, the dependent variable is a logit, which is the natural log of the odds: ( ) ( ) Where P is the possibility of the event (coded with 1) occurrence, a is a constant, X are the predictor variables, and b are the predictor coefficients. Some Assumptions It is assumed that all the input variables into the factor analysis and logistic models follow normal distribution. All categorical variables in this study were coded as 1 or 1 dummy variables, and this coding allows them to operate as normal scaled variables. In addition, it is assumed that the factors and random error in factor analysis were not correlated. Performance Measures The input data were randomly divided into 2 groups: 70% of data into the training (calibration) group and 30% of data into the testing (validation) group. The following four performance indexes were used to evaluate the model performance: Table 2: Model Performance Measures Measure Description Sensitivity (Sen) % of stop events predicted correctly Specificity (Spe) % of go-through events predicted correctly False Positive Rate (FPR) % of incorrect stop event prediction False Negative Rate (FNR) % of incorrect go-through event prediction Higher Sen and Spe value together with lower FPR and FNR value indicate better model performance. 3 As described in previous sections, two model structures were developed in this study: a) Binary logistic regression model b) Combined model with factor analysis score as inputs into binary logistic regression model Results and Analysis Model Structure a The binary logistic regressions were performed in SPSS v17.0 (6) , and the forward stepwise model was used with confidence levels of 0.05 and 0.1 as thresholds of variables entering and removing the model, respectively. The established binary logistic model is as follows ( ) ( ) The -2 Log likelihood of the above model is 298.605, while the Cox & Snell equals to 0.526 From the above model it can be found that older drivers are less likely to stop than young drivers, and longer yellow length together with short time headway decrease the chance of stop. These observations go against previous observations (1) . It is possible that some correlation has reduced the model explanation power. In addition, the phone usage (any interface) were not found to influence the results. Model Structure b The factor analysis was conducted in SPSS v17.0. The principal component method was used to extract factors, and Varimax methods were used in rotated the factor loading. In addition, the factor scores were calculated using regression method in SPSS. The results showed that 5 out of 13 factors achieved the Eigen value of over 1.0, and these factors explained 62.2% of variations (similar to R2) in the original data. The rotated factor loadings are listed as follows: Table 3: Results of Rotated Factor Loadings Variables Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 MAge -.021 .006 .005 -.849 .044 OAge .021 .019 -.002 .849 -.052 Gender .224 -.011 .023 .055 -.356 HS .004 -.859 -.014 .006 -.038 HF .002 .863 -.005 .031 .021 OCall .035 .016 -.867 .005 .022 ICall .056 .011 .858 .001 -.047 4 Yellow Length .812 -.028 .043 .020 .365 Min Accel After Accel Pedal -.912 -.038 .016 -.014 .239 Change Max Accel After Accel Pedal .916 .003 .017 -.011 .070 Change Accel Pedal Change Direction .006 .135 -.086 -.104 .216 Time Headway (Binned) .177 -.057 .047 .067 .897 Vel at Yellow to Red -.952 .013 .010 -.038 .141 The Factor 1 in the above table represents the characteristics of the driver, and higher Factor 1 values are associated with higher acceleration, deceleration, and velocity, therefore responsive drivers would achieve high score in Factor 1. Factor 2 represents the phone interface, and Factor 3 is related to call interface. Factor 4 is associated with age, and older people will achieve higher score than younger people. And factor 5 is associated with time headway, with longer time headway associated with higher score. Then the factor scores were calculated with regression method in SPSS, and the factors scores for the above 5 factors were input into the binary logistic regression models, the established model is as follows: ( ) The -2 Log likelihood of the above model is 374.721, while the Cox & Snell equals to 0.458. The model explained around 45.8% of variation in the data, and higher value in Factor 1 and lower value in Factor 4 increase the occurrence of stop event. By combing the information in Table 3, it is concluded that more responsive drivers and middle age/young driver are more likely to stop than other drivers.

1 2014 TRB Data Contest Combing Factor Analysis with Binary Logistic

Arxiv:1908.07390V1 [Stat.AP] 19 Aug 2019

Discriminant Function Analysis

Factor Analysis

Covid-19 Epidemiological Factor Analysis: Identifying Principal Factors with Machine Learning

T-Test & Factor Analysis

Statistical Analysis in JASP

Autocorrelation-Based Factor Analysis and Nonlinear Shrinkage Estimation of Large Integrated Covariance Matrix

Linear Discriminant Analysis 1

An Exploratory Factor Analysis and Reliability Analysis of the Student Online Learning Readiness (SOLR) Instrument

CHAPTER 4 Exploratory Factor Analysis and Principal Components

REFEREE's REPORT Title of the Paper

Factor Analysis and Logistic Regression for Forest Categorical and Quantitative Data