sustainability

Article An Item Response Theory to Analyze the Psychological Impacts of Rail-Transport Delay

Mahdi Rezapour 1,*, Kelly Cuccolo 2, Christopher Veenstra 2 and F. Richard Ferraro 2

1 Wyoming Technology Transfer Center, University of Wyoming, Laramie, WY 82071, USA 2 Department of , University of North Dakota, Grand Forks, ND 58201, USA; [email protected] (K.C.); [email protected] (C.V.); [email protected] (F.R.F.) * Correspondence: [email protected]

Abstract: instruments have been used extensively by researchers in the literature review for evaluation of various aspects of public transportation. Important implications have been derived from those instruments to improve various aspects of the transport. However, it is important that instruments, which are designed to measure various stimuli, meet criteria of to reflect a real impact of the stressors. Particularly, given the diverse range of commuter characteristics considered in this study, it is necessary to ensure that instruments are reliable and accurate. This can be achieved by finding the relationship between the item’s properties and the underlying unobserved trait, being measured. The item response theory (IRT) refers to measurement of an instrument’s reliability by examining the relationship between the unobserved trait and various observed items. In this study, to determine if our instrument suffers from any potentially associated problems, the IRT analysis was conducted. The analysis was employed based on the graded response model (GRM)

 due to the ordinal nature of the . Various aspects of the instruments, such as discriminability  and informativity of the items were tested. For instance, it was found while the classical theory

Citation: Rezapour, M.; Cuccolo, K.; (CTT) confirm the reliability of the instrument, IRT highlight some concerns regarding the instrument. Veenstra, C.; Ferraro, F.R. An Item Also, the person fit assessment measure, for instance, highlights some concern regarding respondents Response Theory to Analyze the answering some of the questions due to lack of interest, choosing answers randomly. Not many Psychological Impacts of Rail- studies have examined instruments’ reliability in determining the psychological impacts of public Transport Delay. Sustainability 2021, transportation on commuters in the way that was performed here. 13, 6935. https://doi.org/10.3390/ su13126935 Keywords: item response theory; transport psychology; item information curve; the test information; residuals; transport delay Academic Editor: Tomio Miwa

Received: 10 May 2021 Accepted: 17 June 2021 1. Introduction Published: 20 June 2021 Public transportation systems are a crucial part of society. Their benefits can be

Publisher’s Note: MDPI stays neutral summarized as reduction in traffic congestion, carbon emission, and pollution-related with regard to jurisdictional claims in health concerns. However, public transport is not without shortcomings. Extensive efforts published maps and institutional affil- have been made by public transport engineers and planners to enhance public satisfaction. iations. This has been achieved, for instance, by improvements in specific factors such as the reliability of public transport (e.g., [1]), which could lead to greater use of public transport. A big concern that still has not received much attention is the negative aspects of public transport that could impact the well-being of commuters. However, despite many efforts to improve public transport, the services in many Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. areas are still not without shortcomings. One aspect of public transport that has received This article is an open access article substantial attention is its delay. Delay can be defined as part of waiting time, the difference distributed under the terms and between a service’s expected arrival time and its actual arrival time. Expectations within conditions of the Creative Commons commuters might be created by schedules that are presented by policy makers or by real- Attribution (CC BY) license (https:// time information displays. However, if these are not created, or no precise information is creativecommons.org/licenses/by/ provided, expectations might result from the past experiences of a commuter regarding the 4.0/). typical arrival and departure of a public transit service.

Sustainability 2021, 13, 6935. https://doi.org/10.3390/su13126935 https://www.mdpi.com/journal/sustainability Sustainability 2021, 13, 6935 2 of 14

Despite the importance of delay on commuters’ satisfaction about public transport [2], sometimes delay is inevitable, so it is important to determine how delay is transferred to commuters in terms of various psychophysical behaviors and how those feelings impact the perceived quality of the system. To achieve that, researchers have designed instruments to measure commuters’ feel- ings and perceptions. However, those instruments are not without shortcomings. Item response theory (IRT) would help to validate these scales by highlighting many of the instrument challenges which need to be resolved. Despite the importance of validating instruments before their application, relatively little study has been done on the topic of transport psychology.

1.1. Study Motivation This study was conducted to evaluate and validate the instrument being used to measure the psychological impacts of delay on commuters. In this study, after using various IRT techniques of evaluating the reliability of the instrument, more measures were conducted to highlight the underlying causes of the lack of fit. For instance, person fit assessment was employed to flag those response observations that were not in line with all other observations. Also, a large portion of this manuscript is spent on mathematical formulation of the methodology. Despite the extensive efforts in studying and enhancing various aspects of public transport, not much study has been conducted evaluating the reliability of the instruments in that field. Thus, this study highlights several studies that were performed in other fields, implemented the IRT method. The next paragraph outlines a few studies conducted with the help of this technique. IRT was used for analyzing the parenting stress index for parents having children with autism spectrum disorder [3]. The results suggested that changes in distress severity were often reflected in an associated change in item score. However, other items functioned poorly for discriminating between criteria. In another study, IRT was used to examine cross- cultural comparability of standard scales of the occupational scales [4]. The IRT likelihood ratio model was used for differential item functioning (DIF), and differential test function- ing (DTF) analyses. The Rasch method was also used in different fields [5]. Depressive and anxiety symptoms in refugees were evaluated with the help of IRT [6]. The participants completed a patient health questionnaire related to depressive and anxiety symptoms. The results highlighted that interrelations of depressive and anxiety symptoms differed across residents and refugees. IRT was also implemented on the multidimensional assessment of parenting (MAPS); it confirmed that the positive and negative dimensions, and that the best-fitting model include six nested dimensions from the original model. IRT was used to measure the suitability of the author recognition test (ART) for native and nonnative English speakers [7]. The results showed an expected gradient. The autism- relevant quality of life in autistic adults was psychometrically evaluated [8]. IRT was also recently used for psychometric properties of Addenbrooke’s cognitive examination (ACE-III) [9], for self-reflection and insight scale [10], examination of the original and short-term difficulties in emotion recognition scales [11], and development and validation of the cancer knowledge scale for general population [12]. Despite the efforts on identifying translated impacts of delay on commuters, the question of whether the measurements possess the requisite psychometric properties remains to be answered. Sustainability 2021, 13, 6935 3 of 14

1.2. Problem Statement It is hypothesized that delay is not only a matter of reaching a destination, or the monetary value of expended time, but the psychological well-being of commuters. Even monetary valuation of time can be translated into psychophysical behaviors of individuals. Thus, the impacts of delay on commuters can be looked at from the perspective of its psychological implications for outcomes, such as stress or anxiety. For instance, stress could lead to serious illnesses including cardiovascular and suppressed immune functioning [13]. Various feelings can also result from delay occurrence alongside stress. These include feelings of anxiety, fear, or even anger. For instance, delay was highlighted as a primary factor of the stressfulness of traveling by car, especially during high traffic volume [14]. Crowding, delay, and accessibility to a railway station have also been shown to be sources of commuters’ anxiety [15]. A first step in addressing the huge cost of this stress is to understand the feelings that the respondent experiences regarding various associated stimuli. However, before conduct- ing any analysis to identify the factors, it is important for a developed instrument to be validated to make sure it is an accurate representation of commuters’ feelings. As such, this study was conducted to evaluate the instrument intended to measure various psychological effects that commuters experience as a result of delays due to public transportation. The evaluation of the instrument could be done by either classical test theory (CTT) or IRT. CTT uses simplified reliability measures, e.g., Cronbach’s alpha, while IRT offers the test information function, which shows the degree of precision at different values of theta. Another important difference of those two methods is that CTT assumes that the amount of error is the same across the respondents, while IRT assumes that the instruments are imprecise tools, and there is a difference between person’s true score and estimated score. In IRT, item response probability is a function of latent trait scores and item charac- teristics. In other words, while for IRT the outcome measure is based on the distribution of theta, for CTT the outcome measure is based on the sum score distribution. It should be noted that are expected to measure several underlying latent dimensions, the number of which is fewer than the number of items. Those items along with related classes could be identified by .

2. Method The methodology section is presented in the following structures. As the primary steps of IRT are often factor analysis, first, the background of this method is briefly presented, followed by the background of IRT. Then, the model implementations are detailed.

2.1. Factor Analysis Factors, or latent variables, could be written in a form of [16].

x = µ + Λz + u (1)

where x is p × 1 of manifest (observed) variables, z represents m × 1 common factors, u represents p × 1 unique variables, Λ is a p × m factor matrix of partial regression weights of the manifest variables or factor loadings, and elements of µ are corresponding intercepts. It should be noted that the usually is treated as interval, but it could be assumed as ordinal as well. While factor analysis is normally implemented on continuous variables following , it can also be implemented on ordinal-scale vari- ables by using a polychoric correlation matrix. Although it was possible to highlight latent traits by trial and error through IRT, because we wanted to make a comparison between factor analysis (FA) and IRT, we implemented FA for identification of our factors, and later we checked those by IRT. Sustainability 2021, 13, 6935 4 of 14

2.2. Item Response Theory Item response theory (IRT) has been considered as a class of latent variable model linking a possible polytomous manifest to a latent variable [17]. The estimation results of this study are based on the graded response model (GRM), or ordered categorical response model. The GRM was used because the scores are in the form of an ordinal, ordered polytomous response. The objectives of the GRM are estimating the probability of the test subject choosing a particular response for each item and measuring how well an item measures the subject’s latent trait. The plots used for diagnosis in this study were based on GRM results, so it is worth discussing that method. The GRM model, specifying the probability of endorsing the ith response, can be written as: 1 Pi(θm) = (2) 1 + e−αi(θm−βi)

where θm = (αi, βi) is a latent trait/ability (latent score) for an individual m; βi is the location of the extremity boundary for item i (difficulty parameter); αi is the discrimination parameter for item i, which in the case of being constrained would be constant across all items; and θm highlights the ability of an individual m in response to various questions. For simplicity of explanation, we considered a dichotomously scored item. However, an extension to Likert-type items is straightforward. For instance, Equation (2) would turn to:

∗ Pki(θm) = Pki(θm) − Pki+1(θm) (3)

where Pki is the probability of endorsing option k or higher and Pki+1 Is the probability of endorsing ki + 1 or higher. Marginal log likelihood for individuals’ responses can be written for the mth individ- ual respondent as: Z Lm(θ) = log p(xm; θ) = log p(xm|zm; θ)p(zm)dzm (4)

where it is assumed that a respondent m is a random sample from a population (θ), with an ability zm. xm represents the choice of an individual m. As the above equation does not have a closed form, the Gauss–Hermite quadrature can be used to approximate the integral by a weighted average of the integrand at prede- termined abscissas [18]. The difference between Gaussian quadrature and Markov chain Monte Carlo (MCMC), for instance, should be noted. Gauss quadrature can be considered as a deterministic version of MCMC; in Gauss quadrature, sample i and weight (w) are fixed beforehand, but in MCMC they are random choices [18].

2.3. Estimating Item Parameters A few points need to be highlighted about parameter estimates:

2.3.1. For GRM a. Initial values are set as a priori. The model is updated and adjusted until the adjustment becomes extremely small and the model is terminated. b. A few variables within each factor are considered for each analysis. For that, the pattern of the whole factor’s variables is considered instead of the individual effect of each variable. For instance, a pattern of 1_1_1_1_2 for five variables would be considered, where, in this case, the first four questionnaire attributes are 1 and the last one is 2. c. As mentioned, the model parameters are based on the Gauss–Hermite quadrature approximation and are updated by of an optimization process (e.g., quasi- Newton algorithm). For log likelihood, a conditional probability function of a two-parameter model is used, which is similar to Equation (2). Sustainability 2021, 13, 6935 5 of 14

d. Based on Equation (4), the main part of the model parameter estimation is based on the likelihood function, which consists of estimating a conditional probability function of x based on z (nodes), multiplication of the resultant values by weight, and summing the likelihood and setting the value for the optimization process: an approximation of Equation (3) by Gauss–Hermite method.

2.3.2. For Making the Plots The results of the GRM were used. Two main plots of item response category charac- teristics (ICCs) and item information curves (IICs) are discussed: a. ICCs highlight how the probability of responding for each item in regard to the category would change based on the latent variable of ability. In other words, they would be used to highlight the relative ability of an item in discriminating across contiguous trait scores at different locations along the trait continuum [19]. b. For ICC, the continuous function of θ is created based on the two-parameter logit distribution function (considering the difficulties and discrimination parameters) (see Equation (2)). c. As can be seen from Equation (2), the probability of a respondent endorsing an unobserved trait of θ is dependent on two item parameters: difficulty (β) and dis- crimination (α). d. The ICC plots can be similarly obtained from Equation (3) [20]. e. Now for plotting ICC, we would have five plots using Equation (3). For instance, for the last figure of ICC, five categories of Likert scale, and based on Equation (3), for the last category, we have:

∗ P5i(θm) = P5i(θm) − P6i(θm) (5)

Also, it is intuitive that the probabilities of endorsing the lowest category, P1i = 1, and higher category, P6i, are 1 and 0, respectively [13].

2.3.3. For Plotting IICs The process for depicting the IIC is based on item information (I):

2 Ii(θm) = αi × Pi(θm) × (1 − Pi(θm) (6)

where αi is a discrimination parameter for item i and Pi(θi) is the probability of endorsing item i for an individual with θm. Again, here Pi is based on two-parameter logistic probabil- ity. Based on Equations (4) and (2), an item results in higher information at a point where θm = βi, or when item difficulty, βi, matches the individual trait, θm. Now an IIC is plotted by connecting the points of Ii(θm). a. For model parameter estimation, the two parameters are obtained by the matrix of the results, which have both parameters beta and alpha. The difficulty parameters are for each category. Each of those categories’ values can be used for creating the IIC. b. To have an acceptable horizontal axis for the figures, the range of −4–+4 is created and divided into 100 values. The y axis, on the other hand, highlights probability, being between 0–1. c. For the IIC process, it should be noted that the plots can be viewed as the first derivative of the ICCs, peaking at the difficulty values where the item has the highest discrimination, and less information further from the difficulty level. The process was implemented in R package ltm [19].

3. Sample and Instruments This section is presented in two subsections. The first subsection outlines the design of the instrument, while the second subsection gives an overview of various instrument’s explanatory variables. Sustainability 2021, 13, 6935 6 of 14

3.1. The Instrument In the instrument, the commuters were asked to indicate to what degree they agree that they experience various emotional or physical feelings while facing the delay of rail transport. The questionnaires were distributed to 419 commuters at the station of Serdang, which is one of the main stations of the Keretapi Tanah Melayu (KTM) in Malaysia. The surveys were distributed during the off-peak hours from 4 pm to 7 pm to minimize the measured scores. Questionnaires were translated into the local language, Malay, by a Malaysian PhD student in the field of education. The questionnaire had an introduction explaining the objective of the study and outlined the sections of the questionnaire that the commuters would answer. Respondents were also informed that they were under no obligation and could leave the questionnaire blank for any reason. The questionnaire had 2 main parts: psychological effects (4 questions), and physical effects (14 questions). All questions were based on the Likert 5-scale format. It was noted that due to the behavior of respondents in finishing the task, some of the responses might result in incomplete or biased information retrieval, e.g., choosing the first response alter- native or no information retrieval [21]. A solution was proposed by giving an alternative of “I do not know” or “undecided” instead of reporting an opinion. As a result, we incor- porated in all instrument questions, except for the first part, an alternative of “undecided”. Undecided answers could be considered as similar to a middle or neutral response [22]. To evaluate the feelings of commuters regarding delay, the respondents were asked in the questionnaire, for instance, “I feel angry when I face delay in KTM”, and they indicated their level of agreement on the provided 5-point scale, with 1 representing “strongly agree” and 5 representing “strongly disagree”. The physical section of the instrument was based on the Cohen–Hoberman inventory of physical symptoms (CHIPS) [23], a list of 39 common physical symptoms to highlight the relationship between negative life stress and various physical symptoms. These included factors such as back pain, diarrhea, and headache. Factors were limited to 14 items which were considered most relevant or reasonably related to stress in travel. Also, all the effects were based on previous reviews of the literature. Various sources were used for the design of psychological aspects of questionnaire. A self-reported measure of stress was developed and tested [24]. The scale included various physiological and psychological descriptors. Psychological factors included factors such as being angry, nervous, or stressed. Some of the physical factors included neck pain and feeling tired. The design of this part of the survey was also similar to the previous study, which was conducted to illustrate the capability of a cognitive–motivational–relational theory for predicting emotions [25]. For that study, 15 different emotion-related items were identified, including negative emotions such as anger, anxiety, sadness, and disgust.

3.2. Data Descriptions A total of 396 fully completed responses were collected and considered for the analysis. Again, the respondents were asked 5-scale-Likert questions about feelings that they might experience while facing delay. All questions were on the same scale with similar direction, from “strongly agree” to “strongly disagree”. The scale had the following alternatives: strongly agree (1), agree (2), undecided (3), disagree (4), strongly disagree (5). For instance, based on Table1, feeling frustrated and angry were some of the feelings that respondents agreed most strongly that they experienced while facing delays of rail transport. Initial ex- amination of the data revealed that, as expected, a significant majority of respondents rated the impact of delay very negatively and in favor of various emotional or physical feelings. Sustainability 2021, 13, 6935 7 of 14

Table 1. Descriptive summary of important factors and response.

Variables Variance Min Max Psychological feelings B1, being angry 1.840 0.868 1 5 B2, being sad 2.50 1.410 1 5 B3, being frustrated 1.874 0.971 1 5 Physical feelings C8, feeling motion sickness 3.212 1.348 1 5 C13, feeling stomach pain 3.306 1.236 1 5 Physical feelings C1, neck pain 2.306 1.292 1 5 C2, headache 2.669 1.307 1 5 C4, muscle stiffness 2.230 1.220 1 5 C10, back pain 2.248 1.280 1 5 C11, drawing sensation in body 1.977 1.060 1 5

4. Results Results are presented in two main subsections. The results of factor analysis are presented first, followed the IRT results.

4.1. Factor Analysis Several checks were conducted after factor analysis. The first was the amount of variability that was explained by each factor. The amount of variance that was explained by factors was estimated by sum of the square loading (SS loading), or Eigenvalues. That was divided by the sum of all variances to come up with the share of each factor. The results indicated that the average of the communality or explained variance is 56%. Based on the Kaiser Rule, all the factors are important, as they have eigenvalues (SS) greater than 1. To check the internal consistency of measures of reliability, Cronbach’s coefficient alpha and Guttman’s lambda_6 was estimated. The internal consistency of the test, Cronbach’s alpha, for all the factors was greater than 0.8, and dropping any variable from a factor resulted in a reduction in the Cronbach’s alpha value. Also, all values of Guttman’s lambda_6 were greater than 0.8, meaning that more than 80% of variance was due to true score. It should be noted that all of these measures are categorized under CTT and were used to be compared with IRT measures.

4.2. IRT Results The scale contained 10 items distributed across 3 latent factors. Those items were filtered based on the FA out of the total of 14 items. After establishing the factorial structure of delay, the difficulty, discrimination, and informativeness of the scale were examined using item response theory (IRT). For the GRM, an α value of >1.0 would be considered highly discriminant [26], where difficulty values would be within a range of −3 to 3. For instance, the mean item locations for C8 for each class were −1.83, −0.86, 0.03, and 1.78, respectively. Item C8 also had the lowest item location for the first latent variable, meaning that the item was the easiest of the attribute in that factor. In terms of item difficulty, the results highlighted that the same item also had the lowest discriminative value. The extremity parameters (β) highlight the latent trait score at which the respondents had a 50% chance of selecting certain responses. For instance, looking at the output for the first observation of the first item, the value is −1.83. This value suggests that those respondents with a latent trait score of −1.83 had a 50% chance of selecting the first option for that item. On the other hand, extremity 2 (β2) was −0.86 so those with latent trait scores of −0.86 had a 50% chance of selecting 1 or option 2 for that item. Sustainability 2021, 13, x FOR PEER REVIEW 8 of 14

respondents with a latent trait score of −1.83 had a 50% chance of selecting the first option for that item. On the other hand, extremity 2 () was −0.86 so those with latent trait scores of −0.86 had a 50% chance of selecting 1 or option 2 for that item.

Sustainability 2021, 13, 6935 6.3. ICC 8 of 14 The figures related to each item in each latent factor (theta) are depicted in Figure 1. It should be noted that the figure interpretations are specific to each latent factor. The ICC was defined as a nonlinear relationship line representing the probability of endorsing an 4.3.item ICC response category as a function of (quantitative trait) [27]. The vertical axis is the probabilityThe figures of observing related to each each response item in category each latent (1–5). factor (theta) are depicted in Figure1. It shouldFor be instance, noted that C8 theand figureC13 were interpretations related to stomach are specific sickness. to People each latent with higher factor. expe- The ICC wasriences defined of those as a feelings nonlinear are relationshipmore likely to linerespond representing “strongly theagree” probability and go on of to endorsing the other an itemquestions. response As categorycan be seen as from a function the Figure of θ 1,(quantitative for instance, trait)there [is27 a]. higher The vertical proportion axis of is the the theta that agree that C8 is a resultant effect of delay on an individual than C13, and probability of observing each response category (1–5). therefore, there is a higher area under the curve for the first item.

C8 C13

C1 C2 C4

Sustainability 2021, 13, x FOR PEER REVIEW 9 of 14

C10 C11

B1 B2 B3

FigureFigure 1. 1.Item Item category category characteristics characteristics (ICCs) for for various various items items within within different different latent latent factors. factors.

ForAlso, instance, it can be C8 observed and C13 that were when related comparing to stomach two items sickness. within each People factor, with items higher experienceswith flatter ofICCs those correspond feelings to are items more with likely lower to responddiscrimination “strongly factors: agree” for instance, and go oncon- to the othersider questions. C8 (α = 1.69) As and can C13 beseen (α = 4). from the Figure1, for instance, there is a higher proportion

6.4. IIC The second figure relates to item information curves (IICs). The figures highlight how well each item measures the latent trait at various levels of the attributes. On the other hand, the test information function (TIF) measures how well the test measures the latent trait at different levels of the attributes (reliability). It should be highlighted that each unit of figures is related to factors discussed in previous sections in order. For instance, the first sets of figures are related to motion sickness. As can be seen from Figure 2 and Table 2, items with higher discrimination have more information, or area under the curve. For instance, for the first factor, C13 had a discrimination value of 4 versus 1.69 for C8, having a larger pick in the figure. The same explanation could be made regarding other figures.

Table 2. Item response theory parameter estimates.

ID Coefficients Motion sickness Log.Lik: −1084.502 (df = 10) 1 C8, feeling motion sickness 1.69 −1.83 −0.86 0.03 1.78 2 C13, feeling stomach pain 4.00 −1.51 −0.80 −0.02 1.36 Physiological effects of delay Log.Lik: −2364 (df = 25) 3 C1, neck pain 2.83 −0.61 0.36 1.08 2.00 4 C2, headache 2.70 −0.72 0.36 0.95 2.09 5 C4, muscle stiffness 2.13 −0.57 0.44 1.28 2.43 6 C10, back pain 2.38 −0.59 0.52 1.12 2.22 7 C11, drawing sensation in body 2.38 −0.32 0.82 1.37 2.57 Psychological effects of delay Log.Lik: −1364 (df = 15) 8 B1, being angry 2.29 −0.20 1.14 1.68 2.96 9 B2, being sad 1.93 −0.98 0.37 1.00 2.07 10 B3, being frustrated 3.70 −0.18 0.99 1.48 2.09

Sustainability 2021, 13, 6935 9 of 14

of the theta that agree that C8 is a resultant effect of delay on an individual than C13, and therefore, there is a higher area under the curve for the first item. Also, it can be observed that when comparing two items within each factor, items with flatter ICCs correspond to items with lower discrimination factors: for instance, consider C8 (α = 1.69) and C13 (α = 4).

4.4. IIC The second figure relates to item information curves (IICs). The figures highlight how well each item measures the latent trait at various levels of the attributes. On the other hand, the test information function (TIF) measures how well the test measures the latent trait at different levels of the attributes (reliability). It should be highlighted that each unit of figures is related to factors discussed in previous sections in order. For instance, the first sets of figures are related to motion sickness. As can be seen from Figure2 and Table2, items with higher discrimination have more information, or area under the curve. For instance, for the first factor, C13 had a

Sustainability 2021, 13, x FOR PEERdiscrimination REVIEW value of 4 versus 1.69 for C8, having a larger pick in the figure.10 Theof 14 same explanation could be made regarding other figures.

FigureFigure 2. 2.IICs IICs and and TIFs TIFs basedbased on on GRM GRM model. model.

As can be seen from the figures on the left part of Figure 2, for the first, second, and third latent factors, most of the information was provided by C13, C1, and B3, respectively. The wider gaps across items in the first and the last latent factors should be highlighted. Also, C10 and C11 provided similar information for the second latent factor. The test information function (TIF) is the sum of the item information across levels of the theta. The test highlights how well various items estimate the theta or ability. The TIF was provided by aggregating IIC across all items as () = ∑ (). This measure is one of the advantages of IRT where, for instance, reliability of classical method is assumed to be constant by a single index, e.g., Cronbach’s alpha, while IRT replaces reliability with information criteria.

Sustainability 2021, 13, 6935 10 of 14

Table 2. Item response theory parameter estimates.

ID Coefficients α β1 β2 β3 β4 Motion sickness Log.Lik: −1084.502 (df = 10) 1 C8, feeling motion sickness 1.69 −1.83 −0.86 0.03 1.78 2 C13, feeling stomach pain 4.00 −1.51 −0.80 −0.02 1.36 Physiological effects of delay Log.Lik: −2364 (df = 25) 3 C1, neck pain 2.83 −0.61 0.36 1.08 2.00 4 C2, headache 2.70 −0.72 0.36 0.95 2.09 5 C4, muscle stiffness 2.13 −0.57 0.44 1.28 2.43 6 C10, back pain 2.38 −0.59 0.52 1.12 2.22 7 C11, drawing sensation in body 2.38 −0.32 0.82 1.37 2.57 Psychological effects of delay Log.Lik: −1364 (df = 15) 8 B1, being angry 2.29 −0.20 1.14 1.68 2.96 9 B2, being sad 1.93 −0.98 0.37 1.00 2.07 10 B3, being frustrated 3.70 −0.18 0.99 1.48 2.09

As can be seen from the figures on the left part of Figure2, for the first, second, and third latent factors, most of the information was provided by C13, C1, and B3, respectively. The wider gaps across items in the first and the last latent factors should be highlighted. Also, C10 and C11 provided similar information for the second latent factor. The test information function (TIF) is the sum of the item information across levels of the theta. The test highlights how well various items estimate the theta or ability. The TIF N was provided by aggregating IIC across all items as I(θ) = ∑i=1 Ii(θ). This measure is one of the advantages of IRT where, for instance, reliability of classical method is assumed to be constant by a single index, e.g., Cronbach’s alpha, while IRT replaces reliability with information criteria. Based on the test information curve, it can be observed that the questions of this section mostly provided information for respondents with high ability trait. This can be observed of the respondent for the second factor, for instance, in the interval of 0–4 being 30%. However, the value for the traits for the third factor even changed to 25%. From the figures, using item B3 as an example, it can be noted that the amount of information has its maximum at an ability level of around 1.5 and is about 3 for the ability range of 0 ≤ θ ≤ 2. Within this range, ability was estimated with some precision. However, outside this range, the amount of information decreases rapidly, and the corresponding ability levels were not projected very well. In summary, the test information curve was also employed to highlight the sum of IICs to provide a plot depicting where the trait is most discriminating [28]. As can be seen from the TIF, the items for the first latent provided information for respondents with low/high ability (−2–+2) evenly. However, the second and the third latent provided information mostly for the range of 0–3. In other words, while the amount of information for the ability level in the interval of (−4-0) was about 50% of the total information for the first latent factor, the amount of information in that interval was only about 30% and 25% for the second and third latent factors, respectively. Having a small amount of information (IIC) means that an ability cannot be estimated with precision and that estimates will be widely scattered about the true ability [26]. However, the type of instrument and the objectives should be kept in mind. To estimate reliability from the information, the formula of reliability = 1 − [1/infor- mation] was used, so, for instance, values of 0.70 to 0.90 for interpreting reliability values correspond to information of 3.3 to 10 [29]. Now it is worth investigating what makes this reliability low for a few items. To do so, it is reasonable to present a descriptive summary of the items. This description is presented Sustainability 2021, 13, 6935 11 of 14

in Table3. The table provides important information about the possible reasons for lack of reliability for two factors.

Table 3. Description of items, number (%).

Options Strongly Strongly Items Agree (2) Undecided (3) Disagree (4) (Category) Agree (1) Disagree (5) Motion sickness C8 41 (10%) 68 (17%) 95 (24%) 150 (38%) 42 (11%) C13 34 (9%) 56 (14%) 105 (26%) 157 (40%) 44 (11%) Physiological effects C1 113 (29%) 132 (33%) 85 (22%) 49 (12%) 17 (4%) C2 101 (26%) 143 (36%) 73 (18%) 63 (16%) 16 (4%) C4 122 (31%) 132 (33%) 84 (21%) 45 (12%) 11 (3%) C10 118 (30%) 145 (37%) 65 (16%) 53 (13%) 15 (4%) C11 153 (39%) 152 (38%) 46 (12%) 37 (9%) 8 (2%) Psychological effects B1 168 (42%) 159 (40%) 37 (10%) 28 (7%) 4 (1%) B2 86 (22%) 144 (37%) 73 (18%) 68 (17%) 25 (6%) B3 165 (42%) 160 (40%) 38 (10%) 22 (6%) 11 (2%)

The plots highlighted that the information for theta was reliable for the first factor. It can be confirmed from Table3 that the distributions across the two beliefs (agree versus disagree) were even for C8 and C13. However, looking at the second and third classes, it can be observed that for all items within these classes, significant majorities of the respondent agreed with the feelings. For instance, for B1 (angry) and B3 (frustrated), only 8% of the respondents disagreed or strongly disagreed, i.e., stated that they did not experience the feelings. That is why the plots highlight a low reliability for theta greater than 0. Although the low percentage of responses using the disagree options is expected, because of severe shortcomings of transport due to delay, several recommendations can be proposed to address lack of the reliability for positive theta: (1) collecting more data so that more observations can be collected for positive beliefs; (2) adjusting the questionnaire op- tions by removing 1 (strongly agree) and 5 (strongly disagree); (3) rewording the questions and reevaluating the instrument.

4.5. Person Fit Assessment For model fit assessment, the standardized item residual was also used [30]. This was − calculated by O√i Ei , where O and E were observed and expected values. We inspected Ei the constructed residuals for all factors. The results indicated that except for a few misfit observations, all of them were fit well. The residuals provided discrepancy between the predicted frequency and the estimated model. Generally, residuals greater than 3.5 were considered an indication of poor fit [19]. For the first latent factor, comprising items C8 and C13, there were 3 individuals’, chose 1 (strongly agree) for C8, while they also chose strongly disagree for C13, resulting in a residual of >3.5. For the second latent factor, including items C1, C2, C4, C10, and C11, all the residuals were less than 3. For the last latent factor, a few respondents were highlighted to have residuals of greater than 3.5. For instance, some respondents chose option 3, undecided, for all the options, highlighting a possible lack of interest. Also, a few observations chose 1, 5, and 5 for B1, B2, and B3, respectively. Based on the above discussion, we expect that those respondents chose the answers randomly, or that those types of responses might be due to some unique behaviors that might be better ignored. For instance, some respondents selected the “undecided” choice for all the options in the third class, which might be an indication of lack of interest. On the Sustainability 2021, 13, 6935 12 of 14

other hand, while some respondents strongly agreed with an option, they indicated that they strongly disagreed with other possibly related questions in the same class. A few recommendations can be proposed to address those issues. Based on the researchers’ judgment and the results, those observations could be removed from the dataset. Another option is to collect more observations so that possibly more respondents with those opinions could be collected. Again, we believe that the model did a very good job in identification of those outliers, and we speculate that those answers were due to lack of interest.

5. Discussion Extensive efforts have been made in designing instruments for measuring various impacts of stressors. However, a main plausible issue with those instruments is that there is no reasonable consensus regarding the appropriateness of those instruments. CTT has been used extensively with the help of various measures such as Cronbach’s alpha to measure reliability. However, one of the shortcomings of CTT is that the model assumes a common estimate across all individual attributes irrespective of those attributes’ levels. Additionally, CTT, because of the summing scores process, disregards the underlying nature of the data. Our main objective of conducting this study was to check if our instrument has the necessary properties needed to measure the perceived feelings of commuters. Various measures, such as statistical summaries of GRM, ICC, IIC, and residuals, were used to check our instrument. Our instrument has two main subsections measuring various physiological and psychological aspects of delay on commuters. However, the instrument was divided into three subsections by factor analysis, and a few items were dropped due to not having enough loadings. While CTT did not indicate any issues with our instrument in terms of reliability, IRT did suggest some shortcomings with our instrument. ICC provided insights about the discrimination and the difficulty of individual items of our instruments, while IICs highlighted the amount of information carried by each item in a latent class. The test information curve highlighted that two of the factors did not have enough reliability for positive areas of thetas (ability). To understand the underlying causes of that issue, we provided a frequency table of all responses’ answers. We found that the lack of reliability resulted from the respondents agreeing with the feelings for most of the items in those factors, thus leaving no disagreement with the questions. On the other hand, residuals were used to highlight outlier responses that might have been answered by chance or lack of interest. The residuals highlighted important observations that based on our experience were very likely to be answered by chance and needed to be removed from our dataset, e.g., when the respondents chose “undecided” for all items of a factor. One of the limitations of this study could be a limited number of observations. For instance, a value of 500 was considered for the sample size for providing accurate parameter estimates [31]. However, it has been discussed that there is no golden rule for the number of required samples [32]. In addition, while 500 was recommended for an ideal condition for accurate parameter estimate of GRT, the model could still be estimated successfully with 250 respondents [33]. Although this sample size provided some limitations, such as inability to predict disapproving various items in our instrument, the analysis provided important insights about the performance of our instrument. Due to the case study being very notorious in terms of having long delay, which results in the dissatisfaction of commuters, it is expected that increasing the sample size would still not allow for the collection of a sufficient number of observations for those choosing the “strongly disagree” options for items B1 and B3. However, our discussed recommendations could be implemented to address the lack of reliability. In summary, despite the approval of our instrument through the CTT method, IRT was implemented successfully and challenged our instrument by providing important Sustainability 2021, 13, 6935 13 of 14

information necessary for improvement. Collecting more observations, rewording some of the items and extreme opinions, and removing outliers are recommended for future studies. More studies are needed for the use of the instrument for other aspects of transport associated with the included commuters’ feelings. It has been highlighted that the instru- ment could achieve a good performance across all the items by applying minor corrections. Regarding the remaining items that were not included in any class, adding more related items or adjusting the included items is recommended.

Author Contributions: F.R.F.: supervision, C.V.: writing—review editing, and visualization, K.C.: writing—review editing, and visualization, M.R.: methodology. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Institutional Review Board Statement: The study was conducted according to the guidelines of the Declaration of Helsinki. Informed Consent Statement: Not Applicable. Conflicts of Interest: The authors declare no conflict of interest.

References 1. Monchambert, G.; de Palma, A. Public transport reliability and commuter strategy. J. Urban Econ. 2014, 81, 14–29. [CrossRef] 2. Rezapour, M.; Ferraro, F.R. Rail transport delay and its effects on perceived importance of a real-time information. Front. Psychol. 2021, 12, 1863. [CrossRef] 3. Zaidman-Zait, A.; Mirenda, P.; Zumbo, B.D.; Wellington, S.; Dua, V.; Kalynchuk, K. An item response theory analysis of the Parenting Stress Index-Short Form with parents of children with autism spectrum disorders. J. Child Psychol. Psychiatry 2010, 51, 1269–1277. [CrossRef] 4. Tsutsumi, A.; Iwata, N.; Watanabe, N.; de Jonge, J.; Pikhart, H.; Fernández-lópez, J.A.; Xu, L.; Peter, R.; Knutsson, A.; Niedhammer, I. Application of item response theory to achieve cross-cultural comparability of occupational stress measurement. Int. J. Methods Psychiatr. Res. 2009, 18, 58–67. [CrossRef][PubMed] 5. Borders, A.E.; Lai, J.; Wolfe, K.; Qadir, S.; Peng, J.; Kim, K.; Keenan-Devlin, L.; Holl, J.; Grobman, W. Using item response theory to optimize measurement of chronic stress in pregnancy. Soc. Sci. Res. 2017, 64, 214–225. [CrossRef] 6. Schlechter, P.; Wilkinson, P.O.; Knausenberger, J.; Wanninger, K.; Kamp, S.; Morina, N.; Hellmann, J.H. Depressive and anxiety symptoms in refugees: Insights from classical test theory, item response theory and network analysis. Clin. Psychol. Psychother. 2021, 28, 169–181. [CrossRef][PubMed] 7. McCarron, S.P.; Kuperman, V. Is the author recognition test a useful metric for native and non-native english speakers? An item response theory analysis. Behav. Res. Methods 2021, 1–12. [CrossRef] 8. Williams, Z.J.; Gotham, K.O. Assessing general and autism-relevant quality of life in autistic adults: A psychometric investigation using item response theory. Autism Res. 2021.[CrossRef][PubMed] 9. Calderón, C.; Beyle, C.; Véliz-García, O.; Bekios-Calfa, J. Psychometric properties of Addenbrooke’s Cognitive Examination III (ACE-III): An item response theory approach. PLoS ONE 2021, 16, e0251137. [CrossRef] 10. Silvia, P.J. The self-reflection and insight scale: Applying item response theory to craft an efficient short form. Curr. Psychol. 2021, 1–11. [CrossRef] 11. Goldstein, B.L.; Briggs-Gowan, M.J.; Greene, C.C.; Chang, R.; Grasso, D.J. An Item Response Theory examination of the original and short forms of the Difficulties in Emotion Regulation Scale (DERS) in pregnant women. J. Clin. Psychol. 2021.[CrossRef] [PubMed] 12. Kimata, A.; Kumagai, K.; Kondo, N.; Adachi, K.; Fujita, R.; Tsuchiya, M. Development and validation of the Cancer Knowledge Scale for the general population: An item response theory approach. Patient Educ. Couns. 2021.[CrossRef][PubMed] 13. Wener, R.; Evans, G.W.; Boately, P. Commuting stress: Psychophysiological effects of a trip and spillover into the workplace. Transp. Res. Record J. Trans. Res. Board 2005, 1924, 112–117. [CrossRef] 14. Wener, R.E.; Evans, G.W.; Phillips, D.; Nadler, N. Running for the 7: 45: The effects of public transit improvements on commuter stress. Transportation 2003, 30, 203–220. [CrossRef] 15. Cheng, Y. Exploring passenger anxiety associated with train travel. Transportation 2010, 37, 875–896. [CrossRef] 16. Browne, M.W.; Arminger, G. Specification and estimation of mean-and covariance-structure models. In Handbook of Statistical Modeling for the Social and Behavioral Sciences; Springer: Berlin/Heidelberg, Germany, 1995; pp. 185–249. 17. Baker, F.B.; Kim, S. Item Response Theory: Parameter Estimation Techniques; CRC Press: Boca Raton, FL, USA, 2004. 18. Pinheiro, J.C.; Bates, D.M. Approximations to the log-likelihood function in the nonlinear mixed-effects model. J. Comput. Graph. Stat. 1995, 4, 12–35. Sustainability 2021, 13, 6935 14 of 14

19. Rizopoulos, D. ltm: An R package for latent variable modeling and item response theory analyses. J. Stat. Softw. 2006, 17, 1–25. [CrossRef] 20. Embretson, S.E.; Reise, S.P. Item Response Theory; Psychology Press: London, UK, 2013. 21. Krosnick, J.A. Response strategies for coping with the cognitive demands of attitude measures in surveys. Appl. Cogn. Psychol. 1991, 5, 213–236. [CrossRef] 22. Groothuis, P.A.; Whitehead, J.C. Does don’t know mean no? Analysis of’don’t know’responses in dichotomous choice contingent valuation questions. Appl. Econ. 2002, 34, 1935–1940. [CrossRef] 23. Cohen, S.; Hoberman, H.M. Positive events and social supports as buffers of life change stress 1. J. Appl. Soc. Psychol. 1983, 13, 99–125. [CrossRef] 24. Greller, M.; Parsons, C.K. Psychosomatic complaints scale of stress: Measure development and psychometric properties. Educ. Psychol. Meas. 1988, 48, 1051–1065. [CrossRef] 25. Lazarus, R.S. Progress on a cognitive-motivational-relational theory of emotion. Am. Psychol. 1991, 46, 819. [CrossRef][PubMed] 26. Baker, F.B. The Basics of Item Response Theory; ERIC. 2001. Available online: https://www.ime.unicamp.br/~{}cnaber/Baker_Book. pdf (accessed on 3 June 2021). 27. Fraley, R.C.; Waller, N.G.; Brennan, K.A. An item response theory analysis of self-report measures of adult attachment. J. Pers. Soc. Psychol. 2000, 78, 350. [CrossRef][PubMed] 28. Reise, S.P.; Waller, N.G. Item Response Theory for Dichotomous Assessment Data. In Measuring and Analyzing Behavior in Organizations: Advances in Measurement and Data Analysis; Drasgow, F., Schmitt, N., Eds.; Jossey-Bass: Hoboken, NJ, USA, 2002; pp. 88–122. 29. Nunnally, J.C. Psychometric Theory 3E; Tata McGraw-Hill Education: New York, NY, USA, 1994. 30. Hambleton, R.K.; Swaminathan, H.; Rogers, H.J. Fundamentals of Item Response Theory; Sage: Thousand Oaks, CA, USA, 1991. 31. Jiang, S.; Wang, C.; Weiss, D.J. Sample size requirements for estimation of item parameters in the multidimensional graded response model. Front. Psychol. 2016, 7, 109. [CrossRef][PubMed] 32. Morizot, J.; Ainsworth, A.T.; Reise, S.P. Toward modern . In Handbook of Research Methods in Personality Psychology; Guilford Press: New York, NY, USA, 2009; pp. 407–423. 33. Reeve, B.B.; Fayers, P. Applying item response theory modeling for evaluating questionnaire item and scale properties. Assess. Qual. Life Clin. Trials Methods Pract. 2005, 2, 55–73.