<<

Personality and Individual Differences 55 (2013) 317–321

Contents lists available at SciVerse ScienceDirect

Personality and Individual Differences

journal homepage: www.elsevier.com/locate/paid

Item grouping and item randomization in personality measurement ⇑ Kraig L. Schell a, , Frederick L. Oswald b a Department of Psychology, Angelo State University, United States b Department of Psychology, Rice University, United States article info abstract

Article history: This paper describes a study examining the impact of item order in personality measurement on reliabil- Received 22 October 2012 ity, measurement equivalence and scale-level correlations. A large of university students com- Received in revised form 6 March 2013 pleted one of three forms of the International Personality Item Pool version of the Big Five personality Accepted 12 March 2013 inventory: items sorted at random, items sorted by factor, and items cycled through factors. Results Available online 4 April 2013 showed that the underlying measurement model and the internal consistency of the IPIP-Big Five scale was unaffected by differences in item order. Also, most of the scale-level correlations among factors were Keywords: not significantly different across forms. Implications for the administration of tests and interpretation of Personality test scores are discussed, and future research directions are offered. Measurement Ó 2013 Elsevier Ltd. All rights reserved. Reliability Item order

1. Introduction and item order was targeted as a possible source of leniency bias in those ratings. Although the evidence reported suggested a de- The history of item-order research dates back over half a cen- crease in leniency bias for the ungrouped instruments, the findings tury. One of the earliest studies on the effects of item order using were modest at best, again casting doubt on the actual impact of a sample of unionized workers used an attitude instrument with item sequencing. 12 subscales (Kirchner & Uphoff, 1955). The analysis reported indi- Other studies provide evidence in support of item order effects. cated that 11 of those 12 scale did not differ significantly Across large samples of public school supervisors and administra- between tests with items that were sequenced in two different tors, Schurr and Henriksen (1983) reported significant effects of ways: items grouped by factor vs. items that were randomized. item order on the psychometric properties of a 61-item instrument The sample sizes were small (maximum N per group was 21), that was presented in three forms: ungrouped, grouped but not la- and of course, simple comparisons may have missed impor- beled by factor, and grouped and labeled by factor. However, the tant relationships in these data, and mean differences do not speak instrument in question was an item checklist containing skills to the reliability or equivalence of each measure’s internal struc- and activities that could serve as training topics in the organization ture. But nevertheless, this work represents an early contribution sampled, and so the items under any given ‘‘factor’’ were quite di- to the historical conclusion that item order is not a critical issue verse. These effects may not generalize to measures of latent psy- in the administration of self-report instruments. chological constructs such as personality traits. Another study Subsequent studies on the item-order effect (or a lack of one) do conducted soon thereafter used self-report attitude measures, such not seem to be found in the research literature much later, in the as job and family satisfaction, and the authors reported that inter- 1980’s. The first of these studies (Schriesheim, 1981) tested a small nal consistency statistics varied significantly between grouped and sample of department store employees on two recognized leader- labeled, grouped and unlabeled, and ungrouped forms of adminis- ship , examining items that were grouped by con- tration (Solomon & Kopelman, 1984). Data showed that the struct vs. randomized (N = 40 in each of two groups, for each of grouped and unlabeled form was associated with the highest Cron- two studies). These questionnaires were both designed to rate bach’s alpha, whereas the randomized form was associated with the employees’ immediate supervisors on leadership behaviors, the lowest alpha . These differences were also statistically reliable. These two studies renewed interest in the issue of item or- der as a potential confound in psychological measurement. Finally, ⇑ Corresponding author. Address: Box 10907 ASU Station, Angelo State Univer- two other studies (Schriesheim, Kopelman, & Solomon, 1989; Sch- sity, San Angelo, TX 76909, United States. Tel.: +1 325 486 6128; fax: +1 325 942 riesheim, Solomon, & Kopelman, 1989) examined item order ef- 2290. fects in attitude measures from multiple perspectives: internal E-mail address: [email protected] (K.L. Schell).

0191-8869/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.paid.2013.03.008 318 K.L. Schell, F.L. Oswald / Personality and Individual Differences 55 (2013) 317–321 consistency, convergent/discriminant validity, and confirmatory assortment being the least), the psychometric properties of the factor structures. In general, their data showed that the effect of instrument should improve if the structural features of the test item order was likely to be stronger in measures that were not are an important part of the measurement model underlying it. as strong psychometrically compared to those that were. This Thus, we expected that obvious non-random sequencing (items might make conceptual sense, such that when the construct is grouped by factor) should increase the internal consistency of not as salient or strong to the test-taker, factors such as unique the instrument, since item grouping should encourage an under- content specific to items and item ordering have greater influences standing of their shared properties, an idea in line with prior stud- on test responses. Both studies also suggested that the grouped- ies (Melnick, 1993; Smith, 1983). Based on Schriesheim, Kopelman, item format was superior to random-item formats in most cases, et al. (1989), we also hypothesized that confirmatory factor analy- although the authors do admit that the practical significance of sis would reveal unequal factor loadings (metric non-invariance) item order on reliability is likely to be rather low (Schriesheim, between comparable items at the subscale level across the three Kopelman, et al., 1989; Schriesheim, Solomon, et al., 1989) and forms. Third, we expected that observed scale-level correlations we admit that it is a relatively open question whether results in should be significantly different (i.e., show heterogeneity) across this research domain are replicable versus being more strongly tied form type. Specifically, non-random item groupings should lead to samples, measures and settings. to lower correlations between factors because the superficial dif- In line with this sentiment, it should be noted that not everyone ferences between the items should be perceived more readily by studying item order effects at that time was drawing the same con- respondents. To our knowledge, no study has directly compared clusions. In two papers, Knowles and colleagues argued that item scale-level correlations between groups exposed to the same order was a significant influence on the item-total correlations instrument but with different item sequencings. and alpha reliability in a personality instrument (Knowles, 1988; Knowles & Byers, 1996). Steinberg (1994) also suggested that 2. Method item-trait correlations could be systematically affected by the grouping of items by each underlying factor that the measure is in- 2.1. Participants tended to reflect. Finally, more recent studies have been conducted as well, using Three hundred ninety-seven participants (129 men, 268 wo- newer analysis techniques. For example, Franke (1997) showed men) volunteered to complete the study. The average age for the evidence that grouping the items by their underlying factor nega- sample was 21.4 years (SD = 6.3). These were all students from a tively affected the internal consistency and changed mean values mid-sized university in the American southwest who completed for subscales in a multi-factor instrument. She suggests that this the study in exchange for course credits. occurs because the test had been validated using a particular item sequence, and as such, that fact had become part of the psychomet- 2.2. Materials ric properties of the instrument itself. This is an interesting general point, that validation evidence of a measure shown over time may The 50-item International Personality Item Pool version of the be in part a function of an item order established for the measure. Big Five personality instrument was chosen for this study (http:// A more recent study (Frantom, Green, & Lam, 2002) seemed to ipip.ori.org). As portrayed on the IPIP website, the IPIP instrument indirectly support this idea on a 40-item opinion survey, arguing has been consistently shown to be a valid and reliable measure of that item grouping by factor using a Rasch measurement model ap- the Big Five traits. None of the form types described below reflect proach significantly increased person separation, making the items the ‘‘standard’’ item sequencing for this instrument because one reflecting a particular factor more ‘‘sample-free.’’ does not exist of which we are aware. The items were not modified in any way, but their sequencing was adjusted depending on the 1.1. Study rationale and hypotheses form type:

Based on these studies and conclusions, we elected to use a Form C (‘‘cycled’’) was constructed using two steps. First, the well-established psychological measure of the Big Five personality items were grouped according to their underlying scale mem- factors, creating three different forms of the instrument that dif- bership as established in the development of the IPIP personal- fered in item sequencings. These three item order strategies were: ity measure. Next, items were chosen using a rotating selection random item order, grouped items by factor, and then an item/fac- strategy, so that every fifth item belonged to the same factor. tor rotation strategy (‘‘cycling’’),where the items were grouped by For example, items 1, 6, 11, and so on belonged to Extraversion; factor and then selected one item at a time rotating between the items 2, 7, 12, and so on belonged to Emotional Stability; items five factors in the same sequence. These three forms varied in 3, 8, 13, and so on belonged to Openness; items 4, 9, 14, and so terms of how obvious the grouping pattern was likely to be to par- on belonged to Agreeableness; and items 5, 10, 15, and so on ticipants, because we wanted to manipulate how well participants belonged to Conscientiousness. can figure out the thematic elements underlying the items, leading Form R (‘‘random’’) was constructed by randomly selecting to either (a) a systematic bias due to self-presentation effects such items across all five traits. The same random assortment was as social-desirability or self-monitoring, or (b) improved responses used for all participants assigned to this experimental as a result of a clear understanding of what the items are attempt- condition. ing to measure. Also, one of our grouping strategies (cycling), to the Form G (‘‘grouped’’) was constructed by clustering items best of our knowledge, had yet to be tested for item order effects. together according to their underlying scale membership. The Finally, we chose a personality measure because most previous order of factors was the same as for the cycled version: Extra- studies (with the exception of Knowles, 1988) used either behav- version, Emotional Stability, Openness, Agreeableness, and ioral description items or attitude measures, so this study should Conscientiousness. test to see if prior findings generalize to a different type of psycho- logical construct. 2.3. Procedures Our hypotheses were based on the plausible possibility that, as the pattern of item sequencing made the latent factors more trans- All three forms were constructed as online Web forms using a parent (factor grouping being the most transparent and random commercial survey site (http://www.survs.com). This site is capa- K.L. Schell, F.L. Oswald / Personality and Individual Differences 55 (2013) 317–321 319 ble of a professional survey presentation with fully-automated Table 2 data storage and formatting. Participants completed the surveys Bivariate correlations between all pairs of Big Five personality traits separated by form type. in a 10-station lab, supervised by lab staff. Upon arrival to the testing location, participants were randomly Correlated pair Form G Form C Form R V assigned to an Internet-capable computer to complete one of the r (ES, E) 0.307 0.173 0.379 3.53 three forms. Participants provided informed consent r (ES, O) 0.123 0.233 0.066 6.55* electronically and then completed the questionnaire that they r (ES, A) 0.302 0.416 0.494 3.31 r (ES, C) 0.279 0.322 0.412 1.57 were assigned. A debriefing message concluded the experimental r (E, O) 0.157 0.196 0.069 1.22 session. r (E, A) 0.078 0.031 0.157 1.15 r (E, C) 0.038 0.119 0.279 4.19 r (O, A) 0.065 0.143 0.100 0.38 3. Results and discussion r (O, C) 0.047 0.238 0.168 2.32 r (A, C) 0.287 0.409 0.478 3.19 3.1. Differences in internal consistency |Mean| 0.168 0.228 0.260 0.58 Note: Form G (n = 113). Form C (n = 135). Form R (n = 149). Scale means, bivariate correlations and internal consistency ES = Emotional Stability. E = Extraversion. O = Openness. A = Agreeableness. reliability coefficients are displayed in Tables 1 and 2, separated C = Conscientiousness. |Mean| = mean of absolute values of correlations in that by form type (Form R, Form C, or Form G). Our first hypothesis column. * = significant at p < .05. was focused on differences in the internal consistency of each form type, and these statistics are located in Table 1. In addition, an average Cronbach’s alpha was calculated for the entire 50-item Table 3 instrument for each form type (Form G = 0.808; Form C = 0.803; Summary statistics for measurement equivalence analyses, separated by factor. Form R = 0.806). The scale means in Table 1 showed very little dis- Factor Model v2 df crepancy between the three item sequences, and the reliability sta- tistics also were almost identical. Our hypotheses focused in part Conscientiousness Unconstrained 333.0 123 Groups held equal 304.2 105 on the differences in internal consistency between the three forms, Emotional Stability Unconstrained 492.9 123 so we applied the Hakstian–Whalen test for the equality of k sam- Groups held equal 474.3 105 ple alphas (Hakstian & Whalen, 1976). The results of this analysis Extraversion Unconstrained 282.0 123 revealed no significant differences in internal consistency, there- Groups held equal 258.3 105 Openness Unconstrained 520.7 123 fore, the data failed to support our first hypothesis. Groups held equal 500.2 105 Agreeableness Unconstrained 339.2 123 3.2. Test of measurement invariance Groups held equal 314.9 105 Note: All v2 differences between models within each factor were non-significant at Our second hypothesis was focused on testing for measurement p < .05. invariance across the three type of item sequencing. Structural equation modeling was used to assess this hypothesis using Amos (version 18). Data were checked for assumptions (i.e., multivariate tween forms) was supported in all cases (i.e., all chi-square values normality) and were found to conform to expectations. Summary evaluated at p > .05). Based on the recommendations of Hu and statistics using a maximum likelihood estimation process for the Bentler (1999), fit statistics were also adequate for all five factors, analyses are shown in Table 3. Because the sample size available as shown in Table 4, supporting the stability of the measurement was relatively small compared to recommended sample sizes for model underlying each factor. Therefore, our second hypothesis testing multi-group measurement invariance, we elected to test regarding measurement non-equivalence was not supported. each individual trait separately rather than create a measurement model correlating all five traits with one another. 3.3. Comparison of scale-level correlations For all five factors, chi-square values for the unconstrained and constrained path models were compared. These comparisons were Finally, bivariate correlations among the five factors were evaluated at 18 df and p = .05. All five comparisons are shown in examined as a function of the three different item sequences. Table 3, indicating that metric invariance (equal factor loadings be- Based on Table 2, it appeared that several correlations among fac- tors were quite different depending on the item order that was used. Results for analyses comparing these correlations are shown Table 1 Scale means, bivariate correlations and internal consistencies for Big Five factors on in Table 2 in the column labeled ‘‘V’’; V is a statistic based on the the IPIP, separated by item order type. logic of Fisher’s Z test and is intended to test the null hypothesis that the population correlation is equal across observations (Hays, Form/measure ES EXT OP AGR CN 1994). It is also mathematically equivalent to a commonly-used Form G statistic in meta-analysis, Q (e.g., Borenstein, Hedges, Higgins, & M 24.71 34.62 34.56 37.48 37.03 SD 6.24 7.36 5.58 4.91 6.32 a 0.82 0.90 0.72 0.75 0.86 Table 4 Form C Fit statistics for measurement equivalence analyses, separated by factor. M 25.12 34.79 34.84 36.81 36.11 SD 6.36 6.69 5.41 5.38 6.17 Factor CFI NFI RMSEA a 0.82 0.86 0.72 0.78 0.84 Form R Emotional Stability 0.959 0.939 0.068 M 24.46 34.40 34.33 36.92 36.04 Extraversion 0.941 0.921 0.081 SD 6.17 7.14 5.59 4.87 5.51 Openness 0.939 0.910 0.069 A 0.83 0.87 0.75 0.75 0.84 Agreeableness 0.932 0.897 0.065 Conscientiousness 0.966 0.938 0.053 Note: Form G (n = 113). Form C (n = 135). Form R (n = 149). ES = Emotional Stability, EXT = Extraversion, OP = Openness, AGR = Agreeableness, CN = Conscientiousness. Note: Fit statistics calculated for entire sample (N = 397). 320 K.L. Schell, F.L. Oswald / Personality and Individual Differences 55 (2013) 317–321

Rothstein, 2010; Cooper, 2009; Hunter & Schmidt, 2004; Lipsey & lieve that one likely explanation is that respondents are able to in- Wilson, 2001). The V statistic is essentially a sum-of-squares value fer a factor structure on the instrument when similar items are that is calculated according to the formula below and evaluated grouped together. In other words, the test-taker may be able to against the v2 distribution: ‘‘draw the lines’’ separating questions into groups in their mind X as they complete the instrument. When the items are random, this 2 V ¼ ðnj 3ÞðZj UÞ ; structure is more opaque and the instrument may be seen as more unitary. This is a different sort of thinking than may be found in where j is the number of sample correlations to be tested and U is other literature (Schriesheim, Kopelman, et al., 1989; Schriesheim, calculated as a mean of Fisher’s Z for each correlation and weighted Solomon, et al., 1989), but it makes sense to incorporate the meta- by the inverse of their respective error : cognitive experiences of the test-taker as they could be imposed X X upon the instrument, accurately or not, as it is completed. U ¼ ðnj 3ÞZj ðnj 3Þ: One question that this study did not address was whether item order might interact with reverse-scoring of some items. All scales Note that for all possible pairwise correlations between person- that we used contained 50% of the items being reverse-scored with ality factors as well as for the average correlations, the results respect to its intended construct (i.e., each factor was measured showed that only one was significantly different across the three using 10 items, 5 of which were reversed). Yet, there is some evi- forms. While it is clear from Table 2 that the average correlation dence that reverse-scoring may have their own factors, lead to er- for Form R (items randomly assorted) was about 0.1 units higher rors in responding, and potentially interfere with response than the average correlation for Form G (items arranged by fac- tendencies on personality scales (e.g., Chen, Rendina-Gobioff, & tors), this observed difference was not statistically reliable. There- Dedrick, 2010). This could be a hidden source of vari- fore, our third hypothesis was not supported. ance in our data, perhaps affecting our conclusions about the im- pact of item order generally. For example, if reverse-scored items 3.4. General discussion help to disrupt the metacognitive perceptions of the instrument’s underlying intent or structure, then the effect of item order may This study focused on the effects of item ordering on a psycho- be suppressed if the reverse-scoring factor is not explicitly mod- metric instrument. Specifically, we examined the similarities and eled. While such a study would require a very large sample, the differences in internal consistency and scale-level correlations interaction between item order and reverse-scoring seems to be across test forms that vary in their item ordering. Our data do an important psychometric question that has yet to be fully not show an effect of item order on the internal consistency or explored. measurement invariance of the instrument that we studied. An The paradox inherent in a study like this one is that, although analysis of scale-level correlations revealed that observed varia- we were unable to confirm the hypotheses we set forth, we expect tions were not significantly different, either, even though the over- that psychometricians and test builders will be pleased to see the all magnitude of correlation was weakest in the ‘‘grouped-by- null results that we report. This study should be encouraging to factor’’ condition, a finding in line with prior research (Steinberg, those who construct and validate tests, as well as to those who 1994) but contradictory to other studies (Rost & Hoberg, 1997; make frequent use of such tests, since it appears that the order Schriesheim, Kopelman, et al., 1989; Schriesheim, Solomon, et al., in which items are presented or listed is not associated with any 1989). Thus, this study stands as evidence that item order does significant negative consequences. Test builders are left to con- not affect the underlying measurement properties of psychological struct their instruments in the way that seems most logical to instruments, provided that those instruments are similar to the them, and at least based on these data, they are able to do so with- one used (i.e., a well-validated instrument that targets well-de- out having to worry a great deal about how that may impact the fined constructs). The informed reader might consider the litera- validity and reliability of the instrument itself. ture on computerized adaptive personality testing (CAPT; see Butcher, 2013, for a review) to be germane to our investigation Acknowledgements as well. While it is clear that our findings should make CAPT users feel more at ease, there is one important difference to consider. The authors thank the Angelo State University graduate and Static versions of personality instruments cannot adapt dynami- undergraduate research assistants for assistance. cally to the responses that users input. Thus, CAPT provides a more We also thank anonymous reviewers on a previous draft of this pa- thorough counterbalancing of order effects than static testing does, per for important feedback and for identifying a statistical error which should logically reduce or even eliminate any order effects that affected data interpretation. Portions of this paper were pre- that could exist. Our investigation is relevant despite the CAPT lit- sented at the Society for Industrial-Organizational Psychology con- erature because order effects are more likely to emerge (if they ex- ference, Chicago, IL, April, 2011. ist empirically) in situations where the instrument cannot adapt to the user in real time. References We must note that we did not investigate metric invariance across the measurement model that correlates all the factors with Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2010). A basic one another due to sample size concerns. We do believe that our introduction to fixed effect and random effects models for meta-analysis. sample size was large enough to provide acceptable power at the Research Synthesis Methods, 1, 97–111. level of analysis that we employed, so are not overly concerned Butcher, J. N. (2013). Computerized personality assessment. In J. R. Graham, J. A. Naglieri, & I. B. Weiner (Eds.), Handbook of psychology. Assessment psychology, that our data are actually indicative of a Type II error rather than 2nd ed (Vol. 10, pp. 165–191). Hoboken, NJ: Wiley and Sons. a substantive finding. Nevertheless, a of this work with Chen, Y.-H., Rendina-Gobioff, G., & Dedrick, R. F. (2010). Factorial invariance of a a larger sample would be useful as clarification of our results. Chinese self-esteem scale for third and sixth grade students: Evaluating method effects associated with positively and negatively worded items. International We observed that the absolute magnitude of the correlations Journal of Educational and Psychological Assessment, 6, 21–35. between personality factors was stronger (although not signifi- Cooper, H. (2009). Research synthesis and meta-analysis: A step-by-step approach (4th cantly so) when items were randomly assorted rather than when ed.). Thousand Oaks, CA: Sage Publications. Franke, G. H. (1997). ‘‘The whole is more than the sum of its parts’’. The effects of they were grouped by factors. This trend was in line with what grouping and randomizing on the reliability and validity of questionnaires. we expected, despite the lack of statistical significance, and we be- European Journal of Psychological Assessment, 13, 67–74. K.L. Schell, F.L. Oswald / Personality and Individual Differences 55 (2013) 317–321 321

Frantom, C. G., Green, K. E., & Lam, T. C. (2002). Item grouping effects on invariance Schriesheim, C. A. (1981). The effect of grouping or randomizing items on leniency of attitude items. Journal of Applied Measurement, 3, 38–50. response bias. Educational and Psychological Measurement, 41, 401–411. Hakstian, A. R., & Whalen, T. E. (1976). A k-sample significance test for independent Schriesheim, C. A., Kopelman, R. E., & Solomon, E. (1989a). The effect of grouped alpha coefficients. Psychometrika, 41, 219–231. versus randomized questionnaire format on scale reliability: A three-study Hays, W. L. (1994). Statistics (5th ed.). Fort Worth, TX: Harcourt Brace. investigation. Educational and Psychological Measurement, 49, 487–508. Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indices in covariance structure Schriesheim, C. A., Solomon, E., & Kopelman, R. E. (1989b). Grouped versus analysis: Conventional criteria versus new alternatives. Structural Equation randomized format: An investigation of scale convergent and discriminant Modeling, 6, 1–55. validity using LISREL confirmatory . Applied Psychological Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and Measurement, 13, 19–32. bias in research findings (2nd ed.). Thousand Oaks, CA: Sage Publications. Schurr, K. T., & Henriksen, L. W. (1983). Effects of item sequencing and grouping in Kirchner, W. K., & Uphoff, W. H. (1955). The effect of grouping scale items in union low-inference type questionnaires. Journal of Educational Measurement, 20, attitude measurement. Journal of Applied Psychology, 39, 182–183. 379–391. Knowles, E. S. (1988). Item context effects on personality scales: Measuring changes Smith, T. W. (1983). An experimental comparison of clustered and scattered scale the measure. Journal of Personality and Social Psychology, 57, 351–357. items. Social Psychology Quarterly, 46, 163–168. Knowles, E. S., & Byers, B. (1996). Reliability shift in measurement reactivity. Driven Solomon, E., & Kopelman, R. E. (1984). Questionnaire format and scale reliability: An by content engagement or self-engagement? Journal of Personality and Social examination of three modes of item presentation. Psychological Reports, 54, Psychology, 70, 1080–1090. 447–452. Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Steinberg, L. (1994). Context and serial-order effects in personality measurement: Sage Publications. Limits on the generality of measuring changes the measure. Journal of Melnick, S. A. (1993). The effects of item grouping on the reliability and scale scores Personality and Social Psychology, 66, 341–349. of an affective measure. Educational and Psychological Measurement, 53, 211–216. Rost, D. H., & Hoberg, K. (1997). Changing the position of items in personality questionnaires: Methodological malpractice or tolerable practice? Diagnostica, 43, 97–112.