Item Grouping and Item Randomization in Personality Measurement ⇑ Kraig L
Total Page:16
File Type:pdf, Size:1020Kb
Personality and Individual Differences 55 (2013) 317–321 Contents lists available at SciVerse ScienceDirect Personality and Individual Differences journal homepage: www.elsevier.com/locate/paid Item grouping and item randomization in personality measurement ⇑ Kraig L. Schell a, , Frederick L. Oswald b a Department of Psychology, Angelo State University, United States b Department of Psychology, Rice University, United States article info abstract Article history: This paper describes a study examining the impact of item order in personality measurement on reliabil- Received 22 October 2012 ity, measurement equivalence and scale-level correlations. A large sample of university students com- Received in revised form 6 March 2013 pleted one of three forms of the International Personality Item Pool version of the Big Five personality Accepted 12 March 2013 inventory: items sorted at random, items sorted by factor, and items cycled through factors. Results Available online 4 April 2013 showed that the underlying measurement model and the internal consistency of the IPIP-Big Five scale was unaffected by differences in item order. Also, most of the scale-level correlations among factors were Keywords: not significantly different across forms. Implications for the administration of tests and interpretation of Personality test scores are discussed, and future research directions are offered. Measurement Psychometrics Ó 2013 Elsevier Ltd. All rights reserved. Validity Reliability Item order 1. Introduction and item order was targeted as a possible source of leniency bias in those ratings. Although the evidence reported suggested a de- The history of item-order research dates back over half a cen- crease in leniency bias for the ungrouped instruments, the findings tury. One of the earliest studies on the effects of item order using were modest at best, again casting doubt on the actual impact of a sample of unionized workers used an attitude instrument with item sequencing. 12 subscales (Kirchner & Uphoff, 1955). The analysis reported indi- Other studies provide evidence in support of item order effects. cated that 11 of those 12 scale means did not differ significantly Across large samples of public school supervisors and administra- between tests with items that were sequenced in two different tors, Schurr and Henriksen (1983) reported significant effects of ways: items grouped by factor vs. items that were randomized. item order on the psychometric properties of a 61-item instrument The sample sizes were small (maximum N per group was 21), that was presented in three forms: ungrouped, grouped but not la- and of course, simple mean comparisons may have missed impor- beled by factor, and grouped and labeled by factor. However, the tant relationships in these data, and mean differences do not speak instrument in question was an item checklist containing skills to the reliability or equivalence of each measure’s internal struc- and activities that could serve as training topics in the organization ture. But nevertheless, this work represents an early contribution sampled, and so the items under any given ‘‘factor’’ were quite di- to the historical conclusion that item order is not a critical issue verse. These effects may not generalize to measures of latent psy- in the administration of self-report instruments. chological constructs such as personality traits. Another study Subsequent studies on the item-order effect (or a lack of one) do conducted soon thereafter used self-report attitude measures, such not seem to be found in the research literature much later, in the as job and family satisfaction, and the authors reported that inter- 1980’s. The first of these studies (Schriesheim, 1981) tested a small nal consistency statistics varied significantly between grouped and sample of department store employees on two recognized leader- labeled, grouped and unlabeled, and ungrouped forms of adminis- ship questionnaires, examining items that were grouped by con- tration (Solomon & Kopelman, 1984). Data showed that the struct vs. randomized (N = 40 in each of two groups, for each of grouped and unlabeled form was associated with the highest Cron- two studies). These questionnaires were both designed to rate bach’s alpha, whereas the randomized form was associated with the employees’ immediate supervisors on leadership behaviors, the lowest alpha statistic. These differences were also statistically reliable. These two studies renewed interest in the issue of item or- der as a potential confound in psychological measurement. Finally, ⇑ Corresponding author. Address: Box 10907 ASU Station, Angelo State Univer- two other studies (Schriesheim, Kopelman, & Solomon, 1989; Sch- sity, San Angelo, TX 76909, United States. Tel.: +1 325 486 6128; fax: +1 325 942 riesheim, Solomon, & Kopelman, 1989) examined item order ef- 2290. fects in attitude measures from multiple perspectives: internal E-mail address: [email protected] (K.L. Schell). 0191-8869/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.paid.2013.03.008 318 K.L. Schell, F.L. Oswald / Personality and Individual Differences 55 (2013) 317–321 consistency, convergent/discriminant validity, and confirmatory assortment being the least), the psychometric properties of the factor structures. In general, their data showed that the effect of instrument should improve if the structural features of the test item order was likely to be stronger in measures that were not are an important part of the measurement model underlying it. as strong psychometrically compared to those that were. This Thus, we expected that obvious non-random sequencing (items might make conceptual sense, such that when the construct is grouped by factor) should increase the internal consistency of not as salient or strong to the test-taker, factors such as unique the instrument, since item grouping should encourage an under- content specific to items and item ordering have greater influences standing of their shared properties, an idea in line with prior stud- on test responses. Both studies also suggested that the grouped- ies (Melnick, 1993; Smith, 1983). Based on Schriesheim, Kopelman, item format was superior to random-item formats in most cases, et al. (1989), we also hypothesized that confirmatory factor analy- although the authors do admit that the practical significance of sis would reveal unequal factor loadings (metric non-invariance) item order on reliability is likely to be rather low (Schriesheim, between comparable items at the subscale level across the three Kopelman, et al., 1989; Schriesheim, Solomon, et al., 1989) and forms. Third, we expected that observed scale-level correlations we admit that it is a relatively open question whether results in should be significantly different (i.e., show heterogeneity) across this research domain are replicable versus being more strongly tied form type. Specifically, non-random item groupings should lead to samples, measures and settings. to lower correlations between factors because the superficial dif- In line with this sentiment, it should be noted that not everyone ferences between the items should be perceived more readily by studying item order effects at that time was drawing the same con- respondents. To our knowledge, no study has directly compared clusions. In two papers, Knowles and colleagues argued that item scale-level correlations between groups exposed to the same order was a significant influence on the item-total correlations instrument but with different item sequencings. and alpha reliability in a personality instrument (Knowles, 1988; Knowles & Byers, 1996). Steinberg (1994) also suggested that 2. Method item-trait correlations could be systematically affected by the grouping of items by each underlying factor that the measure is in- 2.1. Participants tended to reflect. Finally, more recent studies have been conducted as well, using Three hundred ninety-seven participants (129 men, 268 wo- newer analysis techniques. For example, Franke (1997) showed men) volunteered to complete the study. The average age for the evidence that grouping the items by their underlying factor nega- sample was 21.4 years (SD = 6.3). These were all students from a tively affected the internal consistency and changed mean values mid-sized university in the American southwest who completed for subscales in a multi-factor instrument. She suggests that this the study in exchange for course credits. occurs because the test had been validated using a particular item sequence, and as such, that fact had become part of the psychomet- 2.2. Materials ric properties of the instrument itself. This is an interesting general point, that validation evidence of a measure shown over time may The 50-item International Personality Item Pool version of the be in part a function of an item order established for the measure. Big Five personality instrument was chosen for this study (http:// A more recent study (Frantom, Green, & Lam, 2002) seemed to ipip.ori.org). As portrayed on the IPIP website, the IPIP instrument indirectly support this idea on a 40-item opinion survey, arguing has been consistently shown to be a valid and reliable measure of that item grouping by factor using a Rasch measurement model ap- the Big Five traits. None of the form types described below reflect proach significantly increased person separation, making the items the ‘‘standard’’ item sequencing for this instrument because one reflecting a particular factor more ‘‘sample-free.’’ does not exist