Bowling Green State University ScholarWorks@BGSU

Psychology Faculty Publications

8-1998

Modeling Item-Level with Item Response Theory

Michael John Zickar Bowling Green State University, [email protected]

Follow this and additional works at: https://scholarworks.bgsu.edu/psych_pub

Part of the Psychology Commons

Repository Citation Zickar, Michael John, "Modeling Item-Level Data with Item Response Theory" (1998). Psychology Faculty Publications. 4. https://scholarworks.bgsu.edu/psych_pub/4

This Article is brought to you for free and open access by the Psychology at ScholarWorks@BGSU. It has been accepted for inclusion in Psychology Faculty Publications by an authorized administrator of ScholarWorks@BGSU. VDI.UVIK 7, NUMBHR 4, AUCiUS'I 199H

Modeling Item-Level Data With Item TWO BASIC CONCEPTS Response Theory OF IRT Michael J. Zickar^ Psychologists typically use many of the techniques developed Department of Psychology, Bowling Green State University, Bowling Green, Ohio through classical theory (CTT) to evaluate tests and scales used in research and practice. Concepts such as , item-total corre- Psychology can be separated finement and is now ready to be lations, and the Spearman-Brown into two camps, a substantive implemented into the mainstream prophecy formula are all based on camp that is primarily interested in of psychological research and prac- the CTT model, which posits that understanding important aspects tice. the observed score is a func- of human behavior and thought IRT has already had a major tion of a true score and an error and a methodological camp that is impact on educational testing term. Many limitations of this primarily interested in developing through its impact on computer- framework have been noted. Eirst, tools that will be used by the sub- ized adaptive testing (CAT). The the CTT model focuses on scale- stantive researchers to answer dif- precision of IRT-based item statis- level measurement. The true score ficult questions. Ideally, the two tics allows computerized testing is explicitly linked to a particular camps should have much cross- programs to choose items that pro- instrument. Therefore, an indi- fertilization, and the lines between vide maximum psychometric in- vidual's true score for one math- them should be blurred. However, formation for an individual exam- ematical reasoning test is different complex theories are often con- inee. This process allows adaptive from his or her true score for an- structed before appropriate meth- tests to maintain measurement pre- other mathematical reasoning test. ods are developed for testing the cision similar to that of conven- Second, CTT does not allow for hypotheses. Additionally, quanti- tionally administered tests even much precision in testing specific tative psychologists often develop though fewer items are adminis- hypotheses about the measure- techniques that might be difficult tered. In the 1990s, Educational ment properties of scales. In CTT, and impractical to apply initially to Testing Service (ETS) implemented each scale has a reliability that is real-data problems. These novel a CAT version of the Graduate Rec- used to characterize the measure- techniques need to be refined ord Examination (GRE). ETS plans ment precision of the entire test. through robustness studies and to phase out conventional paper- This concept fails to recognize that analytical work before they can be and-pencil administration of the tests have differential capabilities used with real-world data. A psy- GRE General test and administer in discriminating among different chometric framework called item only CAT versions by fall 1999. The levels of examinees' abilities. Eor response theory (IRT) has under- success of adaptive testing would example, a mathematics test com- gone such prudent testing and re- not be possible without develop- posed primarily of calculus items ment of IRT. In the future, it is will be able to differentiate indi- likely that IRT will yield progress viduals high in math ability from Recommended Reading not only by improving measure- those with average or below- ment technologies, but also by average skills; this test, however, Drasgow, F., & Hulin, C.L. (1990). making contributions in substan- would not provide much differen- Item response theory. In M.D. tive areas, such as decision-making tiation between those of low ability Dunnette & L.M. Hough (Eds.), theory. and those of average ability. To ap- Handbook of industrial and organiza- tional psychology, Vol. 1 (pp. 577- preciate the power of IRT and some This article begins by explaining of its advantages over CTT, it is 636). Palo Alto, CA: Consulting two fundamental concepts of IRT, Psychologists Press. necessary to discuss two basic con- Hambleton, R.K., Swaminathan, H., the item response function and cepts.^ & Rogers, H.J. (1991). Fundamen- item information. Next, the flexibil- tals of item response theory. New- ity of IRT is highlighted to demon- bury Park, CA: SAGE. strate the types of data that can be van der Linden, W.J., & Hambleton, modeled. Einally, the present and Item Response Function R.K. (Eds.). (1997). Handbook of modern item response theory. New future impact of IRT on both prac- York: Springer-Verlag. tical and theoretical issues in psy- IRT relates characteristics of chology is discussed. items and characteristics of indi-

Published by Cambridge University Press CUKUrNT IVRECriONS IN l'S^CI KU.OGIC.M. SCIl NCC viduals to the probability of affirm- slope of the IRE; b,, a location pa- mula and can be obtained by set- ing, endorsing, or correctly an- rameter that determines the area of ting the c parameter to zero. This swering individual items. The the 6 continuum in which the IRF is model has the implicit assumption cornerstone of IRT is the item re- most steep; and c,, a pseudoguess- that people with the lowest 6 val- sponse junction (IRE). This function ing parameter that determines the ues will have a zero probability of is a nonlinear regression of the probability that a respondent with affirming the item. An even sim- probability of affirming item i on a an extremely low 6 will endorse the pler model, the , is ob- latent trait, 6, which represents the item. The probability of affirming tained by setting the a parameter to characteristic measured by the items with large a parameters var- be constant across all items. Each of scale items (e.g., mathematical abil- ies sharply as a function of 6, these models assumes that each ity, extroversion, job satisfaction). whereas the probability of affirm- item is measuring only one 9 di- Eigure 1 presents a graphic repre- ing items with low a parameters mension. sentation of three IRFs. varies weakly as a function of 6. The first hypothetical item pre- There are many different forms Items with low a parameters are sented in Eigure 1 has a = 1.0, of this regression line. Eor dichoto- generally considered poor items. which is an average level of dis- mously scored items (e.g., right vs. Items with large positive b param- crimination; b = 0.0, which is an av- wrong or true vs. false), the two- eters will be endorsed only by re- erage difficulty; and c = .25, which parameter logistic model (2PL) and spondents with large positive 9s, suggests that even individuals with the three-parameter logistic model whereas items with large negative extremely low ability have a 25% (3PL) are commonly used. The for- b parameters will be endorsed by chance of answering the item cor- mula for the 3PL model is everyone except people with the rectly. Item 2 has the same dis- most extreme negative 9s. The c crimination and difficulty as Item 1 parameter introduces a nonzero 1; however, c = 0.0, suggesting that ,.=lie)=c, lower asymptote to the IRE so that there is no guessing occurring with . + e respondents with large negative 9s this item. As can be seen in Figure where the probability that a person will have a nonzero probability of 1, the lower asymptote of the IRF with a latent trait, 9, affirms an affirming the item; this nonzero as- for Item 2 is at 0.0. Finally, Item 3 is ymptote may result from guessing item I (i.e., u,- = 1) is a function of of lower discrimination {a = 0.60) or other processes. The 2PL for- three parameters: a,, a discrimina- and of lower difficulty {b = -1.5) m.ula is a submodel of the 3PL for- tion parameter that determines the than the previous two items. This item has c = 0.0; if the 9 axis were extended further, the IRF would 1.0 eventually reach 0.0. By plotting the IRFs, researchers can compare the functioning of items, determine .8- the extent of guessing, and deter- mine the range of 9 in which an item is most discriminating.

Information

Another key IRT concept is in- jormation, which is used to quantify .2- Item1 measurement precision. The infor- mation value for each item is com- Item 2 puted on the basis of the IRF. The Item 3 formula for item information is -3.00 -2.00 -1.00 .00 1.00 2.00 3.00 Theta Item1: a = 1.0, b = 0.0, c = 0.25; Item 2: a = 1.0, b = 0.0, c = 0.0 Item 3: a = 0.60, b = -1.5, c = 0.0 where P(u, = 119) is the first de- rivative (i.e., slope) of the IRF at a Fig. 1. Item response functions for three hypothetical items. particular value of 9.

Copyright © 1998 American Psychological Society The of 9 can be an interesting point: Items differ in each item response, and then a computed directly from the item their usefulness depending on the computer algorithm selects the information function as follows: purpose. Item 3, which has a much next item to present on the basis of lower discrimination parameter the information level of items at the 1 than Items 1 and 2, might be pre- revised ability estimate. By choos- ferred if the purpose is to estimate ing only items with large amounts the e of a person with low ability of information, adaptive tests can As can be seen in this equation, the because that item has a higher maintain measurement precision at standard error of measurement as amount of information at the low the levels of conventional tests conceptualized in IRT is condi- end of the 9 continuum. A second even though fewer items are ad- tioned on the level of 9. Unlike important observation, that guess- ministered. CTT, IRT allows measurement pre- ing reduces information, can be cision to vary across different re- seen by examining the difference gions of the measurement scale. between information functions for Figure 2 presents the informa- Item 1 and Item 2. Item 2 has EXPANDING THE tion function for each of the three greater information at all regions of DOMAIN previously presented items. Sev- the 9 continuum, and especially at eral important observations can be the lower regions of the 9 con- Two unfounded criticisms of made about the information func- tinuum, where guessing will be IRT are that IRT models are ca- tion. First, information is maxi- prevalent among respondents. This pable of modeling only items that mized near where 9 equals the decrease in the amount of informa- measure cognitive ability and that value of the b parameter of each tion highlights the negative effects IRT is incapable of modeling items item. The functions for Items 1 and that guessing has on measurement that are not dichotomously scored. 2 have more information near the precision. IRT models have been developed middle of the 0 distribution than Computer adaptive tests work primarily with cognitive-ability does the function for Item 3, which by administering items with large items having clear right and wrong has its information maximized in amounts of information near the answers. These types of items lend the negative range of the 9 con- region where 9 is most likely to be. themselves well to dichotomous tinuum. This difference illustrates The ability estimate is revised after scoring. However, dichotomous IRT models can be used for items that have two options, even though .8- neither option is considered more correct than the other. For example, personality items that ask the re- spondent whether a particular de- scription "describes yourself" or "does not describe yourself" can be modeled with dichotomous mod- o els. Dichotomous IRT models have been used to model items in di- verse fields, such as personality (Zickar & Drasgow, 1996), drug- use screening (Kirisci, Tarter, & Hsu, 1994), and attitudes toward .2-i work (Roznowski, 1989). Item 1 In recent years, much of the ba- Item 2 sic research in IRT has focused on developing models for more com- 0.0^ Item 3 plicated item types, such as items -3.00 -1.80 -.60 .60 1.80 3.00 with more than two response op- -2.40 -1.20 .00 1.20 2.40 tions (i.e., polytomous items) and items that measure more than one Theta dimension. Roznowski (1989), ana- Fig. 2. Item information functions for three hypothetical items. See Figure 1 for the lyzing the Job Descriptive Index item parameters. (JDI), collapsed a three-option re-

Published by Cambridge University Press CUKRFMT DIKlCnONS IN PSYCHOItX^IC/M SCIKNCF sponse format into a dichotomous the test maker cannot identify the IRT models that assume unidimen- response scale because polytomous degree of wrongness without look- sionality will not fit the data. IRT was relatively unknown and ing at the data in a pilot sample. Recent work has focused on de- untested at that time. She predicted For example, on a vocabulary test, veloping multidimensional IRT that "algorithms that allow . . . one of the wrong answers may be models (see Reckase, 1997). These analysis of [polytomous] JDI data more similar in meaning to the cor- techniques can be useful for mod- using such methods likely will be rect answer than other wrong an- eling complex items, such as math- forthcoming" (p. 813). She was cor- swers are. For most personality ematical story problems, which re- rect. tests, nominal models are probably quire both quantitative and verbal inappropriate. skills. However, these techniques Polytomous models provide still require more work before they Polytomous Models two advantages to researchers. will be integrated into substantive First, these models can handle research areas frequently. Future Polytomous models are appro- more flexible item types. Often work with multidimensional IRT priate when items are not dichoto- (e.g., Roznowski, 1989) polyto- will focus on developing polyto- mously scored and when the at- mous response scales are collapsed mous IRT models and also on mak- tractiveness of options differs for so that dichotomous models can be ing estimation programs more respondents of differing 9 levels. used. This is bad practice because user-friendly. This work is impor- The main difference between poly- the model used to analyze the data tant because it will help break tomous and dichotomous models does not match the original re- through another limitation. is that IRFs are replaced by option sponse scale. Second, polytomous response functions (ORFs). An models can extract more informa- ORF explicitly relates 9 to the prob- tion than dichotomous models ability of choosing a particular op- used on the same items. This in- IRT-BASED TOOLS tion instead of the probability of creased information helps improve answering an item correctly. the precision of 9 estimates. Given Some of the most important con- There are two classes of polyto- these two advantages, polytomous sequences of IRT are the psycho- mous models. Models in the first IRT has been an important addi- metric tools that rely on its basic class, denoted graded models, as- tion to the IRT toolbox. concepts. These tools include adap- sume an a priori ordering of op- tive testing and appropriateness tions in terms of valence, so that measurement, among others. Option 1 has a more negative va- Multidimensional IRT lence than Option 2, and Option 3 Adaptive Testing has a more positive valence than The previously mentioned IRT Options 1 and 2. This model is ap- models all assume that the latent propriate with Likert-type scales trait, 9, is the only individual deter- Adaptive testing presents a chal- (e.g., when response options are minant of item responses. If this lenge for psychometricians because "disagree," "neither agree nor dis- prescription were to be taken liter- adaptive tests must be constructed agree," "agree"), which are preva- ally, the models could be used to to maximize the precision of ability lent in personality and attitude analyze only scales with strict uni- estimates for each test taker. CTT measurement. An example of a dimensionality. However, Dras- methods are not very practical be- graded model is Samejima's gow and Parsons (1983) demon- cause the item parameters are Graded Response Model (Same- strated that deviations from strict liriked to the sample for which the jima, 1969). unidimensionality will not destroy parameters are calibrated; IRT pa- The other class of polytomous the fidelity of these models as long rameters are sample-invariant. models, denoted nominal models, as an appropriate Item parameters estimated from assumes no a priori ordering of re- demonstrates that the first factor one sample can be applied to ex- sponse options within an item. An has a much larger eigenvalue than aminees who were not in the cali- example of a nominal model is secondary factors. Unfortunately, bration sample. Another advantage Bock's Nominal Model (Bock, many psychological constructs, is that item and ability parameters 1972). A nominal model might be such as personality traits, may are on the same scale; this feature most appropriate for cognitive- have higher levels of multidimen- allows for relatively simple algo- ability items for which the incor- sionality than the levels Drasgow rithms to be used to select items rect options may imply different and Parsons found acceptable. For that are best suited for each indi- degrees of "wrongness," but often these multidimensional constructs. vidual. IRT models, along with the

Copyright © 1998 American Psychological Society VOl.UMl- 7, NUMI3KR 4, AUGUST 1948 development of low-cost desktop These models will provide re- tions, whereas other scenarios had computers, have made adaptive searchers even more modeling op- differences in location parameters testing practical. Adaptive testing tions. only across conditions. The de- is now used extensively by ETS, the Another important method- tailed description of the IRFs al- U.S. Armed Services, and other li- ological research topic that needs lowed for fine distinctions between censure testing services. The popu- more attention is evaluation of items to be noted. This IRT analysis larity of adaptive testing in such goodness of fit. IRT models, like suggested that the concept of fram- high-volume, high-stakes testing structural equation models, need to ing needs to be reevaluated so that programs is a testament to the con- be evaluated to see if the models fit these differences can be explained. fidence psychometricians have in the data. In the structural equations More than 10 years ago, Roskam IRT. literature, there are many alterna- (1985) stated that "psychometric tive goodness-of-fit indices, which modeling has been and still is pri- evaluate model fit with different marily developed as a technology Appropriateness Measurement standards. In IRT, few options for for measurement, with only loose evaluating model fit exist. Issues connections to substantive theoriz- Appropriateness measurement such as model modification indices ing" (p. 9). With some exceptions, attempts to identify individuals and influence statistics have yet to Roskam's statement still holds true who do not fit the existing model be tackled in IRT. today. With the increased accessi- bility of IRT estimation programs for responding to items (Levine & The real contributions of IRT to Rubin, 1979). This technique can be and the continued dissemination of the field of psychology are yet to be IRT research in substantive jour- used to identify people who are realized; the current challenge is to unmotivated, cheating, faking, or nals, this important methodologi- use IRT models and related tools to cal technology should help ad- otherwise answering in a way un- answer substantive problems. like most other respondents. The vance the state of psychological Thissen and Steinberg (1988) dem- knowledge. logic of appropriateness measure- onstrated that IRT models can be ment is that a model is estimated used in conjunction with experi- on a group of respondents who can mental manipulations to test Notes be considered to be normal or whether such manipulations alter regular. Response patterns are the way respondents answer items. checked against this normative pat- 1. Address correspondence to In an extension of this methodol- tern. Individuals whose patterns Michael J. Zickar, Department of Psy- ogy, Highhouse and I used the 2PL do not conform to this model are chology, Bowling Green State Univer- model to model responses to risky- sity, Bowling Green, OH 43403; e-mail: checked further. Additional inves- [email protected]. tigation may reveal that the indi- choice scenarios under a positive or negative framing condition (i.e., 2. For more information about the vidual cheated or did not take the differences between CTT and IRT, the test seriously. when positive or negative conse- reader is referred to Hambleton and quences of a decision, respectively, Jones (1993). were emphasized; Zickar & High- house, 1998). Our results sup- FUTURE DIRECTIONS ported the traditional finding that References respondents in negative frames are more likely to choose risky alterna- Bock, R.D. (1972). Estimating item parameters and There are several directions that latent ability when responses are scored in two I predict future IRT research will tives than are respondents in posi- or more nominal categories. Psychometrika, 37, tive frames; however, when plot- 29-51. follow. On the technical side, more Drasgow, F., & Parsons, C.K. (1983). Application sophisticated models will be devel- ting IRFs for individual items of unidimensional item response theory mod- administered under both positive els to multidimensional data. Applied Psycho- oped to model more complex item logical Measurement, 7, 189-199. types. Some examples of this trend and negative frames, we identified Hambleton, R.K., & Jones, R.W. (1993). Compari- that the effect of framing differed son of classical test theory and item response can be seen in nonparametric mod- theory and their applications to test develop- els (e.g., Levine & Tsien, 1997) that across scenario types. The classic ment. Educational Measurement: Issues and Prac- Asian Disease problem (Tversky & tice, U, 38-47. provide more flexibility than the Kelderman, H. (1997). Loglinear multidimensional models discussed in this article, Kahneman, 1981), which is the item response model for poiytomously scored most frequently used risky-choice items. In W.J. van der Linden & R.K. Hamble- models that incorporate speed of ton (Eds.), Handbook of modern item response responding (Roskam, 1997), and scenario, had differences in dis- theory (pp. 287-304). New York: Springer- multidimensional models for poly- crimination and location param- Verlag. eters across experimental condi- Kirisci, L., Tarter, R.E., & Hsu, T. (1994). Fitting a tomous data (Kelderman, 1997). two-parameter logistic item response model to

Published by Cambridge University Press clarify the psychometric properties of the drug sponse theory. In E.E. Roskam (Ed.), Measure- Thissen, D., & Steinberg, L. (1988). Data analysis use screening inventory for adolescent alcohol ment and personality assessment (pp. 3-19). Am- using item response theory. Psychological Bul- and drug abusers. Alcoholism: Clinical and Ex- sterdam: Elsevier Science. letin, 104, 385-395. perimental Research, 18, 1335-1341. Roskam, E.E. (1997). Models for speed and time- Tversky, A., & Kahneman, D. (1981). The framing Levine, M.V., & Rubin, D.B. (1979). Measuring the limit tests. In W.J. van der Linden & R.K. Ham- of decisions and the rationality of choice. Sci- appropriateness of multiple-choice test scores. bleton (Eds.), Handbook of modem item response ence 221. 453-^58. journal of Educational Statistics, 4, 269-290. theory (pp. 187-208). New York; Springer- Zickar, M.J., & Drasgow, F. (1996). Detecting fak- Levine, M.V., & Tsien, S. (1997). A geometric ap- Verlag. ing on a personality instrument using appro- proach to two dimensional measurement. In Roznowski, M. (1989). An examination of the mea- priateness measurement. Applied Psychological A.J. Marley (Ed.), Choice, decision, and measure- surement properties of the Job Descriptive In- Measurement, 20, 71-87. ment (pp. 207-223). Mahwah, NJ: Erlbaum. dex with experimental items, journal of Applied Zickar, M.J., & Highhouse, S. (1998). Looking Reckase, M.D. (1997). The past and future of mul- Psychology. 74, 805-814. closer at the effects of framing on risky choice: tidimensional item response theory. Applied Samejima, F. (1969). Estimation of latent ability us- An item response theory analysis. Organiza- Psychological Measurement, 21, 25-36. ing a response pattern of graded scores. Psy- tional Behavior and Human Decision Processes, Roskam, E.E. (1985). Current issues in item re- chometrika, .34(Suppl. 17). 75, 75-91.

ment theory (e.g., Bowlby, 1969; The Social-Cognitive Model of Collins & Read, 1990; Hazan & Transference: Experiencing Past Shaver, 1987), to those concerned with the self (e.g., Aron, Aron, Tu- Relationships in the Present dor, & Nelson, 1991), close relation- ships (Berscheid, 1994; Murray & Susan M. Andersen and Michele S. Berk^ Holmes, 1993), and basic processes Department of Psychology, New York University, New York, New York in social cognition (Higgins, 1996; Higgins & King, 1981). In this article, we provide an Personal experience, as well as tions in daily life, until recently, overview of this research. To begin, psychological theory and research, little empirical work of any kind we describe the basic tenets of the suggests that relationships with has examined transference (al- model, highlighting its social- significant individuals from one's though see Luborsky & Crits- cognitive and, clinical origins, and past may have a profound impact Christoph, 1990). also outline our experimental para- on present-day relationships. The In our work, we have developed digm. We then summarize the ex- notion that aspects of past relation- a social-cognitive model of transfer- perimental research supporting the ships may reemerge in later social ence in everyday social relations model, which has demonstrated relations also forms the basis of the (Andersen & Glassman, 1996; transference as measured by infer- clinical concept of transference Andersen, Reznik, & Chen, 1997; ence and memory derived from (Freud, 1912/1958; Sullivan, 1953), Chen & Andersen, in press; for re- significant-other representations, which involves old issues in past lated models, see Singer, 1988; as well as by representation- relationships emerging in new re- Wachtel, 1981; Westen, 1988). We derived evaluation. We also review lations, especially in analysis. have shown that mental represen- research that shows the pervasive Transference in everyday life is the tations of significant others are impact of transference on interper- focus of our research, even though stored in memory, and that the sonal relations, summarizing find- historically, transference has been fundamental processes underlying ings involving affect, motivation, examined mainly theoretically and transference are the activation and expectancies, interpersonal roles, as it pertains to psychotherapy application of these representa- and self-definition. (e.g., Ehrenreich, 1989). Despite its tions to new people. Such activa- potential importance to social rela- tion and application occur particu- larly when the new person THE SOCIAL-COGNITIVE resembles the significant other. MODEL Recommended Readim This research provides the first ex- OF TRANSFERENCE Andersen, S.M., & Berk, M.S. (1998). perimental demonstrations of the (See References) transference concept and is rel- Andersen, S.M., & Glassman, N.S. evant to a variety of related litera- Basic Assumptions (1996). (See References) tures, ranging from those dealing Andersen, S.M., Reznik, I., & Chen, with relational schemas (Baldwin, S. (1997). (See References) Research suggests that the acti- 1992; Bugental, 1992) and attach- vation and use of significant-other

Copyright © 1998 American Psychological Society