JOURNAL OF APPLIED MEASUREMENT, 5(l),95-1 10 Copyright' 2004

Rasch Model Estimation: Further Topics

John M. Linacre University ofthe Sunshine Coast Australia

Building on Wright and Masters (1982), several Rasch estimation methods are briefly described, including Marginal Maximum Likelihood Estimation (MMLE) and minimum chi-square methods. General attributes of Rasch estimation algorithms are discussed, including the handling of missing data, precision and accuracy, estimate consistency, bias and symmetry. Reasons for, and the implications of, measure misestimation are explained, including the effect of loose convergence criteria, and failure of Newton-Raphson iteration to converge . Alternative parameterizations of rating scales broaden the scope of Rasch measurement methodol- ogy-Requests

for reprints should be sent to John M. Linacre, P. O. Box 811322, Chicago, IL 60681-1322, e-mail : MLC& winsteps .com

96 LINACRE

Introduction ("step difficulties", "Rasch thresholds") . To is introduced for mathematical symmetry. Algebra- Rasch models have the algebraic form of ically, it cancels out. It is usually set at 0. But, logit-linear models. Techniques for estimating the choosing a large negative value for do enables parameter values of such models have a long his the computation of probabilities for a very wide tory in modern (Yule, 1925). Several of range of ability-difficulty differences . These the estimation methods currently in wide use are would otherwise cause exponential overflow in described in Wright and Masters (1982). This a computer's math processor. Computers are gen- paper documents some of the other estimation erally accommodating of floating-point under- methods now in use, and some current variants flow, treating it as equivalent to an arithmetical of earlier methods. Some statistical properties of zero, but they report computational failure on all estimation methods are also discussed. floating-pointoverflow. The estimation methods described in Wright Sufficient statistics for the person and item and Master (1982) are the "Normal Approxima- parameters are the sums of the scored observa- tion Algorithm" (PROX), "Pairwise Conditional tions, i.e., the raw scores, associated with those Estimation" (PAIR), "Unconditional Maximum parameters. For instance, Rn is the raw score as- Likelihood Estimation" (UCON), also called sociated with Rn, and S. with 8.. For each of the "Joint Maximum Likelihood Estimation" scale parameters, ti ., j=1, m, a (JMLE), and "Fully Conditional Estimation" is the count of observations in the associated cat- (FCON, CON), also called "Conditional Maxi- egory. Ronald Fisher (1922) writes ofa sufficient mum Likelihood Estimation" (CMLE) . statistic "that the statistic chosen should summa- Additional methods to be described here in- rize the whole of the relevant information sup- clude "Marginal Maximum Likelihood Estima- plied by the sample." A Rasch model goes fur- tion" (MMLE), "Extra-Conditional Maximum ther, specifying that the irrelevant information be Likelihood" (XMLE), "Minimum Chi-Square random noise with a certain distribution. Qual- Estimation", and loglinear estimation. ity-control fit statistics report the extent to which Refinements to estimation procedures in- data meet this specification . clude alternative rating scale parameterizations, Equation (1) is more conveniently expressed and alternatives to Newton-Raphson iteration. as: Statistical properties to be examined include nix consistency, bias and the effects ofmisestimation . log /o ni(x-i) A Rasch Rating Scale Model Following the notation in Wright and Mas- Here, is the probability of person n being ters (1982), a Rating Scale model (Andrich, observed in category x-1 of item i. This expres- sion of the Rasch model emphasizes that the 1978) which defines the probability, Tc,' of per probabilistic, log-odds, structure of the data, on son n of ability R. on the latent variable being observed in category x of item i with difficulty 8. the left, is modeled to be the manifestation of an as: additive combination of latent parameters on the right.

i-0 Estimation Methods 7tnia - m k

- I;)] This section discusses estimation methods fey_[#n (s; + J°0 k-0 omitted from Wright and Masters (1982). It largely overlaps material in Fischer and Molenaar where the categories are ordered from 0 to m, (1995) and Linacre (1989) to which the reader is and Jr. } are the rating scale structure parameters referred for a more technical exposition.

RASCH MODEL ESTIMATION: FURTHER TOPICS 97

Linacre (1999) compared current implemen- ematical fiction. A frequent choice is the normal tations of several Rasch estimation algorithms, distribution. Then the { R } can be replaced by a and concluded that, for practical purposes, "all normal distribution, parameterized with 0, of methods produce statistically equivalent esti- mean p and standard deviation a, i.e., of form mates" (p. 402). Rasch measurement has not yet N(g,a). progressed to the point that the slight differences Equation (1) then becomes, for any person between the estimates produced by different al- u in the sample, gorithms have any systematic substantive impact. Ofcourse they can have accidental consequences. ;+ For instance, a barely statistically significant dif- +~ eY_[0-(8 ~cnk= f - '-°k f(e) dB ference between two measures, might be com- -°° ~,ey[e + r, )] puted to be "just significant" when the measures k-0 l=0 are estimated using one method, but "just falling short of significance" when a different estima- where tion method is employed. This merely indicates (B-,U)2 the insecure nature of hairline decisions, whether 8 2Q2 of significance or of pass-fail decisions relative a 2)c e to a criterion point. Numerical estimates can be obtained by Newton- Novel estimation methods continue to be Raphson iteration or other techniques, see Fischer proposed, each with its own particular virtues. and Molenaar (1995). A constraint must be intro- For instance, Karabatsos (2001) proposes a non duced to make the estimates unique. In ConQuest, parametric method, perhaps immune to scaling the constraint is that the mean ofthe item difficul- distortions which Nickerson and McClelland ties is zero. In many IRT programs, the mean of (1984) perceive to be undetectable by numerical the person distribution, la, is set to zero. methods. Such methods have yet to reach pro- The PROX algorithm duction software, so their substantive impact is follows the distribu- tional train ofthought not yet known. The Rasch analyst, however, even further, applying nor- mal distributions to both should continue to exercise caution with respect persons and items. It takes advantage of an to claims such as "we suggest that our [Rasch approximate identity be- tween the Cumulative estimation] method is superior to all others cur- Normal (D and Logistic `F distributions, which rently available." (Sheng and Carriere, 2002). is (Camilli, 1994) Marginal Maximum Likelihood Estimation dO=T( x )=(D( (MMLE) f f (0) x/1 .702 )

MMLE is implemented in Item Response In fact, Joseph Berkson (1944), who coined the Theory (IRT) software, such as BILOG (Mislevy term "logit", proclaims that this approximation and Bock, 1996), and Rasch-specific software, is indeed good enough for practical work. It such as ConQuest (Wu, Adams and Wilson, greatly simplifies and speeds computation, be- 1998). Its advantage is that parameter estimates cause the integral now has a closed form solu- for very large samples and very long tests can be tion. This usually removes the need for integra- obtained. Its disadvantage is that assertions must tion by numerical quadrature, which itself is a be made about the sample distribution . A unique source of estimation error. In view of the fact feature is that this sample distribution is usually that the asserted person distribution is, at best, modeled to include persons with extreme (zero only an approximate match to the empirical dis- and perfect) scores. tribution, it is surprising that MMLE software Under MMLE, the sample distribution is does not routinely take advantage of this equiva- imagined to conform to some convenient math- lence.

98 LINACRE

As usually implemented, MMLE provides In clinical situations, however, the distribu- item and rating scale structure parameter esti- tions and inferences can be far different. If a typi- mates, but only summary estimates of the person cal clinical instrument, e.g ., a quality of life as measures. Individual person measures can be sessment, is applied to a random sample of the obtained by estimating the measure correspond- adult population, then the results can be expected ing to each raw score given the item and struc- to be highly skewed. The crucial clinical cases ture parameters, as in the JMLE method (Wright are likely to be in the long lower tail. If analysis and Masters, 1982, p. 77) . But this will not sum- combines clinical and non-clinical samples, then marize to the same person measure distribution the person distribution may be bimodal. Even in as the MMLE estimates. In particular, under the educational testing bimodal distributions may JMLE method, extreme scores do not yield esti- occur (Lee, 1991) . In clinical applications, mis- mable measures . estimation may lead to incorrect inferences about Another approach is to consider the raw the clinical state, or the clinical improvement, of score of person n, R., and to compute the prob- patients who are the focus of the assessment. ability, Pne that the person had a particular abil Accordingly, care must be taken to verify that ity, 0. This computation is performed over the the asserted person distribution is a reasonable entire MMLE ability distribution . The MMLE match to the empirical one. This can be done by person ability is inspection of the raw score distribution, or, for analyses involving missing data, comparison of +00 the MMLE person distribution withthat produced on = J Bgeiand9 (6) by JMLE or PAIR -00 Extra-Conditional Maximum Likelihood Estimates like these will necessarily summarize Estimation (XMLE) to the asserted MMLE distribution, no matter The familiar objection to the convenient what it is. Accordingly it is essential that they be JMLE method is that under certain conditions its characterized with fit statistics if any attempt is estimates are statistically inconsistent and, for to be made validate the reasonableness of the short tests or small samples, noticeably statisti- asserted distribution . cally biased (Haberman,1977; Wright 1988). The MMLE, and its close relative, PROX, can JMLE bias produces estimates that are too dis- mislead the unwary. An obvious drawback to persed. For a two-item dichotomous test, the these approaches is that they violate Rasch's pre JMLE item estimates are twice as dispersed as cept that "it ought to be possible to compare the CMLE ones (Andersen, 1973) . Wright (1988) stimuli belonging to the same class-'measuring discusses a simple correction which effectively the same thing'-independent of which particu- eliminates JMLE bias. lar individuals within a class considered were An analytic attempt to remedy the estimation- instrumental for the comparison." (Rasch, 1980, bias defect in JMLE, while maintaining flexibility p. xx) . Indeed on the page, Rasch has reported of data designs, is XMLE, "extra-conditional criticisms of "group-centered" statistics. Practi- MLE", originally XCON (Linacre, 1989), imple- cal considerations, however, can override ideal- mented in WINSTEPS (Linacre, 2002c). The istic ambitions . MMLE and PROX were devel- source of the bias in JMLE estimation is that it oped for the analysis of dichotomous educational acts as though it can estimate finite measures for tests in which it is reasonable to assume a extreme score vectors, even though it can't. XMLE unimodal, reasonably symmetric person distribu- adjusts for this by removing the possibility of ex- tion. Misestimation at the tales of the distribu- treme score vectors from the estimation space. tion, geniuses and dunces, would not lead to in- that correct inferences . Equation (1) specifies the probability Xn,=x and this is the probability used in JMLE.

RASCH MODEL ESTIMATION : FURTHER TOPICS 99

But the probability that Xni is part of a zero score m vector for person n is Ei= E k9rik (10) k=0 L )rn(0J - rl)rni0 i=1 The multinomial variance, Wni, of the observa- tion about its expectation is given by: Similarly for the perfect score vector, Tcnimi, and the extreme item score vectors 71,(),i and 7c,,,,i' m Accordingly, in XMLE, these four probabilities Wni= Y ( k - E i /T ink (11) are computed for each Xni, and then the probabili- k=0 ties used for estimation become: From these can be obtained a set of standardized P~im=1n - n,~nni - grin+ )Tii01 Y residuals, {Zni}, whose distribution approximates N(0,1): P_r x=1,m-t - P,su = (71,;) - 1x .,10 7rilol + 7r,,10) )rtlol Y l _ Xni - Eni Zni - (12) W ni where y is a local normalizing value such that X ' M, EP nW --0 is 0. Accumulating these, for any person n or item i, JMLE estimation is then executed with these yields an approximate chi-square statistic: adjusted probabilities . For instance, Wright and Masters (1982, p. 76) equation 4.4 .5 retains its z L L ( x ~, - &r ) z x = E Z. = E df. (13) form but with adjusted probabilities . i=1 i=1 w~; = L-1

al. L m where L is the length of the test taken by person = Rn - I Y_ kPnik n. On i=1k=0 a To minimize this for Rn, i.e., to find the value where a, is the log-likelihood of the the data. Rn of Rn which produces the best local fit ofthe data is the raw score of person n on a test of L items to the model, we use: each with categories numbered from 0 to m. This adjustment has the effect of reducing the probabilities of the extreme categories . Con- sequently the XMLE estimate corresponding to which yields relationships similar to: a score vector for an item or person, is more cen- (X . a ~ 2(x ;-E  - E- ; P,,k tral than the JMLE estimate. Thus XMLE essen- x°°= ;) wa E(k-F, )' (15) tially corrects the bias and inconsistency prob- lems, but, as always, raising other concerns which for which estimates can be obtained by Newton- are addressed later in this chapter. Raphson iteration. Minimum Chi-Square Estimation The first term of (15) sums to zero when the observed person raw score matches the expected This estimation method is older than any of score, but the sum of the second term exists if the others listed here or in Wright and Masters there is any imbalance in the residuals. Figure 1 (1982). It is applied to logit-linear models in Yule illustrates this with a 6-item dichotomous test. (1925). It produces parameter estimates which The item difficulties are symmetrically distrib- maximize the fit of the data to the model. The uted so that a raw score of 3 on the test, places expectation, Eni, corresponding to observation, the MLE person ability estimate in the center of X ., is: m the items. If the respondent had succeeded on the

100 LINACRE

three easier items, and failed on the three harder, be estimated with generally available statistical this would also have been the minimum chi- software, though LOGIMO (Kelderman and square ability estimate. This respondent, however, Steen, 1988) also takes advantage of this failed on an easier item, but succeeded on the conceptualization . most difficult item. The chi-squares for this re- The log-linear version of a Rasch model is sponse string for different ability levels are plot- based on cell frequencies in a contingency table. ted. The minimum chi-square, and so the ability The cell identification corresponds to the person estimate according to this method, is now at 0.33 response string. Thus the probability ofobserving logits, noticeably above the center of the test. a particular response string, S, which sums to raw Using the minimum chi-square method, per- score Rs, for a person of ability (3n is given by: sons with different response patterns, but with the same raw score, obtain different measures (16) characterized by different standard errors, and similarly for the items. This is usually consid- ered unacceptable for reporting purposes, even whereX" is the response to item i in string S which though PAIR, as implemented in RUMM2010 consists of L, the test length, responses . (Andrich, et al., 2000), reports different estimates Then expected frequency of this cell for the for dichotomous items with the same raw score, sample of N persons is thus: and IRTprograms, such as MULTILOG (Thissen, 1991), routinely report different estimates for Freq (S) _ (17) persons with the same raw score. Log-Linear Estimation So that, from (1), Strictly speaking, this isn't a different method log(H'rey(S))=-YX, .S ; - of estimation, but rather a different way of for- mulating the Rasch model. Typically, this formu lation is used so that Rasch item parameters can where 8. is the difficulty of item i and tik is the

-2 -1 Ability Estimate (Logits) Figure 1. Minimum chi-square estimation.

RASCH MODEL ESTIMATION : FURTHER TOPICS 10 1

Rasch threshold parameter at which categories persons do not respond to items . Typically, if a k-1 and k are equally probably. LR, is a term de- non-response is scored, it is keyed as "wrong" pendent on the raw score, R,, the person distribu- for a multiple-choice test, and as "don't know" tion and the item distribution, but not on the par- or "neutral" for an attitude survey. But this de- ticular response string. LRc is constant across all pends on the purpose of the analysis. In this response strings with the same raw score as S. present discussion, "missing data" means data for Item and scale structure parameters in this which no scored responses are available . formulation can be estimated using standard log- In principle, missing data are of no concern linear estimation methods. to the Rasch model. Equation (1) is an observa- tion-level model. Estimation methods which rely Technicalities of Estimation on sufficient statistics, such as JMLE and PAIR, Precision and Accuracy merely summarize the non-missing observations that are relevant to each parameter, and compare Precision is reproducibility, i.e., the extent them with their expectations. Thus, for JMLE or to which a measuring instrument agrees with it- XMLE, Equation (9) becomes self. Accuracy is the extent to which an instru ment reports the "truth". In thermometry, these _ - kP ik are seen to be clearly different, and sometimes aQ i=llXni1ii k=0 opposing, attributes (National Physical Labora- (19) tory, 1955) . Rasch software routinely reports pre- Rn = X , i=11 X  i?0 cision as estimate standard errors, in a manner familiar to statisticians and metrologists . End Some estimation methods require complete users, however, are rarely able to capitalize on data, or, a little more flexibly, data consisting of this information, and tend to regard Rasch mea- complete subsets. These methods include CMLE sures in the same way they have always regarded and log-linear approaches. Since complete data raw scores, i.e., as point estimates . permits optimization ofestimationroutines, some Accuracy of Rascb estimates, in terms ofan implementations of PROX, JMLE and MMLE external reference standard, such as a "standard also require complete data. When complete data meter", is not known or reported. Rasch measure are required, either casewise deletion or imputa- ment has not yet advanced that far. Internal ac- tion of missing values is performed. If only a few curacy is reported as the fit of the data to a Rasch observations are missing in the data set, casewise model, but it is largely unclear what level of ac- deletion is indicated. When performed, imputa- curacy is needed for any particular application . tion of missing values also requires correction of In practice, however, the alternatives to Rasch measure standard errors and fit statistics for the measurement, be they raw scores, descriptive IRT influence of the imputed data. . models, or qualitative approaches, are even less Estimate Consistency secure. Consequently, once blatant inaccuracies, such as wild guessing, miskeyed items and Consistency is the property that, given an misoriented rating scales, have been eliminated infinite amount of data which fit a statistical from the analysis, the resultant Rasch estimates model, the estimation procedure would recover are "accurate enough for government work." the values of the parameters used to generate those Missing Data data. Consistency differs from accuracy. Accuracy is the extent to which the estimates meet For conventional analyses, complete data external criteria of exactness. In Rasch analysis, exist when there is a scored observation for ev- these are usually quality-control fit criteria. ery person on every item. Missing data can oc Consistency can never be observed directly. cur when items are not administered to persons, It can only be inferred from the mathematical as in adaptive testing and test linking, or when properties ofthe estimation method. It is usually 102 LINACRE

assumed to be asymptotic, so that, as the amount Estimate Bias of data becomes large, estimates are expected to An estimation algorithm isunlikely to recover approach their generating values, but this is not true value of a parameter from finite data. In- necessarily true. For instance, a Monte Carlo the for each parameter an estimate is reported method which guessed at random would finally stead, precision, in the form of a of guess the correct estimate, but the last guess of a and its around its unknown true value. Since large number of guesses might not be any better that estimate is unknown, the estimate is usually than the first. the true value treated as the true value, and standard error is ap- Considerable effort has been expended criti- plied to the estimate. Since error distributions are cizing JMLE for its lack of consistency, e.g., not always of a simple form, measurement impre- Jansen, et al., (1988) . But consistency is a slip cision may be reported in different forms, e.g., as pery property. Consistency requires not only an plausible values in ConQuest. Nevertheless, it is infinite amount of data, but data that is infinite in assumed that the true value is somehow central in particular ways. Thus Haberman (1977) demon- the error distribution . strates that JMLE can be consistent if both the Estimation bias occurs when the estimated number of persons and number of items become values are, on average, higher or lower than the infinite together. Estimation methods which elimi- true values, and perhaps even outside the reported nate, in some way, the individual person param- precision . Though JMLE estimation bias has been eters fromestimation ofthe item parameters, such thoroughly discussed (Wright, 1988), that ofother as CMLE, PAIR and MMLE, are consistent if algorithms has been generally ignored. the number ofpersons becomes infinite (Pfanzagl, estimation CMLE is generally much less bi- 1994; Zwinderman 1995). CMLE, as usually For example, JMLE for short tests, but not totally implemented, is not consistent if the number of ased than persons remains finite, but the number of items unbiased. becomes infinite. Consider a 2-item dichotomous test admin- istered to 3 persons. The 64 possible data matri- The usual reason for inconsistency is the pos- ces are shown in Table 1 . Only the 6 matrices sibility ofextreme score vectors (zero and perfect shown in roman bold contain no extreme score raw scores). These imply that the corresponding vectors for either persons or items. When com- parameter is infinite or undefined . How able is a puting the likelihoods used in making its esti- person who succeeds on every item? How easy is mates, JMLE includes all 64 data matrices, de- an item on which everyone succeeds? Consistency spite the fact that 58 of them contain inestimable requires that, as the amount of data become infi- response strings. CMLE explicitly excludes data nite, the probability of an extreme score vector in matrices with inestimable person response strings the estimation space becomes zero. from its likelihood calculations. This eliminates For CMLE, consistency occurs in two stages. 56 of the 64 data matrices, keeping the 6 bolded Extreme person vectors are explicitly excluded matrices. CMLE also keeps the two data matri- from the estimation space. Only non-extreme ces with inestimable item response strings which person score vectors contribute to the likelihood are shown in italics at the diagonal corners of of the data, which is to be maximized . The likeli- Table 1 . These introduce bias into the CMLE hood of extreme item vectors is reduced to zero estimates. According to JMLE, the logit distance by expanding the person sample to infinity, so between the items is 21og(2) = 1.39 logits. Ac- that it becomes certain that there is at least one cording to CMLE, the distance is log(2) = 0.69 success and one failure on every item. logits. In fact, the exact MLE estimate for these Consistency is a theoretical property. Of data is that the items are infinitely far apart! practical concern is the extent to which estimates In this boundary case, the JMLE estimates based on finite data differ from their true values. are actually more accurate than the CMLE ones, This is termed estimation bias.

RASCH MODEL ESTIMATION: FURTHER TOPICS 103

though both are infinitely wrong. XMLE estima- comes negligibly small, even without explicit tion diverges on this analysis, which, though cor- correction. rect, is not helpful. Here is a case where incor- The small rect, but sample estimation bias for PAIR finite, estimates are more useful than and MMLE correct, depends on technical details of the infinite ones. implementation. If we double the sample size to 6 persons, Estimation Symmetry keeping the same success ratio, there are now 4096 possible data matrices, of which 62 are es As Rasch measurement is applied in ever timable. CMLE again includes the likelihood for more diverse fields, the problem ofestimate sym- 2 inestimable matrices and JMLE includes all metry is coming more into prominence. In the 4096 possible matrices, The CMLE item esti- educational and health care fields, it is generally mates remain 0.69 logits apart, and the JMLE obvious what are the objects of measurement, the estimate remains 1 .39 logits. The exact MLE es- students or patients to be measured, and what are timate is now .891ogits apart. For 9 persons, the the agents ofmeasurement, the test items designed exact MLE estimate is .74 logits apart. As sample to perform the probing. But, even from the earli- size increases, the bias in the CMLE estimates est days, there are cases where this is uncertain. rapidly disappears. For JMLE it never does. In 's (1969) analysis of traffic The mostextreme example ofestimation bias accidents, he considered the number of fatal ac- is that of JMLE estimates obtained from a two- cidents in specific short road segments over spe item dichotomous test which we have just exam cific periods of time. With current Rasch meth- ined. The reported dispersion of the two items is odology, this situation could be easily modeled twice that of the CMLE values. This motivated using the rating scale, O=no fatalities, 1=1 fatal Wright and Douglas (1977) to propose an (L-1)IL accident, 2=more than one fatal accident. But bias correction for short tests, which, according to what is the object of measurement and what is Jansen, et al., (1988), largely eliminates the bias the agent? Is the object of the investigation to for tests of over 10 items. But, for long tests, or measure "danger inherent in different road seg- short tests comprising rating scales with many ments" or "danger inherent in driving at differ- categories, the JMLE estimation bias also be- ent times of day"? The choice of what are the Table 1

All 64 possible data matrices, each comprising 2 dichotomous items (rows)administered to 3 persons (columns)

000 000 000 000 000 000 000 000 000 001 010 011 100 101 110 111 100 100 100 100 100 100 100 100 000 001 010 011 100 101 110 111 010 010 010 010 010 010 010 010 000 001 010 011 100 101 110 111 110 110 110 110 110 110 110 110 000 001 010 011 100 101 110 111 001 001 001 001 001 001 001 001 000 001 010 011 100 101 110 111 101 101 101 101 101 101 101 101 000 001 010 011 100 101 110 111 011 011 011 011 011 011 011 011 000 001 010 011 100 101 110 111 111 111 111 111 111 111 111 111 000 001 010 011 100 101 110 111 104 LINACRE

objects of measurement and what are the agents between the observed and the expected, and the is arbitrary. local slopes of relevant mathematical functions. well suited to the This may seem to be an academic discus- Newton-Raphson is particularly mathematically well- sion until the data matrix is constructed, and es- Rasch model because of the ogive. But there timation undertaken. Though Rasch models, like behaved nature of the logistic (1), are symmetric in the way objects and agents can be problems. enter into the analysis, estimation software rarely If there is only one estimate to be obtained, is. Most software provides a more detailed analy- e.g., the measure of a person on a set of items of sis for items than for persons. More perplexing known difficulty, then Newton-Raphson can work to the user, however, is the fact that transposing smoothly and quickly . In order to avoid overflow the data matrix can produce non-equivalent sets or loss of precision during computation, it is ad- ofestimates. JMLE and PROX estimates are sym- visable to choose an initial estimate more central metric. Perforce, MMLE must be asymmetric than the final estimate, and also to limit changes because its distributional specification is one- in the estimate to one logit per iteration. sided. CMLE and PAIR estimates are asymmet- Conceptually, Newton-Raphson is operating ric, but the extent of the mismatch depends on on a plane, a two-dimensional space. Under most the methods used to obtain object estimates from circumstances, however, many parameters are the agent estimates. estimated simultaneously. For instance, for a di- Newton-Raphson : A Cautionary Tale chotomous test of L items, L-1 free parameters are usually estimated and one constraint imposed. Except for the PROX method applied to This means that Newton-Raphson is operating in complete data (Cohen, 1979), the estimation of an L-dimensional space, but suggesting changes Rasch measures requires iteration. An initial es for each estimate in terms of a local 2-dimen- timate of a measure is made. The implications of sional space. This can leadto the situation shown this measure are compared with the data. A cor- in Figure 2. rection is made to the initial measure in an at- is made. Then, tempt to bring its implications into closer con- In Figure 2, an initial estimate function, and formity with the data. Many commonly used based on the slope ofthe likelihood the observed iterative methods are based on the Newton- the amount ofdiscrepancy between revised, second estimate Raphson approach. These methods provide ad- and expected values, a estimate produces a re- justments based on the sizes of discrepancies is produced. The second

Measure Estimate Figure 2 . Effect of misestimation on observation fit.

RASCH MODEL ESTIMATION : FURTHER Topics 105

vised slope and discrepancy, leading to a third ways be thought of as "close", i.e., "different from estimate. In principle, these estimates rapidly the estimate for an adjacent raw score", if the approach the best estimate, i.e., the value of the expected raw score corresponding to that esti- parameter most likely to have generated the ob- mate is within 0.5 score points of its observed served data. value. This places the estimate within 0.5 (SEZ) In Figure 2, however, the Newton-Raphson logits-of its most likely value, where SE is the approach is failing to converge on the best esti- standard error of the estimate. This level of pre- mate. The slope of the mathematical function is cision in estimation is not usually required in such that the estimates arejumping from one side complete data sets. For instance, for a typical of the best estimate to the other. This oscillating dichotomous item, with p-value of .8, adminis- behavior is usually an indication that the likeli- tered to 300 persons, the value of 0.5 (SEZ ) = hood function is rather flat, so that a range of .01 logits is too small to have any noticeable im- estimates are almost equivalent. pact on fit computations or almost any substan- tive decision, and much less than the difficulty In Figure 2, it is seen that the third estimate is estimate's S.E. of . 14 logits. slightly further away from the best estimate than was the initial estimate. This type of divergent For most estimation methods, it is the esti- behavior has been observed with 3-PL estimation mates corresponding to almost extreme scores (Stocking, 1989). I have also observed this to oc- that are the last to become close, i.e., to converge. cur with Rasch estimation ofincomplete data sets, Consequently, methods that announce conver- containing weakly connected subsets of data, or gence based on the small size ofa change in some when there are categories of a rating scale that are average or summary indicator may noticeably rarely observed. Under these circumstances, other misestimate almost extreme measures, usually by methods may be more robust. These methods in- making them too central. This also tends to make clude proportional curve-fitting (implemented in their fit too good. Winsteps) and Monte Carlo techniques. Figure 3 plots the reported standardized re- Misestimation, Convergence and Fit siduals (on the y-axis) for dichotomous observa- tions for different abilities relative to an item (the Paradoxically, as estimates of measures im- contours), and for different amounts of misesti- prove, their fit to the Rasch model can become mation (on the x-axis) . It is seen that that the stan- worse. Consider a dichotomous test, such as the dardized residuals corresponding to unexpected Knox Cube Test (Wright and Stone, 1979) . Sup- responses by persons far from an item are those pose that due to gross misestimation, every item that are most sensitive to misestimation . A one- was estimated to have the same difficulty, and logit misestimation associated with outlying un- every person was estimated to have an ability expectedresponses canreduce their standardized exactly targeted on that difficulty. Then the model residuals by 30%, effectively halving their expectation for each person on each item is .5, squared residuals. These are the values that most and the model binomial standard deviation of an influence conventional chi-square statistics, such observation about its expectation is 0.5. In fact, as Wright's OUTFIT statistic, and, in a similar the observations are failure (0) or success (1). way, likelihood-ratio tests. All the "observed-expected" residuals are 0.5, so every single standardized residual is 1 .0. Chi- The Growing Family of Rasch Models square statistics would report stochastically per- The "growing family of Rasch models" fect model fit! (Rost, 2000) presents both opportunities and chal- The moral of this story is that fit statistics lenges. Opportunities include widening the ap can be misleading if the estimates are far from plication of Rasch measurement methodology, the latent parameter values. An estimate can al- and making better use of it in current areas of

106 LINACRE

application . Challenges include more flexible Inpractice, arating scale definition is neither measurement models and methods of estimation. completely defined nor completely arbitrary. For categories, the implications of each Rating Scale Estimation scales of a few are usually fairly clear-cut, and, Rasch models, like Under Rasch model conditions, rating scale Equation (1) are usually straight-forward to esti- categories are conceptualized to be ordered in- mate. But as the number of categories increases, dicators of performance on an item. The essen and their meanings become less clear-cut, the pa- tial observation is the count of such categories rameterizationofindividual categories can become up from the lowest category, whether or not that insecure. Categories may have very low frequen- is the process used by the respondent on the way cies, or even not be observed. to being observed in that category. This implies There is a mathematical device to maintain that a category has a specific meaning, i.e., is unobserved intermediate categories in the rating "well-enough defined" (Masters and Wright, scale structure (Wilson, 1991). If category x is not 1984). An obvious rating scale is the two cat- observed, then Equation (2) can be rewritten, egory, "wrong", "right" scale associated with many dichotomous multiple-choice items . log ln - (2S, + Zx-I,x+I) (20) This contrasts with the arbitrary categoriza- =2 tion inherent in, say, the Graded Response model and other models which attempt to be "invariant where ,T,- , ,,+ , parameterizes thepoint at which cat- under the grouping of adjacent response catego- egories x-1 and x+1 are equally probable, and ries" (McCullagh, 1985, p.39) . For these models, Tc~ . x = 0. But this device may not be satisfactory categories, though ordered, are not qualitatively from the standpoint of inference, because it pre- different, but are merely defined for convenience. dicts that category x will never be observed. Such categorizations include arbitrary stratifica- RUMM2010 produces estimates for a tions ofpercents, times, distances, weights and the reparameterization of the rating scale structure like. These are often encountered in a large-scale in terms of its mean ("location"), dispersion surveys, such as a national census.

-2 -1 a 1 2 3 Logit Misestimate of Ability Measure Figure 3. Effect of misestimation on observation fit.

RASCH MODEL ESTIMATION : FURTHER TOPICS 107

("unit"), skewness and kurtosis. For long rating tance along the latent variable corresponds to a scales, this enables the inclusion of unobserved large distance along the rating scale. This is bad categories and regularizes the probabilistic rela- if each category is intended to correspond to a tionship of all the categories. For instance, if the broad step in development . analyst believes that the categories define a regu- Empirically incorrect category and thresh- lar progression, then this can be imposed on the old ordering are among aspects of the function- estimates. Consequently, local, accidental varia- ing ofrating scales discussed in Linacre (2002a). tion in the use of categories is smoothed out. Sim- Omitted, however, from that discussion is the plification, however, comes at a cost. Just as with condition necessary for the existence of super- the assertion of distributions for MMLE, care modal categories . Super-modal categories are must be taken to verify that the summary struc- those, that in some region of the latent variable, ture does not hide important features of the rat- have a higher probability of being observed than ing scale. For instance, a long bipolar scale might all other categories combined. This contrasts with be conveniently summarized by mean and dis- modal categories which may not go beyond be- persion parameters, but these might obscure a ing more probable than any other single category. malfunctioning central "don't know" category. The condition for supermodality is that the ik ad- Equation (2) can be expressed in terms of a scale vance by (or that ~/ 2 is) at least 1 .1 + m/10 logits. structure mean, lt, and dispersion, 4, as: Figure 4 illustrates non-modal, modal and super-modal categories. In this Figure, the model log =#,, - ( u, + ( 2x-m ) ~ ) (21) probabilities of a person, of given ability rela tive to the item difficulty, being observed in each In this formulation, negative ~ would indicate category are shown. Extreme categories, in this "disordered thresholds" . Disordered thresholds case 0 and 4, are always super-modal, because do not imply disordered categories . Disordered their probabilities asymptotically approach 1 .0 categories represent a substantive disagreement at the extremes ofthe latent variable. Category 3 between the ordering of the categories and the is here also super-modal as its peak probability orientation of the latent variable. An example is exceeds .5 . Category 2 is modal because its peak a negatively-worded statement on an attitude sur- probability is higher than the probability of any vey. Unless responses to that item are reverse- other category at some point on the latent vari- scored, its categories will be disordered, in fact able. Category 1 is non-modal. Either category 0 reverse-ordered, relative to the latent variable. or category 2 is always more probable than cat- Disordered categories are usually accomapnied egory 1 . It is seen that the points of equal prob- by gross category-level and item-level misfit. ability between categories 0 and 1, tip , and be- tween categories 1 and 2, tie, are in the reverse Disordered thresholds reflect the existence order of the categories along the latent variable. of non-modal categories, i.e., categories that do These are the "disordered thresholds" . not have a higher probability of being observed than the other categories at any point along the Combinations of item structures and item latent variable. In principal, these present no ad- discriminations ditional estimation problems unless the non- An area ofestimation that is only starting to modal category is unobserved, i.e., a sampling be addressed is the combining of different Rasch or incidental zero. In this case, the device de- mgdels in one analysis. A quality of life instru scribed in (20) above may be employed. ment, for instance, may include true-false items, Disordered thresholds indicate that a cat- Likert scales, frequency scales, intensity scales, egory corresponds to a narrow interval on the Poisson counts of events, Bernoulli trials and latent variable. This is good if the item is intended paired comparisons all in the same instrument. to be highly discriminating, so that a small dis- Rasch estimation methods vary greatly in their

108 LINACRE

capacity to encompass different models in one cern. is not yet so advanced, but, analysis. JMLE is proving to be the most flex- provided the impact of the sub-dimensions is seen ible. Facets (Linacre, 2002b) can co-calibrate all to be small, there is little practical motivation for the item types just listed. reporting two or more measures whose meaning, An obvious extension of Equation (2) to ac- for decision-makers, is identical. commodate several different rating scales struc- Differently discriminating sub-groups of tures within the same instrument is: items are more awkward to manage. These are equivalent to mixing Celsius and Fahrenheit tem peratures in the same analysis without conver- log = ,8, - ( 8i.t + a,, ) (22) like every ther- ~~nig(x-l) J sion. Of course, every item, just mometer, has slightly different discrimination . in where g indicates a group of items which share Linacre (2000) indicates that discriminations probit scale) can the same rating scale response structure . If all the range of 0.5-1 .5 (0.9-2.5 on items are assigned to the same one group, this is be usefully accommodated. the Rating Scale (Andrich, 1978) model. If each For wider ranges of discrimination, or for item is assigned to its own group, this is the "Un- sub-groups comprising relatively high or low dis- restricted" (Andrich, et al., 2000) or "Partial criminating items, OPLM (Verhelst, et al., 1995) Credit" model (Masters, 1982). permits the imputing ofdiscrimination parameters It is typical, however, for each item group- as though they are known constants. This differs ing to exhibit its own sub-dimension, and also to from the 2-PL IRT model in which discrimina- have its own level of discrimination on the latent tions are estimated simultaneously with the item variable . Different sub-dimensions would be difficulties . Verhelst and Glass (1995) suggest equivalent to incorporating temperature readings statistics which may be useful in guiding the from alcohol thermometers, mercury thermom- choice of the discrimination constants. A more eters and thermocouples in the same analysis. direct option may be to read them off a graph, Construction of thermometers is now so regular- such as that provided in Linacre (2000). An al- ternative, if the subgroups of items are long ized, that this is no longer thought to be of con-

-2 -1 0 1 2 3 Logit Measure relative to Item Difficulty Figure 4. Super-modal, modal and non-modal categories.Table 1 :

RASCH MODEL ESTIMATION : FURTHER TOPICS 10 9

enough, is to analyze each subgroup separately, Haberman, S. (1977). Maximum likelihood esti- and then to combine the resulting estimates after mates in exponential response models. The conversion to a common linear scale, in exactly Annals of Statistics, 5, 815-841 . the same way that Celsius and Fahrenheit tem- Jansen, P. G., Van den Wollenberg, A. L., and peratures are routinely combined. Wierda, F. W. (1988). Correcting uncondi- Conclusion tional parameter estimates in the Rasch model forinconsistency . Applied Psychological Mea- The remarkable variety and adaptability of surement, 12, 297-306. Rasch measure estimation algorithms already Karabatsos, G (2001). The Rasch model, addi- supports the analysis of a vast range of ordered tive conjoint measurement, and new models categorical data. Those few technical intricacies of probabilistic measurement theory. Journal that actually have substantive implications can ofApplied Measurement, 2, 389-423. usually be overcome by reconceptualizing the analytical problem or applying an alternative es- Kelderman, H., and Steen, R. (1988). LOGIMO Loglinear and Loglinear IRT timation method. The challenge is no longer to Model Analy- estimate measures, it is to understand and com- sis. Chicago : Scientific Software. municate their meaning. Lee, O.-K. (1991). Convergence: statistics or substance? Rasch Measurement Transactions, References 5, 172. Andersen, E. B. (1973). Conditional inference Linacre, J. M. (1989). Many-Facet Rasch Mea- for multiple-choice questionnaires . British surement. Chicago: MESA Press . Journal ofMathematical and Statistical Psy- Linacre, J. M. (1999). Understanding Rasch mea- chology, 26, 31-44. surement: estimation methods for Rasch Mea- Andrich, D. (1978). A rating scale formulation sures. Journal of Outcome Measurement, 3, for ordered response categories . 381-405 . Psychometrika, 43, 561-573. Linacre, J. M. (2000). Item discrimination and Andrich, D., Lyne, A., Sheridan, B ., and Luo, G. infit mean-squares . Rasch Measurement (2000) RUMM2010 Computer Program. Transactions, 14, 743. Perth, Australia: Rumm Laboratory Pty. Ltd. Linacre, J. M. (2002a). Understanding Rasch Berkson, J. (1944). Applications of the logistic measurement: Optimizing category effective- function to bio-assay. Journal of the Ameri- ness. Journal ofApplied Measurement, 3, 85- can Statistical Society 39, 357-365 106. Camilli, G. (1994). Origin of the scaling constant Linacre, J. M . (2002b). Facets Rasch Measure- d=1.7 in . Journal ofEdu- ment Computer Program . Chicago : Facets. cational and Behavioral Statistics, 19, 293-5. com Cohen, L. (1979). Approximate expressions for Linacre, J. M. (2002c) . WINSTEPS Rasch Mea- parameter estimates in the Rasch model . Brit- surement Computer Program . Chicago: ish Journal of Mathematical and Statistical Winsteps.com Psychology, 32, 1, 13-120. Masters, G N. (1982) A Rasch model for partial Fischer, G H, and Molenaar, l. W. (1995) Rasch credit scoring . Psychometrika, 47, 149-174. Models: Foundations, Recent Developments, Masters, G. N., and Wright, B. D. (1984). The andApplications. New York: SpringerVerlag. essential process in a family of measurement Fisher, R. A. (1922). On the mathematical foun- models . Psychometrika, 49, 529-544. dations of theoretical statistics. Proceedings McCullagh, P. (1985). Statistical and scientific ofthe Royal Society, 222, 309-368 . aspects of models for qualitative data. In P. 11 0 LINACRE

Nijkamp, et al., (Eds.), Measuring the tion of TestProperties. (Research Report RR- Unmeasurable (pp. 39-49). Dordrecht, The 89-5). Princeton, NJ : ETS. Netherlands: Martinus Nijhoff. Thissen, D. (1991). MULTILOG IRT Computer Mislevy, R. J., and Bock, R. D. (1996). BILOG Program . Chicago: Scientific Software. computer program. Chicago: Scientific Soft- Verhelst, N. D., and Glas, C. A. W. (1995) The ware International. one parameter logistic model. In G. H. Fischer, National Physical Laboratory. (1955) . Calibra- and I. W. Molenaar (Eds.), Rasch Models: tion of Temperature Measuring Instruments. Foundations, Recent Developments, and Ap- London: HMSO. plications. New York: Springer Verlag. Nickerson, C.A. & McClelland, G.H. (1984). Verhelst, N. D., Glas, C. A. W., and Verstralen, Scaling distortion in numerical conjoint mea- H. H. F. M. (1995). OPLM: One Parameter surement. Applied Psychological Measure- Logistic Model. Computer program and ment, 8, 182-198 . manual. Arnhem, The Netherlands: CITO. Pfanzagl, J. (1994). On item parameter estima- Wilson, M. (1991) Unobserved categories. Rasch tion in certain latent trait models. In G H. Measurement Transactions, S, 128. Fischer & D. Laming (Eds.), Contributions to Wright, B. D. (1988). The efficacy of uncondi- Mathematical Psychology, Psychometrics and tional maximum likelihood bias correction. Methodology. New York: Springer Verlag. Applied Psychological Measurement, 12, 315- Rasch, G (1960). Probabilistic Modelsfor Some 318. Intelligence and Attainment Tests. Wright, B. D., and Douglas, G A. (1977). Best Copenhagen: Danish Institute for Educaitonal procedures for sample-free item analysis. Ap- Research. (Expanded edition, 1980. Chicago : plied Psychological Measurement, 1, 281- University of Chicago Press.) 294. Rasch, G (1969). Models for description of the Wright, B . D ., and Masters, G N. (1982). Rating time-space distribution of traffic accidents. scale analysis. Chicago: MESA Press. Symposium on the Use ofStatistical Methods Best test in the Analysis of Road Accidents. Wright, B . D., and Stone, M. H. (1979). design. Press. Copenhagen: Organization for Economic Co- Chicago: MESA operation and Development. Wu, M. L., Adams, R. J ., and Wilson, M. R. ACER ConQuest: generalized item re- Rost, J. (2000). The growing family of Rasch (1998). sponse modelling software . models. Chapter 2 in A. Boomsma, M. A. J. Melbourne, Aus- tralia: Australian Council for Educational van Duijn, T. A. B. Snijders (Eds .), Essays on Research. Item Response Theory. New York: Springer Verlag. Yule, G. U. (1925). The growth of population and Sheng, S., and Carriere, K. C. (2002) An im- the factors which control it. Presidential ad- Journal of the Royal Statistical Soci- proved CML estimation procedure for the dress. ety, 88, 1-62. Rasch model with item response data. Statis- tics in Medicine, 21, 407-416. Zwinderman, A. H. (1995). Pairwise parameter estimation in Rasch models. Applied Psycho- Stocking, M. L. (1989). Empirical Estimation logical Measurement, 19, Errors in Item Response Theory as a Func- 369-375 .