JOURNAL OF APPLIED MEASUREMENT, 10(2), Copyright© 2009

Local Independence and Residual Covariance: A Study of Olympic Ratings

John M. Linacre University of the Sunshine Coast, Australia

Rasch fit analysis has focused on tests of global fit and tests of the fit of individual parameter estimates. Critics have noted that slight, but pervasive, patterns of misfit to a Rasch model within the data may escape detection using these approaches. These patterns contradict the Rasch axiom of local independence, and so degrade measurement and may bias measures. Misfit to a Rasch model is captured in the observation residuals. Traces of pervasive, but faint, secondary dimensions within the observations may be identified using factor analytic techniques. To illustrate these techniques, the ratings awarded during the Pairs Figure Skating competition at the 2002 Winter are examined. The intention is to detect analytically the patterns of rater bias admitted publicly after the event. It is seen that the one-parameter-at-a-time fit statistics and differential item functioning approaches fail to detect the crucial misfit patterns. Factor analytic methods do. In fact, the competition was held in two stages. Factor analytic techniques already detect the rater bias after the first stage. This suggests that remedial rater retraining or other rater-related actions could be taken before the final ratings are collected.

Requests for reprints should be sent to 2 Li n a c r e

Introduction performers will fail on an item, and higher per- formers will succeed on it, then that item has high Local independence is a property required discrimination. In classical test theory (CTT), of measures. As Wright (1996) remarks, statisti- these are regarded as the best items. In Rasch cal independence in data occurs when the value theory, however, responses to such items may of one datum has no influence on the value of be seen to be too predictable from the responses another. Thus, the outcome of a “head” for a coin to the other items. This indicates that they lack toss does not increase the probability that the the local independence required for objective next toss produces a “tail”. Local independence measurement. Indeed, even in CTT, extreme over- specifies that the value of one datum has no influ- predictability leads to the “attenuation paradox” ence on another once the underlying variable, the in which increases in test reliability actually re- latent trait or dimension, has been accounted for duce test statistical validity (Loevinger, 1954) (conditioned out). Local independence includes, but goes beyond, unidimensionality. Including If there is an item with nothing in common the same math item twice in a math test would with the other items, it will tend to be entirely not alter its substantive dimensionality. Yet an unpredictable. It will indeed represent another examinee would be expected to either succeed dimension, but may be the only representative or fail on both items together, so responses to of that dimension. Such other dimensions include the two identical items would not be locally “misprints,” “miskeys,” “data entry errors” and independent. the like. Most commonly encountered Rasch quality-control fit statistics are designed to flag Successful implementation of Rasch mea- statistically unexpected responses or response surement requires items that approximate local patterns by considering the performance of one independence. The relationship between fit respondent or one test item at a time. These sta- statistics and dimensionality is not direct. Every tistics are very powerful for detecting guessing, item contains many dimensions. The intention of carelessness, response sets, social conformity, the test constructor is that, when a set of items miskeyed items, data entry errors and the like. is compiled into a test, what they share together These, however, are not what is usually meant by will be the dimension that the test constructor test multi-dimensionality. It cannot be determined intends, and, further, that particular dimension from one-item-at-a-time statistics whether the off- will overwhelm all the other little dimensional dimension behavior in an item is shared with other differences between the items. It can then be said items (i.e., forms another dimension) or is unique that what the items share is their “dimension”, and to each item. Even this approach is not hopeless, all the other dimensions in each item are acting because the analyst can often guessat the defini- like random noise. tion of shared secondary dimensions by looking Of course, this ideal is never realized in at the item content and response patterns. practice. Some items have more of what all items share, and some have less. This is basically what Dimensionality Detection item-level fit statistics are reporting. If there is What is required is a technique for discov- an item that only consists of material shared ering shared commonalities among some items with other items, then it will tend to be overly which are not explained by the shared dimension. predictable. An item of this type is the “overall In fact, local deviations from the model specifica- impression” item commonly printed at the end of tion of local independence “can be measured by customer satisfaction surveys. the size of residual covariances. Unfortunately, From a naive perspective, it would seem some computer programs for fitting the Rasch that the more highly predictable items in a test model do not give any information about these. are the best items. If it is fairly certain that lower A choice would be to examine the covariance matrix of the item residuals, not the sizes of the Lo c a l Independence a n d Re s i d u a l Co v a r i a n c e 3

residuals themselves, to see if the items are in- 2. For each observation, X, compute its expected deed conditionally uncorrelated, as required by value according to the measurement model, the principle of local independence” (McDonald, E, and the model variance of the observed 1985, p. 212). about expected Q. (For dichotomous data, Q In fact, not only does the covariance matrix is Jacob Bernoulli’s binomial variance). The of item residuals identify correlated items, it also part of the observation not directly explained facilitates the identification of patterns of shared by the measurement model, the residual, is correlations through factor analysis of the matrix. X-E. This is still in the raw score metric. This indicates which items are operating together 3. Normalize the residuals. Each residual indi- in an other-dimensional way, and how much this cates how much locally easier or harder that perturbs the variance structure in the data. item was than expected (or how much more or Not all multidimensionality is unintended less locally competent the examinee). These or unproductive. Papers which discuss other ap- residuals are anticipated to be the outcome proaches to the investigation of multidimension- of an infinity of infinitesimal perturbations ality, such as Wang, Wilson and Adams (1997) of the expected response, i.e., to follow a nor- and Chen and Davison (1996), are included in mal distribution. The model variance of that the reference list. distribution, for each response, is Q. So the standardized residuals, (X-E)/CQ, are mod- A Factor-analytic Approach eled to conform to an N(0,1) distribution. Smith and Miao (1994) demonstrate that 4. For each pair of items, compute the Pearson factor analysis is a useful method for identifying correlation between the standardized residu- multidimensionality in data that has been con- als across all examinees who responded to structed to be unidimensional. They analyze the both items. Potentially locally dependent matrix of raw responses, but this has the drawback pairs of items will have high positive cor- that the first factor only approximates the Rasch relations (e.g., items which embody the dimension. Indeed, multiple factors have been re- same perspective on a sub-dimension) or ported in data known to fit the Rasch model. “One high negative correlations (e.g., items which might expect the emergence of only one factor embody opposing perspectives). when a factor analysis would be performed on all 5. Perform a factor analysis of the item correla- newly defined subsets [of unidimensional items]. tion matrix. The first factor in the residuals, However, factor analysis of the newly defined reported here, is conceptually the second subsets yielded two factors. Further inspection factor overall, because the Rasch dimension of the factor plot showed that the emergence of is the first factor overall. This secondoverall a second factor could be considered as an artefact factor identifies the strongest shared pattern due to the skewness of the subset scores” (Van in local dependency among the items as der Ven and Ellis, 2000). reflected in their correlations. Subsequent This suggests an improved method of factor factors in the residuals may also be useful analysis in which the non-linearity and distri- diagnostically, but reflect weaker patterns butional aspects of the raw scores are removed in the data. prior to the investigation of multi-dimensionality If the data accord with the Rasch model, (Wright, 1996). i.e., the data exhibit local independence, the 1. Perform a conventional Rasch analysis. The standardized residuals for each item will load on resulting Rasch dimension is analogous to a an individual item-specific, statistically unique, first factor in a factor analysis, but now in a factor. There will be no commonality, and so no linear framework. second shared dimension. When this holds, the data are effectively unidimensional or perhaps en- tirely random. Divgi (1986) provides an example 4 Li n a c r e

of randomness, the fitting of the Rasch model to components are orthogonal and unrotated. The coin tosses. In this case, the Rasch dimension has first PCAR component explains as much of the collapsed to a point. residual variance in the data as possible. This Factor analysis of residuals is an attempt variance has had the Rasch dimension removed. to find more dimensions in the data beyond the Consequently it reflects a contrast, not between Rasch dimension, i.e., to falsify the Rasch model the Rasch dimension and a secondary dimension, specification of local independence in the data. but between two secondary dimensions. For in- Accordingly the factor analytic approach of stance, in an arithmetic test, the Rasch dimension choice is the one that most strongly asserts that the would be defined by easier and harder “arith- residual variance is explained by shared dimen- metic” items. The first component in the PCAR sions. It is the principal components approach that could contrast “addition” with “subtraction”. Thus does this. “If one were to make his choice entirely PCAR components are contrast factors, contrast- upon statistical considerations, a rather natural ing one sub-dimension with another. approach would be to represent the original set A Case Study: Olympic Pairs of variables in terms of a number of factors, de- Figure Skating termined in sequence so that at each successive stage the factor would account for the maximum The ratings of the 2002 Olympic Pairs of variance. This statistically optimal solution Figure Skating competition in Salt Lake City, —the method of principal axes [components]— Utah, provide an illustration of the diagnostic was first proposed by Pearson [1901], and in the characteristics of different approaches to detec- 1930s, Hotelling provided the full development tion of failures in local independence. After this of the method.” (Harmon, 1960, p.4). competition was completed, one judge admitted Two practical issues arise in identifying to bias in the awarding of ratings in order to fa- dimensions from principal components analy- vor skaters also known to be favored by certain sis of residuals (PCAR). First, when is a factor of the other judges. This case study attempts to component large enough to indicate the pres- identify this aberrant judge based solely on the ence of substantive dimensions? Kaiser’s (1960) reported ratings. rule of factor size, i.e., eigenvalue, larger than The Pairs event takes place in two stages. 1.0, would report many secondary dimensions, All 20 pairs competing skate a “short” program even for data simulated to fit a Rasch model. A containing compulsory elements. Then they substantive secondary dimension within the data skate a long “free” program. Each program is would require at least two item’s to define it, so a rated by nine judges on two aspects of skating minimum eigenvalue of 2.0 is more meaningful. performance, “technical merit” and “artistic Combining this with Cattell’s (1966) graphical impression”. The rating scale extends from 0.0 scree-plot approach would appear to give a more to 6.0, but only the upper end of the scale is used reasonable report on dimensionality for longer in Olympic competition. tests. A criterion value for no secondary dimen- Though the awarding of medals is based on sionality in the data can be estimated by means the ratings, this is not done directly. Instead, a of data simulated from the empirically-derived complex set of procedures is followed, intended to measure distribution. increase the fairness of the final result. The judges, The second practical issue is how the PCAR however, are not only expert raters of skating components are to be interpreted as dimensions. performance, they also have an expert understand- In the conventional common-factor approach, the ing of the scoring procedures. In this paper, only factors are made oblique and rotated until each the reported ratings are considered. Subsequent factor can be interpreted as “one dimension”. scoring procedures are not addressed. In the PCAR approach conceptualized here, the Lo c a l Independence a n d Re s i d u a l Co v a r i a n c e 5

What the judge did pated in the Olympic event, with the pairs listed in their rank order at the end of the competitive Consider the self-admitted biased ratings of phase of the competition. the French judge in this competition. This bias concerns the awarding of the Gold and Silver For analytic convenience, the original ratings medals. In Table 1, the ratings of all the judges have been converted to integers by ignoring deci- for the top two pairs are shown. The French mal points. It is seen that only ratings in the range judge rated in favor of the leading Russian pair from 29 to 59 were observed. 60 corresponds to (Berezhnaya and Sikharulidze: BS) as opposed to a flawless performance and is rarely awarded. the leading Canadian pair (Sale and Pelletier: SP). For the purposes of this analysis, ratings below Notice the subtlety of the bias. It is not apparent 29 and above 59 were treated as structural zeroes to the eye. The French judge generally awarded and were not modeled. A rating of 30 was also ratings of 5.8. Exceptions were a rating of 5.9 to not observed, but this was treated as an incidental the Russian pair and 5.7 to the Canadian pair. This zero, and maintained in the ordinal structure for slight favoring of the Russian pair as opposed to the purposes of constructing Rasch measures the Canadian pair was enough to give the Russians (Wilson, 1991). The judges were modeled to share the Gold Medal. None of the French judge’s rat- the same understanding of the rating scale, i.e., ings, however, were out of line with those given to rate in accord with the Andrich (1978) “Rating by other judges. Indeed, the USA judge rated the Scale” version of a Rasch model. Canadians lower at 5.6. So, can the French judge’s This dataset was subjected to a Rasch bias be detected, and, if so, how? measurement analysis using Winsteps (Linacre, 2002). The resulting performance measures cover Constructing Measures of Performance a range of 25 logits from the best scored perfor- An apparent intention of the judging plan is mance by the Russian pair, BS, for their Free that pairs can exhibit different levels of compe- program Artistic impression, to the worst score tence on different aspects of different programs, performance by the Armenian pair, KZ, for their but the judges should maintain the same level Short program Technical merit. of leniency or severity throughout the judging Figure 1 displays the top measures. Each session. A convenient way to analyze this is to pair appears 4 times comprising Free and Short conceptualize each pair’s performance on each programs, Technical merit and Artistic impres- aspect of each program as separate case, with its sion. The total raw scores for BS and SP were own measure of competence. The judges then the same. BS had the highest scored performance become the “items” against which those case- component and SP the lowest scored for these competencies are rated. Table 2 shows the entire two pairs, but 3 of SP’s components were higher dataset of 9 judges rating the 20 pairs who partici- scored than 3 of BS’s.

Table 1 Ratings of Top Two Pairs

Nationality of Judge­: Russia­ ­ USA­ France­ Poland­ Canada­ Ukraine­ Germany Japan­

Program­ Aspect­ Russian pair: BEREZHNAYA Elena / SIKHARULIDZE Anton (BS)­ Short Technical merit­ 5.8 5.8 5.7­ 5.8 5.8 5.8 5.8­ 5.8­ 5.7 Short­ Artistic impression­ 5.8­­ 5.8­­ 5.8 5.8 5.9 5.8 ­ 5.8­ 5.8 5.8 ­ Free Technical merit 5.8 5.8 5.7 5.8 5.7 5.7 5.8 5.8 5.7 Free Artistic impression 5.9 5.9 5.9 5.9 5.9 5.8 5.9 5.8 5.9 ­

­ Program­ Aspect­ Canadian pair: SALE Jamie / PELLETIER David (SP)­

Short Technical merit 5.7 5.7 5.6 5.7 5.8 5.8 5.7 5.8 5.6 Short Artistic impression 5.8 5.9 5.8 5.8 5.8 5.9 5.8 5.9 5.8 ­ Free Technical merit 5.8 5.9 5.8 5.8 5.8 5.9 5.8 5.9 5.8 Free Artistic impression 5.8 5.8 5.9 5.8 5.8 5.9 5.8 5.9 5.9 ­ 6 Li n a c r e

­Table 2 Data file of Pair ratings

1 BS-Rus S T 58 58 57 58 58 58 58 58 57 1 BEREZHNAYA Elena / SIKHARULIDZE Anton : RUS 1 BS-Rus S A 58 58 58 58 59 58 58 58 58 2 BEREZHNAYA Elena / SIKHARULIDZE Anton : RUS 1 BS-Rus F T 58 58 57 58 57 57 58 58 57 3 BEREZHNAYA Elena / SIKHARULIDZE Anton : RUS 1 BS-Rus F A 59 59 59 59 59 58 59 58 59 4 BEREZHNAYA Elena / SIKHARULIDZE Anton : RUS 2 SP-Can S T 57 57 56 57 58 58 57 58 56 5 SALE Jamie / PELLETIER David : CAN 2 SP-Can S A 58 59 58 58 58 59 58 59 58 6 SALE Jamie / PELLETIER David : CAN 2 SP-Can F T 58 59 58 58 58 59 58 59 58 7 SALE Jamie / PELLETIER David : CAN 2 SP-Can F A 58 58 59 58 58 59 58 59 59 8 SALE Jamie / PELLETIER David : CAN 3 SZ-Chn S T 57 58 56 57 57 57 56 57 56 9 / : CHN 3 SZ-Chn S A 57 57 57 57 56 56 57 56 55 10 SHEN Xue / ZHAO Hongbo : CHN 3 SZ-Chn F T 57 57 58 58 57 57 57 57 57 11 SHEN Xue / ZHAO Hongbo : CHN 3 SZ-Chn F A 57 58 58 57 57 57 58 57 57 12 SHEN Xue / ZHAO Hongbo : CHN 4 TM-Rus S T 56 56 54 56 56 56 56 54 54 13 TOTMIANINA Tatiana / MARININ Maxim : RUS 4 TM-Rus S A 56 57 55 56 56 54 56 55 55 14 TOTMIANINA Tatiana / MARININ Maxim : RUS 4 TM-Rus F T 58 56 56 57 56 55 55 55 56 15 TOTMIANINA Tatiana / MARININ Maxim : RUS 4 TM-Rus F A 56 56 56 56 56 54 57 55 57 16 TOTMIANINA Tatiana / MARININ Maxim : RUS 5 IZ-USA S T 55 54 53 55 55 55 55 55 55 17 INA Kyoko / ZIMMERMAN John : USA 5 IZ-USA S A 57 56 56 56 56 56 57 56 55 18 INA Kyoko / ZIMMERMAN John : USA 5 IZ-USA F T 55 56 57 56 54 55 56 56 55 19 INA Kyoko / ZIMMERMAN John : USA 5 IZ-USA F A 57 57 58 56 56 57 56 57 57 20 INA Kyoko / ZIMMERMAN John : USA 6 PT-Rus S T 55 55 54 56 55 55 55 54 55 21 PETROVA Maria / TIKHONOV Alexei : RUS 6 PT-Rus S A 56 55 54 57 55 54 56 54 56 22 PETROVA Maria / TIKHONOV Alexei : RUS 6 PT-Rus F T 57 55 55 55 54 56 55 55 55 23 PETROVA Maria / TIKHONOV Alexei : RUS 6 PT-Rus F A 56 55 54 55 55 54 56 54 56 24 PETROVA Maria / TIKHONOV Alexei : RUS 7 ZS-Pol S T 50 50 49 50 48 48 48 47 49 25 ZAGORSKA Dorota / SIUDEK Mariusz : POL 7 ZS-Pol S A 54 54 53 55 54 54 55 52 54 26 ZAGORSKA Dorota / SIUDEK Mariusz : POL 7 ZS-Pol F T 54 56 55 54 54 54 55 54 56 27 ZAGORSKA Dorota / SIUDEK Mariusz : POL 7 ZS-Pol F A 55 55 55 55 54 55 55 53 56 28 ZAGORSKA Dorota / SIUDEK Mariusz : POL 8 BD-Cze S T 51 52 49 52 53 50 49 52 50 29 BERANKOVA Katerina / DLABOLA Otto : CZE 8 BD-Cze S A 54 53 51 53 54 51 49 54 51 30 BERANKOVA Katerina / DLABOLA Otto : CZE 8 BD-Cze F T 52 54 52 53 52 53 52 53 53 31 BERANKOVA Katerina / DLABOLA Otto : CZE 8 BD-Cze F A 54 55 53 54 54 54 54 53 55 32 BERANKOVA Katerina / DLABOLA Otto : CZE 9 PT-Chn S T 51 50 48 50 48 48 50 50 49 33 PANG Qing / : CHN 9 PT-Chn S A 53 53 53 53 50 51 51 53 51 34 PANG Qing / TONG Jian : CHN 9 PT-Chn F T 54 55 53 54 52 53 53 54 52 35 PANG Qing / TONG Jian : CHN 9 PT-Chn F A 53 54 52 54 52 51 53 52 52 36 PANG Qing / TONG Jian : CHN 10 LF-Can S T 47 49 46 46 45 44 45 44 46 37 LARIVIERE Jacinthe / FAUSTINO Lenny : CAN 10 LF-Can S A 50 53 50 51 50 50 49 47 51 38 LARIVIERE Jacinthe / FAUSTINO Lenny : CAN 10 LF-Can F T 51 53 48 52 52 52 51 50 50 39 LARIVIERE Jacinthe / FAUSTINO Lenny : CAN 10 LF-Can F A 53 53 50 53 53 53 52 50 52 40 LARIVIERE Jacinthe / FAUSTINO Lenny : CAN 11 ZZ-Chn S T 52 53 53 52 53 50 52 53 52 41 ZHANG Dan / ZHANG Hao : CHN 11 ZZ-Chn S A 50 51 50 50 53 48 48 52 50 42 ZHANG Dan / ZHANG Hao : CHN 11 ZZ-Chn F T 52 52 51 51 51 52 51 51 52 43 ZHANG Dan / ZHANG Hao : CHN 11 ZZ-Chn F A 51 49 47 49 50 49 49 49 50 44 ZHANG Dan / ZHANG Hao : CHN 12 LA-Can S T 46 46 44 47 43 45 42 48 45 45 LANGLOIS Anabelle / ARCHETTO Patrice : CAN 12 LA-Can S A 51 51 51 53 48 51 48 52 51 46 LANGLOIS Anabelle / ARCHETTO Patrice : CAN 12 LA-Can F T 50 50 51 52 51 51 50 52 51 47 LANGLOIS Anabelle / ARCHETTO Patrice : CAN 12 LA-Can F A 52 52 52 52 52 52 51 52 52 48 LANGLOIS Anabelle / ARCHETTO Patrice : CAN 13 SD-USA S T 46 45 46 45 44 45 44 45 46 49 SCOTT Tiffany / DULEBOHN Philip : USA 13 SD-USA S A 53 52 52 52 51 52 50 50 53 50 SCOTT Tiffany / DULEBOHN Philip : USA 13 SD-USA F T 49 49 49 49 47 48 46 48 50 51 SCOTT Tiffany / DULEBOHN Philip : USA 13 SD-USA F A 52 52 53 50 50 51 49 51 53 52 SCOTT Tiffany / DULEBOHN Philip : USA 14 KJ-Ger S T 49 49 49 44 46 47 44 45 49 53 KAUTZ Mariana / JESCHKE Norman : GER 14 KJ-Ger S A 49 51 49 47 47 48 44 47 50 54 KAUTZ Mariana / JESCHKE Norman : GER 14 KJ-Ger F T 46 49 46 46 47 45 48 48 49 55 KAUTZ Mariana / JESCHKE Norman : GER 14 KJ-Ger F A 47 51 46 47 47 45 49 46 50 56 KAUTZ Mariana / JESCHKE Norman : GER 15 SM-Ukr S T 49 45 46 47 47 44 43 42 44 57 SAVCHENKO Aliona / MOROZOV Stanislav : UKR 15 SM-Ukr S A 51 50 48 50 51 50 46 47 48 58 SAVCHENKO Aliona / MOROZOV Stanislav : UKR 15 SM-Ukr F T 48 47 47 47 48 47 48 49 48 59 SAVCHENKO Aliona / MOROZOV Stanislav : UKR 15 SM-Ukr F A 50 47 46 48 50 46 48 50 48 60 SAVCHENKO Aliona / MOROZOV Stanislav : UKR 16 CP-Ukr S T 48 46 48 45 47 46 45 44 46 61 CHUVAEVA Tatiana / PALAMARCHUK Dmitri : UKR 16 CP-Ukr S A 49 50 50 48 49 49 48 48 49 62 CHUVAEVA Tatiana / PALAMARCHUK Dmitri : UKR 16 CP-Ukr F T 45 45 46 46 46 46 46 46 46 63 CHUVAEVA Tatiana / PALAMARCHUK Dmitri : UKR 16 CP-Ukr F A 44 46 44 46 46 44 47 45 47 64 CHUVAEVA Tatiana / PALAMARCHUK Dmitri : UKR 17 BB-Svk S T 39 41 39 41 40 37 39 38 36 65 BESTANDIGOVA Olga / BESTANDIG Jozef : SVK 17 BB-Svk S A 44 45 44 45 45 41 44 43 41 66 BESTANDIGOVA Olga / BESTANDIG Jozef : SVK 17 BB-Svk F T 40 43 43 43 41 42 40 41 43 67 BESTANDIGOVA Olga / BESTANDIG Jozef : SVK 17 BB-Svk F A 42 43 43 44 42 41 43 41 42 68 BESTANDIGOVA Olga / BESTANDIG Jozef : SVK 18 PS-Uzb S T 38 38 40 39 34 38 34 31 29 69 PONOMAREVA Natalia / SVIRIDOV Evgeni : UZB 18 PS-Uzb S A 43 43 45 45 43 43 41 40 39 70 PONOMAREVA Natalia / SVIRIDOV Evgeni : UZB 18 PS-Uzb F T 41 43 41 42 41 43 40 42 40 71 PONOMAREVA Natalia / SVIRIDOV Evgeni : UZB 18 PS-Uzb F A 42 42 40 43 41 43 41 42 39 72 PONOMAREVA Natalia / SVIRIDOV Evgeni : UZB 19 CP-Ita S T 36 37 36 40 35 38 36 34 36 73 COBISI Michela / de PRA Ruben : ITA 19 CP-Ita S A 42 40 42 42 42 42 42 41 42 74 COBISI Michela / de PRA Ruben : ITA 19 CP-Ita F T 41 40 40 41 39 38 41 40 41 75 COBISI Michela / de PRA Ruben : ITA 19 CP-Ita F A 41 41 39 42 40 37 42 39 41 76 COBISI Michela / de PRA Ruben : ITA 20 KZ-Arm S T 35 34 35 32 35 34 33 32 32 77 KRASILTSEVA Maria / ZNACHKOV Artem : ARM 20 KZ-Arm S A 40 39 40 38 40 38 40 38 37 78 KRASILTSEVA Maria / ZNACHKOV Artem : ARM 20 KZ-Arm F T 39 39 38 38 39 37 39 39 39 79 KRASILTSEVA Maria / ZNACHKOV Artem : ARM 20 KZ-Arm F A 38 38 38 38 38 36 40 37 39 80 KRASILTSEVA Maria / ZNACHKOV Artem : ARM

1-20: Rank order of pair immediately after competition; 1-80: Case number­ BS-Rus: Initials of pair skaters and nationality­ Programs: S = Short, F = Free­ Aspects: T = Technical merit, A = Artistic impression­ Lo c a l Independence a n d Re s i d u a l Co v a r i a n c e 7

Unexpected Ratings a judge might also be detected as an unexpected rating. The Rasch model does predict that unex- In the same way as lucky guesses on a multi- pected ratings will be observed on occasion, but, ple-choice test manifest themselves as unexpected when that happens, further investigation is always responses (Wright, 1977), a blatant mis-rating by merited. Is the observation merely the outcome of a confluence of random components, or is it Map of Pair Measures the result of the intervention of another dimen- Logit Pair sion spoiling the outcome that would be expected Measure Performance under conditions of local independence? 19 + | 1 BS-Rus F A Table 3 shows the most unexpected ratings 18 + (p < .05) in this competition in order of decreas- | ing unexpectedness. Only one rating of the BS or 17 + 2 SP-Can F A SP appears on this list, and this is by the Polish | 2 SP-Can F T 2 SP-Can S A 16 + judge. Even it is a rating of 5.9 for the Russian | 1 BS-Rus S A pair, BS, when just over 5.8 is its expectation. 15 + Under the Olympic scoring procedures, however, | changing this rating to 5.8 would be unlikely to 14 + | affect the outcome, because 5.9 was discounted 13 + in the Olympic scoring system as an extreme | 1 BS-Rus S T high rating. 12 + | 1 BS-Rus F T The suspect French Judge, 4, appears once 11 + 3 SZ-Chn F A on the list, rating a German pair with 4.4, where | 3 SZ-Chn F T 4.8 would be expected. But this pair ended 14th 10 + 2 SP-Can S T in the competition and their scores had no bear- | 3 SZ-Chn S T 5 IZ-USA F A 9 + ing on the awarding of medals. In fact, of the Figure 1. Measures for top-scored skating perfor- 17 significantly unexpected ratings, only 4 were mances awarded to pairs finishing in the top half. Yet these Table 3 Unexpected Ratings by Skating Judges Observed Expected Residual St. Res. Judge Pair 29 35.36 –6.36 –3.02 9 Jap 18 PS-Uzb S T 59 58.10 .90 2.61 5 Pol 1 BS-Rus S A 49 51.89 –2.89 –2.52 7 Ukr 8 BD-Cze S A 48 44.46 3.54 2.41 8 Ger 12 LA-Can S T 58 56.19 1.81 2.39 1 Rus 4 TM-Rus F T 39 42.25 –3.25 –2.39 9 Jap 18 PS-Uzb S A 44 47.56 –3.56 –2.33 4 Fra 14 KJ-Ger S T 48 50.89 –2.89 –2.28 3 USA 10 LF-Can F T 44 47.39 –3.39 –2.22 7 Ukr 14 KJ-Ger S A 37 39.79 –2.79 –2.20 6 Can 19 CP-Ita F A 40 35.43 4.57 2.18 3 USA 18 PS-Uzb S T 53 50.14 2.86 2.17 5 Pol 11 ZZ-Chn S A 49 45.91 3.09 2.11 1 Rus 15 SM-Ukr S T 54 55.72 –1.72 –2.10 6 Can 4 TM-Rus F A 36 38.74 –2.74 –2.04 9 Jap 17 BB-Svk S T 48 50.59 –2.59 –2.02 5 Pol 12 LA-Can S A 45 42.28 2.72 1.99 3 USA 18 PS-Uzb S A 8 Li n a c r e

are important. For instance, the unexpectedly Technical component of the Short program for low rating given by the Canadian judge, 6, to the Canadian pair 2, SP, was also on this list. As Russian pair TM, who finished fourth, could have can be seen in Table 1, 4 judges awarded this affected their medal hopes. On the other hand, performance 5.7, 3 judges 5.8 and 2 judges 5.6. the Russian judge gave this pair an even more Surprisingly, it was the USA and Japanese who unexpected, high rating. awarded the low scores. The French judge rated this performance 5.7, in line with consensus. Unexpected Response Strings Though inspection of individual unexpected ratings does not reveal the known judge bias, an Table 5 accumulation of small unexpectednesses might Fit analysis for pair performances violate the Rasch model (Klauer, 1995) and so be Infit Outfit diagnosed as judge misbehavior. In this instance Mnsq Mnsq­ ­Pair­ the fit statistics used to investigate this are the 2.52 2.35 18 PS-Uzb S T­ Pearsonian INFIT and OUTFIT statistics (Wright 2.02 1.92 8 BD-Cze S A­ and Masters, 1982). 1.68 1.66 14 KJ-Ger S T The fit analysis for the judges is reported in 1.64 1.67 ­18 PS-Uzb S A­ Table 4. It is seen that the Japanese judge is in- 1.53 1.50 2 SP-Can S T consistent, and then only slightly, in Rasch model 1.42 1.42 ­15 SM-Ukr S T terms (mean-square>1.0). The other judges are somewhat overly predictable, i.e., fail to exhibit Differential Rater Function Investigation the local independence specified by the model. This is a known feature of the Olympic judging In this dataset, the judges are acting as the procedures in which ratings deemed divergent, items. Each skating pair has 4 data records. Lack i.e., apparently too independent, are referred back of locally independent rating by the judges, i.e., to the judges for emendation before the ratings bias in respect of skating pairs, could therefore be are made public. conceptualized as equivalent to differential item function (DIF). Accordingly estimates of DIF Table 4 were obtained for all judges across all contestants, with four data records per contestant. Judge fit analysis. Infit­ Outfit­ Table 6 shows the largest and also the most Mnsq­ Mnsq Judge­ statistically significant differences in judge leni- 1.23 1.07 9 Jap ency between contesting pairs. The computation .95 .98 8 Ger used is described in Linacre (1996). It is seen that .96 .98 3 USA the largest logit biases are all toward the Canadian .94 .91 7 Ukr SP pair, and are by the Canadian and German .82 .86 6 Can judges. These biases, however, have large stan- .80 .68 4 Fra dard errors and so are not statistically significant .71 .78 5 Pol at the .05 level. Nevertheless they do support the .64 .62 2 Chn­ viewpoint that some of the judges were notice- .57­ .58 1 Rus­ ably biased towards the Canadian pair, indeed apparently more than counterbalancing the mis- The fit analysis for the pair performances is behavior of the French judge towards the Russian reported in Table 5. The 6 which received the most pair. Statistically significant biases affect mainly inconsistent ratings are shown. The Short program lower placed contestants. The French judge ap- of Uzbek pair 18 provoked the most disagreement pears in this Table only once for her ratings for among the judges. The relatively low-scoring the pairs from Germany and Uzbekistan. These Lo c a l Independence a n d Re s i d u a l Co v a r i a n c e 9

Table 6 Sizeable Judge Bias terms. Bias Significance Judge Pair Pair Size S.E. t d.f. Name Largest size: 2 SP 4 TM –5.87 2.67 –2.19 6 Can 2 SP 4 TM –5.86 2.68 –2.19 6 Ger 2 SP 1 BS –5.86 2.79 2.10 6 Can 2 SP 7 ZS –5.85 2.65 –2.21 6 Ger 2 SP 6 PT –5.64 2.67 –2.11 6 Ger 2 SP 10 LF –5.64 2.64 –2.14 6 Ger 2 SP 9 PT –5.29 2.64 –2.00 6 Can 2 SP 17 BB –5.28 2.64 –2.00 6 Can 2 SP 18 PS –5.17 2.63 –1.97 6 Ger 2 SP 11 ZZ –5.15 2.64 –1.95 6 Can 2 SP 14 KJ –5.05 2.63 –1.92 6 Can 2 SP 19 CP –5.04 2.63 –1.92 6 Ger 2 SP 17 BB –5.02 2.64 –1.90 6 Ger Most unexpected: 14 KJ 18 PS –2.58 .53 –4.83 6 9 Jap 13 SD 18 PS –2.43 .56 –4.35 6 9 Jap 10 LF 12 LA 1.96 .54 3.60 6 8 Ger 7 ZS 18 PS –2.28 .64 –3.59 6 9 Jap 7 ZS 12 LA 2.17 .62 3.53 6 8 Ger 16 CP 18 PS –1.80 .52 –3.47 6 9 Jap 18 PS 19 CP 1.76 .52 3.37 6 9 Jap 3 SZ 14 KJ 2.42 .76 3.17 6 9 Jap 14 KJ 18 PS 1.52 .49 3.13 6 4 Fra Negative “Size” values indicate more leniency to the first pair listed. are irrelevant to the awarding of medals. Standard performed. Figure 2 displays the results. The x- DIF analysis has not identified the relevant judge axis is the Rasch “pair performance” dimension misbehavior. with higher-scoring performances to the right. The y-axis is the first component in the residuals. Multi-dimensional Responses Patterns This is the biggest variance component in the off- Slight patterns of bias across groups of Rasch-dimension patterns within the standardized persons or judges act as minor dimensions in the residuals. data. These will manifest themselves as covari- In Figure 2, most performances are marked ances among the observations (Glas and Verhelst, with lozenges. Only the pair performances for 1995, Sec. 5.2.4), and specifically among the BS and SP are identified explicitly. It is seen that residuals remaining after model expectations have there is a bifurcation, but not a strong one. Table been removed from the observations. Linacre 7, however, identifies the judges most strongly (1998a) suggests that the most indicative form contributing to this component. The USA judge of the residuals is that standardized by its model has rated higher than expected 8 of the 9 per- standard deviation. Linacre (1998b) suggests that formances which most positively load on this a principal components analysis is a useful means component, and rated lower than expected 5 of of identifying dimensionality in the data. the 9 most negatively loaded performances. The Accordingly, a principal components analysis Ukrainian judge, by contrast, has rated lower than of the pair performance residual correlations is expected 7 of the most highly positively loaded 10 Li n a c r e

Figure 2. Rasch dimension and first principal component in Pair performance residuals. Figure 2. Rasch dimension and first principal component in Pair performance residuals. Table 7 Judge behavior on first component in pairs residuals ­Judge Favors Top of Fig. 2­ Top 9 Performances Bottom 9 Performances­ High Exp. Low­ High Exp. Low “West”­ 8 1 0 1 3 5 3 USA­ 4 5 0 ­1 1 7 6 Can­ 3 5 1 ­0 4 5 8 Ger 6 3 0 ­2 4 3 9 Jap­

­Judge Favors Bottom of Fig. 2­ Top 9 Performances Bottom 9 Performances­ High Exp. Low­ ­High Exp. Low “East”­ 0 2 7 8 1 0 7 Ukr 0 2 7 ­6 3 0 4 Fra­ 0 3 6 ­3 5 1 5 Pol­ 1 6 2 ­1 8 0 1 Rus­ 2 5 2 ­3 5 1 2 Chn­ pairs, and rated higher than expected 8 of the most These results suggest that a principal com- negatively loaded pairs. The Russian judge, on the ponents analysis of the inter-judge correlations other hand, is rating these 18 performances almost in the residuals would be fruitful. This is shown entirely as the model would predict. The French in Figure 3. The x-axis is the Rasch “rater leni- judge is now seen to be firmly in the “Eastern” ency” dimension. The Chinese judge is the most camp with Ukraine, Poland and perhaps Russia lenient rater, and the Polish and USA judges the and China. They contrast with the “Western” most severe. The y-axis, the first component in camp of USA, Canada, Germany and Japan. the residuals, now neatly contrasts the “Eastern” Lo c a l Independence a n d Re s i d u a l Co v a r i a n c e 11

and “Western” rater blocks. The Russian judge Quality Control and Early Intervention is clearly part of the “Eastern” block, but not so the Chinese judge. Thus this method of analysis In high-stakes tests, such as the Olympic does not report an “Eastern” factor and a “West- Games, it is obviously better to identify and ern” factor, but rather a factor which contrasts correct judge misbehavior before real damage “Eastern” and “Western”. is done, than to attempt to remedy it afterwards. A possible course of action would have been to The French judge is now seen to be more apply the most successful of the dependency- “Eastern” than even the former Soviet-bloc coun- detection techniques to the data as it was after the tries. It may well have been this uncharacteristic first stage of the judging, the Short program, be- behavior that most raised the ire of the spectators fore judging commenced on the second stage, the to the event. The French judge’s misbehavior, Long program. This is shown in Figure 4. Three however, is subtle, and was just sufficient to give judges appear as outliers, and so their ratings the Gold medal to the Russian pair. This was later would merit close inspection, and they themselves amended by the Olympic authorities, so that both might be subject to retraining. These judges are the Russian and the Canadian pairs were awarded the French, Ukrainian and Japanese judges. The Gold medals. Japanese judge was conspicuous for general misfit In fact, if the French judge’s slightly aberrant (Table 4). The French and Ukrainian judges may rating pattern for the Gold and Silver medalists already be exhibiting the bias that later became had been her only instance of biased rating, she public knowledge (Figure 3). could have made a very strong case that she had exhibited no bias at all. It was her larger pattern Conclusion of “Eastern” rating, most of which was inconse- This analysis demonstrates that lack of local quential, that tripped her up. independence can be a major threat to measure (or raw score) validity. It also demonstrates that

Figure 3. Rasch dimension and first principal component in Judge residuals.

Figure 3. Rasch dimension and first principal component in Judge residuals. 12 Li n a c r e

Figure 4. Rasch dimension and first principal component in Judge residuals after Short program. it is not the extentFigure or the 4. size Rasch of violation dimension that is andCattell, first R. principalB. (1966). The component scree test for thein numberJudge crucial, but ratherresiduals its impact after on critical Short aspects program. of of factors. Multivariate Behavioral Research, the measuring system. 1, 245-276. The Japanese judge was the least predictable, Chen, T.-H., and Davison M. L. (1996) A mul- but that did not matter substantively because that tidimensional scaling, paired comparisons judge’s most unexpected ratings affected the pairs approach to assessing unidimensionality in who were 17th and 18th in the overall ranking. The the Rasch model. In G. Engelhard, Jr., and M. French judge’s slight, but focused, bias along Wilson (Eds.), Objective measurement: Theory the “East”-”West” sub-dimension could not be into practice. Vol. 3 (Chapter 16). Greenwich, identified in individual ratings, or even by consid- CT: Ablex. eration of all that judge’s ratings, but required an Divgi, D. R. (1986). Does the Rasch model really investigation of off-dimensional patterns within work for multiple choice items? Not if you look the total set of ratings. closely. Journal of Educational Measurement, This case study supports the view that 23(4), 283-298. principal components analysis of residuals is a Embretson, S. E. (1997) Structured ability models powerful means of investigating failures in the in tests designed from cognitive theory. In local independence in data indicative of latent M. Wilson, G. Engelhard, Jr., and K. Draney sub-dimensions. (Eds.), Objective measurement: Theory into References practice. Vol. 4 (Chapter 12). Greenwich, CT: Ablex. Andrich, D. A. (1978). A rating scale formulation for ordered response categories. Psychometri- Fischer, G. H. (1997) Structural Rasch models: ka, 43, 561-573. Some theory, applications and software. In M. Wilson, G. Engelhard, Jr., and K. Draney Lo c a l Independence a n d Re s i d u a l Co v a r i a n c e 13

(Eds.), Objective measurement: Theory into McDonald, R. P. (1985). Factor analysis and practice. Vol. 4 (Chapter 10). Greenwich, CT: related methods. Hillsdale, NJ: Lawrence Ablex. Erlbaum. Glas, C. A. W., and Verhelst, N. D. (1995). Testing Smith R. M., and Miao C. Y. (1994). Assessing the Rasch model. In G. H. Fischer, and I. W. unidimensionality for Rasch measurement. Molenaar (Eds.), Rasch models: Foundations, In M. Wilson (Ed.), Objective measurement: recent developments, and applications (pp. Theory into practice. Vol. 2 (Chapter 18). 69-95). New York: Springer-Verlag. Norwood, NJ: Ablex. Gonzalez, E., Adams, R. J., Wu, M., and Ludlow, Van der Ven, A. H. G. S., and Ellis, J. L. (2000). L. (1997) Item response theory in the Third A Rasch analysis of Raven’s standard pro- International Mathematics and Science Study. gressive matrices. Personality and Individual In M. Wilson, G. Engelhard, Jr., and K. Draney Differences, 29(1), 45-64. (Eds.), Objective measurement: Theory into Wang, W.-C., Wilson, M., and Adams, R. J. practice. Vol. 4 (Chapter 9).Greenwich, CT: (1997) Rasch models for multi-dimensionality Ablex. between and within items. In M. Wilson, G. Harmon, H. H. (1960). Modern factor analysis. Engelhard, Jr., and K. Draney (Eds.), Objec- Chicago: University of Chicago Press. tive measurement: Theory into practice. Vol. Kaiser, H. F. (1960). The application of electronic 4 (Chapter 8). Greenwich CT: Ablex. computers to factor analysis. Educational and Wang, W.-C., Wilson, M., and Adams, R. J. Psychological Measurement, 20, 141-151. (2000) Interpreting the parameters of a multi- Klauer, K. C. (1995). The assessment of person dimensional Rasch model. In M. Wilson and fit. In G. H. Fischer, and I. W. Molenaar (Eds.), G. Engelhard, Jr. (Eds.), Objective measure- Rasch Models: Foundations, Recent Develop- ment: Theory into practice. Vol. 5 (Chapter ments and Applications (p. 97-110). New York: 12). Westport CT: Ablex. Springer Verlag. Wilson, M. (1991) Unobserved categories. Rasch Linacre. J. M. (1996). DIF in polytomous items. Measurement Transactions, 5(1), 128. Rasch Measurement Transactions, 10(3), Wilson, M., Engelhard, G., Jr., and Draney, 520. K. (Eds.). (2000). Objective measurement: Linacre, J. M. (1998a). Detecting multidimen- Theory into practice. Vol. 4. Greenwich, CT: sionality: which residual data-type works Ablex. best? Journal of Outcome Measurement, 2(3), Wright, B. D. (1996) Local dependency, correla- 266-283. tions and principal components. Rasch Mea- Linacre, J. M. (1998b) Structure in Rasch residu- surement Transactions, 10(3), 509-511. als: why principal components analysis? Rasch Wright, B. D. (1977). Solving measurement Measurement Transactions, 12(2), 636. problems with the Rasch model. Journal of Linacre, J. M. (2002). Winsteps Rasch model Educational Measurement, 14, 97-166. computer program. Chicago: Winsteps.com Wright, B. D., and Masters, G. N. (1982) Rating Loevinger, J. (1954). The attenuation paradox scale analysis. Chicago: MESA Press. in test theory. Psychological Bulletin, 51, 493-504.