<<

BehaviorResearch Methods, Instruments, & 1988, 20 (5), 471-480 A 77 program for a nonparametric item response model: The Mokken scale analysis

JOHANNES KINGMA University of Utah, Salt Lake City, Utah and TERRY TAERUM University ofAlberta, Edmonton, Alberta, Canada

A nonparametric item response theory model-the Mokken scale analysis (a stochastic elabo­ ration of the deterministic Guttman scale}-and a program that performs this analysis are described. Three procedures of scaling are distinguished: a search procedure, an evaluation of the whole set of items, and an extension of an existing scale. All procedures provide a coeffi­ cient of scalability for all items that meet the criteria of the Mokken model and an item coeffi­ cient of scalability for every item. Four different types of reliability coefficient are computed both for the entire set of items and for the scalable items. A test of robustness of the found scale can be performed to analyze whether the scale is invariant across different subgroups or samples. This robustness test serves as a goodness offit test for the established scale. The program is writ­ ten in FORTRAN 77. Two versions are available, an SPSS-X procedure program (which can be used with the SPSS-X mainframe package) and a stand-alone program suitable for both main­ frame and microcomputers.

The Mokken scale model is a stochastic elaboration of which both mainframe and MS-DOS versions are avail­ the well-known deterministic Guttman scale (Mokken, able. These programs, both named Mokscal, perform the 1971; Mokken & Lewis, 1982; Mokken, Lewis, & Mokken scale analysis. Before presenting a review of the Sytsma, 1986). The Mokken model is applied to dichoto­ program and the various , we describe in short­ mous items for which one alternative is designated as posi­ hand the underlying assumptions of the Mokken model, tive with respect to the latent variable of interest (attitude, which are crucial for the appreciation of this rather ability, etc.). This model belongs to the general family unknown model. of the popular item response theory models (Mokken & Lewis, 1982). The Mokken model has been shown to be THE MOKKEN MODEL very useful for scaling social attitudes, political knowledge, political participation, and political efficacy The Mokken model treats the attitude, the ability, or (Gillespie, Ten Vergert, & Kingma, 1987a, 1987b; Lip­ any other variable of interest as a single latent trait on pert, Schneider, & Wakenhut, 1978; Mokken, 1969a, which the person' location is represented by the param­ 1969b; Stokman, 1977; Stokman & Van Schuur, 1980). eter (J and the item's location is represented by the In psychological research, the Mokken model has been parameter o. Given a reasonably unidimensional set of applied successfully to the construction of developmen­ items-that is, one dominated by the latent variable tal scales of Piagetian concepts (Kingma, 1984; Kingma measured-the person's parameter (0) can be estimated & Loth, 1985; Kingma & Reuvekamp, 1984; Kingma & by the number of items to which a person responds posi­ Ten Vergert, 1985; Kingma & Van den Bos, in press), tively, and the item parameter (0) can be estimated by the as well as to cumulative scaling of problem solving tasks proportion of people who respond positively to that item. (Henning, 1976) and sleep quality (Mokken & Lewis, The former is referred to as the person's scale score and 1982). The purpose of this paper is to describe a user the latter as the item difficulty. Items that fit the Mokken procedure FORTRAN 77 program (which can be used model must meet the so-called assumption of double mo­ with SPSS-X) and a stand-alone Fortran 77 program of notony.

Preparation of this paper was supported by Grant Al235 from the Natural Sciences and Engineering Research Council of Canada and by The Assumption of Double Monotony Grant H56-314 from the Netherlands Organization for the Advancement Assume that n persons answer k dichotomous items Xi' of Pure Scientific Research (Z.W.O.). Requests for reprints may be ad­ ... XI" where Xi=1 or 0 if a person's answer to item i dressed to Terry Taerum, Department of Computing Services, 319C is positive or negative, respectively. The Mokken model General Services Building, University of Alberta, Edmonton, Alberta TOO 2El, Canada, or to Johannes Kingma, Department of Educational specifies the relation between item Xi and the latent trait Psychology, University of Utah, Salt Lake City, Utah 84H2. in terms of an ICC, denoted by P(Xi IOJ). As formal ex-

471 Copyright 1988 Psychonomic Society, Inc. 472 KINGMA AND TAERUM pression indicates, this curve represents the probability 1 of a positive response on item Xi' given respondentj's The Cross-Tabulation of Two Items location (8) on the latent trait. An important feature of Response to Item j the Mokken model is that, unlike the parametric latent Response to Item i 1 0 Row Total trait models (Hulin, Drasgow, & Parsons, 1983), it makes 1 (I,I) f(1,O) f(I,.) no assumption about the functional form of the item o f(O,I) f(O,O) 1(0,.) characteristic curve (ICC). Consequently, the Mokken Column Total f(. ,I) f(. ,0) f(., .) model is referred to as a nonparametric model, and the Note-Item i is assumed to be more difficult than itemj. A 1 denotes resulting scale scores and item difficulties constituteor­ a positive response; a 0 denotes a negative response. dinal rather than interval or ratio values. Instead,the onlyconstraint thatthe Mokken modelputs timate of the index of item pair homogeneity, measures on the ICCs is referred to as the assumption ofdouble the proportionaldifferencebetweencell frequencyof the monotony. The first requirement of thisassumption is that error expectedunder marginalindependence and the for any item in a Mokkenscale, the probabilityof a posi­ actual cell frequency: tive response increases as 8 increases. To put this more formally, for any two persons i andj where 8i < 8J> the H. = [e(I,O) - f(I,O)] probability of a positive response on any itemin the scale .) e(l,O) , (1) is less for person i. This requirement has been defined as monotone homogeneity (Mokken & Lewis, 1982; where e(I,O) = f(l, .)f(. ,O)/f(.,.) and i < j. Molenaar& Sytsma, 1984),whichis a necessaryprecon­ Readers familiar with the convention of using letters dition for unidimensional measurement. a, , , and d to represent the cell frequencies of a 2x 2 The secondrequirement to fulfill the assumption of dou­ table may fmd this notation more convenient: ble monotony is that for any value of 8, the probability Hij = (ad - bc)/(a+b)(b+d). (2) of a positive responsedecreaseswith the difficultyof the item. This means that the order of item difficulties re­ They may also notice the similarity between this index mains invariant over the values of 8, or, stated graphi­ and other measures for 2x2 tables. When items are in­ cally, the ICCs do not intersect. Giventhesetwo require­ dependent, Hij will be zero; whenthe error cell is empty, ments, it becomespossibleto define unambiguously the Hij will equal unity. difficulty of an item with a probability of .50. To deter­ The estimate of the coefficient of item homogeneity, mine whether a set of items forms a Mokken scale, two Hi' is given in Equation 3: test procedures are performed, (3) Testing the Monotone Homogeneity of a Set of Items where i »i. The estimate aggregates the observed and To test the requirement of monotone homogeneity, expected frequencies for the error cell for each 2 x 2 ta­ Mokken developedthree related coefficients of scalabil­ ble that cross-classifies item i with the other items in the ity, which are based on the assumption of marginal in­ scale. Hi is analogous to the item-totalcorrelation used dependence (Mokken, 1971).The first, Hib measuresthe in reliabilityanalysis(Sytsma& Molenaar, 1987).It will homogeneity or association betweeneach pair of items. be zero whenitem i is independent of the other itemsused The second, Hi> measuresthe homogeneity of a particu­ in the scale. It willattaina maximum valueof unitywhen lar item with respect to all other items and is obtained the error cell in each of the relevant2x 2 tablesis empty. by aggregating acrossthecoefficients of the relevantitem In termsof Equation 2, the numeratorof Equation 3 con­ pairs. The third, H, measures the homogeneity of the scale sists of the sum of the differences between the diagonal as a whole by aggregating across the coefficientsof the and off-diagonal cross-productterms for the 2 x 2 table. individual items. The denominatorof Equation 3 sums the product of the Accordingto Loevinger(1948), the coefficientfor the appropriate marginals for these tables. homogeneity of an itempair, Hib essentially measures the The estimate of the coefficient of scale homogeneity, association in the 2x 2 table that is obtained by cross­ H, is given in Equation4, whichaggregates the observed classifyingthe two items (see Table 1). Let item i define and expected error frequencies used to calculateR j for the rows and itemj define the columns, where item i is all item pairs: more difficult than itemj. Under Guttman's determinis­ ~1t-l~1t ~It-l '{""It I-" tic model, we expect the top-right cell, the error cell, to H = L.i=lL.tj=i+leij - Loti=1 L"j=i+lJij (4) be empty [i.e.,f(I,O) = 0]. Under the model of statisti­ E~;f E'=i+1 eij , cal independence (or no association between i andj), we anticipatethat the expectedvalue of the frequencyof the wherei = j -1. Againin termsof Equation 2, the numer­ error cell, defined as e(1,O), equals the product of mar­ ator of Equation 4 consists of the sum of the differences ginal frequencies dividedby the samplesize [i.e., e(l,O) between the diagonal andoff-diagonal cross-product terms = f(l, .)f(. ,O)/f(. ,.)]. Given in Equation 1, Hib the es- for all 2 x 2 tables. The denominatorof Equation4 sums MOKKEN SCALE 473 the products of the appropriate marginal rows of these In practice, when an item does not meet these criteria, tables. H will be zero when all items are mutually indepen­ it is eliminated from the scale. Subsequently, the coeffi­ dent, and it will attain the maximum value of unity when cients Hib 8;, and H are recomputed for the remaining the error cells for all the 2 x2 tables are empty. subset of items, and a new inspection has to be performed An asymptotic sampling theory of the estimates of the whether or not these coefficients meet the criteria. This scale coefficients, H and Hi' has been developed (Mok­ process is repeated until a sufficiently strong Mokken scale ken, 1971, pp. 157-160). On the basis ofthis theory, the is obtained. null hypothesis of random response, also called the null case (H=O or Hi=O), can be tested. A statistic called Monotonicity in the Item Difficulties DELTA STAR has been derived from the estimate ofthe The estimates of the coefficients of scalability (H, 8;, scale coefficient H, and on item level the estimate of Hi and Hij) are used to test the monotone homogeneity of has been employed (see Mokken, 1971). The statistic the items. This test is a necessary, but not sufficient, con­ DELTA STAR allows us to test the more specific null dition to meet the assumption ofdouble monotony under­ hypothesis of random response: lying the Mokken model. The test ofthe monotonicity of item difficulties, together with the test for monotone Ho : E(H) = E(DELTA STAR) = O. homogeneity of the items, provides a sufficient condition to test the assumption of double monotony. The null hypothesis ofrandom response is accepted when The test of the monotonicity of the item difficulties in­ the numerical value of DELTA STAR is smaller than volves an inspection ofthe P and Po matrices, which con­ some critical value c defined by the researcher, where c tain the probabilities of two positive and two negative is the z score from the standard normal distribution, which responses, respectively, to all possible pairs ofitems. This is obtained as follows: test is based on another assumption ofthe Mokken model, • a namely the assumption of local stochastic independence c = z for o, = lhk(k-l)' (5) of item responses for a fixed subject parameter value. More specifically, item responses are conditionally in­ where a is the confidence level chosen by the researcher dependent, given the same value of 8, so that the condi­ and lhk(k-l) is the number ofitem pairs. When DELTA tional probability ofjoint responses for persons with the STAR is greater than c, the null case is rejected; that is, same value of 8 equals the product ofthe marginal prob­ it may be concluded that the coefficient is not based on abilities of these responses. The test of monotonicity in random response. Using this test on item level-that is, item difficulties analyzes the observable unconditional testing 8; against the c criterion-enables the researcher probabilities of each item pair in the P and Po matrices. to weed out those items that are based on random response Since local independence is assumed for a given (J value, (or chance only). the unobservable conditional probabilities imply the ob­ The test of the null case is the first step toward delet­ servable unconditional probabilities (see Mokken, 1971). ing items that do not satisfy the Mokken scaling model. For example, when items 1, 2, and 3 represent decreas­ However, the researcher may also impose (a priori) ad­ ing difficulty, the (conditional) probability of a pair of ditional requirements on the coefficients Hi and H in such positive responses will be the greatest for items 2 and 3, a way that both Hi and H must exceed a numerical value next greatest for items 1 and 3, and finally for items I u (where 0 < u < 1, and JL will be gradually increased, and 2. Similarly, the (conditional) probability of a pair e.g., from .3 to .5). Only those items are selected of which ofnegative responses will be the greatest for 1 and 2, fol­ the Hi exceeds this a priori criterion (and consequently lowed by 1 and 3, and 2 and 3. the scale coefficient H of the selected items will also be In the P and Po matrices, the rows and columns are or­ greater than the critical value). The possibility of gradu­ dered, respectively, from left to right and from top to bot­ ally imposing stronger requirements on the items is es­ tom according to decreasing levels of item difficulty. To pecially worthwhile for the construction ofthe stochastic meet the criterion of monotonicity in item difficulties, the Mokken scale in new research areas (Kingma & Ten Ver­ probability ofa pair ofpositive responses should increase gert, 1985). in the P matrix from left to right in each row and from The coefficients of scale and item homogeneity allow top to bottom in each column. Similarly, the probability the researcher to judge whether the scale or the scalable of a pair of negative responses should decrease in the Po set ofitems as a whole and the scalability ofthe individual matrix. The statistical significance ofdepartures from this items satisfy the established criteria. Mokken (1971) es­ pattern can be tested by means of a simple runs test tablished a set of criteria for the coefficients H, Hi, and (Molenaar, 1983a). In practice, those items that violate Hij in order to analyze the homogeneity ofa scale. First, significantly the criterion ofmonotonicity in item difficul­ all Hij should be greater than zero. Second, all Hi (and ties are removed from the scale, and the analysis is thus all H) should be greater than a predetermined con­ repeated until no significant disturbances are observed in stant u. The following classification of scale has frequently both matrices. been used (Kingma, 1984): .50 :s H, a strong scale; .40 A Mokken scale has been found when the items meet -s H < 50, a scale of medium strength; and .30 :s H the criterion of double monotony; that is, the must < 40, a weak scale. be in agreement with the assumption of monotone homo- 474 KINGMA AND TAERUM geneity, as well as with the assumption of monotonicity (again, subject to sampling error). Thisfeature is referred in item difficulties. to as specific objectivity (Rasch, 1960). The Mokkentest can also be performedon item level A Test for Robustness of the Scale as by replacing Hihfor H; and S(Hlh) for S(Hh) in Equations Goodness of Fit Criterion 6 and 7. Once the researcher fmds a stochastic Mokken scale, he/she can test the robustness of the scale (the non-null Confidence Intervals case)acrossdifferentsubgroups of the sample. The sam­ Since the Mokken robustness analysis provides esti­ ple maybe dividedintogroupsaccording to differentex­ matesfor bothS(H) andS(Hi)for the sampledata, a con­ ternalcriteria(e.g., age, sex, religion, andsocioeconomic fidence intervalcan be asymptotically determined for H status). Ifsuchan external criterion(e.g., socioeconomic and each HI' Let d be the z score from the standard nor­ status) is adopted, one may test the null hypothesis that maldistribution withtwo-sided probability a. For a large thecoefficients of scalability havethesamevalues foreach sample sizen, an approximate confidence interval for the subgroup. Therefore, theprobability distribution ofH and population coefficient H at probability level Iha is Hi must be estimated for these subsamples by means of Equations 3 and4. Mokken (1971, pp. 164-167)showed H-dS(H) :5 H :5 H+dS(H), (8) that for large n, the quantity (Hn-H)/S(Hn) is approxi­ whereHis obtainedwithEquation 4. Similarly, thecon­ mately standard normally distributed, where S2(Hn) = fidence interval of the population coefficient HI for item i lin H~EnHn is the estimated asymptotic variance of the can be obtained by substituting HI for Hand S(HI) for subsample, and H; is the estimate of the scalecoefficient S(H) in Equation 8. for subgroup n, andS(Hn) istheestimated asymptotic ­ dard deviation of the subsample scale coefficient. The Reliability of Test Scores in Mokken's mean value for the different sample coefficients H, is Ii Nonparametric Item Response Model defined as follows: Mokken's approachto reliability estimation avoids the - E:=IHhlS2(Hh) assumptions of itemequivalence that are commonly em­ ployedin classical test theory. The assumption of double H; = E:=1 11S2(H , (6) h) monotony permitsan estimation methodfor the reliabil­ The statistic T is used to test the null hypothesis that the ity coefficient Qu, which is not based on replication or values H, for subgroups h (h=1, 2, ... , p) are equal internal consistency. In theP matrix,the nondiagonal cell (HI=H2=... =Hp ). T has been defined as follows: entriescontainthe probabilities of the positiveresponses to each item pair. Ifwe could estimatethe values of the p (H -H)2 diagonal elements of thisP matrix,wewouldhavean ap­ T= E -'--h_---'---- (7) h=! S2(Hh) , proximation for all items of their probability of a posi­ tive response on two differenttest occasions. This prob­ Under the null hypothesis, T has approximately a chi­ ability is denoted by!(i.I). Mokken devised two methods square distribution withp-l degreesof freedom. Ifthe to estimate thesediagonal elements fromoff-diagonal ele­ null hypothesis cannot be rejected (i.e., if T is smaller ments of the P matrix. The estimate of fu,l) is defined than a critical value dependent on the number of the for the method of extrapolation as degrees of freedom and ~ a significance level a), the j _ 1<. ,1) !(i,I+1) mean value for H, (i.e., H) can be considered a pooled (9) estimateH; (now supposed all equal)for the p samples. J(i,i) -.f)(. ,1+1) ' The samples or subgroups must be disjoint. where1<. ,i) and1<. ,1+1) are, respectively, the sums of the The Mokken test for robustness is considereda good­ probabilities of the ith and the ith +1 columns of the nessof fittest,foran internal criterion (e.g., scoregroups; P matrix, and !(I,I+1) is the (i,Hl)th off-diagonal see Molenaar, 1983b) as wellas for an externalcriterion element. (i.e., when the scale coefficient remains equivalent for When};I,I-1) is taken as the off-diagonal element, then the different subgroups). This test is comparable to the the methodof extrapolation defines the estimate of fu,l) Anderson goodness of fit test in the one-parameter item as follows: response model, or the Raschmodel (Anderson, Kearny, & Everett, 1968; Gustafson, 1980; Molenaar, 1983b). j - 1<. ,1>!(i,I-1) (10) J(i,i) -• In summary, the Mokken test provides a statistical ba­ 1<.,1-1) sis to decidewhetherthe ordering of a set of items is in­ variantacrossdifferent subpopulations. Ifthe assumption According to the methodof interpolation, the estimate of double monotony holds for a population, the order of of the diagonal elementfu,l) is defined as the item difficulties should remain invariant (subject to j _ ~ + (};.,I)-1<.,i+1»(fu,I+1)-!(i,I-l) (11) sampling error) for any sample from the population, )(i,l) - )(1,1-1) (.~ _~) . )(. ,1+1) Jt: ,1-1) whereas the order of the respondents (based on scale scores) willremain invariant to different samples of items The method of interpolation is used for i = 2, 3, ... , MOKKEN SCALE 475 k - 1, whereasthe methodof extrapolationhas to be em­ *5a Value labels iteml 1 'Test1' 2 'Test2' ployed for i = I and i = k. Having found the estimates *5b I group 1 'Male' 2 'Female' I

11-20 are read from record 6 according to their column items may form together with the 5 old items into a non­ specifications. parametric Mokken scale. On many occasions, the researcher may use a spread­ Record 9 of the command file contains different op­ sheet program for data entry in which the variables are tions. Because the program Mokscal is for dichotomous separated by blanks, commas, or so forth. The program items only, "options positive = 2" indicates that a score Mokscal is able to read this type of data, because it also of 2 on every item is considered the positive alternative accepts FORTRAN format statements in the data list (the default value is positive = 1; i.e., when an option specification. Assume that a set of data has been entered is not stated on the option list, the program Mokscal con­ by means of a program, there is one record siders that a score of 1 is considered the positive alterna­ per subject, and 20 variables have to be read. The vari­ tive). The a level for testing whether an item the scale ables are separated by blanks and variable 1 is located has to be set to a constant c by the researcher. In our ex­ in column 1. Then the data list specification may be: ample command file, a = .01 (the default value is a = .05). The option "lowerbound" specifies the lowerbound / vl to v20 (fLO, 19f2.0). of scale coefficient H, item coefficient Hi' and coefficient When the spreadsheet program delivers commas or other Hij for each item pair i andj. The default value for the symbols to separate the variables, the format statement lowerbound is .50. The procedure to be used to analyze for this example for the data list specification is: whether a set or subset of items form a Mokken scale is declared by the option "type." Type = search is used / vI to v20 (fLO, 19(x(f1.0») when the research is exploring whether a set or a subset In summary, the data list specification enables the of items may be considered a Mokken scale. Type = test researcher flexible use of the program Mokscal, without is used for testing whether a set of items as a whole (as the need to alter the original (ASCll) data me for differ­ specified in record 7) will form a Mokken scale. When ent Mokken scale analyses. type = robust, the Mokken test for robustness is per­ The fifth record in our example command me is op­ formed for the variables as specified in record 7. The tional (thus the ). It specifies the value for group variable also has to be declared on this record. The a particular variable (i.e., a name is attached to a partic­ researcher wants to test the robustness of the scale for ular variable). These value label names (descriptive labels) both sex groups, record 7 of our example command me are printed on the output instead of the variable name, becomes: as specified in the data list specification on record 4. The / item2 item4 to item7 item18 to item20 by group(2). use of value labels often makes the output more reada­ ble. In our example command file, the value label attached In record 10, print = narrow is employed when one wants to item 1 when item 1 = 1 is Testl; when item 1 = 2, to view the output on the console (i.e., the output will the value label is Test2. The value label for each variable have a record length of 80 characters). This option may is separate from the value label for the preceding vari­ also be used when printers of microcomputers do not have able by a slash. For example, a slash precedes the value a condensed mode for printing a record length of 132 label assignment for group on record 5b. characters. The width default value has been set to 132 Record 6 of the command me specifies that after the characters. The option "print" enables the researcher to data are read (either by SPSS-X or by the stand-alone pro­ specify which part of the output will be printed or writ­ gram read shell), the user procedure part of the Mokscal ten to a me; however, in exploring a set of items, one program starts. may bypass the print of all output. For example, when On record 7, the researcher may specify which vari­ one is interested only in the H coefficients, the print op­ ables are included in the actual Mokken scale analysis. tion may be set to print=H. In our example, command items 2, 4-7, and 9-20 are in­ The command me is employed in the common batch cluded for the scale analysis. procedures for SPSS-X jobs when the user procedure me Record 8 is used only when the extension of an exist­ of Mokscal is used. The stand-alone program is started ing Mokken scale is the focus of interest. Note that this by calling Mokken ($run Mokken on MTS operating sys­ slash select statement is employed only with the option tem; Mokken, followed by a carriage return, on micro­ = type = search statement (see record 9). In other cases, computers). Subsequently, the program asks for the name the select record is not included in the command me. Sup­ of the command file. Following this, the program asks pose that a researcher has found in previous research that for the name of the output me or output device. 5 items of interest form a nonparametric Mokken scale. The program employs different types of algorithms for Suppose also that the researcher wants to extend this scale the search test, and robustness test procedures. by adding 15 new items to the 5 old ones. By means of the select statement of record 8, the researcher can specify Search Procedure the old items as a starting point of the scale analysis (in When exploring a set of items, the researcher may be our example, command me items 2, 3, and 18-20). The interested primarily in constructing a new stochastic Mok­ program Mokscal searches through the set of 15 new items ken scale that (on the basis of the content) is thought to to determine whether the whole set or a subset of the new be more or less homogeneous. The logic behind such a MOKKEN SCALE 477 search procedure runs as follows: a selection of items has the set of items, as well as that the choice is unique: to be made (the number of items as large as possible) that (4) The scale consisting of the already selected items and meet the criteria of scalability as described previously. item i, should have the highest value H relative to any In addition to the statistical criteria, the approach of the other item from the set of remaining items. (5) When a search procedure is a familiar one in statistical methods unique choice is impossible, because some items may meet aiming for optimal selection of p items from a set of k criteria 1 to 3 and they may also have equal maximum items. In theory, such selection is a very simple proce­ H values, then the item i, that has the greatest Hi, value dure, because for givenp, all possible p-tuples from the with respect to the already selected items is chosen by set of k items have to be inspected. However, such a selec­ the program. (6) When more than one item still can be tion procedure is rather inefficient (tedious and time­ chosen, then the most difficult is chosen by the program consuming), because astronomically great numbers of Mokscal (i.e., the item with the lowestp value, or propor­ p-tuples may be involved. tion correct responses) in the sequence of sample The program Mokscal provides an alternative selection difficulties. procedure. The method developed is a straightforward At the end of this search procedure, the resulting scale maximization of the scale coefficient H. The maximiza­ has to be evaluated on the basis of the P and Pomatrices. tion procedure consists of a stepwise technique of con­ In the P matrix, the proportions of correct responses to structing a scale from a set of k items. The lowerbound each item pair i and j should increase from left to right of H is set to the critical value c (see input description). in each row and from top to bottom in each column. In The program computes the cross-tabulations (2 x2 tables) the Po matrix, the proportions of incorrect responses to for all item pairs (see Table 1). The maximization proce­ each item pair i and j should decrease from left to right dure starts with the choice ofthe best pair of items. This in each row and from top to bottom in each column. On choice is based on the following criteria: (1) the coeffi­ the basis of a simple run test (see Molenaar, 1983a), the cient Hi} must be significantly greater than zero at a prede­ program Mokscal marks those cell entries in which the fmed level of confidence a (see input description); (2) Hi} proportion differs significantly from the expected pattern. is the greatest for all possible item pairs that meet crite­ When significant disturbances are observed in either the rion 1; and (3) Hi} must be equal to or greater than the P or the Pomatrices, the set of items selectedby the search chosen lowerbound c (see input description). procedure may not be considered a stochastic Mokken The algorithm used in the program Mokscal ensures scale, because they do not meet one of the assumptions uniqueness of the selection of the item pair when more of double monotony, that is, the monotony in item difficul­ item pairs are equally eligible. In this circumstance, the ties. In practice, the researcher deletes items that are starting pair of items contains the most difficult items in responsible for these disturbances (by altering the specifi­ the sequence ofsample difficulties. Ifno unique pair can cation ofthe items to be searched in record 7 ofthe com­ be found, the pair of next most difficult items is chosen. mand file), and he/she repeats the search procedure for Following the choice of an optimal starting pair of items, these remaining items. When no significant disturbances single items are added to the scale in consecutive steps. are observed in either the P or the Po matrices, it may In general, this recursive process of adding an item to the be concluded that these constitute a stochastic Mokken scale can be described as follows: Let two items be scale, because they meet the criterion of double monot­ selected from a set of r items as a starting pair for the ony (the assumption of monotone item homogeneity has new scale. Then from the remaining set, (r-2) items the already been met by those items selected during the search rth item (i,) will be added to that scale only when it meets procedure, and the lack of disturbances in either the P the following three criteria: (1) Item i, should have a posi­ or the Pomatrices implies that the items of the scale also tive correlation with each item that has already been meet the assumption of monotonicity in item difficulties). selected for the scale. An item that correlates negatively In addition, the program Mokscal does not stop when with one ofthese already scalable items is rejected by the a final scale is established for a subset of items. The pro­ program in the consecutive steps for the construction of gram also analyses whether the remaining items (that do the new scale at hand. (2) The item coefficient of scal­ not fit the first final scale) constitute a separate scale. ability, H;, of item i, with respect to the already selected The search procedure is ended when no items in the items should be significantly greater than zero at the remaining set meet criteria 1 to 3, or when no more items selected a level (see input description), according to the are available, or when the set of items eligible in terms one-sided test based on the statistic DELTA STAR. of criterion 1 is empty. (3) The coefficient Hi with respect to the already selected items should be greater than, or equal to, the critical Extension of an Existing Mokken Scale lowerbound c (see input description). Once a Mokken scale has been found and its useful­ The three preceding criteria are employed by the pro­ ness has been shown, it might be worthwhile to extend gram Mokscal to determine the set or subset ofitems that the scale. Improvements and extensions of these scales are in agreement with the definition of a Mokken scale. may be made by adding new items and trying to vary In addition, the following three criteria are employed by difficulty levels for better differentiation. For this pur­ the program to assure that the best item is chosen from pose, the search procedure can be modified. Instead of 478 KINGMA AND TAERUM beginning with the selection of the initial best pair, items items being selected for the scale (from the set of k items), are selected as the start set, and, subsequently, the search withj, items being retained in the remaining pool (after procedure of adding new items starts as previously de­ eliminating the negatively correlated items). The tests scribed. The Mokken scale program performs this exten­ necessary for the selection of the (g + l)th item are then sion procedure when the following record is added in the computed by the program at significance level: command file before the options (see input description): • a a - (16) /SELECT var2 var5 var6 '" vark (g+I) - Ihk(k-l) + E1=di . where var2, var5, var6, and vark are the items of the origi­ nal scale (in the options type = search). Those new items The Evaluation of a Whole Set that meet the first three criteria of the search procedure of Items as One Scale are added to the original scale. Further inspection of the Suppose it has been shown in previous research that P and Po matrices must show that these new selected items a set of k items forms a stochastic Mokken scale. When meet the assumptions of double monotony. Items that vio­ this set of items is used in another study, the Mokken scale late these requirements have to bedeleted, and the proce­ program may be employed to evaluate whether the whole dure of the search for the extension has to be repeated set of k items again forms a Mokken scale (under the same for the remaining set of items. As in the SEARCH proce­ restrictions as in the earlier study, i.e., the same alpha dure, the Mokken scale program searches the items that level and lowerbound; see input description) by setting do not fit on the extended scale to determine whether these type = test (record 9 of the command file; see input items (or subsets) may form another scale. description). The Mokken scale program evaluates whether the whole set of items forms one scale (i.e., no Level of Significance in the Search Procedure search procedure is performed). The algorithm at the ba­ In the search procedure of the Mokken scale program, sis of this test procedure computes the intercorrelations a number of statistical tests are performed at each step of all item pairs, the item coefficients Hi, and the scale to determine the ultimate selection ofan item. It may be coefficient H. Two statistical levels of evaluation can be argued that due to repeated testing, the alpha risk may distinguished in this test procedure. First, the by-the­ increase. Consequently, the obtained results may have a program computed values of the coefficients Hiand Hare very low probability of being repeated in other ex­ tested one-sided against the null hypothesis of random periments. response using the statistic DELTA STAR. DELTA To avoid this risk of capitalizing on chance, the sam­ STAR must exceed some critical z value at an ex level of ple may be randomly split into two halves to replicate the analysis; however, the power of the tests will be reduced. acrit = Ihr(1-r) , (17) The Mokken scale program offers an alternative; that is, a gradually increasing protection level is employed to where a in the numerator is the confidence level defined moderate the effects of possible chance results by impos­ by the researcher (see input description) and r is the num­ ing a continual reduction of the level of significance in ber of items. The critical z value is derived by the pro­ the consecutivesteps. This algorithm runs as follows: First gram from a table of the standard normal distribution for at the beginning of the search procedure, a level of sig­ the computed acrit. In addition to this procedure, it is nificance (mostly a = .05 or .01; see input description) evaluated that no item is negatively correlated. The sec­ is chosen. Second, from an initial set ofk items, the test ond step involves the evaluation of whether these coeffi­ for the selection of the best starting pair is performed at cients exceed the lowerbound value c (see input descrip­ the actual level of significance given in Equation 5: tion). When the whole set of items satisfies these two steps, the hypothesis of scalability is accepted. However, • a a - when one or more of the H/s coefficients have negative 1 - Ihk(k-l) ' values or values that do not differ significantly from zero, where the denominator indicates the number ofcompari­ the hypothesis of the scalability of the set of items as a sons involved. Third, after eliminating those items that whole is rejected. do not meet the first criterion of the search procedure (i.e., positive correlation), it is supposed that the remaining set Test for Robustness of the Mokken Scale of items consists of h items. The tests necessary for the The robustness (or invariance) of a found Mokken scale selection of the third item are performed at level of sig­ for different samples (or subsamples) can be tested by nificance: means of the Mokken test. The group variable has to be defined in record 7 and type robust in record 9 (see • a = input description). al = Ihk(k-l) +h . (15) The Mokken test option computes the T statistic for both Fourth, the significance (protection) level used in the the scale as a whole and for each of the items. T has an Mokken scale program can be generally defined for g approximate chi-square distribution with p- I degrees of MOKKEN SCALE 479 freedom (wherep is the number of groups). When T ex­ HI coefficients, and the statistics DELTA STAR of the ceeds the critical value, the scalesmay not be considered nonscalable items that did not fit the final scale; (3) P, invariantacrossthe groups. The T statistics on itemlevel Po, and HI} matrices and the four reliability coefficients providesthe researcherinformation aboutwhichitemsare for the final scale. responsiblefor such a significantdifference. By deleting The search procedure is repeated by the program for those items with significant1'; values (in a stepwiseman­ those items that did not fit into the first final scale. The ner) and subsequently rerunning the programwiththe op­ sameoutputas described aboveis obtained for theseitems. tion type = robust, a scale maybe obtainedthat is robust The procedure to extend an already existing Mokken for the different subgroups(for a practical application of scale (select = vari to varn and type = search) provides this approach, see Gillespie et al., 1987a;Kingma & Ten outputthat is almostidenticalto that of the search proce­ Vergert, 1985). dure. The only differenceis that the outputdoes not con­ tain information aboutthestartpair, butstartsat thatplace Output in the outputthat containsthe information aboutall items The Mokken scale program (both the user procedure of the already existing scale, followed by the stepwise out­ file for SPSS-X and the stand-alone programs for main­ put for each new itemaddedto this scale, as in the search frame and microcomputers) provides the following out­ procedures. put for the search procedure (type = search): (1) A The output for the evaluationof a set of items as one description of the command file is given, followed by a scale (type = test) is as follows: (1) the P, Po, and Hlj specification of which columns and records are read in matrices for all items; (2) the four reliabilitycoefficients the data file. (2) An example of the syntax used in the for the wholeset of items; (3) the scalecoefficientHand command fileis printed,enabling the researcher to change scale DELTASTAR;and (4) thep values, the Hlcoeffi­ very quickly an error made. (3) A specification of the cients,andthestatistics DELTASTARforeachindividual items used in the actual analysis are displayed and the item. Ifall HI and HI}meetthe criteria (see search proce­ number of cases read from the data file; (4) The P, Po, dure), thenthe programstops.Ifsomeitemsare rejected, andHijmatrices are givenfor all items;(5) Four reliability becausethey do not meet thesecriteria, the programpro­ coefficientsare provided for the whole set of items: one videsadditional output(identical to the searchprocedure) is based on Mokken's methodof extrapolationto find the for both the scalable and unscalable items. diagonal elements of the P matrix, the second is based The following outputis givenfor the test of the robust­ on Mokken's methodof interpolation,the third is the MS ness of the found Mokkenscale across different samples reliability coefficient (Molenaar& Sytsma, 1984; Sytsma or subgroups (type = robust): (1) for all samples together, & Molenaar, 1987, and the last is the Kuder Richardson the descriptionof the commandfile, the input format of coefficient of homogeneity (the KR20 and the item­ the items specified, a list of the items used in the anal­ corrected total correlations). (6) The phi inverse co­ ysis, and the numberof casesreadfromthe datafile; (2) a efficient-that is, the z value-for the best start pair, the survey of the total scale scores for total population (all scale coefficientH, and the scale DELTA STAR statis­ samples), the percentageand the different response pat­ tic, as well as the actual a level used, are given. On item terns that constitutea particular total score (itemsare or­ level HI coefficients, the statisticsDELTASTARand the dered inthesepatterns according to decreasing difficulty), p values (proportion correct responses) are given. and a surveyfor each sampleor subgroup; (3) the P, Po,

In the second step of adding an item to the scale, the and Hlj matrices for all samples together and the four reli­ outputconsistsof (1) the scalecoefficientH for the three abilitycoefficients; (4) theE, Eo, H, S2(H), andthestatis­ items,the statistic DELTASTARfor the scaleas a whole, tic DELTA STAR for both the scale and for each in­ andthe actuala levelused; (2) on itemlevel,thep values, dividualitem level and the item difficulties; (5) the same HI values for each of the three items, and their statistics outputfor eachsubsample; (6) the meanof the coefficient DELTA STAR; and (3) the actual a level used for test­ Ii, the statistic T, the degrees of freedom, and the prob­ ing and the phi inverse coefficient (i.e., the z score of a ability level for the whole scale, as well as for the in­ significance level; see Equations 14and 15). For the third dividual items. and following step, this intermediateoutput is given, so that for each step the changes in the item coefficientsHI Further Elaborations and Discussions and the statistics DELTA STAR can be evaluated. on the Mokken Scale When the search procedure of the Mokken scale pro­ The Mokken scale analysis is relatively unknown in gram has found a final scalefor the selection of itemsfrom NorthAmerica. SeeMokken (1971) fora thorough review the original set, the output is as follows: (1) the scale of this nonparametric item responsetheory model. More coefficient H, the statistic DELTA STAR of the found recently, several discussions about the usefulness of the scale, thep valuesof the scalableitems, their coefficients Mokken scale analysis haveappeared in the literature (Jan­ HI and their statistics DELTA STAR, the phi inverse sen, Roskam, & Van den Wollenberg, 1984; Roskam, coefficient, and the actual a level used; (2) the p values, Van den Wollenberg, & Jansen, 1986; Sytsma, 1984, 480 KINGMA AND TAERUM

1986). Further refinements of the model were reported mental scalefor seriation. Educational & Psychological Measurement, by Molenaar (1982, 1983a) and Sytsma and Molenaar 44, 1-23. (1987). These refinements are incorporated in the present KINGMA, J., & TEN VERGERT, E. M. (1985). A nonparametric scale analysis of the developmentof conversation. Applied Psychological program. Measurement, 9, 375-387. KINGMA, J., & VAN DEN Bos, K. (in press). On the sequentiality incon­ Limitations cept development. Cognitive Systems. The Mokken scale program is limited at present to the LIPPERT, E., ScHNEIDER, P., & WAKENHUT, R. (1978). Die Verwen­ dung der Skalierungsberfahren von Mokken und Rasch zur Uber­ analysis of a maximum of 100 items. This limitation may prufung und Revision von Einstellungsskallen. Diagnostica, 24, be modified by altering dimensions of the arrays. The pro­ 252-274. gram accepts scores from an unlimited number of subjects. LoEVINGER, J. (1948). The techniqueof homogeneous tests compared with some aspects of "scale analysis" and . Psycho­ PROGRAM AVAILABILITY logical Bulletin, 45, 507-530. MOKKEN, R. J. (l969a). Dutch-American comparisonsof the "sense of political efficacy." Quality & Quantity, 3, 125-152. Program listings of both the SPSS-X user procedure file MOKKEN, R. J. (I969b). Dutch-American comparisonsof the "sense and the stand-alone programs (both mainframe and micro­ of politicalefficacy": Some remarks on cross-cultural robustnessof computer versions) can be obtained from either author scales. Acta Politica, 4, 425-440. MOKKEN, R. J. (1971). A theory and procedure ofscale analysis. The on a 5.25-in. flexible disk (MS-DOS operating system) Hague, The Netherlands: Mouton. for a nominal fee. For the microcomputer, a compiled MOKKEN, R. J., & LEWtS, C. (1982). A nonparametricapproach to the program is available for mM PCIAT microcomputer or analysis of dichotomous item responses. Applied Psychology Mea­ compatibles with math coprocessor. A user's guide and surement, 6, 417-430. example input and output will accompany the listings of MOKKEN, R. J., LEWIS, C., & SYTSMA, K. (1986). Rejoinder to "The Mokken scale: A critical discussion." Applied Psychological Mea­ the source codes. surement, 10, 279-285. MOLENAAR, I. W. (1982). Mokken scaling revisited. Kwamitatieve REFERENCES Methoden, 8, 145-164. MOLENAAR, I. W. (l983a). Rasch, Mokken en schoolbeleving[Rasch, ANDERSON, J., KEARNEY, G. E., & EVERETT, A. V. (1968). An evalu­ Mokkenand schoolexperience]. In S. Lindenberg& F. N. Stokman ation of Rasch's structural model for test items. British Journal of (Eds.), Modellen in de Sociologie [Models in Sociology] (pp. 195­ Mathematical Psychology, 21, 231-238. 213). Deventer, The Netherlands: Van Loghum Slaterus. GILLESPIE, M., TEN VERGERT, E. M., & KINGMA, J. (l987a). Using MOLENAAR, I. W. (l983b). Some improveddiagnostics for failure of Mokken methods to develop robustcross-nationalscales: American the Rasch model. Psychometrika, 48, 49-72. and West German attitudes toward abortion. Social Indicators MOLENAAR, I. W., & SYTSMA, K. (1984). Internalconsistency and reli­ Research, 19, 74-95. ability in Mokken's nonparametric item response model. Tydschrift GILLESPIE, M., TEN VERGERT, E. M., & KINGMA, J. (l987b). Using Voor Onderwysresearch, 5, 257-268. Mokken scale analysis to develop unidimensional scales: Do the six RASCH, G. (1960). Probabilisticmodels for some intelligence and at­ abortion items in the NORC GSS form one or two scales? Quality tainment tests. Copenhagen: Denrnarks Paedagogiske Institut. & Quantity: International Journal of Methodology, 21, 1-15. ROSKAM, E. E., VAN DEN WOLLENBERG, A. L., & JANSEN, P. G. W. GUSTAFSON, J. (1980). Testing and obtaining fit of data to the Rasch (1986). The Mokken scale:A criticaldiscussion. AppliedPsychological model. Bulletin of British Psychological Society, 33, 205-233. Measurement, 10,265-277. HENNING, H. J. (1976). Die Technikder MokkenSkalenanalysen. Psy­ STOKMAN, F. N. (1977). Roll calls and sponsorships:A methodologi­ chologische Beitrage, 18, 410-430. cal analysis ofthird world group formation in the United Nations. HULIN, C. L., DRASGow, F., & PARSONS, C. K. (1983). Item response Leiden, The Netherlands: Sythof. theory. Homewood, IL: Dow Jones-Irwin. STOKMAN, F., & VAN SCHUUR, W. (1980). Basic scaling. Quality & JANSEN, P. G. W., ROSKAM, E. E., & VAN DEN WOLLENBERG, A. L. Quantity, 14, 5-30. (1984). Discussion on the usefulnessof the Mokken procedure for SYTSMA, K. (1984). Usefulnonparametric scaling. Psychologische Bei­ nonparametric scaling. Psychologische Beitrage, 26, 722-735. trage, 26, 423-437. KINGMA, J. (1984). A comparison of four methods of scaling for the SYTSMA, K. (1986). Another note on the usefulness of Mokken scal­ acquisitionof early numberconcept.JournalofGeneralPsychology, ing. Psychologische Beitrage, 28, 425-432. 110, 23-45. SYTSMA, K., & MOLENAAR, I. W. (1987). Reliabilityoftest scores in KINGMA, J., & Lora, F. L. (1985). The validationof a developmental nonparametric item response theory. Psychometrika, 52, 79-97. scale for seriation. Educational & PsychologicalMeasurement, 45, 321-328. (Manuscript received December 8, 1987; KINGMA, J., & REUVEKAMP, J. (1984). The constructionof a develop- revision accepted for publication May 20, 1988.)