ORIGINAL ARTICLE Item response theory: How Mokken scaling can be used in clinical practice

Roger Watson, L Andries van der Ark, Li-Chan Lin, Robert Fieo, Ian J Deary and Rob R Meijer

Aims. To demonstrate the principles and application of Mokken scaling. Background. The history and development of Mokken scaling is described, some examples of applications are given, and some recent development of the method are summarised. Design. Secondary analysis of data obtained by cross-sectional survey methods, including self-report and observation. Methods. Data from the Edinburgh Feeding Evaluation in Dementia scale and the Townsend Functional Ability Scale were analysed using the Mokken scaling procedure within the ‘R’ statistical package. Specifically, invariant item ordering (the extent to which the order of the items in terms of difficulty was the same for all respondents whatever their total scale score) was studied. Results. The Edinburgh Feeding Evaluation in Dementia scale and the Townsend Functional Ability Scale showed no violations of invariant item ordering, although only the Townsend Functional Ability Scale showed a medium accuracy. Conclusion. Mokken scaling is an established method for item response theory analysis with wide application in the social sciences. It provides psychometricians with an additional tool in the development of questionnaires and in the study of individuals and their responses to latent traits. Specifically, with regard to the analyses conducted in this study, the Edinburgh Feeding Evaluation in Dementia scale requires further development and study across different levels of severity of dementia and feeding difficulty. Relevance to clinical practice. Good scales are required for assessment in clinical practice and the present paper shows how a relatively recently developed method for analysing Mokken scales can contribute to this. The two scales used as examples for analysis are highly clinically relevant.

Key words: activities of daily living, dementia, item response theory, Mokken scaling, nurses, nursing, , scalability

Accepted for publication: 12 July 2011

naire is a valuable instrument for the measurement of Introduction phenomena such as psychological morbidity, quality of The use of questionnaires is important and widespread in life and clinical symptoms. Also, in nursing and other nursing research and practice. A well-designed question- areas of social research, the measurement of attitudes,

Authors: Roger Watson, PhD, Professor of Nursing, School of Centre for Cognitive Ageing and Cognitive Epidemiology, The Nursing & Midwifery, The University of Sheffield, Sheffield, UK and University of Edinburgh, Edinburgh, UK; Rob R Meijer, PhD, School of Nursing & Midwifery, University of Western Sydney, Professor of Psychometric and Statistical Techniques, Department of Campbelltown, Australia; L Andries van der Ark, PhD, Associate Psychometrics and Statistics, University of Groningen, Groningen, Professor, Department of Methodology and Statistics, Tilburg The Netherlands University, Tilburg, The Netherlands; Li-Chan Lin, PhD, Professor, Correspondence: Roger Watson, Professor of Nursing, School of School of Nursing, National Yang-Ming University, Taipei, Taiwan; Nursing & Midwifery, The University of Sheffield, Sheffield S7 5AU, Robert Fieo, PhD, Research Scientist, Columbia University, Division UK. of Geriatric Psychiatry, New York, NY, USA; Ian J Deary, PhD, E-mail: roger.watson@sheffield.ac.uk Professor of Differential , Department of Psychology,

2011 Blackwell Publishing Ltd, Journal of Clinical Nursing 1 doi: 10.1111/j.1365-2702.2011.03893.x R Watson et al.

opinions and educational achievement is common and result from high precision or relatively small measurement valuable. error (Alagumalai & Curtis 2005).This information (i.e. error The development of questionnaires requires some obliga- that varies by person performance) can be used to identify the tory steps including the clarification of the concept being most sensitive part of the instrument or scale under investi- studied, the selection of items, validation of content, estab- gation (Wilson 2005). The range of item response theories lishing the reliability of the items, and then further steps to will not be covered here but the method that has most investigate the construct validity of the questionnaire (Bann- recently come to characterise and be described as an example igan & Watson 2009). This field of research is known as of item response theory (although it differs from some other psychometrics (Rust & Golombock 1999) and depends methods) is Rasch modelling (Meijer & Sijtsma 1990). heavily on methods developed in psychology which are However, a convenient, useful and relatively easy to equally applicable across the different fields where question- understand method which follows the principles of item naires are used. The methods employed to ensure that response theory has also been developed and this is known as questionnaires are psychometrically sound range from com- Mokken scaling after Robert Mokken (http://staff.science. mon sense (selection of items and validation of content) to uva.nl/~mokken/; retrieved 30 September 2010) who initially some sophisticated mathematical and statistical methods for developed it. The remainder of this paper will explain the establishing reliability and validity. principles of Mokken scaling and provide some recent examples of its application.

Classical test theory Background The methods used to establish reliability and validity rely heavily on what is referred to as classical test theory, which Mokken scaling is a non-parametric method for investigating includes methods such as Cronbach’s alpha for the estimation the relationship between items and latent traits which evolved of reliability of a test score. Classical test theory – which will from the Guttman method of investigating hierarchies of not be expounded on further here – is concerned with the items within scales. estimation of measurement error and establishing, within the bounds of the methods available, an estimate of the true Guttman scaling score. Guttman scaling, developed by Louis Guttman (Stouffer et al. 1950, Katz 1988) is deterministic in the sense that it Item response theory does not take any stochastic elements into account. This A more recently developed and very powerful alternative to means that it does not view the relationship between an item classical test theory is item response theory (Hulin et al. and the latent trait that is being measured in terms of 1983) which seeks to solve much the same problems as probability (as in the probability of a score on the item being classical test theory and, indeed, can often be complementary related to the extent to which the latent trait is present); to classical test theory in terms of identifying sets of items rather, it sees the relationship as being one of complete that measure the same concept (i.e. that are unidimensional) discrimination between the presence or absence of the latent (van Schuur 2003). This is exemplified in the case of the trait based on the endorsement or lack of endorsement of an EdFED (Edinburgh Feeding Evaluation in Dementia) scale item or an item step response (i.e. the response to any one of (Watson & Deary 1994, 1997, Watson 1996) where both the points on, for example, a Likert scale as opposed to a exploratory and Mokken scaling identified the response to a dichotomous item – we will only consider same set of items to measure feeding difficulty in older people dichotomous items here in relation to Guttman scales for with dementia. Item response theory is less concerned with clarity and will return to polytomous items under Mokken scores on sets of items (test level scores) and more concerned scaling). This relationship is shown in Fig. 1. with the responses to individual items. Additionally, item The way Guttman scaling works can be demonstrated by response theory enhances the interpretive power by providing considering the questions listed in Table 1. The idea behind measurement precision that varies with a person’s ability these questions – which, for the purposes of this example are level (Hambleton et al. 1991). The discrepancies between deliberately ordered as they appear in the table – is to observed and true scores indicate how much the test score measure the attitude of individuals towards the supporters of differs from the true scores, and are summarised by the an opposing football team. The questionnaire is based on the standard error of measurement; better reliability estimates fact that most people will not openly express any particular

2 2011 Blackwell Publishing Ltd, Journal of Clinical Nursing Original article Item response theory

1 more popular item before they endorse a ‘harder’ or less popular item – and things are rarely like that in reality and we are more likely to obtain a response pattern like the one shown in Table 3. In reality, answer patterns are subject to p(θ) more than the underlying construct or trait; answer patterns are influenced by factors like mood of the respondent and interpretation of the question (Kempen et al. 1995). There- fore, the relationship to item responses and the construct is θ better defined as probabilistic rather than deterministic.

Figure 1 An example of an item behaving according to the deter- Furthermore, to be sure that a scale follows the expectation ministic Guttman model along a latent trait h on the abscissa with the of clear-cut pass/fail point (i.e. deterministic nature) for each probability of a positive response to the items on the ordinate. person, the differences between item difficulties must be large (Fisher & Fisher 1993). Thus, the sensitivity of such scales to

Table 1 Questions used to demonstrate hierarchical item ordering small changes in functioning within individuals over time or to small differences between individuals can be reduced quite Item Label Response substantially (Finch et al. 1994). Without explaining the 1 I like supporters of opposing football teams Yes/no Guttman method in detail, the method is concerned with 2 I would sit next to an opposing supporter on a bus Yes/no calculating acceptable levels of reproducibility and scalability 3 I would speak to a supporter of an opposing team Yes/no (Menzel 1953, Schuessler 1971) based on the number of 4 I would invite an opposing supporter to my house Yes/no violations of the underlying pattern. This involves identifying 5 I would allow an opposing supporter to marry one Yes/no of my children the incidences when an individual does not answer in a way that is consistent with the majority of the other respondents and comparing these incidences with the number of correct prejudice towards other people but, when pushed further; for responses and calculating coefficients that tell us how well the example, inviting such a person to their house or welcoming data conform to a Guttman scale. them into their family, then they will eventually ‘draw the line’ somewhere. A set of responses to the questions is shown Mokken scaling in Table 2 and this demonstrates that all the respondents, except one, show no general prejudice; however, as we Guttman scaling is described as deterministic in that it does progress through the questions, they become increasingly not allow for randomness: a score on the scale should closely hard for these respondents to endorse with only one person indicate the responses to the items in the scale and items are endorsing all of the items. Of course, the response pattern written (or discarded) to achieve this. Mokken scaling refers shown here is perfect: people always endorse an ‘easier’ or to a series of methods for investigating whether stochastic

Table 2 Responses to questions in Table 1 Item Respondent 1 Respondent 2 Respondent 3 Respondent 4 Respondent 5 showing perfect Guttman scalability 1 No Yes Yes Yes Yes 2 No Yes Yes Yes Yes 3 No No Yes No Yes 4 No No Yes No No 5 No No Yes No No

Table 3 Responses to questions in Table 1 Item Respondent 1 Respondent 2 Respondent 3 Respondent 4 Respondent 5 showing violations of Guttman scalability* 1 No Yes Yes Yes Yes 2 No No Yes Yes No 3No Yes Yes No Yes 4 Yes No Yes No No 5 No No Yes No No

*Violations of Guttman scalability are shown in bold.

2011 Blackwell Publishing Ltd, Journal of Clinical Nursing 3 R Watson et al.

versions of the Guttman scale hold in the data. These 1 stochastic versions of a Guttman scale are known as the Item 2 Item 1 model of monotone homogeneity (MHM) and the double monotonicity model (DMM), which fall within the broad p(θ) spectrum of methods described as item response theory (IRT). The scores on a scale, representing the presence of the latent trait, are related to the scores on individual items by probability (Sijtsma & Molenaar 2002); that is, an item is θ more likely to be endorsed if the person scores high in the latent trait; an individual is more likely to agree to an item Figure 3 Two items showing different levels of difficulty and dis- crimination: item 1 is more difficult than item 2 and item 2 is more indicating that they are anxious if they suffer from anxiety – discriminating than item 1. but the model does allow for error and individual counter- intuitive responses to some items. Thus, if we represent the Discrimination and difficulty response to an item on a Guttman scale as shown in Fig. 1, ICCs, as shown in Fig. 3 can differ in steepness and those that we can see that the response to the item is either endorsement are steeper are described as more discriminating than those or not; that is ‘yes’ or ‘no’ to an item where agreement with a that are less steep. Item 1 is more difficult than item 2 because statement is required – this is a deterministic model. for the majority of trait levels, the probability of endorsing However, Fig. 2 shows how responses to an item are viewed item 2 is greater than the probability of endorsing item 1. In in a stochastic model such as the MMH; as the amount of the this sense ‘difficulty’ refers to the ease with which an item is latent trait increases, the probability of endorsing the item endorsed by respondents. Ideally, in general trait measure- also increases and it does so in a characteristic fashion to be ment, what is required is highly discriminating items with described below. varying levels of difficulty.

Item characteristic curve Assumptions of IRT

The behaviour of individual items in a scale relative to the Before proceeding to describe Mokken scaling in more detail, latent trait is described by item characteristic curves (ICCs) three assumptions of IRT which are common across IRT (Hambleton & Swaminathan 1985). If the Greek symbol methods should be explained. These assumptions are: unidi- theta (h) represents the latent trait then the ICC for any mensionality, local independence, and monotonicity. IRT particular item represents P(h) – the probability of an assumes that, for the items that form a scale, there is a individual’s score on an item being obtained in the presence dominant single latent trait that determines the answers to of a particular level of the latent trait. As the latent trait the items; this is known as unidimensionality (Hulin et al. increases, then the probability of a score on the item increases 1983). This does not mean that, in a large set of items, more in a non-linear fashion described by a probability distribution than one dimension may not exist; rather, that sets of items function as shown in Fig. 2. The shape of and relationship fitting an IRT model are unidimensional. This distinction is between ICCs is part of Mokken scaling analysis and this will noteworthy if we emphasise that, despite meeting the be described below. assumption of unidiminsionality for this scale, minor abilities can still influence response patterns. ‘It has been long argued 1 that responses to a set of items are multiply determined, in that several minor abilities are required to respond to items’ (Nandakumar 1994, p. 17). IRT also assumes local stochastic

p(θ) independence of items in a scale, which means that an individual’s responses to items in a scale are dependent on the individual’s level on the latent trait being measured: the response to one item is not influenced by the score on another (Nunnally 1978). It should be emphasised that this is, largely, θ an assumption as complete local stochastic independence is

Figure 2 An item responding stochastically in the presence of latent very hard to achieve, especially where items in a scale form an trait h on the abscissa with the probability of a positive response to item chain where success on one item depends on success on the items on the ordinate. the previous item or where items overlap (Balazs & De Boeck

4 2011 Blackwell Publishing Ltd, Journal of Clinical Nursing Original article Item response theory undated). Montonicity refers to the increasing probability of the score on the item increasing as the level if the latent trait Item 2 Item 1 increases; this is illustrated in Fig. 2.

Item 3 The model of monotone homogeneity The next important stage of the Mokken procedure is investigating if items fit the MMH (Mokken & Lewis 1982). The MMH is a very unrestrictive model, assuming only the three common assumptions in IRT: unidimensionality, local independence, and monotonicity. While unidimensionality Figure 5 Three items with items 1 and 2 showing DMM and item 3 and local independence cannot be visualised, Fig. 4 illustrates violating DMM. one item (item 1) which has monotonicity and one (item 2) which has not. MMH means that, for all items, as the score Mokken scaling analysis on the latent trait increases, the score on the items in the scale should also increase. The MMH allows for the ordering of Mokken scaling is an improved version of Guttman scaling in persons on the latent trait by the sum of the item score – an the sense that it is stochastic and, thereby, less restrictive in essential requirement for a psychological test. The software nature: it does not assume that the responses of individuals to for Mokken scale analysis (Molenaar & Sijtsma 2000) pro- items in a scale are deterministic. Therefore, Mokken scaling vides diagnostics which allow you to detect items which produces a similar outcome to Guttman scaling but is likely violate MMH and remove them from the analysis. to include more items from a pool of items into a hierarchical scale and also to provide more information about the The double monotonicity model behaviour of individual items. For scale developers, Mokken The next stage of the Mokken scaling procedure is investi- scaling is an additional tool for the identification of unidi- gating whether the items fit the double monotonicity model mensionsal sets of items and it provides additional informa- (DMM) (Mokken & Lewis 1982). In addition to the three tion about the relationship between items in such scales. assumptions of the MMH, the DMM assumes that the ICCs do not intersect. This is shown in Fig. 5 where items 1 and 2 Loevinger’s coefficient (H) do not intersect and item 3 intersects items 1 and 2; therefore, The first parameter to be established in Mokken scaling is item 3 violates the DMM. The DMM allows for the ordering whether or not, based on their mean scores, the items in a of persons on the latent trait by the sum score and allows for questionnaire form a hierarchy. In Mokken scaling this is an invariant ordering (IIO) of the items in terms of difficulty. done by inspecting the item scores relative to one another to The IIO property is crucial for establishing hierarchical see how consistently they are ordered. This is precisely the scales. The order of the items in terms of difficulty should be same as Guttman scaling and it is low violations of Guttman the same for all respondents whatever their latent trait value ordering that lead to the initial inclusion of items in a Mok- (Sijtsma & Junker 1996). The software for Mokken scale ken scale. The principals of this can easily be demonstrated by analysis also provides diagnostics which allow you to detect looking at Table 4 which shows, in a 2 · 2 contingency table items which violate the DMM and remove them from the (based on Niemo¨ ller & van Schuur 1983), two items i and j analysis. and all possible patterns of endorsement. If item i has greater ‘difficulty’ than item j then responses in cells a, c and d are all 1 acceptable, but responses in cell b are not, as these are clearly Item 2 Item 1 violations of the Guttman ordering and arise from respon- dents who have scored item i as being less difficult than item j. p(θ) Cell b, therefore, is the error cell and the proportion of responses in cell b to those in the remaining cells gives an estimation of the scalability: the extent to which items form a hierarchy. This proportion is used to compute item-pair

scalability coefficient Hij: if there are no observations in error θ cell b, then Hij equals 1 and items i and j form a perfect Figure 4 Two items with Item 1 showing MMH and Item 2 violating Guttman scale. As the number of observations in the error MMH. cell increase Hij decreases. Additionally, there is an item

2011 Blackwell Publishing Ltd, Journal of Clinical Nursing 5 R Watson et al.

Table 4 Contingency table showing possible response categories to of two steps: first, based on statistical testing, items are two items i and j detected (and removed if requested) that clearly violate IIO. T Item j Second, a diagnostic analogous to H, called Htrans or H is used to establish IIO; values of HT > 0Æ3 are considered to Item i 10 indicate a scale with IIO (Ligtvoet et al. 2010). The methods 1 a (1, 1) b (1, 0) for investigating IIO will be described below. Likewise, 0 c (0, 1) d (0, 0) establishing IIO is possible and the software for these analyses will be described below and some examples of the application

scalability coefficient Hi, indicating the scalability of an of Mokken scaling will be provided. individual item, and a scalability coefficient H, indicating the scalability of the set of items. A set of items forms a so-called Software for Mokken scaling Mokken scale if two conditions are met: (1) for all item pairs,

scalability coefficient Hij is greater than zero and (2) scala- The Mokken scaling procedure (MSP) (Molenaar & Sijtsma

bility coefficient Hi is greater than some a priori chosen cri- 2000) is commercially available for Windows and this terion c (c usually equals 0Æ3, but this is up to the researcher). software is capable of analysing polytomous items and of In addition, the following guidelines for H are common: investigating scalability, MMH and DMM for dichotomous 0Æ30–0Æ40 is called a weak scale, 0Æ40–0Æ50 is called a medium and polytomous items. It is also capable of analyzing IIO for scale, and >0Æ50 is called a strong scale Molenaar and Si- dichotomous items but not, in its most recent release, of jtsma (2000). analyzing IIO for polytomous items. However, public domain software called R contains Mokken scaling analysis (van der Polytomous items Ark 2007) and is capable of analysing IIO for polytomous Thus far we have been concerned with dichotomous items items and generating HT (van der Ark 2010). For specific where the non-intersecting ICCs as implied by the DMM is details on how to run these software packages, manuals are equivalent to IIO. Mokken scaling has been developed to available with step-by-step guidance. The outputs from MSP analyse polytomous (Sijtsma et al. 1990, Hemker & Sijtsma and R have some similarities and some differences. MSP is 1995) items. The principles remain the same, but the analysis probably more ‘user friendly’ and produces outputs that are is not simply of ICCs, but of the responses to each of the more directly interpretable and which can be imported levels in the items (e.g. 1–5 on a Likert type scale). The relatively easily into publications. R, on the other hand, is resulting relationship between these responses and the score more sophisticated in its analytical possibilities but, as this is on the latent trait can be represented using item step char- public domain, the output options are less sophisticated. acteristic curves or item step response functions (ISRFs). Mokken scaling is only one method for investigating IRT These are the responses of each of the steps in the scale. For models; other IRT methods, such as Rasch modelling and the example, in a Likert scale with five response categories, there Generalized Partial Credit Model will not be described here. are four steps between the five categories, so there are four However, other software for analyzing items is available and ISRFs. ISRFs are central to the analysis of polytomous items TestGraf, for example, is also public domain software for item using Mokken scaling. The procedure for establishing if a set analysis (http://www.psych.mcgill.ca/faculty/ramsay/TestGraf. of polytomous items forms a Mokken scale follows the same html; retrieved 1 October 2010) and can be a useful adjunct pattern as that for dichotomous items. Without a detailed for investigating item properties. explanation, the parameters of H, MHH and DMM should hold for ISRFs and diagnostics exist to establish this in ver- Applications of Mokken scaling sions of the Mokken scaling procedure that have been developed to process polytomous items. Mokken scaling has been applied to a wide range of sets of An important difference between dichotomous items and items in psychology, social science, and nursing; it polytomous items is that for polytomous items, the DMM has been applied to psychological questionnaires (Moorer & does not automatically yield items that are invariantly ordered Suurmeijer 1994, Watson et al. 2007, 2008a,b, Bedford et al. in terms of difficulty. Hence, for polytmous items, establishing 2010a,b, Deary et al. 2010, Stewart et al. 2010), attitude the DMM is not enough for establishing IIO (Sijtsma et al. measurements (Gillespie 1988, Tenvegert et al. 1992), qual- 2011). Recently, Ligtvoet et al. (2010) developed a method to ity of life (Moorer et al. 2001, Ringdal et al. 1999, 2003, investigate whether IIO holds. The method should be applied Thompson & Watson 2010), disability (Kempen et al. 1995, after the conventional Mokken scaling procedure and consists van Boxel et al. 1995), nursing (Watson 1996, Dijkstra et al.

6 2011 Blackwell Publishing Ltd, Journal of Clinical Nursing Original article Item response theory

1999, 2000, Lin et al. 2008), psychiatry (de Jong & Mole- Mokken scaling further confirmed that this was the case. naar 1987, Meijer & Baneke 2001), pain (von Korff et al. Subsequent analysis of separate datasets have confirmed 1992), sleep (Kingshott et al. 1998) and activities of daily both the factor structure of the EdFED and the hierarchical living (Kempen & Suurmeijer 1990, 1991, Suurmeijer & ordering of the behavioural items and this holds for a recent Kempen 1990, Fieo et al. 2010) inventories. Specific exam- dataset obtained from Taiwan using a translated version of ples of recent applications include the use of MSP to analyse the EdFED scale (Lin et al. 2008). The EdFED items form a sets of items in inventories of psychological distress such as hierarchy, in terms of difficulty, from a general refusal to the General Health Questionnaire (Watson et al. 2008) and eat, up to being unable to swallow and allowing food to fall the Clinical Outcomes in Routine Evaluation-Outcome out of the mouth. Measure (CORE-OM) (Bedford et al. 2010).

The present study Psychological morbidity The application of MSP to inventories measuring psycho- Therefore, the psychometric properties of the EdFED and logical morbidity and personality disturbance show hierar- Townsend scales have been intensively studied, including the chies of items ranging from relatively mild distress to more use if IRT. Both demonstrate four of the essential features of extreme forms of distress and these are often anchored at the Mokken scales but neither – in common with many published most difficult end of the scale in items which indicate feelings Mokken scales with polytomous items – has been investi- of worthlessness or suicidal ideation. For example, Mokken gated for IIO, which can only be implied from the above scaling of one version the General Health Questionnaire parameters. We considered that these two short scales were (GHQ), the GHQ-30, shows a hierarchy of items running in good candidates for a study of IIO that would complement terms of difficulty from being unable to face up to problems the existing information on these scales, complete the present and being constantly under strain to feeling that life is not paper, and illustrate the principles and application of IIO. worth living (Watson et al. 2008a). Mokken scaling of the With the advent of the R programme containing the Mokken CORE-OM shows a hierarchy of items running in terms of scaling procedure and, within that, the ability to check for difficulty from not being able to achieve things and being IIO in polytomous items, it is now possible to analyse the unable to ignore problems to making plans to end one’s life Townsend and the EdfED for IIO. (Bedford et al. 2010). Clearly, in these cases, the intervening items are also logically related to the anchoring items in the Methods range. The EdFED database came from a study by Lin et al. ADL scales (2008) and comprised data from 477 older people with An ADL (Activities of Daily Living; Townsend 1962) scale, dementia (mean age 79Æ5 [SD 9Æ62] years; 67Æ7% male) in the Townsend Functional Ability Scale (Townsend), has nine special licensed long-term care facilities in Taiwan. been analysed using MSP (Fieo et al. 2010) and this shows a The Townsend database came from a study by Fieo et al. hierarchy of items in terms of difficulty from tying a knot in (2010) and comprised data from 548 older people aged 79 a piece of string to cutting one’s own toe nails indicating and 42Æ3% male. The analysis here is concerned only with that, as ADL function is increasingly impaired, it is fine items that showed Mokken scaling properties in the movements that are affected. The above inventories are self- previous studies; i.e. six of 11 items in the case of the administered, but MSP has been applied to observational EdFED and seven of nine items in the case of the data and one example of this is the development of the Townsend. Data from both sets of items were imported EdFED scale (Watson 1994 a,b, 1997, Watson & Deary into the R package by converting SPSS databases to 1996, 1997, Watson et al. 2001, 2002). The EdFED was *.Rdata files and, using the default settings, checked for developed from a set of items designed to measure feeding violations of IIO and coefficient HT by running the difficulty in older people with dementia. Early versions of ‘check.iio’ procedure for both sets of data. the EdFED were analysed using factor analysis and Guttman scaling and it was shown that there were distinct underlying Results dimensions to the EdFED and that one of these was related solely to the behavioural aspects of feeding displayed by the The results are shown in Table 5. For both scales there are no older people with dementia. The factor analysis and the significant violations of IIO for either set of items; therefore, Guttman scaling confirmed this and the later application of no items required removal and only one step was required.

2011 Blackwell Publishing Ltd, Journal of Clinical Nursing 7 R Watson et al.

Table 5 Violations of IIO and coefficient HT for the EdFED and the MSP and where items have been shown to be unidmensional TFAS and hierarchical and where ISRFs have been shown to

EdFED Step TFAS Step adhere to the MMH and the DMM. The unique contribu- tion of this paper has been to investigate these sets of items Items* 1 Items* 1 for IIO. Refusing to eat 0 Tie knot 0 IIO was shown to be a safe assumption for items from the Refusing to open mouth 0 Wash or bathe 0 Turning head away 0 Reach overhead 0 Townsend but not for the EdFED, despite there being no Spitting 0 Get on bus 0 violations of IIO in either set of items. This is good news for Refusing to swallow 0 Up and down stairs 0 users of the Townsend and little more needs to be said here – Leaving mouth open 0 Cut toe nails 0 the items extracted by the MSP are hierarchical and do not Coefficient HT 0Æ27 0Æ48 overlap in the sample of individuals tested. However, what *Items are abbreviated. about the EdFED scale? The uniqueness and utility of the EdFED scale have been highlighted (Amella 2002, Amella & Stockdell 2008, Chang & Roberts 2008, Aselage 2010, The EdFED items show HT of 0Æ27 and those from the Aselage & Amella 2010) and it has been recommended by the Townsend show HT of 0Æ48. The retention of all items The Hartford Institute for Geriatric Nursing, New York suggests that there is evidence of IIO in both sets of items but University, College of Nursing, and the Alzheimer’s Associ- it is stronger for the Townsend items. According to Ligtvoet ation in the USA (http://consultgerirn.org/uploads/File/ et al. (2010), HT may be interpreted in the same way as trythis/try_this_d11_1.pdf; retrieved 10 October 2010). The H meaning that HT < 0Æ3 indicates that item ordering is too EdFED shows good psychometric properties (Watson et al. inaccurate to be useful and 0Æ4 £ HT < 0Æ5 means that 2001, 2002), is responsive to different levels of dementia and accuracy of item ordering is medium. has been used as an outcome measure in recent studies of feeding difficulty (Lin et al. 2010a,b). Clearly, the EdFED is a Discussion useful instrument but the indications here, based on the work of Ligtvoet et al. (2010), are that the items lack either This paper has described the development of Mokken scaling discrimination or are located very close on the latent trait of analysis from its roots in the non-stochastic, deterministic feeding difficulty. Analysis using TestGraf (unpublished) scaling method of Guttman through its evolution as a method suggests that ICCs are relatively steep and, therefore, for the analysis of dichotomous items and polytomous items. discriminating. It is likely, therefore, that the explanation The basic precepts of IRT: MMH (unidmensionality, local for inaccurate IIO is closeness of items on the latent trait of stochastic independence), DMM and IIO have been consid- feeding difficulty in dementia. ered and how these apply to Mokken scaling and how these compare and contrast with classical test theory. The compu- Conclusion tational software, principally MSP and R, available for Mokken scaling analysis have been described along with Mokken scaling is a useful way of investigating the behaviour some applications of Mokken scaling in different fields of of items in scales in response to varying levels of a latent trait. study in the social sciences. Mokken scaling has recently been enhanced by the introduc- However, recent exchanges between scaling practitioners tion of software to enable analysis of IIO. The Townsend, (Watson & Deary 2009, Bedford et al. 2010, Meijer 2010) a scale to measure ADL, shows good IIO and the EdFED, have revealed some misunderstanding of the extent to which a scale to measure feeding difficulty in older people with IIO is both understood and can be implied from the dementia, does not. Further development of the EdFED scale Mokken scaling analysis procedures and software available is required to increase its utility. Possibly, additional items are for polytomous items. In addition, recent additions to the required and its properties need to be studied further across a literature (Ligtvoet et al. 2010, Sijtsma et al. 2011) have range of severity of dementia. demonstrated that the concept of IIO, as it applies to polytomous items, is now much better understood and how, Relevance to clinical practice with the advent of new computational software, it can be addressed in sets of polytomous items. Therefore, the Generally, in clinical practice, we require good measure- original contribution of this paper is to re-visit two existing ment scales with good psychometric properties including databases which have previously been analysed using the IIO. Sequential functional loss scales are more informative

8 2011 Blackwell Publishing Ltd, Journal of Clinical Nursing Original article Item response theory than those simply summing functional loss, with predictive Acknowledgements value to the clinician monitoring, for example, an older patient: if the sequence is out of order or accelerated, the This study was unfunded. need for interventions may be indicated (Daltroy et al. 1992). Additionally, examination of the sequence of loss Contributions may help characterise adaptations to impairment and differences among subgroups. The Townsend shows good Study design: RW, RF, L-CL; data collection and analysis: IIO and is likely to be useful in its present form in clinical RF, L-CL, RW, IJD and manuscript preparation: RW, LAvA, practice. The EdFED remains, demonstrably, a useful RF, RM. clinical tool but its applicability across a wide range of levels of feeding difficulty and dementia remains problem- Conflict of interest atic. In addition, clearly, replication of the EdFED study is necessary. None.

References

Alagumalai S & Curtis D (2005) Classical Bannigan K & Watson R (2009) Reliability ing Care Dependency Scale. Journal of test theory. In Applied Rasch Mea- and validity in a nutshell. Journal of Clinical Nursing 8, 380–388. surement: A Book of Exemplars (Ala- Clinical Nursing 18, 3237–3243. Dijkstra A, Brown L, Havens B, Romeron gumalai S, Curtis D & Hungi eds). Bedford A, Watson R, Lyne J, Tibbles J, TI, Zanotti R, Dassen T & van den Springer, Dordrecht, pp. 1–14. Davies F & Deary IJ (2010a) Mokken Heuvel W (2000) An international Amella EJ (2002) Resistance at mealtimes scaling and principal components psychometric testing of the care for persons with dementia. The Journal analysis of the CORE-OM in a large dependency scale. Journal of Advanced of Nutrition, Health & Aging 6, 117– clinical sample. Clinical Psychology Nursing 31, 944–952. 22. and Psychotherapy 17, 51–56. Fieo R, Watson R, Deary IJ & Starr JM Amella EJ & Stockdell R (2008) Edin- Bedford A, Watson R, Henry JD, Crawford (2010) A revised activities of daily liv- burgh Feeding in Dementia Scale: JR & Deary IJ (2010b) Mokken scaling ing/instrumental activities of daily living Developing a plan of care for meal- analyses of the Personal Disturbance instrument increases interpretive power: times. American Journal of Nursing Scale (DSSI/sAD) in large clinical and theoretical application for functional 108, 46–54. non-clinical samples. Personality and tasks. Gerontology 56, 483–490. van der Ark LA (2007) Mokken scale anal- Individual Differences 50, 38–42. Finch M, Kane RL & Philip I. (1994) ysis in R. Journal of Statistical Software van Boxel Y, Roest FHJ, Bergen MP & Stam Developing a new metric for ADLs. 20, 1–19. HJ (1995) Dimensionality and hierar- Journal of the American Geriatric van der Ark LA (2010) Getting started with chical structure of disability measure- Society 43, 877–884. Mokken scale analysis in R. http:// ment. Archives of Physical Medicine Fisher WP & Fisher AG (1993) Applications spitswww.uvt.nl/~avdrark/research/ and Rehabilitation 76, 1152–1155. of Rasch analysis to studies in occupa- mokkenstart.pdf (retrieved 1 October Chang C-C & Roberts BL (2008) Feeding tional therapy. Physical Medicine and 2010). difficulty in older people with demen- Rehabilitation Clinics of North Amer- Aselage M (2010) Measuring mealtime dif- tia. Journal of Clinical Nursing 17, ica New Developments in Functional ficulties: eating, feeding and meal 2266–2274. Assessment 4, 551–569. behaviours in older adults with Daltroy LH, Logigian M, Iversen MD & Gillespie M, Tenvengert EM & Kingsma J dementia. Journal of Clinical Nursing Liang MH (1992) Does musculoskeletal (1988) Using Mokken methods to 19, 621–631. function deteriorate in a predictable develop robust cross-national scales: Aselage M & Amella EJ (2010) An evolu- sequence in the elderly? Arthritis Care American and West-German attitudes tionary analysis of mealtime difficulties Research 5, 146–150. toward abortion. Social Indicators in older adults with dementia. Journal Deary IJ, Wilson JA, Carding PN, Mac- Research 20, 181–203. of Clinical Nursing 19, 33–41. Kenzie K & Watson R (2010) From Hambleton RK & Swaminathan H (1985) Balazs K & de Boeck P (undated) Detecting dysphonia to dysphoria: Mokken Item Response Theory: Principles and local item dependence stemming scaling shows a strong, reliable hierar- Applications. Kluwer-Nijhoff Publish- for minor dimensions. Technical Report chy of voice symptoms in the Voice ing, Boston. 0684 IAP Statistics Network, Inter- Symptom Scale questionnaire. Journal Hambleton RK, Swaminathan H & Rogers university Attraction Pole. http://www. of Psychosomatic Research 68, 67–71. HJ (1991) Fundamentals of Item stat.ucl.ac.be/Iapdp/tr2006/TR0684. Dijkstra A, Buist G, Moorer P & Dassen T Response Theory. Sage, Newbury Park, pdf (retrieved 30 September 2010). (1999) Construct validity of the Nurs- CA.

2011 Blackwell Publishing Ltd, Journal of Clinical Nursing 9 R Watson et al.

Hemker BT & Sijtsma K (1995) Selection of spaced retrieval and Montessori-based D, Schofield N & Whitely P eds). Fran- unidimensional scales from a multi- activities in improving eating ability for ces Pinter, London, pp. 120–170. dimensional item bank in the poly- residents with dementia. International Nunnally JP (1978) Psychometric Theory. tmous Mokken IRT model. Applied Journal of Geriatric Psychiatry 25, McGraw-Hill, New York. Psychological Measurement 19, 337– 953–959. Ringdal K, Ringdal GI, Kaasa S, Bjordal K, 352. Lin L-C, Watson R & Wu S-C (2010b) Wisløff BF, Sundstrøm S & Hjermstad Hulin LH, Drasgow Y & Parsons CK (1983) What is associated with low food intake MJ (1999) Assessing the consistency of Item Response Theory: Application to in older people with dementia? Journal psychometric properties of the HRQoL Psychological Measurement. Dow of Clinical Nursing 19, 53–59. scales within EORTC QLQ-C30 across Jones-Irvin, Homewood, IL. Meijer RR (2010) A comment on Watson, populations by means of the Mokken de Jong A & Molenaar IW (1987) An Deary, and Austin (2007) and Watson, scaling model. Quality of Life Research application of Mokken’s model for Roberts, Gow, and Deary (2008): how 8, 25–43. stochastic, cumulative scaling in psy- to investigate whether personality items Ringdal GI, Jordhøy MS & Kaasa S (2003) chiatric research. Journal of Psychiatric form a hierarchical scale? Personality Measuring quality of palliative care: Research 21, 137–149. and Individual Differences 48, 502–503. psychometric properties of the FAM- Katz E (1988) In memoriam: Louis Gutt- Meijer RR & Baneke JJ (2001) Analyzing CARE scale. Quality of Life Research man, 1916–1987. Public Opinion psychopathology items: a case for 12, 167–176. Quarterly 52, 240–242. nonparametric item response theory Rust J & Golombock S (1999) Modern Kempen GIJM & Suurmeijer ThPBM modeling. Psychological Methods 9, Psychometrics. Routlegde, London. (1990) Depression, loneliness, physical 354–368. Schuessler K (1971) Analyzing Social Data: disability and the utilization of profes- Meijer RR & Sijtsma K (1990) Theoretical A Statistical Orientation. Houghton sional home care among older adults. and empirical comparison of the Mok- Mifflin Company, Boston. International Journal of Health Sci- ken and the Rasch approach to IRT. van Schuur WH (2003) Mokken scale ences 1, 271–276. Applied Psychological Measurement analysis: between the Guttman scale Kempen GIJM & Suurmeijer ThPBM 14, 283–298. and parametric item response theory. (1991) Factors influencing professional Menzel H (1953) A new coefficient or sca- Political Analysis 11, 139–163. home care utilization among the logram analysis. Public Opinion Sijtsma K & Junker BW (1996) A survey of elderly. Social Science and Medicine 32, Quarterly 17, 268–280. theory and methods of invariant item 77–81. Mokken RJ & Lewis C (1982) A nonpara- ordering. British Journal of Mathemat- Kempen GIJM, Myers AM & Powell LA metric approach to the analysis of ical and Statistical Psychology 49, (1995) Hierarchical structure in ADL dichotomous item responses. Applied 79–105. and IADL: analytical assumptions and Psychological Measurement 6, 417– Sijtsma K & Molenaar IW (2002) An applications for clinicians and 430. Introduction to Nonparametric Item researchers. Journal of Clinical Epide- Molenaar IW & Sijtsma K (2000) MSP5 for Response Theory. Sage Publications, miology 48, 1299–1995. Windows: a program for Mokken Scale Thousand Oaks. Kingshott R, Douglas N & Deary I (1998) Analysis for Polytomous Items. iec Sijtsma K, Debets P & Molenaar IW (1990) Mokken scaling of the Epworth Sleep- ProGAMMA, Groningen. Mokken scaling analysis for polychot- iness Scale items in patients with the Moorer P & Suurmeijer TBPM (1994) A omous items: theory, a computer pro- sleep apnoea/hypopnoea syndrome. study of unidimensionality and cumu- gramme and an empirical application. Journal of Sleep Research 7, 293–294. lativeness of the MOS Short-Form Quality & Quantity 24, 173–188. von Korff MV, Ormel J, Keefe FJ & General Health Survey. Psychological Sijtsma K, Meijer RR & van der Ark Dworkin SF (1992) Grading the Reports 74, 467–470. LA (2011) Mokken scale analysis as severity of chronic pain. Pain 50, 133– Moorer P, Suurmeijer ThBPM, Foets M & time goes by: an update for scaling 149. Molenaar IW (2001) Psychometric practitioners. Personality and Individ- Ligtvoet R, van der Ark LA, te Marvelde JM properties of the RAND-36 among ual Differences 50, 31–37. & Sijtsma K (2010) Investigating an three chronic diseases (multiple sclero- Stewart ME, Watson R, Clark A, Ebmeier invariant item ordering for polytom- sis, rheumatic diseases and COPD) in KP & Deary IJ (2010) A hierarchy of ously scored items. Educational and the Netherlands. Quality of Life happiness? Mokken scaling analysis of Psychological Measurement 70, 578– Research 10, 637–645. the Oxford Happiness Inventory. Per- 595. Nandakumar R (1994) Assessing dimen- sonality and Individual Differences 48, Lin L-C, Watson R, Lee Y-C, Chou Y-C & sionality of a set of item responses- 845–848. Wu S-C (2008) Edinburgh Feeding comparison of different approaches. Stouffer SA, Guttman L, Suchman EA, La- Evaluation in Dementia (EdFED) scale: Journal of Educational Measurement zarsfeld PF, Star SA & Clausen JA cross-cultural validation of the Chinese 31, 17–35. (1950) Measurement and Prediction, version. Journal of Advanced Nursing Niemo¨ ller K & van Schuur W (1983) Sto- Vol. 4. Princeton University Press, 62, 116–123. chastic models for unidimensional scal- Princeton, NJ. Lin L-C, Huang Y-J, Su S-G, Watson R, ing: Mokken and Rasch. In Data Suurmeijer ThPBM & Kempen GIJM Tsailk BW-J & Wu S-C (2010a) Using Analysis and the Social Sciences (McKay (1990) Behavioural changes as an out-

10 2011 Blackwell Publishing Ltd, Journal of Clinical Nursing Original article Item response theory

come of disease: the development of an elderly people with dementia. Interna- Watson R, Green S & Legg L (2001) The instrument. International Journal of tional Journal of Nursing Studies 33, Edinburgh Feeding Evaluation in Health Sciences 1, 189–194. 385–393. Dementia Scale #2 (EdFED#2): conver- Tenvegert E, Gillespie MW, Kingma J & Watson R & Deary IJ (1994) Measuring gent and discriminant validity. Clinical Klasen H (1992) Abortion attitudes, feeding difficulty in patients with Effectiveness in Nursing 5, 44–46. 1984–1987–1988: effects of item dementia: multivariate analysis of Watson R, MacDonald J & McReady T ordering and dimensionality. Percep- feeding problems, nursing interventions (2002) The Edinburgh Feeding Evalua- tual and Motor Skills 74, 627– and feeding difficulty. Journal of tion in Dementia Scale #2: inter- and 642. Advanced Nursing 20, 283–287. intra-rater reliability. Clinical Effec- Thompson DR & Watson R (2010) Mok- Watson R & Deary IJ (1996) Is there a tiveness in Nursing 5, 184–186. ken scaling of the Myocardial Infarc- relationship between feeding difficulty Watson R, Deary I & Austin E (2007) Are tion Dimensional Assessment Scale and nursing intervention in elderly personality trait items reliably more or (MIDAS). Journal of Evaluation in people with dementia? NT Research 1, less ‘difficult’? Mokken scaling of the Clinical Practice 17, 156–159. 44–45. NEO-FFI. Personality and Individual Townsend P (1962) The Last Refuge. Watson R & Deary IJ (1997) Feeding diffi- Difference 43, 1460–1469. Routledge & Kegan Paul, London. culty in elderly people with dementia: Watson R, Deary IJ & Shipley B (2008a) A Watson R (1994a) Measuring feeding diffi- confirmatory factor analysis. Interna- hierarchy of distress: Mokken scaling of culty in patients with dementia: devel- tional Journal of Nursing Studies 34, the GHQ-30. Psychological Medicine oping a scale. Journal of Advanced 405–414. 28, 575–579. Nursing 19, 257–263. Watson R & Deary IJ (2009) Reply to: A Watson R, Roberts(nee Shipley) B, Gow A Watson R (1994b) Measuring feeding diffi- comment on Watson, Deary, and & Deary IJ (2008b) A hierarchy of culty in patients with dementia: repli- Austin (2007) and Watson, Roberts, items within Eysenck’s EPI. Personality cation and valiadation of the EdFEd Gow, and Deary (2008): how to and Individual Differences 45, 333– Scale#1. Journal of Advanced Nursing investigate whether personality items 335. 19, 850–855. form a hierarchical scale? Personality Wilson M (2005) Constructing Measures: An Watson R (1996) Mokken scaling procedure and Individual Differences 48, 504– Item Response Modelling Approach. (MSP) applied to feeding difficulty in 505. Erlbaum, Mahwah, NJ, p. 187.

The Journal of Clinical Nursing (JCN) is an international, peer reviewed journal that aims to promote a high standard of clinically related scholarship which supports the practice and discipline of nursing.

For further information and full author guidelines, please visit JCN on the Wiley Online Library website: http://wileyonlinelibrary.com/journal/jocn

Reasons to submit your paper to JCN: High-impact forum: one of the world’s most cited nursing journals and with an impact factor of 1Æ228 – ranked 23 of 85 within Thomson Reuters Journal Citation Report (Social Science – Nursing) in 2009. One of the most read nursing journals in the world: over 1 million articles downloaded online per year and accessible in over 7000 libraries worldwide (including over 4000 in developing countries with free or low cost access). Fast and easy online submission: online submission at http://mc.manuscriptcentral.com/jcnur. Early View: rapid online publication (with doi for referencing) for accepted articles in final form, and fully citable. Positive publishing experience: rapid double-blind peer review with constructive feedback. Online Open: the option to make your article freely and openly accessible to non-subscribers upon publication in Wiley Online Library, as well as the option to deposit the article in your preferred archive.

2011 Blackwell Publishing Ltd, Journal of Clinical Nursing 11