PSYCHOMETRIKA—VOL. 78, NO. 1, 116–133 JANUARY 2013 DOI: 10.1007/S11336-012-9293-1
HOW SHOULD WE ASSESS THE FIT OF RASCH-TYPE MODELS? APPROXIMATING THE POWER OF GOODNESS-OF-FIT STATISTICS IN CATEGORICAL DATA ANALYSIS
ALBERTO MAYDEU-OLIVARES FACULTY OF PSYCHOLOGY, UNIVERSITY OF BARCELONA
ROSA MONTAÑO UNIVERSIDAD DE SANTIAGO DE CHILE
We investigate the performance of three statistics, R1, R2 (Glas in Psychometrika 53:525–546, 1988), and M2 (Maydeu-Olivares & Joe in J. Am. Stat. Assoc. 100:1009–1020, 2005, Psychome- trika 71:713–732, 2006) to assess the overall fit of a one-parameter logistic model (1PL) estimated by (marginal) maximum likelihood (ML). R1 and R2 were specifically designed to target specific assump- tions of Rasch models, whereas M2 is a general purpose test statistic. We report asymptotic power rates under some interesting violations of model assumptions (different item discrimination, presence of guess- ing, and multidimensionality) as well as empirical rejection rates for correctly specified models and some misspecified models. All three statistics were found to be more powerful than Pearson’s X2 against two- and three-parameter logistic alternatives (2PL and 3PL), and against multidimensional 1PL models. The results suggest that there is no clear advantage in using goodness-of-fit statistics specifically designed for Rasch-type models to test these models when marginal ML estimation is used. Key words: discrete data, power, IRT, maximum likelihood.
1. Introduction
Broadly speaking, item response theory (IRT) refers to the class of latent trait models for discrete multivariate data obtained by coding the responses to a set of questionnaire items, such as those found in educational tests, personality inventories, etc. Rasch-type models are a subset of IRT models, so named after the pioneering work of Rasch (1960). Rasch-type models are characterized by two properties (McDonald, 1999): (a) the sum score is a sufficient statistic for the latent traits, and (b) comparisons of subpopulations are made independently of the item or items used for the comparison (the so-called specific objectivity property). Although only highly restrictive IRT models can satisfy these properties, their mathematical potential has led some researchers to prefer them to all other IRT models. Thus, we may distinguish between two traditions in IRT modeling: a model-based tradition and a data-based tradition. In the model- based tradition, a model with appealing mathematical properties is selected first (a Rasch-type model) and tests are designed to fit the model. By contrast, in a data-based tradition, different models within the IRT family are explored to find the best fitting model for the available data. Because of the availability of sufficient statistics for the latent traits that do not depend on item parameters, estimation methods (conditional maximum likelihood, or CML) and goodness- of-fit testing procedures have been developed specifically for Rasch-type models (for an overview
This research was supported by an ICREA-Academia Award and Grant SGR 2009 74 from the Catalan Government, and by Grants PSI2009-07726 and PR2010-0252 from the Spanish Ministry of Education awarded to the first author, and by a Dissertation Research Award of the Society of Multivariate Experimental Psychology awarded to the second author. The authors are indebted to the reviewers and to David Thissen for comments that improved the manuscript. Requests for reprints should be sent to Alberto Maydeu-Olivares, Faculty of Psychology, University of Barcelona, P. Valle de Hebrón, 171, 08035 Barcelona, Spain. E-mail: [email protected]
116 © 2012 The Psychometric Society ALBERTO MAYDEU-OLIVARES AND ROSA MONTAÑO 117 of Rasch modeling, see Fischer & Molenaar, 1995). However, for Rasch-type models, it is not clear whether there is any advantage in using the procedures specifically designed for these tests or if, on the other hand, those general procedures applicable to other IRT models that do not possess sufficient statistics can actually yield more accurate results. For item parameter estimation techniques and after Thissen’s (1982) pioneering work (1982), there seems to be a consensus (Pfanzagel, 1993; see also De Leeuw & Verhelst, 1986) that the more general marginal maximum likelihood estimation procedure (MML or simply ML) generally implemented via the EM algorithm (see Bock & Aitkin, 1981) is preferable to the CML procedure originally favored by Rasch modelers. However, many Rasch modelers still pre- fer CML because no distribution needs to be assumed for the latent traits. Finally, the reader should also note that MML estimation is sometimes referred to in the literature as full informa- tion maximum likelihood (FIML) (e.g., Jöreskog & Moustaki, 2001). However, no such consensus has emerged regarding the use of goodness-of-fit testing pro- cedures. Indeed, there is a large number of goodness-of-fit statistics specifically proposed for Rasch-type models (see Andersen, 1973; van den Wollenberg, 1982; Suárez-Falcon and Glas, 2003; and the excellent review by Glas & Verhelst, 1995), in addition to the general procedures proposed available for testing multivariate discrete data models, and particularly IRT models; see the reviews by Mavridis, Moustaki, and Knott (2007), Swaminathan, Hambleton, and Rogers (2007), and Maydeu-Olivares and Joe (2008). The purpose of this article is to compare the per- formance of certain goodness-of-fit statistics to test Rasch-type models. To do so, we concentrate on models for binary data. More specifically, we use the one-parameter logistic model, that is, the random effects version of Rasch’s 1960 model. The statistics being compared are R1, R2, and M2. The statistics R1 and R2 were proposed by Glas (1988) to assess the fit of the one- parameter logistic model, and M2 was proposed by Maydeu-Olivares and Joe (2005, 2006)for testing general composite null hypotheses in multivariate discrete data The remaining sections of this article are divided as follows. Sections 2 and 3 provide the- oretical background. The R1, R2, and M2 test statistics are described and the details of their asymptotic distribution are provided for composite nulls, and also under a sequence of local alternatives. Section 3 describes a procedure first used by Reiser (2008) to approximate asymp- totically the power of the statistics for specific alternatives without having to use simulations. In many ways, this procedure is the categorical data counterpart to the procedure proposed by Satorra and Saris in structural equation modeling (1985). Section 4 compares the performance of the statistics. We report asymptotic power rates under some interesting violations of model assumptions (different item discrimination, presence of guessing, and multidimensionality) as well as empirical rejection rates for correctly specified models and some misspecified models. Finally, Section 5 provides two numerical examples using real data.
2. Rasch-Type Models for Binary Data
Consider n binary items Yi , whose categories have been coded as 0 or 1. Rasch (1960) proposed the following model: ξ P(Yi = 1|ξ)= , (1) ξ + δi where ξ denotes the latent trait, and δi denotes the item difficulty parameter. The model can be reparameterized using ξ = exp(η) and δi = exp(bi) to yield exp(η) 1 P(Yi = 1|η) = = , (2) exp(η) + exp(bi) 1 + exp[−(η − bi)] 118 PSYCHOMETRIKA which is a special case of the three parameter logistic (3PL) model, 1 P(Yi = 1|η) = ci + (1 − ci) , (3) 1 + exp[−(ai(η − bi))] introduced in Lord and Novick (1968). In (2), η and bi denote the (reparameterized) latent trait and item difficulty parameter, respectively. In turn, the item parameters ai and ci in (3) may be interpreted as discrimination and guessing parameters, respectively. Rasch (1960) treated the latent trait parameters ξ (or equivalently η) as fixed effects, so that the distribution of ξ (or equivalently η) is not specified. He proposed identifying the model by introducing the constraint n bi = 0(4) i=1 and estimating the mean and variance of the latent trait. The model can be equivalently identified by fixing the mean and variance of the latent trait to some constants, say 0 and 1, estimating the bi without the constraint (4), and rewriting (2)as 1 P(Yi = 1|η) = . (5) 1 + exp[−a(η − bi)] Note that the discrimination parameter a in (5) is common to all items. In recent times, the latent trait η in (5) has most often been treated as a random effect, gener- ally by specifying a standard normal distribution for η. To distinguish between the two variants of the model, we shall refer to the fixed-effects version of the model (i.e., Equation (2) with the constraint (4) and latent trait mean and variance to be estimated) as the Rasch model; and we shall refer to the random-effects version (given by Equation (5) with mean zero and unit variance for η), as the one-parameter logistic model or the 1PL. A parametric latent trait distribution need not be assumed for the 1PL, as the latent trait density can be estimated nonparametrically. How- ever, in this article, we shall assume that the latent trait follows a standard normal distribution for all the models considered, meaning the 1PL, the 3PL model (3), and in the special case where all ci parameters are equal to zero, the two-parameter logistic model, 2PL. Readers should also note that regardless of whether we assume the Rasch model or the 1PL (i.e., regardless of whether the latent trait is treated as fixed- or random-effect), the specific objectivity property holds. In other words, for any item the log-odds value of two subpopulations is equal to the difference in their trait values; see Irtel (1995) for a less restrictive definition of specific objectivity that applies to the 2PL.
3. Goodness-of-Fit Assessment in Binary IRT Models
Consider modeling N observations on n binary random variables. The observed responses n can then be gathered in an n-dimensional contingency table with C = 2 cells. Let πc be the prob- ability of one such cell, c = 1,...,C, and let pc be the observed proportion. Also, let π(θ) be the C-dimensional vector of model probabilities expressed as a function of q model parameters θ to be estimated from the data. Then the (composite) null hypothesis to be tested is H0 : π = π(θ) against H1 : π = π(θ). 2 The two standard goodness-of-fit statistics for discrete data are Pearson’s statistic, X = C −ˆ 2 ˆ 2 = C ˆ N c=1(pc πc) /πc, and the likelihood ratio statistic G 2N c=1 pc ln(pc/πc), where ˆ πˆc = πc(θ) denotes the probability of cell c under the model. Asymptotic p-values for both statistics can be obtained using a chi-square distribution with C −q −1 degrees of freedom when maximum likelihood estimation is used. However, these asymptotic p-values are only reliable ALBERTO MAYDEU-OLIVARES AND ROSA MONTAÑO 119 when all expected frequencies are large (>5 is the usual rule of thumb). Unfortunately, as the number of cells in the table increases, the expected frequencies must be small (Bartholomew & Tzamourani, 1999) because the sum of all C probabilities must be equal to 1. As a result, in multivariate discrete data analysis the asymptotic p-values for these statistics can hardly ever be used. To overcome these limitations, a number of authors (e.g., Christoffersson, 1975;Reiser, 1996, 2008; Bartholomew & Leung, 2002; Maydeu-Olivares & Joe, 2005, 2006; Cai, Maydeu- Olivares, Coffman, & Thissen, 2006) have proposed testing using limited information, that is, pooling cells of the contingency table a priori so that the resulting statistics have a known asymp- totic null distribution.
3.1. M2 and the Mr Family of Test Statistics Maydeu-Olivares and Joe (2005) proposed testing using a quadratic form in residual mo- ments of the multivariate Bernoulli distribution (Teugels, 1990) up to the smallest order at which the model is identified. The family of statistics they proposed is ˆ ˆ ˆ Mr = N pr − π r (θ) Cr pr − π r (θ) , (6) = −1 − −1 −1 −1 −1 = (c) (c) (c) −1 (c) Cr r r r r r r r r r r r r r , (7) ˆ ˆ = r n where Cr denotes Cr evaluated at θ.In(6), π r denotes the s i=1 i vector of moments of the multivariate Bernoulli distribution up to order r, and pr denotes its√ sample counterpart. ∂πr (θ) = N(p − π ) In (7), r ∂θ , and r denotes the asymptotic covariance matrix of r r .Also, (c) (c) (c) r is the s × (s − q) orthogonal complement of r (i.e., it satisfies r r = 0). Mr is a family of statistics comprising M1,M2,...,Mn.InM1 only univariate information is used, in M2 only univariate and bivariate information is used, and so forth up to Mn,afull information statistic that is algebraically equal to Pearson’s X2 when ML estimation is used. For the chi-square approximation to the distribution of Mr to be accurate, only the expected frequencies of the moments of order min (2r, n) need to be large. Thus, the smaller the r used for testing, the more accurate the asymptotic approximation in small samples. Because most IRT models are identified from univariate and bivariate information, M2 is the statistic of choice within this family for testing IRT models. Note that only expected frequencies for sets of four variables are involved in the computation of M2. Actually, the moments of the multivariate Bernoulli distribution are simply marginal proba- bilities obtained by a linear transformation of the cell probabilities, π n = Tπ,or ⎛ ⎞ ⎛ ⎞ ˙ π˙ 1 T1 ⎜ ⎟ ⎜ ˙ ⎟ ⎜ π˙ 2 ⎟ ⎜ T2 ⎟ ⎜ . ⎟ = ⎜ . ⎟ π, ⎝ . ⎠ ⎝ . ⎠ ˙ ˙ π n Tn ˙ n where π r is the r -dimensional vector of rth order moments. This transformation is illustrated here for n = 3 variables: 120 PSYCHOMETRIKA
∂π(θ) ˙ ˙ Note that = T = T , where T = (T ... T ) is an s × C matrix, and = T T , r r r ∂θ r 1√ r r r r where is the asymptotic covariance matrix of N(p − π), = D − ππ , with D = diag(π). Thus, Mr may be written as = − ˆ ˆ − ˆ Mr N p π(θ) Tr Cr Tr p π(θ) . (8) ˆ Maydeu-Olivares√ and Joe (2005) showed that if the model is identified from π r , and if θ is an N-consistent and asymptotically normal estimator (not just for the ML estimator), Mr is asymptotically distributed as a chi-square with s − q degrees of freedom under the null hypoth- esis. This follows from the asymptotic normality of the vector of cell residuals and by noting, using (7), that
r = Cr r Cr . (9)
Also, note that M2 is simply a quadratic form in residual means and cross-products, since the = ˙ ˙ ˙ = = ˙ = = = elements of π 2 (π 1 π 2) are of the type πi Pr(Yi 1), and πij Pr(Yi 1,Yj 1).The degrees of freedom available for testing when using M2 are n(n + 1)/2 − q.
3.2. R1, R2, and the Family of Generalized Pearson Statistics
The statistic R1 proposed by Glas (1988) has a similar form to Mr as given in (8), = − ˆ ˆ − ˆ = −1 R1 N p π(θ) TR1CR1TR1 p π(θ) , CR1 TR1DTR1 , (10) ˆ ˆ with CR1 denoting CR1 evaluated at θ,butTR1 is an (n(n − 1) + 2) × C block diagonal matrix, = (0) (n) TR1 diag(TR1,...,TR1), and CR1 has a much simpler form than Cr . Furthermore,
R1 = R1CR1R1, (11) and R1 = TR1TR1. The relationship π R1 = TR1π is illustrated below for n = 3 items
, (12)