Quick viewing(Text Mode)

Comparing Conditional and Marginal Direct Estimation of Subgroup Distributions

Comparing Conditional and Marginal Direct Estimation of Subgroup Distributions

RESEARCH REPORT

January 2003 RR-03-02

Comparing Conditional and Marginal Direct Estimation of

Subgroup Distributions

Matthias von Davier

Research & Developm ent Division Princeton, NJ 08541

Comparing Conditional and Marginal Direct Estimation of Subgroup Distributions

Matthias von Davier

Educational Testing Service, Princeton, NJ

January 2003

Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from: Research Publications Office Mail Stop 10-R Educational Testing Service Princeton, NJ 08541

Abstract Many large-scale assessment programs in education utilize “conditioning models” that incorporate both cognitive item responses and additional respondent background variables relevant for the population of interest. The set of respondent background variables serves as a predictor for the latent traits (proficiencies/abilities) and is used to obtain a conditional prior distribution for these traits. This is done by estimating a linear regression, assuming normality of the conditional trait distributions given the set of background variables. Multiple imputations, or plausible values, of trait parameter estimates are used in addition to or, better, on top of the conditioning model—as a computationally convenient approach to generating consistent estimates of the trait distribution characteristics for subgroups in complex assessments. This report compares, on the basis of simulated and real data, the conditioning method with a recently proposed method of estimating subgroup distribution that assumes marginal normality. Study I presents simulated data examples where the marginal normality assumption leads to a model that produces appropriate estimates only if subgroup differences are small. In the presence of larger subgroup differences that cannot be fitted by the marginal normality assumption, however, the proposed method produces subgroup mean and variance estimates that differ strongly from the true values. Study II extends the findings on the marginal normality estimates to real data from large-scale assessment programs such as the National Assessment of Educational Progress (NAEP) and the National Adult Literacy Survey (NALS). The research presented in Study II shows differences between the two methods that are similar to the differences found in Study I. The consequences of relying upon the assumption of marginal normality in direct estimation are discussed.

Key words: conditioning models, large-scale assessments, NAEP, NALS, direct estimation

i

Acknowledgements I would like to thank John Mazzeo for valuable comments on previous versions of this document, which improved both content and presentation. Any remaining errors are mine.

ii

Introduction Large-scale assessments such as the National Assessment of Educational Progress (NAEP) estimate the distribution of academic achievement for policy relevant subgroups. Examples of estimates provided by large-scale assessment are means and percentages above cut points for the subgroups of interest. Many large-scale assessments such as NAEP use a sparse matrix sample design in which the number of cognitive items per respondent is kept relatively small. Using such designs allows the assessment to provide a broad coverage of the content domain while keeping the subjects’ testing time brief. This implies that individual ability estimates based on these kinds of assessments would have a large measurement error component, which has to be taken into account when reporting aggregate statistics for subgroups. Direct estimation procedures, by which these estimates are obtained without the generation of individual scores, have been the approach most commonly taken to address this analysis challenge. Typically, these procedures have made use of background variables along with the cognitive item responses to ensure a higher degree of accuracy in estimating subgroup characteristics compared to only using the cognitive responses. Moreover, matrix sampling makes it impossible to compare subjects—or groups of subjects—based on their observed item responses. Therefore, large-scale assessments using matrix sampling rely on item response theory (IRT) models (Lord & Novick, 1968; Rasch 1960). To estimate the subgroup statistics of interest, ETS has employed since 1984 a particular approach of integrating achievement data (item responses) and background information, such as subgroup membership and additional student variables, into a hierarchical IRT model. This approach may be referred to as “direct estimation” because ETS estimates group statistics without the use of individual test scores. For the purposes of this report, I refer to this approach as ETS-DE. The core features of the ETS-DE approach include: 1. A population model that assumes proficiencies are normally distributed conditional on a large number of background variables (grouping variables and other covariates). As a consequence, the marginal distribution (overall and for major reporting subgroups) is a mixture of normals. 2. The generation of a posterior latent trait distribution of proficiency for each individual in the sample, which is based on an estimate of (1); a separately estimated set of IRT parameters that are treated as fixed and known; the cognitive item responses, the respondents’ group

1

membership; and other covariates. The mixture of these individual posterior distributions provides the estimate of the actual subgroup distributions. 3. The integration over posterior distributions of examinees and some of the model parameters (the parameters of the population model defined later) in (1) to obtain estimates of means, percentages above achievement levels, etc. 4. The use of normal approximations for the individual posteriors and a multiple-imputation approach (the so-called plausible values) to approximate the integration in (3). Imputations are used in conjunction with conditioning models based on both cognitive item responses and background information. The imputations are used as a mere convenience in order to simplify the integration in (3) and to provide data that can be used with standard tools by secondary analysts. Cohen and Jiang (1999) propose an alternative approach to direct estimation (which I refer to as CJ-DE in this report) of subpopulation characteristics that does not utilize additional background variables. Cohen and Jiang assume that CJ-DE provides consistent subgroup estimates without the use of background variables. The core features of CJ-DE include: 1. A population model that assumes marginal normality, i.e., the ability distributions of all subgroups align in such a way that the joint distribution is normal. 2. A measurement model for the categorical grouping variables that assumes an underlying continuous latent variable whose joint distribution with proficiency is normal. 3. Use of a set of fixed/known IRT model parameters. 4. Item responses that are used together with a single grouping variable only—the one used for reporting—i.e., no additional covariates like other reporting variables or their interactions are used in the population model. 5. A direct calculational approach that bypasses the generation of individual posterior distributions and the generation of plausible values. Both approaches, ETS-DE and CJ-DE, may be referred to as “direct estimation” because they estimate group statistics without the use of individual test scores. ETS-DE uses a more general model, which includes grouping variables as well as additional background information and no specific assumption regarding the marginal proficiency distribution. CJ-DE includes the assumption of marginal normality and ignores all the additional background information other

2

than a single grouping variable. This report presents a comparison of ETS-DE and CJ-DE using simulated and real data.

The ETS-DE Methodology For obtaining estimates of subpopulation distributions, ETS-DE involves a two-phase procedure that uses achievement data (item responses) and respondents’ background information. Key references for a more detailed outline of the conditioning model used by the ETS-DE method are Mislevy (1991), Mislevy, Beaton, Kaplan, and Sheehan (1992) and Thomas (1993, 2002). The two phases of the method, which sometimes are confused when discussed in secondary literature, are: 1. Estimation of parameters for the conditioning, or population, model. 2. Production of plausible values from individual posterior distributions given the model parameters, item responses, and background data.

The Conditioning Model The method used for analyzing large-scale assessments at ETS uses both item responses and background information, sometimes numbering up to one hundred conditioning variables. Assume that there are k scales in the assessment and that each proficiency scale follows a 1 unidimensional IRT model with the usual assumption of given , i.e.,

P ( x ,.., x | )  P ( x | )   (1) k1... K 1 J ( k ) k kk1).Kj 1.. J ( jk k

The conditioning model combines the k-scale IRT model with a k-dimensional multivariate latent regression model in order to maximize the likelihood based on the posterior

distribution of the latent trait =( ,., ):

L( | x, y)  f ( | x, y) ~ P(x | ) ( | y) (2) kk1)..K j 1..J ( jk k

where the prior (| y) is assumed to be normal with y N('y , ). The latent trait  is unobserved and must be inferred from the observed item responses. The predictor y is a vector of

3

individual values on a set of conditioning variables,  is a matrix of regression weights, and  is the residual variance- matrix. Note that at ETS, three software programs are currently available to carry out the estimation: NGROUP, BGROUP, and CGROUP. All implementations are based on the EM (estimation-maximization) algorithm. In the E-step, the posterior distribution of given item responses and conditional on the background variables is computed for each individual. These estimates are then used in the M-step to obtain the regression weights and the residual . The approaches implemented in NGROUP, BGROUP, and CGROUP differ with respect to how each carries out the E-step:

1. NGROUP assumes that the item likelihood  j=1..J(k)P(xjk| ) can be approximated by a multivariate and has limited use. (It may be used only for generating starting values for CGROUP or with extremely long scales.) 2. BGROUP does not assume any specific form of the item likelihood and uses a numerical quadrature in the E-step. To date, BGROUP has been shown to not be computationally feasible in more than two dimensions. 3. CGROUP is designed to be computationally feasible for more than two dimensions (it uses a Laplace approximation in the E-step). CGROUP is used most frequently in NAEP since most subject areas have multiple scales and require reporting on a composite. In NAEP and other large-scale assessments analyzed at ETS, the estimation of the conditioning model for multivariate latent traits is carried out with BGROUP and CGROUP. This report uses CGROUP as the basis for evaluating the differences in direct estimation between the conditional normality approach (ETS-DE, as implemented in CGROUP) and the marginal normality approach (CJ-DE, as implemented in the AM software, see below) since CGROUP has been the program most frequently used for NAEP analysis purposes.

Plausible Values The second phase of the ETS-DE involves the production of plausible values, which provide a computationally tractable approach of integrating the posterior distributions of respondents to estimate the target statistics in subgroups of interest. Using plausible values provides a means for estimating the error in the estimates due to the proficiencies being latent (i.e., only indirectly observed) and the uncertainty about the regression parameters in the

4

population model. In addition, plausible values provide a set of quantities that researchers can use with commercial statistical software to conduct a wide variety of secondary analyses. The BGROUP, CGROUP, and NGROUP set of programs generate multiple imputations for each respondent based on the estimates of  and  and on the respondents’ background data y and the item responses x. These plausible values are drawn from the k-dimensional posterior N(E( |y,x),( |y,x)). In other words, the approach assumes that given y and x is approximately normally distributed. This conditional normality is a less restrictive assumption compared to the marginal normality assumption, on which CJ-DE relies (Cohen & Jiang, 1999). The marginal distribution in ETS-DE conditioning model is therefore rather flexible and is not limited to the normal distribution, but it is actually a mixture of the conditional posterior distributions for the given set of items responses and background variables. In order to carry the variability due to measurement and parameter estimation errors through all subsequent analyses, a number of plausible values has to be drawn for each respondent. As a rule of thumb, five to ten plausible values are drawn in most large-scale assessment analyses. These plausible values are aggregated to provide consistent estimates of group means, variances, and percentages above cut points for the subgroups defined by the reporting variables. Plausible values drawn from a population model that uses item responses and a large amount of background information are a valuable source for studying relationships between the proficiency scales and secondary variables.

The CJ-DE Methodology Marginal normality based direct estimation, or CJ-DE (Cohen & Jiang, 1999), is a recently proposed method of estimation subgroup statistics based on a number of assumptions regarding a) the marginal distribution of the latent trait and b) its relation to a set of group indicator variables. The following studies use simulated and real data to compare the results from the ETS-DE and CJ-DE methods. The study of real data offers a determination as to whether CJ- DE yields estimates consistent with the results of more general models. The software package AM (Cohen, 1998) implements the CJ-DE approach and is available for the Windows operating system. The software provides modules for CJ-DE and additional modules for univariate and composite regressions of the latent trait on a number of predictors, which is referred to as marginal maximum likelihood (MML) regression in the AM

5

package. While the focus of this study is to compare CJ-DE with the ETS-DE conditioning approach, AM's MML regression was used to make sure that both software programs—AM and CGROUP—agree on the data structure. AM provides two procedures for CJ-DE that were developed “…to consistently estimate subpopulation distributions when the groups are defined by values of a [nominal or ordinal variable]” (Cohen, 1998; Cohen & Jiang, 1999). The AM modules implementing CJ-DE are referred to as “Ordinal Table” (OT) and “Nominal Table” (NT) in the software, depending on the scale level of the grouping variable. Both the OT and the NT modules assume that the latent trait is marginally normally distributed (Cohen, 1998; Cohen & Jiang, 1999), so that the estimates of a finite mixture of subgroup distributions have to fit this assumption. In contrast to this assumption, the conditional normality estimation—ETS-DE, which is used in NAEP's conditioning model and other large-scale assessment programs—does not rely on assuming a certain form of the trait parameters’ marginal distribution. The marginal distribution in the conditioning model is a mixture of normals. In addition, NAEP uses a multinomial distribution to approximate the marginal distribution of for item calibration (Yamamoto & Mazzeo, 1992), so that the item parameters used in the conditioning model are not based on a certain form of the marginal trait distribution.

Central Assumptions Driving CJ-DE Cohen and Jiang (1999) propose to use the following approach in order to estimate subgroup statistics: a) Assume a latent trait ~ N( ,). is usually unobserved and has to be inferred by the

subjects responses to a number of items (x1,..,xk)

b) Assume that there are m groups, where the group membership gi indicates the maximum

on a number of m unobserved variables, yl,...,ym. That means the group membership

of individual i equals k (gi = k), if for the unobserved variables yki > yli for all l <> k.

c) Assume that for k=1,..,m, a linear relationship exists between and yk (i.e., yk = ak + bk + ek)

with mutually independent ek. The conditional distribution of yk given  is assumed to be N(0,1).

d) Assume that conditional on , the yi are mutually independent, i.e.,       P(yi A, y j B | ) P(yi A | ) * P(y j B | )

6

Assumption (a) forces the ability distribution to be marginally normal. Assumption (c) also is very strong and “may not be true but is a common and powerful one” (Cohen & Jiang, 1999). Assumptions (b) and (d) are used for defining the conditional density of

f (gk| )f(x|)P(yyjUk| )dy f(x| ) P(yy| )dy (4) kkj k** kjkU kjk

This conditional density, together with assumption (c) and the assumption of marginal normality (a), yields

f (,gk )(z )(yab ) P(yy| )dy (5) * kkk jkU k j k

where  denotes the normal density and z = / One more replacement uses the second part of assumption (c), namely that the error term e in the linear relation yj=aj+bj +e is assumed to be N(0,1). This yields

              P(yk y j 0 | ) P(a j b j e yk ) P(e yk a j b j ) (yk a j b j ) (6) where denotes the normal distribution function. It follows that

f (gk, )(z )(yab ) (yab )dy (7) * kkk jkU k jj k

Finally, the conditional density of given group g=k is obtained by

       (z )* (yk ak bk ) U (yk a j b j )dyk f ( | g  k)  j k (8)  (z )  (y  a  b ) (y  a  b )dy d * * k k k jUk k j j k which is used to compute the conditional means and variances given subgroup g=k (see Cohen & Jiang, 1999). We may now define

7

E( n | g  k)  * n f ( | g  k)d (9)   in order to obtain the conditional moments of The parameters a1,b1...am,bm and , of f( |g=k) are estimated by maximizing the likelihood function based on the individual likelihood terms

    L( , ,a1 ,b1 ,... | x, g k) * p(x | ) f (g k, )d (10)

for a subject in group g=k with observed responses x=(x1,..,xj), and f(g=k, ) as defined by Equation (7). The two approaches taken by ETS-DE and CJ-DE differ strongly with respect to the information incorporated in estimating subgroup characteristics. ETS-DE uses extensive background (conditioning) information including grouping variables in addition to the cognitive item responses. In contrast to that, CJ-DE only includes the grouping variable together with the item responses but draws on a number of strong assumptions regarding the shape of the marginal ability distribution and the relation between and the grouping variable. The following section presents examples of the differences found between both approaches with respect to recovering known subgroup characteristics of simulated data.

Study I: Simulation Results The examples presented in this section compare ETS-DE and CJ-DE based on simulated data where each simulee responds to a limited set of test items and is additionally characterized by a small set of background variables. The simulated data sets resemble some characteristics of NAEP, such as the number of items per subscale. Short subscales in NAEP typically consist of an average of 6 items across booklets; long subscales consist of approximately 12 items. The number of subscales or dimensionality of the latent trait, k=3 in the simulations, also is found in NAEP. The number of background variables in the simulation is smaller than what is typically used in NAEP’s conditioning approach. While NAEP’s conditioning model may include up to hundreds of background variables, the simulated data used in the present study limits the number of background variables to the three made-up variables, GROUP, SES, and GENDER. Four distinct data sets were simulated following a 2 x 2 design, varying: 1. The number of items per subscale (6 versus 12 items).

8

2. The dependency of the latent traits on the background variables: Setup (1) had a strong dependency leading to multimodal marginal trait distributions, while Setup (2) had a weak dependency resulting in unimodal, but possibly platokurtic marginals. Using two different linear models created the two levels of dependency of the latent traits on the background variables. Two different sets of regression weight were used to generate the three-dimensional trait parameters ( 1, 2, 3). Each latent trait value i for i in 1, 2, 3 was generated based on a linear model

i = 1yGENDER,i + 2ySES,i + 3yGROUP,i + ei (11)

incorporating fictitious GENDER, SES, and GROUP effects together with normally distributed residuals ei. GENDER, SES, and GROUP accounted for a varying percentage of variance for the three trait components (see regression results below). The trait variable ( 1, 2, 3) and its component-wise linear relation to GENDER, SES, and GROUP were unaffected by additional fictitious design variables WEIGHT, STRATA, and CLUSTER. The latter variables have been included to check whether zero correlations are recovered in the same way by the regression modules of CJ-DE and ETS-DE. Setup 1, which includes one bimodal and one multimodal marginal, was included to examine how CJ-DE performs in situations where its marginal distribution assumptions are clearly violated. Setup 2 represents a more typical situation in which the marginal distributions are unimodal but more platokurtic than the normal (see Figure 1). Data were generated for the six-item test for both Setups 1 and 2, the item parameters used to generate the data are given in Appendix A and B. However, only the six-item test is presented for Setup 2, since the pattern of results obtained for the two test lengths was similar in Setup 1.

9

Figure 1. Histograms of marginal distributions for Setups 1 (left) and 2 (right).

10

Figure 1 shows histograms with integrated density plots for Setup 1 (left column) and Setup 2 (right column), crossed by the three (from top to bottom row) simulated latent traits. Setup 1 on the left results in a clearly bimodal marginal for Dimension 1, whereas in Setup 2, the marginals are platokurtic or skewed, but not obviously multimodal. In Setup 1, the proportion of variance of accounted for by the fictitious GROUP and GENDER produced bimodal (for gender) or multi-modal marginal distributions. In Setup 2, the proportion of variance explained by the fictitious conditioning variables GENDER, SES, and GROUP was reduced, so that the resulting marginal distributions are unimodal but platokurtic.

The marginal distribution of 1 is a mixture of two subpopulations where the mean difference between subgroups is due to the fictitious GENDER variable. 2 is a mixture of five normals with common variance but slightly different means due to the five-category variable GROUP.

The third variable, 3, can be viewed as the “control dimension” in both setups (i.e., the subgroup distributions are all identical as there is no effect of the conditioning variables on latent trait 3). Setup 2 can be viewed as a less extreme, non bimodal, version of Setup 1 with higher intercorrelations between the variables. The data generated by both setups were analyzed with the ETS-DE and CJ-DE approaches to direct estimation. The results of both methods were compared to the true values obtained from analyzing the actual values used for generating the item responses. Tables 1a and 1b show the marginal correlations obtained from analyzing the simulees’ generating values, both for the 6- and the 12-item data sets.

Table 1a Marginal Distributions in Setup 1, Correlations Between Dimensions

[,1] [,2] [,3] [1,] 1.0000000 0.3985606 0.1620800 [2,] 0.3985606 1.0000000 0.1832677 [3,] –0.1620800 0.1832677 1.0000000

11

Table 1b Marginal Distributions in Setup 2, Correlations Between Dimensions

[,1] [,2] [,3] [1,] 1.0000000 0.6499676 0.5401054 [2,] 0.6499676 1.0000000 0.7718106 [3,] 0.5401054 0.7718106 1.0000000

The following sections present results based on the generating true values on the one hand and the two approaches to direct estimation of subgroup statistics on the other. To clarify that the expected differences between CJ-DE and ETS-DE are the result of differences in model assumptions, the agreement of both software packages on the correlational structure of the simulated data was assessed. To check this, the recovery of regression weights and the residual variance covariance matrix of both AM (the software used for CJ-DE) and CGROUP (the software for ETS-DE) was analyzed.

Regression Module Comparison The regression module comparison is a check of agreement between both programs using the same data. The regression of the three dimensional latent trait on the variables INTER (explicit intercept), GENDER, SES, GROUP, STRATA, and CLUSTER was compared. The results in Table 2 are obtained by analyzing the generating vectors (the TRUE columns in the tables below) with standard regression procedures. The entries in the ETS-DE and MML columns stem from analyzing the item response data with the conditioning model incorporated in ETS-DE and with AM’s MML regression module. The MML regression module, however, is different from the direct estimation proposed by Cohen and Jiang (1999). The MML regression module closely resembles the regression part of the ETS-DE approach in the one-dimensional case and consequently should yield similar results when used with the same set of background variables. MML regression does not include the marginal normality assumptions used by CJ-DE. Table 2 shows the estimates of the linear model for the three-dimensional variable. The estimates show that the GENDER variable has the largest effect on 1, whereas the GROUP

12

variable has highest impact on 2 and the effects for 3 are close to zero for all methods, as expected. Table 2 Regression Coefficients for the Six-item Simulated Data Set, Setup 1

Scale 1 Scale 2 Scale 3 Effect TRUE ETS-DE MML TRUE ETS-DE MML TRUE ETS-DE MML

Constant -3.460 -3.530 -3.480 -2.970 -2.590 -2.570 0.040 0.070 0.070 CLUSTER -0.030 -0.050 -0.050 0.000 -0.060 -0.060 -0.040 -0.050 -0.050 STRATA -0.010 -0.010 -0.010 0.000 0.000 0.000 0.010 -0.010 -0.010 GENDER 1.680 1.760 1.740 0.430 0.340 0.340 0.000 0.020 0.030 GROUP 0.190 0.220 0.220 0.600 0.640 0.640 0.060 0.050 0.050 SES 0.280 0.280 0.280 0.270 0.210 0.210 -0.020 0.020 0.020

MML regression and the regression that is part of ETS-DE agree closely on the estimates for this setup. Both ETS-DE and MML regression produce estimates close to those in the TRUE columns, even though the number of six items per scale is comparably small (i.e., the inference on used by ETS-DE and MML regression are subject to a rather large measurement error). Table 3 shows the respective results based on the 12-item data set.

Table 3 Regression Coefficients 12-item Simulated Data Set, Setup 1

Scale 1 Scale 2 Scale 3 Effect TRUE ETS-DE MML TRUE ETS-DE MML TRUE ETS-DE MML

Constant -3.560 -3.540 -3.533 -2.940 -2.890 -2.883 -0.070 -0.030 -0.038 CLUSTER 0.000 0.000 0.004 -0.010 -0.020 -0.016 0.040 0.040 0.043 STRATA 0.000 -0.010 -0.008 0.000 0.000 -0.003 -0.010 0.000 0.000 GENDER 1.700 1.720 1.718 0.470 0.510 0.505 -0.060 -0.080 -0.085 GROUP 0.140 0.140 0.136 0.610 0.620 0.617 -0.050 -0.050 -0.047 SES 0.300 0.290 0.290 0.220 0.180 0.183 0.080 0.040 0.045

MML regression and ETS-DE recover the parameters weights more closely if the number of items is doubled. Note that both methods also agree with the TRUE columns for Scale 3,

13

where there is no impact on the latent variable, and as expected, all three columns show values close to zero. Table 4 shows the residual correlations and variances as they were obtained using the true values from the simulations as well as the corresponding values produced by the ETS- DE regression and MML regression algorithms.

Table 4 Residual Correlations With Variances in the Diagonal, Six-item Simulated Data Set

TRUE ETS-DE MML regression Scale 1 2 3 1 2 3 1 2 3 1 0.188 –0.025 0.203 0.199 –0.033 0.246 0.188 –0.025 0.249 2 0.194 –0.214 0.214 –0.289 0.209 –0.293 3 0.996 1.127 1.155

Table 5 shows the results for the 12-item data set. ETS-DE and MML regression reproduce the residual correlations and variances in a very similar way, both for the 6-item and the 12-item data set in Setup 1.

Table 5 Residual Correlations With Variances in the Diagonal, 12-item Simulated Data Set

TRUE ETS-DE MML regression Scale 1 2 3 1 2 3 1 2 3 1 0.176 –0.036 0.186 0.183 –0.125 0.227 0.183 –0.133 0.231 2 0.196 –0.218 0.167 –0.143 0.167 –0.144 3 0.991 0.960 0.971

Subgroup Distribution Recovery ETS-DE and CJ-DE implement two very different approaches to direct estimation. While ETS-DE assumes that the latent trait is conditionally normal given a vector of background data, CJ-DE assumes that the marginal latent distribution is normal, regardless of potentially large subgroup differences in complex samples. These two approaches are compared in this

14

section with respect to the recovery of subgroup distributions. This analysis uses the exemplary data previously introduced as Setup 1—6 and 12 items and Setup 2—6 items. As shown in the previous section, the ETS-DE regression and the MML regression as implemented in the software packages CGROUP and AM agree on these data sets and reproduce the true regression parameters in a very similar way. In contrast, ETS-DE and CJ-DE incorporate different assumptions regarding the marginal distribution of the latent traits. Recall that the marginal distributions for Setup 1 are bimodal for Scale 1 and multimodal for Scale 2, because the background variable GENDER (two subgroups) explains a major part of the variance for Scale 1 whereas the background variable GROUP (five subgroups) has a strong impact on Scale 2. It can be expected that the marginal normality assumption of CJ-DE, which is violated for Scales 1 and 2, will result in differences between subgroup mean estimates of ETS-DE and the true values on the one hand, and CJ-DE on the other hand.

Table 6a Subgroup Means and Standard Deviations for the Six-item Data Set, Setup 12

Mean Standard deviation Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE ALL –0.004 0.027(.039) -/- 1.001 1.040 -/- Female –0.849 –0.852(.030) –0.466(.074) 0.548 0.565 0.947 1 Male 0.840 0.907(.040) 0.343(.116) 0.527 0.545 0.936 ALL –0.002 –0.032(.047) -/- 0.997 0.960 -/- Female –0.213 –0.205(.057) –0.193(.088) 0.991 0.968 0.972 2 Male 0.208 0.140(.058) 0.135(.113) 0.957 0.919 0.972 ALL 0.014 –0.012(.045) -/- 1.003 1.065 -/- Female 0.000 –0.024(.057) –0.032(.058) 1.029 1.083 1.076 3 Male 0.027 0.000(.064) –0.003(.059) 0.975 1.045 1.08

Note. The results of CJ-DE direct estimation reported here are ones closest to the true values from one out of four trials with AM’s “slog through” option. Rows with large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.

Table 6a shows the TRUE values for the six-item data set in Setup 1 (i.e., the values obtained by analyzing the generating data) as well as the subgroup means and standard deviations as estimated by ETS-DE and CJ-DE. In addition, the values in parentheses next to the

15

subgroup mean estimates show the associated standard errors either computed with Rubin’s imputation formula in the case of ETS-DE or as given by the Taylor series estimates in the case of CJ-DE. The Taylor series estimates are given by the CJ-DE direct estimation procedure and are recommended to yield appropriate estimates for complex samples by Cohen and Jiang (1999). Here, the Taylor series standard error estimates for Scales 1 and 2 are larger than the imputation-based estimate.

Table 6b gives a more condensed overview of the same results. Instead of individual subgroup means, the table gives standardized mean differences

ZETS-DE = (METS-DE - true)/se(DETS-DE) (12)

ZCJ-DE = (MCJ-DE - true)/se(DCJ-DE) (13)

as well as the variance ratio of estimated variance divided by true variance. se(D) stands for the standard error of the difference. Assuming the TRUE values to be fixed target statistics, the se(D) equals the standard error associated to the respective estimate given either by ETS-DE or CJ-DE. If the difference between the two estimates of a certain subgroup mean is standardized, se(D) equals the square root of the sum of the squared standard errors of the two statistics. The standardized mean differences between CJ-DE and TRUE should be ~N(0,1) if the CJ-DE model holds. The variance ratios given in Table 6a should be close to 1 if the approach recovers the values in the TRUE column.

16

Table 6b Subgroup Standardized Mean Differences and Variance Ratios for the Six-item Data Set, Setup 1

Standardized mean difference Variance ratio Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE 1 Female 0.000 –0.099 5.176 1.000 1.063 2.986 Male 0.000 1.693 –4.284 1.000 1.069 3.154 2 Female 0.000 0.140 0.227 1.000 0.954 0.962 Male 0.000 –1.178 –0.646 1.000 0.922 1.032 3 Female 0.000 –0.419 –0.552 1.000 1.108 1.093 Male 0.000 –0.425 –0.508 1.000 1.149 1.218

Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.

The differences from the expected values are as hypothesized; CJ-DE shows large differences for Scale 1, for which the marginal normality assumption does not hold. The absolute standardized mean differences are 5.176 for the female subgroup and 4.284 for the male subgroup. The variance ratios indicate that CJ-DE overestimates the subgroup variances by a factor of ~3 for Scale 1. Table 7a gives the mean and standard deviation for the 12-item data set in Setup 1 while Table 7b gives the standardized mean differences and variance ratio. Table 7b enables a direct comparison against the values 0 (zero) for the expected mean differences and 1 (one) for the expected variance ratio if the models behind the approaches fit the data.

17

Table 7a Subgroup Mean and Standard Deviation for GENDER for the 12-item Data Set, Setup 1

Mean Standard deviation Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE ALL 0.004 .016(.035) -/- 1.008 0.995 -/- Female –0.854 –.832(.029) –.707(.072) 0.529 0.515 0.858 1 Male 0.862 .865(.031) .704(.063) 0.527 0.524 0.746 ALL 0.000 .008(.039) -/- 1.002 0.991 -/- Female –0.234 –.252(.049) –.255(.089) 0.978 0.959 1.003 2 Male 0.234 .269(.050) .248(.124) 0.971 0.952 1.003 ALL –0.004 –.003(.036) -/- 0.996 0.993 -/- 3 Female 0.031 .034(.047) .045(.050) 0.989 0.967 0.988 Male –0.041 –.042(.052) –.040(.047) 1.001 0.981 0.988

Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.

Table 7b Subgroup Standardized Mean Differences and Variance Ratios for the 12-item Data Set, Setup 1

Standardized mean difference Variance ratio Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE Female 0.000 0.760 2.042 1.000 0.948 2.631 1 Male 0.000 0.097 –2.508 1.000 0.989 2.004 Female 0.000 –0.367 –0.236 1.000 0.962 1.052 Male 0.000 0.699 0.113 1.000 0.961 1.067 2 Female 0.000 0.063 0.280 1.000 0.956 0.998 Male 0.000 –0.020 0.021 1.000 0.960 0.974 3

Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.

18

In the 12-item case, the current CJ-DE implementation does not converge with the default settings for Scale 1 but needs to be put into the “slog through” mode, and the number of iterations needs to be increased from 50 to 500. ETS-DE reproduces the subgroup means and standard deviations accurately also for the 12-item data set. As in the six-item case, the differences between subgroup standard deviations are not reproduced by CJ-DE. The second reporting variable with a strong impact on one of the latent trait components is GROUP, a variable with the categories 1..5. Table 8a shows the standardized mean differences for the six-item data from Setup 1. Like in the above analysis with the grouping variable GENDER, the algorithm for CJ-DE needs to be put in the “slog through” mode in AM to converge in this example. The marginal normality assumption does not hold for Scales 1 and 2, the first and second component of the three dimensional latent trait in the example data. It can be expected that CJ-DE using the marginal normality assumption will not match the true subgroup means and variances as closely as ETS-DE does in the analysis of the GROUP reporting variable. The subgroup mean differences for CJ-DE in Scale 2 indicate two subgroups for which CJ-DE estimates deviate significantly from the true values. For Group 1, the absolute mean difference between CJ-DE and the true value is 4.45, and for Group 5, the absolute mean difference is 5.62. In contrast, the subgroup mean differences for ETS-DE and the true values are all in the expected range. Table 8a shows also that CJ-DE overestimates the subgroup variances for all subgroups and Scale 2 by a factor of between 1.77 and 2.54.

19

Table 8a Subgroup Standardized Mean Differences and Variance Ratios for GROUP for the Six-item Data Set, Setup 1

Standardized mean differences Variance ratio Scale Subgroups TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE Group 1 0.0000 0.1928 0.7942 1.0000 1.0659 1.2226 Group 2 0.0000 0.6372 0.0748 1.0000 0.9503 1.2346 Group 3 0.0000 0.5593 –0.0784 1.0000 1.0092 1.3660 1 Group 4 0.0000 0.8930 0.4622 1.0000 1.0661 1.4910 Group 5 0.0000 0.6808 0.0856 1.0000 1.0927 1.2583 Group 1 0.0000 0.4218 4.4504 1.0000 1.0483 2.1302 Group 2 0.0000 0.1192 –1.0500 1.0000 0.8695 2.0350 Group 3 0.0000 0.0802 –1.0467 1.0000 0.9320 2.1604 2 Group 4 0.0000 –0.5602 –1.6938 1.0000 1.1380 2.5416 Group 5 0.0000 –1.7368 –5.6242 1.0000 0.9781 1.7734 Group 1 0.0000 0.5929 0.4584 1.0000 1.0153 0.9944 Group 2 0.0000 –0.7004 –0.3818 1.0000 1.0841 1.1513 Group 3 0.0000 –0.4016 –0.0861 1.0000 1.0557 1.0988 3 Group 4 0.0000 –1.5818 –0.7226 1.0000 1.0774 1.1999 Group 5 0.0000 –1.0241 –0.6721 1.0000 1.1177 1.1141

Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.

Simulation Results: Setup 2 Truly multimodal distributions are rarely found in real data, even though results from large-scale assessments show variables that account for large differences in average achievement between subgroups. Setup 2 was designed to be a less extreme version of the same model used for Setup 1 and was made more realistic by allowing larger between-scale correlations as they can be found in many large-scale assessment programs. Analyses like the ones presented for Setup 1 were carried out with the six-item data set in Setup 2 in order to obtain additional results from this less extreme case. Table 8b shows the comparison of CJ-DE MML regression and ETS-DE regression estimates with the regression coefficients based on the true values.

20

Table 8b Regression Coefficients for the Six-item Simulated Data Set, Setup 2

Scale 1 Scale 2 Scale 3 Effect TRUE ETS-DE MML TRUE ETS-DE MML TRUE ETS-DE MML

INTER -3.186 -3.162 -3.100 -2.953 -2.996 -2.973 0.123 0.138 0.132 CLUSTER -0.033 -0.021 -0.018 0.000 -0.013 -0.014 -0.040 -0.080 -0.080 STRATA -0.016 -0.020 -0.020 -0.005 -0.002 -0.002 -0.010 0.000 0.000 GENDER 1.203 1.199 1.178 0.495 0.484 0.482 -0.014 0.015 0.018 GROUP 0.285 0.267 0.258 0.537 0.558 0.556 0.053 0.094 0.096 SES 0.394 0.382 0.373 0.313 0.330 0.327 0.000 -0.023 -0.024

Both ETS-DE and MML regression reproduce the regression weights based on the true values closely for this data set. This indicates that AM’s MML regression and ETS-DE agree on the underlying relationship between the reporting variables and the latent trait variables, so that the basis on which CJ- DE marginal direct estimation and ETS-DE’s conditioning model are compared is the same. Table 9 shows the residual correlations and variances for the true residuals and for the estimates as obtained by MML regression and ETS-DE.

Table 9 Residual Correlations With Variances in the Diagonal, Six-Item Data Set, Setup 2

TRUE ETS-DE MML Scale 1 2 3 1 2 3 1 2 3

1 0.410 –0.077 0.200 0.437 –0.102 0.330 0.412 –0.104 0.332 2 0.297 –0.228 0.349 –0.109 0.345 –0.106 3 0.996 1.036 1.069

The two approaches reproduce the residual covariance matrix in a very similar way. The differences between ETS-DE and CJ-DE are even smaller than the small differences of the two approaches to the true values. The results on the regression part of ETS-DE and the MML regression module of AM give no indication that the basic relationships between the three latent

21

traits and the subgroup variables are represented differently by the two approaches to direct estimation. For reporting the variable GENDER in Setup 2, Table 10 shows the respective standardized mean differences and variance ratios.

Table 10 Standardized Mean Differences and Variance Ratios for GENDER for the Six-item Data Set, Setup 2

Standardized mean difference Variance ratio Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE 1 Female 0.000 0.544 1.088 1.000 1.082 1.218 Male 0.000 –0.450 –1.517 1.000 0.969 1.019 2 Female 0.000 0.069 0.057 1.000 1.098 1.126 Male 0.000 –0.045 –0.277 1.000 1.084 1.117 3 Female 0.000 –0.641 –0.281 1.000 1.026 1.044 Male 0.000 –0.164 –0.015 1.000 0.990 1.086

The results for Setup 2 show smaller, but noticeable differences between the estimates of CJ-DE on the one hand, and the true values and ETS-DE on the other hand. The marginal distributions in Setup 2 deviate to a lesser extent from CJ-DE normality assumption, so that the subgroup estimates of CJ-DE seem impacted less by a moderate model violation as compared to Setup 1. Table 11 shows the results for the reporting variable GROUP, which again could not be estimated by CJ-DE using the default options and which is the one with the strongest effect on Scale 2. As expected, the standardized mean differences between the true values and CJ-DE for Scale 2 are larger than the differences between the true values and ETS-DE. In addition, the variance ratios for CJ-DE are consistently larger than 1.5 for Scale 2, indicating that CJ-DE overestimates subgroup variances here. The results for both reporting variables GENDER and GROUP are similar with respect to where CJ-DE deviates from the TRUE values and the ETS-DE approach: The GENDER effect is largest for Scale 1, where CJ-DE deviates most when reporting GENDER subgroup means.

22

Similarly, for Scale 2, where the GROUP reporting variable has a strong effect on Latent Trait 2, CJ-DE deviates most when reporting on the GROUP subgroups.

Table 11 Standardized Mean Differences and Variance Ratios for GROUP for the Six-item Data Set, Setup 2

Standardized mean difference Variance ratio Scale Subgroups TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE Group 1 0.0000 0.0349 0.6512 1.0000 0.9608 0.9649 Group 2 0.0000 –0.0490 –0.2745 1.0000 1.1590 1.2968 Group 3 0.0000 0.3333 –0.0897 1.0000 0.8876 0.8820 1 Group 4 0.0000 0.1477 –0.5000 1.0000 0.9061 0.9119 Group 5 0.0000 –0.2414 –0.3534 1.0000 1.0628 1.1324 Group 1 0.0000 –0.5567 3.2474 1.0000 1.1540 1.9239 Group 2 0.0000 0.0300 0.0300 1.0000 1.0930 1.7859 Group 3 0.0000 0.4177 –1.0127 1.0000 1.1113 1.6488 2 Group 4 0.0000 0.6000 –1.2143 1.0000 1.2346 1.7322 Group 5 0.0000 –0.3088 –3.4118 1.0000 1.1239 1.4805 Group 1 0.0000 –0.3298 –0.6702 1.0000 0.9496 0.9625 Group 2 0.0000 0.1038 0.6321 1.0000 1.1183 1.1314 Group 3 0.0000 0.1954 0.3448 1.0000 1.0988 1.1816 3 Group 4 0.0000 –0.7143 –0.5048 1.0000 1.0020 1.0563 Group 5 0.0000 –0.8095 –0.4762 1.0000 0.8931 1.0000

Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.

Conclusions: Study I In the examples presented above, AM’s MML regression module yields similar results to what is found when using the regression results of the ETS-DE methodology. Regression coefficients, residual correlations, and variances are reproduced in much the same way as ETS- DE recovers these parameters. These results cannot be generalized as they are currently based on a few simulated data sets only. Nevertheless, all examples presented here indicate that both

23

software programs agree on the basic correlational relationships in the data as given by the AM MML regression module and ETS-DE’s regression estimates. In contrast to the close agreement of ETS-DE regression and AM’s MML regression, the AM module for CJ-DE—the marginal normality direct estimation approach—diverges from the ETS-DE results and the true values if the marginal distributions are non-normal. The exemplary data sets were constructed and simulated in a way to show where discrepancies can be expected, and the results so far match the expectations. Setup 1 was constructed to study how CJ-DE performs if marginal distributions are bimodal or multimodal, and CJ-DE did not converge with the default settings for the scales that violated the assumptions used in the marginal direct estimation approach. Setup 2 represents a “milder” version of model violation for CJ-DE and also shows that under this setup, where the multimodality of the marginal is less obvious, the CJ- DE estimates differ from the values produced by ETS-DE, the conditional normality direct estimation and the true values. Assuming that the latent trait is normally distributed across groups may lead to an inappropriate model because of strong monotonicity assumptions in the IRT model (note that IRT serves as the basis for both ETS-DE and CJ-DE). For the 1PL and 2PL IRT models as well as the (generalized) partial credit models, a simple statistic of the observed responses—the weighted sum of scores—is sufficient for estimating the latent trait. Even for the 3PL, the monotonicity of the success P(X=1| ) in the latent trait and in the item parameters ensures a relationship between the observed distribution of the raw scores and the unobserved (but not arbitrary!) distribution of the latent trait. As an example, if a test is administered to two different samples that differ a lot in their ability distributions (e.g. a reading test taken by both a group of kindergarten students and a group of third graders), it seems unreasonable to assume a joint normal distribution. A model assuming marginal normality would force both distributions under one mode and produce biased estimates of differences between these two groups and other groups defined by additional reporting variables. The simulated data examples revealed effects of CJ-DE in the presence of non-normal marginal distributions: systematic deviations from the true values in the mean and in the variance estimates. In contrast, no indication of systematic differences between the true values and the ETS-DE approach were found in the examples analyzed here. From the perspective of , the differences in the subgroup mean estimates of CJ-DE are easier to detect, because in

24

extreme cases CJ-DE reports when it fails to converge. Nevertheless, when using AM’s “slog through” estimation option and increasing the number of iterations, there may be no indication of nonconvergence. The effects of CJ-DE when estimating subgroup variances are more difficult to detect, as this can only be accomplished by additional analysis using other, less restrictive methods.

Study II: Comparing Marginal Direct Estimation and Conditional Direct Estimation Subgroup Statistics for NAEP and NALS Data Study I showed that the marginal direct estimation (CJ-DE) method relies strongly on the assumption that the latent trait is marginally normally distributed. The CJ-DE method as implemented in the AM software (Cohen, 1998) does not reproduce subgroup mean and variance appropriately in cases where a significant part of subgroup differences is explained by the grouping variable of interest. The examples presented here help in studying consequences of this effect of marginal direct estimation in large-scale assessment data analysis. Assessments across a number of countries, states, regions, or other grouping variables cannot assume a certain form of marginal distribution of the trait across the groups (Yamamoto & Mazzeo, 1992). In addition, assuming that subgroup variances are homogenous (i.e., that the trait[s] vary to a similar degree within all groups) might be too restrictive to fit diverse populations. Data from large-scale assessment programs provide a source to study differences between CJ-DE and ETS-DE in a realistic data analysis setting. Using real data with operational reporting variables enables one to formulate expectations about whether certain variances should be equal or for which subgroups differences may be expected. This adds a different perspective to what was examined in Study I, where known parameters were compared with CJ-DE and ETS-DE estimates.

NAEP Math Assessment, Grade 4 As the first real data example, results were compared for ETS-DE and CJ-DE on data from an assessment given to a nationally representative sample of 13,855 students in the fourth grade for the National Assessment of Educational Progress (NAEP). The assessment, administered in 2000, used a sparse matrix sample design where examinees were given a 45- minute test of mathematics items consisting of a mixture of multiple choice and constructed

25

response items. The 173-item pool was divided into 13 blocks of items (separately-timed sections). The blocks were assembled into 26 booklets based on a BIB (balanced incomplete block) design (Braswell et al., 2001). Each booklet contained three blocks of items, which were classified into five content-area scales—numeracy and operations, measurement, geometry, data analysis, and algebra. A typical examinee answered from 6 to 12 items per scale. A multiscale IRT model estimated with PARSCALE was used to calibrate the IRT item parameters for each of the five scales. The following exhibits show results based on the ETS-DE methodology using 381 background variables in addition to item responses in order to obtain subgroup estimates. The 381 background variables are factor scores based on a principal component analysis that was conducted using the variables available from the background questionnaire (see Braswell et al., 2001, for details on the NAEP 2000 math assessment and the available background data). The operational NAEP 2000 item parameters were used in a five-dimensional run with CGROUP, the current software implementation of the multidimensional ETS-DE approach. The ETS-DE approach was found to work accurately in recovering subgroup means and variances in Study I and serves as a benchmark for CJ-DE, which has been proposed for use for subgroup reporting (Cohen & Jiang, 1999). In contrast to CJ-DE, the ETS-DE approach assumes conditional normality of the latent traits with a large set of background variables. Given that a large number of background variables are used that explain a significant portion of the latent trait variance, this approach is capable of modeling complex mixtures of abilities resulting in non-normal population and subgroup distributions. To compare the results of ETS-DE and CJ-DE, the operational data and NAEP 2000 math item parameters were imported into the software that implements CJ-DE.

School Type The first reporting variable used in this comparison is School Type, which has three categories in NAEP—Public, Private, and Catholic. The subsequent tables offer a comparison between CJ-DE and ETS-DE, the benchmark, on the basis of standardized mean differences and variance ratios similar to the exhibits in the previous part of the report. Table 12a shows the reference values estimated by ETS-DE in the untransformed latent trait scale, not in the NAEP reporting scale. The untransformed latent trait scale is implicitly given by the item parameters as

26

calibrated with the PARSCALE software. PARSCALE defaults to the marginal latent trait moments M( )=0 and a standard deviation S( )=1.

Table 12a ETS-DE Estimates of the Means and Standard Deviations in the Latent Trait (Theta) Scale for School Type Subgroups

Mean Standard deviation Public Private Catholic Public Private Catholic NUM&OPER –0.047 0.430 0.368 1.021 0.913 0.842 MEASURMT –0.053 0.480 0.402 1.060 0.928 0.897 GEOMETRY –0.034 0.299 0.267 1.012 0.913 0.821 DATA ANL –0.045 0.327 0.425 1.103 0.969 0.880 ALGEBRA –0.047 0.429 0.358 1.081 0.969 0.886

The Private and Catholic school categories have a mean that is about 0.35 to 0.52 standard deviations higher than the one for Public schools, whereas the respective standard deviations for these subgroups is slightly lower than the subgroup standard deviation for Public school category across all five scales of the NAEP math assessment. Table 12b gives the corresponding standardized mean differences and variance ratios. The table shows these values for the School Type subgroups, where the differences are formed by “CJ-DE minus ETS-DE” and the ratios are “CJ-DE divided by ETS-DE.”

27

Table 12b Standardized Mean Differences and Variance Ratios for School Type Subgroups

Standardized mean difference Variance ratio Public Private Catholic Public Private Catholic NUM&OPER 0.047 –0.245 –0.737 0.931 1.129 1.338 MEASURMT 0.039 0.137 –0.293 0.911 1.143 1.235 GEOMETRY 0.074 0.658 –1.390 0.902 1.087 1.359 DATA ANL 0.071 –0.420 –0.330 0.820 1.044 1.253 ALGEBRA 0.149 –1.226 –0.439 0.832 1.010 1.211

Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.

ETS-DE and CJ-DE provide quite similar subgroup mean estimates for most of the five scales in the three subgroups, but there are differences in the subgroup standard deviations reported by the two methods. The ETS-DE method reports that the Catholic school subgroup has a smaller standard deviation as compared to the Public school types on all five scales3, whereas the CJ-DE method report comparably more similar standard deviations across the three subgroups. In Study I, using simulated data examples, it was found that CJ-DE does not recover differences in subgroup standard deviations correctly. The ETS-DE method, however, was found to recover this type of subgroup heteroscedasticity in the simulated examples, and ETS-DE reflects differences between subgroup variances in the NAEP example reported here.

Race/Ethnicity The next variable analyzed is Race/Ethnicity, which has four categories—WHI/AI/O (White, American Indian, Other), AFRAM (African American), HISPANIC (Hispanic American), and ASIAM (Asian American)—in the NAEP 2000 data. Table 13 below shows the subgroup mean differences between CJ-DE and ETS-DE and the corresponding variance ratios for this reporting variable.

28

Table 13 Race/Ethnicity Subgroup Reports Generated Based on the NAEP 2000 Grade 4 Math Data 1

Standardized mean difference Variance ratio WHI/AI/O AFRAM HISPANIC ASIAM WHI/AI/O AFRAM HISPANIC ASIAM

NUM&OPER -0.259 0.482 0.348 -0.285 0.996 0.952 0.849 0.729 MEASURMT -0.109 -0.172 0.459 0.091 0.955 0.935 0.830 0.738 GEOMETRY -0.282 0.012 0.658 0.222 0.976 0.901 0.784 0.715 DATA ANL -0.094 0.743 -0.254 -0.715 0.889 0.778 0.691 0.730 ALGEBRA -0.305 0.490 0.477 -0.388 0.878 0.815 0.728 0.717

Note. Large differences from the expected values given the more general model are printed in boldface.

The subgroup mean differences indicate that the estimates of the two methods do not differ significantly from each other. CJ-DE resembles the ETS-DE mean estimates satisfactory for the race subgroup variable. The standard deviation estimates given by CJ-DE differ from what is reported by the ETS-DE method for the subgroups, AFRAM, HISPANIC, and ASIAM. The standard deviation estimates provided by CJ-DE are about 0.7 times the size of the respective ETS-DE estimate. In contrast to that, CJ-DE yields a standard deviation more similar to ETS-DE for the WHI/AI/O subgroup.

Individualized Education Plan Table 14 shows the subgroup mean differences of CJ-DE estimates against the ETS-DE analysis and the corresponding variance ratios for the dichotomous grouping variable IEP (Individualized Education Plan). There is a large mean difference between the two subgroups IEP and non-IEP. The IEP group means are approximately 0.9 standard deviations smaller than the non-IEP group estimates across all five scales (see Appendix C, where the ETS-DE estimates for the reporting variable IEP are given). Based on the findings of Study I, it can be expected that CJ-DE mean estimates will not reflect the large difference between the IEP and the non-IEP subgroups. The standardized mean differences and variance ratios for the IEP reporting variable are given in Table 14.

29

Table 14 IEP Subgroup Reports Based on the NAEP 2000 Grade 4 Math Data

Standardized mean difference Variance ratio IEP Non-IEP IEP Non-IEP NUM&OPER 5.207 –0.556 0.841 1.025 MEASURMT 4.145 –1.206 0.835 0.996 GEOMETRY 4.832 0.042 0.830 1.002 DATA ANL 4.783 0.275 0.685 0.921 ALGEBRA 5.639 –0.998 0.621 0.961

Note. Large differences from the expected values given the more general model are printed in boldface.

The CJ-DE estimates show large differences to the IEP group means as provided by ETS- DE. CJ-DE reports consistently smaller mean differences between IEP and non-IEP subgroups, so that the corresponding mean difference between CJ-DE and ETS-DE is a large positive number. The same was found in Study I (see above) using simulated data when the absolute mean differences between subgroups are large. These results support the conjecture that CJ-DE direct estimation of subgroup mean differences deviate from more general models in the presence of large between group differences. Compared to ETS-DE, CJ-DE slightly underestimates the IEP subgroup variances for the subscale categories—NUM&OPER, MEASURMT, and GEOMETRY. For the subgroup variances of ALGEBRA and DATA ANL, the CJ-DE estimates are only about 0.7 the size of the corresponding ETS-DE estimates.

National Adult Literacy Study The second real data set used in this comparison is taken from National Adult Literacy Survey (NALS) administered in 1992. This data set consists of 21,363 subjects and contains a sparse matrix sample of 713 items from three content domains of literacy—quantitative, prose, and document. NALS …measured literacy along three dimensions, prose literacy, document literacy, and quantitative literacy, designed to capture an ordered set of information-processing skills and strategies that adults use to accomplish a diverse range of literacy tasks. The literacy

30

scales make it possible to profile the various types and levels of literacy among different subgroups in our society (“Defining and measuring literacy,” n.d.). The exemplary comparisons presented here utilize the NALS main assessment data file and the operational item parameters, which were used with the CGROUP program, which is the current implementation of the ETS-DE approach. The same data and item parameters were imported into the implementation of the CJ-DE approach, the AM software. Similar to the preceding analyses, a number of policy-relevant grouping variables from the NALS data file were chosen to compare the subgroup distribution estimates as given by the ETS-DE and the CJ-DE approach. Table 15 shows variance ratios and standardized mean differences and between the estimates of ETS-DE and CJ-DE for the grouping variable REGION with four subgroups.

Table 15 Variance Ratio and Standardized Mean Differences Between CJ-DE and ETS-DE Estimates for REGION as Defined in the NALS 1992 Data

Standardized mean difference Variance ratio REGION Prose Document Quantitative Prose Document Quantitative

MIDWEST –0.560 –1.189 –0.418 1.012 0.981 0.964 N-EAST –0.154 0.193 –0.277 0.828 0.809 0.831 SOUTH –0.112 –0.275 –0.298 0.770 0.743 0.748 WEST 0.771 1.029 0.892 0.717 0.708 0.740

Note. The direction of the difference is (CJ-DE minus ETS-DE) and the direction of the ratio is (CJ-DE divided by ETS-DE). Variance ratios smaller than 0.75 are printed in boldface.

The results indicate that all four subgroup mean estimates given by ETS-DE and CJ-DE agree relatively well. In contrast to the agreement between ETS-DE and CJ-DE for the means of the region subgroups, the variance estimates for the regions SOUTH and WEST given by CJ-DE are only about 0.75 times as large as the variance estimates given by ETS-DE. The next NALS reporting variable used in the comparison is BORN IN having the five categories—USA, SPAN (Spanish-speaking world), EUROP, ASIA, and OTHER. Table 16

31

shows the standardized mean differences and variance ratios CJ-DE compared to the ETS-DE estimates for this reporting variable.

Table 16 Variance Ratio and Standardized Mean Differences Between CJ-DE and ETS-DE Estimates for the Grouping Variable BORN IN as Defined in the NALS 1992 Data

Standardized mean differences Variance ratio BORN IN Prose Document Quantitative Prose Document Quantitative USA –2.687 –2.899 –2.475 0.923 0.890 0.884 SPAN 8.892 7.552 7.142 0.413 0.409 0.455 EUROP 0.616 0.980 0.415 0.559 0.592 0.655 ASIA 2.093 1.930 1.789 0.539 0.505 0.553 OTHER 1.404 1.293 1.182 0.588 0.564 0.627

Note. The direction of the difference is (CJ-DE minus ETS-DE) and the direction of the ratio is (CJ-DE divided by ETS-DE). Variance ratios smaller than 0.75 and standardized mean differences absolute larger than 2.2 are printed in boldface.

There are discrepancies between the subgroup mean estimates of CJ-DE and ETS-DE for the USA and SPAN subgroups. The CJ-DE estimates for USA are about 2.5 to 2.8 standard units lower than the ETS-DE estimates for the three literacy scales. The standardized differences between the CJ-DE mean estimates and the ETS-DE estimates for SPAN lie between 7 to 8 across the three scales, indicating that CJ-DE differs significantly from the ETS-DE estimates. The variance ratio for four subgroups—SPAN, EUROP, ASIA, and OTHER—is between 0.4 and 0.65 across all three subscales of the NALS data, indicating that the CJ-DE estimates are systematically smaller than the ETS-DE estimates in this case. The final comparison of CJ-DE and ETS-DE on the basis of the NALS data is based on the reporting variable “Years living in the USA.” This reporting variable has nine categories, ranging from “1-5 years in the USA” to “Ever live in the USA,” in 5 to 10 year intervals (see below). Table 17 shows the standardized mean differences between CJ-DE and ETS-DE and the variance ratios for the three literacy scales across the nine subgroups.

32

Table 17 Variance Ratio and Standardized Mean Differences Between CJ-DE and ETS-DE Estimates for “Years Living in the USA” as Defined in NALS 1992 Data

Standardized mean difference Variance ratio Yrs in USA Prose Document Quantitative Prose Document Quantitative

1–5 12.060 11.418 8.791 0.432 0.410 0.432 6–10 2.762 2.725 2.584 0.545 0.527 0.572 11+ 4.121 4.581 3.361 0.427 0.434 0.471 16+ 2.863 2.333 2.740 0.440 0.436 0.486 21+ 1.461 2.035 2.208 0.520 0.518 0.565 31+ 0.777 1.151 0.913 0.558 0.602 0.609 41+ 0.187 0.370 0.104 0.467 0.473 0.499 51+ –0.441 –0.035 –0.322 0.939 0.943 0.979 Ever –3.182 –3.471 –2.770 0.937 0.902 0.896

Note. The direction of the difference is (CJ-DE minus ETS-DE) and the direction of the ratio is (CJ-DE divided by ETS-DE). Variance ratios smaller than 0.75 and standardized mean differences absolute larger than 2.2 are printed in boldface.

The subgroup mean estimates of CJ-DE are between 2.3 and 12 standardized units larger than the corresponding estimates given by ETS-DE for the subgroups—“1–5 years in the USA,” “6–10,” “11+,” and “16+.” The mean estimate for subgroup “Ever live in the USA” is between 2.7 and 3.4 standard units smaller for CJ-DE as compared to ETS-DE. The variances estimates by CJ-DE for the first six subgroups in the interval between “1– 5” and “41+” are systematically smaller than what ETS-DE reports. The variance ratio lies between 0.41 and 0.6 in these subgroups across all three scales. In contrast, the CJ-DE subgroup variance estimates for “Ever” and “51+” are close to what ETS-DE yields, as the variance ratio is close to 1. Note that the subgroups of US residents with a comparably small amount of years residing in the United States are the subgroups with a comparably larger difference to the total mean (see Appendix D). For these subgroups, CJ-DE yields estimates that deviate more from

33

what is given by the more general ETS-DE approach, whereas subgroups closer to the total mean (“Ever” and “51+”) receive estimates that agree more closely with the ETS-DE approach.

Conclusions: Study II The results reported in Study II show similarities with the results obtained in Study I, which used simulated data. In the case of simulated data, CJ-DE differs from the values obtained by ETS-DE and the true values obtained from analyzing the simulated proficiency values used for generating the response data. The assumption of marginal normality leads to discrepancies between CJ-DE and the true values in the presence of large subgroup mean differences and in cases where the subgroup variances are heteroscedastic. Recall Cohen and Jiang's (1999) direct estimation model, where the conditional density of given subgroup membership g=k is derived based on the marginal normality assumption. This density depends on the marginal parameters

and  and subgroup parameters (a1,b1,..,aG,bG). Essentially, the marginal normal density

  /  acts as a prior for the conditional density

 ((  ) 1 ) f (g  k | ) f ( | g  k)  (14)    1  * (( ) ) f (g k | )d  which prevents the conditional densities from fitting larger subgroup mean differences. This might be an indication why the CJ-DE standard deviation estimates are less variable across subgroups, and the restriction of the standard deviation is correlated with the distance of the corresponding subgroup mean from the total mean. A thorough analysis of the marginal direct estimation model (Cohen & Jiang, 1999) should reveal that this restriction of the parameter space is caused by the assumption of marginal normality. This assumption forces the mixture of subgroup distributions to fit under the unimodal normal distribution. The conclusion in Study II, which uses real data from NAEP and NALS, corresponds closely to the findings of Study I, which compares CJ-DE and ETS-DE based on simulated data examples, even though in real data applications, the true values usually are unknown. In the presence of large subgroup mean differences, CJ-DE yields less extreme subgroup estimates than ETS-DE, which also was found in the comparison in Study I of both methods to the true values.

34

Additionally, the variance estimates given by CJ-DE tend to be more similar across subgroups as compared to the ETS-DE estimates and when comparing CJ-DE the true values in Study I. The CJ-DE variance estimates seem to be increasingly restricted with increasing difference of the subgroup mean to the total mean. As noted in the introduction, CJ-DE uses a number of assumptions to derive a conditional subgroup density while maintaining the restriction of normality of the marginal density. This normal marginal assumption of the latent trait is believed to reflect common practice in large- scale assessment applications of IRT (see Cohen & Jiang, 1999). However, NAEP and other large-scale assessments do not rely on this assumption. Appendix E gives an example of how to avoid the assumptions of CJ-DE when using AM in order to estimate a less restrictive model with this software. The results of using simulated data and the results of using real data both show that these assumptions used in CJ-DE lead to discrepancies when analyzing complex samples where the assumptions are not met by the data. The operational ETS-DE approach does not put the normality assumption in the marginal distribution, but in the conditional distribution of the latent trait given the item responses and a large number of the background variables. The conditioning approach utilized by ETS-DE is therefore more general and enables it to fit non- normal distributions, as the conditional means given the background model are not assumed to follow a specific distribution. In the light of systematic differences seen in both Study I and II, using methods such as CJ-DE that rely on item responses only and replacing valuable background information by a number of assumptions does not seem defendable for the analysis of large-scale assessment data. This also holds for trend studies, where the assessment of change relies even more on maximizing the comparability of results and the accuracy of the mean and variance estimates obtained across time points and subgroups.

35

References Braswell, J. S., Lutkus, A. D., Grigg, W. S., Santapau, S. L., Tay-Lim, B., & Johnson, M. (2001). The nation’s report card: Mathematics 2000. Washington, DC: National Center for Education Statistics. Cohen, J. D. (1998). AM online help content—Preview. Washington, DC: American Institutes for Research. Cohen, J. D., & Jiang, T. (1999). Comparison of partially measured latent traits across normal populations. Journal of the American Statistical Association, 94(448), 1035-1044. Defining and measuring literacy. (n.d.) In National assessments of adult literacy. Retrieved December 6, 2002, from http://nces.ed.gov/naal/defining/defining.asp Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex samples. Psychometrika, 56(2), 177-196. Mislevy, R. J., Beaton, A. E., Kaplan. B., & Sheehan. K. M. (1992). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement, 29(2), 133-161. Rasch G. (1960). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press. Thomas, N. (1993). Asymptotic corrections for multivariate posterior moments with factored likelihood functions. Journal of Computational and Graphical Statistics, 2, 309-322. Thomas, N. (2002). The role of secondary covariates when estimating latent trait population distributions. Psychometrika, 67(1), 33-48. Yamamoto, K., & Mazzeo, J. (1992). Item response theory scale linkage in NAEP. Journal of Educational Statistics 17(2), 155-173.

36

Notes 1 The item parameters in the k-scale IRT model are assumed to be known constants. 2 The overall mean and standard deviation reported here are estimates by ETS-DE and the TRUE data; CJ-DE does not provide overall means and standard deviations. 3 This indicates that the Catholic school category is more homogeneous as compared to the two other categories. The Public school category consistently has the largest standard deviations across all five scales.

37

Appendix A Item Parameters of the Simulated Three-scale Six-item Data Set

Scale Slope Difficulty Scale 1 [1,] 1.0707435 –0.423607249 [2,] 1.1946191 0.369087609 [3,] 1.1356097 –0.008368651 [4,] 1.1029780 –0.434542858 [5,] 0.6926124 –0.320136837 [6,] 0.8034373 0.817567985 Scale 2 [7,] 0.9617609 0.003169065 [8,] 1.1004634 1.327405006 [9,] 0.9115646 0.451618136 [10,] 1.0574126 –2.053570652 [11,] 1.1098851 0.006184470 [12,] 0.8589135 0.265193973 Scale 3 [13,] 1.2621460 1.339141978 [14,] 0.8917393 –0.220816527 [15,] 0.9161605 0.758596816 [16,] 0.9253288 –0.066838528 [17,] 0.7505099 –0.099260362 [18,] 1.2541155 –1.710823377

Note. The guessing parameter was 0.1 for all items.

38

Appendix B Item Parameters of the Simulated 3-scale 12-item Data Set

Scale Slope Difficulty Scale 1 [1,] 1.0301048 –0.40334405 [2,] 1.0807597 –0.12162779 [3,] 1.0250148 –0.29599706 [4,] 0.8097633 0.13585131 [5,] 1.0834746 –0.10137978 [6,] 1.0881449 1.18682432 [7,] 0.8241556 0.58488677 [8,] 1.0754401 0.95989977 [9,] 0.8284506 –1.57049425 [10,] 1.0272207 –0.24556290 [11,] 1.1410092 –0.53788298 [12,] 0.9864615 0.40882663 Scale 2 [13,] 1.0131312 0.56203695 [14,] 1.0604981 0.63205024 [15,] 1.2831725 –0.41368560 [16,] 1.1636971 –0.90477486 [17,] 0.9043142 0.01714852 [18,] 0.9837799 –0.84975192 [19,] 1.0296239 0.63169027 [20,] 1.2039188 0.04996556 [21,] 0.6799550 0.77051519 [22,] 0.9778539 –0.91851904 [23,] 1.0707815 0.19213650 (Table continues)

39

Table (continued)

Scale Slope Difficulty [24,] 0.6292741 0.23118819 Scale 3 [25,] 1.1981095 0.24101286 [26,] 1.0874208 –0.18829633 [27,] 0.9684248 –0.58984308 [28,] 0.8853709 –0.95740524 [29,] 0.9017118 –0.19778461 [30,] 1.0488593 –1.42372395 [31,] 1.0086545 0.17463042 [32,] 0.8052735 1.48726305 [33,] 1.2051341 1.30940643 [34,] 1.0667933 0.28232721 [35,] 0.8616304 –0.32302987 [36,] 0.9626173 0.18544311

Note. The guessing parameter was 0.1 for all items.

40

Appendix C ETS-DE Estimates for IEP Subgroup Means and Standard Deviations

Mean Standard deviation IEP Non-IEP IEP Non-IEP NUM&OPER –0.910 0.091 1.050 0.966 MEASURMT –0.841 0.054 1.100 1.016 GEOMETRY –0.844 0.076 1.043 0.958 DATA ANL –0.852 0.103 1.199 1.043 ALGEBRA –0.927 0.099 1.244 1.009

41

Appendix D Means and Standard Deviations for ETS-DE Estimates for “Years Living in the USA” as Defined in NALS 1992 Data

Mean Standard deviation Yrs in USA Prose Document Quantitative Prose Document Quantitative 1-5 –1.287 –1.154 –1.043 1.549 1.584 1.548 6-10 –1.181 –1.029 –0.987 1.325 1.364 1.305 11+ –1.228 –1.094 –1.026 1.506 1.513 1.448 16+ –0.900 –0.874 –0.777 1.507 1.514 1.447 21+ –0.714 –0.691 –0.565 1.389 1.396 1.346 31+ –0.463 –0.513 –0.351 1.349 1.298 1.298 41+ –0.616 –0.749 –0.585 1.459 1.448 1.417 51+ 0.068 –0.139 0.062 1.033 1.032 1.018 Ever 0.102 0.094 0.084 1.065 1.076 1.085

Note. The subgroup means are reported as differences from the total mean.

42

Appendix E Using AIR’s AM Software for Secondary Analyses Studies I and II have shown that AM’s procedure for CJ-DE, a direct estimation approach relying on a marginal normality assumption, does not seem suitable for data where the normality of the latent trait across subgroups cannot be warranted. The CJ-DE approach has been developed “to consistently estimate subpopulation distributions when the groups are defined by values of a [nominal or ordinal variable]” (Cohen & Jiang, 1999). The two procedures implementing CJ-DE are referred to as “Ordinal Table” (OT) and “Nominal Table” (NT) in the AM software, depending on the grouping variables scale level. In contrast to the findings concerning CJ-DE, AM’s MML regression procedure reproduced the results of analyzing the true values—which served as the basis for the simulated data—quite well, in much the same way the ETS-DE approach does. In the simulated data examples with known true regression coefficients, ETS’s method and AM’s MML regression agreed closely when estimating regression parameters for the full conditioning model. AM’s MML regression module cannot be used “as is” for reporting purposes, because additional steps are necessary in order to produce subgroup statistics based on the regression results. The goal of this appendix is to explore ways to use MML regression and other modules of AM and to provide a guideline on how to put together analysis steps that can be used to get results with the AM software that resemble more closely the true values and the ETS-DE conditioning model estimates. AM was used in examples presented below in a multistep procedure for producing subgroup statistics without using AM’s CJ-DE modules. This step-by-step procedure lacks the convenience of the operational ETS-DE approach in that it requires manual concatenation of separate intermediate results produced by AM’s procedures. Therefore, the goal of the study presented here is not to provide an alternative to ETS-DE, but to test whether AM can be used for secondary analyses. The approaches taken by the ETS-DE conditioning model on the one hand and AM’s direct estimation as well as its AM’s CJ-DE module on the other hand differ strongly with respect to the information incorporated in estimating subgroup characteristics. ETS-DE uses extensive background (conditioning) information, including grouping variables in addition to the observed item responses. CJ-DE, in contrast, only includes one grouping variable at a time

43

together with the item responses but draws on a number of strong assumptions regarding the shape of the marginal ability distribution and the relation between and the group indicator variable.

Issues in Model Selection Assumptions about the population structure are central in the process of building a model for complex survey data. The question is what kind of assumptions are viewed as appropriate for the comparison of multiple subgroups with respect to their means and variances.

Figure E1. Subgroup distributions with normality assumption on the marginal level.

In the case depicted in Figure E1, the overall distribution is assumed to be normal, and the sum of all subgroup distributions has to accommodate this shape. It follows that the shapes of the subgroup distributions are no longer free; they have to fit under the overall normal shape and their sum has to be equal to that shape. This assumption is central to CJ-DE and makes it

44

inappropriate for more complex real data. A less restrictive assumption is that all subgroups are normally distributed and share the same variance but may vary with respect to their means and size. This assumption can be modeled by a regression with contrast coded subgroup indicators. This can be done in many software packages as well as in ETS-DE and AM. This drops the assumption of marginal normality and with it the main feature of CJ-DE as proposed by Cohen and Jiang (1999). The effect of this relaxing the marginal normality assumption is illustrated in Figure E2.

Figure E2. Subgroup distributions with normality assumption in all subgroup levels.

This less restrictive assumption obviously allows a larger range of cases to be fitted as compared to CJ-DE. This approach can be taken by using AM’s MML means procedure, even though that procedure will not yield subgroup variance estimates. If only a few subgroups are used, the homoscedasticity assumption within subgroups limits the ability to fit more general marginal distributions. A useful extension would be to assume a separate variance for each subgroup. In AM, this assumption can be accommodated by using MML regression together with filtering the data as many times as there are subgroups. But even this limits the subgroup

45

distributions to be normal, which seems still a too restrictive approach if, for example, there is a strong indication that some subgroups are composites. This is one of the reasons why an even more general approach is needed for estimating subgroup means and variances consistently in complex samples. The approach taken by ETS-DE increases the number of conditioning subgroups by using a conditioning model that incorporates a large number of background variables. This conditioning model—using regression techniques—assumes that the conditional distributions are normal with conditional mean

E( |y1,..,yk) and a conditional variance V( |y1,..,yk), where y1,..,yk is a vector of background variables that incorporates more than only the grouping variables. In contrast to the homoscedasticity assumption given the vector of background variables y1,..,yk, there is no such assumption regarding the subgroup variances. This means that the subgroup distribution is an aggregation over many individual conditional normal distributions defined in the ETS-DE conditioning model. The ETS-DE conditioning model allows fitting complex composite marginal distributions like the ones found in large-scale assessments that consist of a number of subgroup distributions with different shapes and sizes. Figure E3 shows two normal subgroup distributions (dashed lines) and a non-normal subgroup distribution (solid line), which—when combined—result in a heavy tailed marginal distribution.

46

Figure E3. Subgroup distributions with and without normality assumption on the subgroup level.

A major advantage of choosing this most general approach is allowing maximum flexibility of the model in order to accommodate as many shapes of subgroup distributions as possible. This is necessary even if only the first one or two moments of the subgroup distributions are to be estimated because strong a-priori assumptions about the subgroup distribution's shape will deteriorate these estimates (see Study I and II).

MINTER & POSTER AM Software Tweaks The following analyses are based on findings that emerged while experimenting with the AM software in order to explore ways to circumvent the shortcomings of CJ-DE. MINTER (short for “multiple intercept only runs”) is based on the MML regression module of CJ-DE and resembles somewhat the CJ-DE direct estimation approach without the restrictive marginal normality assumption. This manual procedure first selects observations by means of each possible outcome of the grouping variable. Second, MML regressions are estimated for each of

47

the filtered subsamples without background variables, i.e., an MML regression with the intercept term only. This has to be carried out number of scales by the number of subgroup times (3*5=15 times in our example) in order to estimate subgroup means and variances for all possible scale by subgroup combinations. In trying to resemble the conditioning approach that is operational at ETS, another approach can be taken with AM that still lacks the advantages of simultaneous estimation of the conditioning effects for all scales and all subgroups of interest. This second approach, referred to as POSTER here, is carried out by utilizing posterior means. In a first step, the POSTER procedure uses an MML regression model (the AM MML regression procedure with all background variable). This resembles the ETS-DE methodology for the conditioning model in the unidimensional case. A second step is needed in order to produce posteriors (option “posteriors” in the respective MML-regression icon’s popup menu invoked by the right mouse button). A third step uses AM’s descriptives procedure where the “POSTM0” variable (or, if this was carried out more than once, POSTM1 or POSTM2 and so on) is selected as the dependent variable, and the grouping variable of interest is used as the independent variable. This uses the posterior means instead of plausible values to compute subgroup distributions. The POSTER procedure includes invoking 3*3=9 AM procedures manually. The results for the example data described as Setup 2 in Study I are given in Table E1.

48

Table E1 Subgroup GROUP Statistics for the Six-item Data Set, Setup 2

Standardized mean difference Variance ratio TRUE MINTER POSTER TRUE MINTER POSTER Scale 1 Group 1 0.0000 –1.6881 –1.3284 1.0000 1.5112 0.9868 Group 2 0.0000 0.0943 –0.6406 1.0000 1.4201 0.9248 Group 3 0.0000 –0.1171 –0.5323 1.0000 1.1225 0.8128 Group 4 0.0000 –0.8333 –0.3529 1.0000 0.8609 0.9026 Group 5 0.0000 0.3019 0.4328 1.0000 1.3861 1.0022 Scale 2 Group 1 0.0000 –0.6129 –0.3846 1.0000 0.6906 0.6104 Group 2 0.0000 0.3448 –0.8571 1.0000 0.5898 0.6746 Group 3 0.0000 –0.5593 –0.5417 1.0000 0.6758 0.6624 Group 4 0.0000 0.7326 –0.3333 1.0000 1.5398 0.6721 Group 5 0.0000 –1.9000 –0.4815 1.0000 0.3810 0.6808 Scale 3 Group 1 0.0000 0.5570 –1.8235 1.0000 1.0907 0.6282 Group 2 0.0000 0.2174 0.4651 1.0000 1.1610 0.7020 Group 3 0.0000 –0.4667 3.5000 1.0000 1.0165 0.6466 Group 4 0.0000 0.2985 1.5758 1.0000 1.0452 0.5591 Group 5 0.0000 –0.1918 –3.8387 1.0000 1.3006 0.7309

Conclusions AM’s MML regression module is, to a certain extent, useful for studying subgroup differences. Compared to the CJ-DE modules of AM, the MML regression module agreed more closely both with the study based on real data and with the true known values and ETS-DE’s results in the simulation study. MML regression and the regression part of the ETS-DE conditioning model share common features when using the same set of background variables as predictors and restricting the comparison to the regression part of the programs. The

49

implementation of the ETS-DE conditioning model incorporates features like multivariate latent traits and the flexibility to handle all relevant reporting variables in a joint set of subgroups that make this approach more suitable for operational reporting and research. AM appeals to secondary users through the easy-to-use interface and its capability of user-selected preliminary analyses that do not require high accuracy that is achieved by the incorporation of large numbers of background variables. To promote the conditioning model for reporting in large-scale assessments, producing a user interface for the program implementing ETS-DE seems to be easier and requires less verification of the models and estimation methods as compared to adding missing features to AM and testing its modules reliability throughout.

50