Comparing Conditional and Marginal Direct Estimation of Subgroup Distributions
Total Page:16
File Type:pdf, Size:1020Kb
RESEARCH REPORT January 2003 RR-03-02 Comparing Conditional and Marginal Direct Estimation of Subgroup Distributions Matthias von Davier Research & Developm ent Division Princeton, NJ 08541 Comparing Conditional and Marginal Direct Estimation of Subgroup Distributions Matthias von Davier Educational Testing Service, Princeton, NJ January 2003 Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from: Research Publications Office Mail Stop 10-R Educational Testing Service Princeton, NJ 08541 Abstract Many large-scale assessment programs in education utilize “conditioning models” that incorporate both cognitive item responses and additional respondent background variables relevant for the population of interest. The set of respondent background variables serves as a predictor for the latent traits (proficiencies/abilities) and is used to obtain a conditional prior distribution for these traits. This is done by estimating a linear regression, assuming normality of the conditional trait distributions given the set of background variables. Multiple imputations, or plausible values, of trait parameter estimates are used in addition to or, better, on top of the conditioning model—as a computationally convenient approach to generating consistent estimates of the trait distribution characteristics for subgroups in complex assessments. This report compares, on the basis of simulated and real data, the conditioning method with a recently proposed method of estimating subgroup distribution statistics that assumes marginal normality. Study I presents simulated data examples where the marginal normality assumption leads to a model that produces appropriate estimates only if subgroup differences are small. In the presence of larger subgroup differences that cannot be fitted by the marginal normality assumption, however, the proposed method produces subgroup mean and variance estimates that differ strongly from the true values. Study II extends the findings on the marginal normality estimates to real data from large-scale assessment programs such as the National Assessment of Educational Progress (NAEP) and the National Adult Literacy Survey (NALS). The research presented in Study II shows differences between the two methods that are similar to the differences found in Study I. The consequences of relying upon the assumption of marginal normality in direct estimation are discussed. Key words: conditioning models, large-scale assessments, NAEP, NALS, direct estimation i Acknowledgements I would like to thank John Mazzeo for valuable comments on previous versions of this document, which improved both content and presentation. Any remaining errors are mine. ii Introduction Large-scale assessments such as the National Assessment of Educational Progress (NAEP) estimate the distribution of academic achievement for policy relevant subgroups. Examples of estimates provided by large-scale assessment are means and percentages above cut points for the subgroups of interest. Many large-scale assessments such as NAEP use a sparse matrix sample design in which the number of cognitive items per respondent is kept relatively small. Using such designs allows the assessment to provide a broad coverage of the content domain while keeping the subjects’ testing time brief. This implies that individual ability estimates based on these kinds of assessments would have a large measurement error component, which has to be taken into account when reporting aggregate statistics for subgroups. Direct estimation procedures, by which these estimates are obtained without the generation of individual scores, have been the approach most commonly taken to address this analysis challenge. Typically, these procedures have made use of background variables along with the cognitive item responses to ensure a higher degree of accuracy in estimating subgroup characteristics compared to only using the cognitive responses. Moreover, matrix sampling makes it impossible to compare subjects—or groups of subjects—based on their observed item responses. Therefore, large-scale assessments using matrix sampling rely on item response theory (IRT) models (Lord & Novick, 1968; Rasch 1960). To estimate the subgroup statistics of interest, ETS has employed since 1984 a particular approach of integrating achievement data (item responses) and background information, such as subgroup membership and additional student variables, into a hierarchical IRT model. This approach may be referred to as “direct estimation” because ETS estimates group statistics without the use of individual test scores. For the purposes of this report, I refer to this approach as ETS-DE. The core features of the ETS-DE approach include: 1. A population model that assumes proficiencies are normally distributed conditional on a large number of background variables (grouping variables and other covariates). As a consequence, the marginal distribution (overall and for major reporting subgroups) is a mixture of normals. 2. The generation of a posterior latent trait distribution of proficiency for each individual in the sample, which is based on an estimate of (1); a separately estimated set of IRT parameters that are treated as fixed and known; the cognitive item responses, the respondents’ group 1 membership; and other covariates. The mixture of these individual posterior distributions provides the estimate of the actual subgroup distributions. 3. The integration over posterior distributions of examinees and some of the model parameters (the parameters of the population model defined later) in (1) to obtain estimates of means, percentages above achievement levels, etc. 4. The use of normal approximations for the individual posteriors and a multiple-imputation approach (the so-called plausible values) to approximate the integration in (3). Imputations are used in conjunction with conditioning models based on both cognitive item responses and background information. The imputations are used as a mere convenience in order to simplify the integration in (3) and to provide data that can be used with standard tools by secondary analysts. Cohen and Jiang (1999) propose an alternative approach to direct estimation (which I refer to as CJ-DE in this report) of subpopulation characteristics that does not utilize additional background variables. Cohen and Jiang assume that CJ-DE provides consistent subgroup estimates without the use of background variables. The core features of CJ-DE include: 1. A population model that assumes marginal normality, i.e., the ability distributions of all subgroups align in such a way that the joint distribution is normal. 2. A measurement model for the categorical grouping variables that assumes an underlying continuous latent variable whose joint distribution with proficiency is normal. 3. Use of a set of fixed/known IRT model parameters. 4. Item responses that are used together with a single grouping variable only—the one used for reporting—i.e., no additional covariates like other reporting variables or their interactions are used in the population model. 5. A direct calculational approach that bypasses the generation of individual posterior distributions and the generation of plausible values. Both approaches, ETS-DE and CJ-DE, may be referred to as “direct estimation” because they estimate group statistics without the use of individual test scores. ETS-DE uses a more general model, which includes grouping variables as well as additional background information and no specific assumption regarding the marginal proficiency distribution. CJ-DE includes the assumption of marginal normality and ignores all the additional background information other 2 than a single grouping variable. This report presents a comparison of ETS-DE and CJ-DE using simulated and real data. The ETS-DE Methodology For obtaining estimates of subpopulation distributions, ETS-DE involves a two-phase procedure that uses achievement data (item responses) and respondents’ background information. Key references for a more detailed outline of the conditioning model used by the ETS-DE method are Mislevy (1991), Mislevy, Beaton, Kaplan, and Sheehan (1992) and Thomas (1993, 2002). The two phases of the method, which sometimes are confused when discussed in secondary literature, are: 1. Estimation of parameters for the conditioning, or population, model. 2. Production of plausible values from individual posterior distributions given the model parameters, item responses, and background data. The Conditioning Model The method used for analyzing large-scale assessments at ETS uses both item responses and background information, sometimes numbering up to one hundred conditioning variables. Assume that there are k scales in the assessment and that each proficiency scale follows a 1 unidimensional IRT model with the usual assumption of conditional independence given , i.e., P ( x ,.., x | ) P ( x | ) (1) k1... K 1 J ( k ) k kk1).Kj 1.. J ( jk k The conditioning model combines the k-scale IRT model with a k-dimensional multivariate latent regression model in order to maximize the likelihood based on the posterior distribution of the latent trait =( ,., ): L( | x, y) f ( | x, y) ~ P(x | ) ( | y) (2) kk1)..K j 1..J ( jk k where the prior (| y) is assumed to be normal with y N('y , ). The latent trait is unobserved and must be inferred from the observed item responses. The predictor y is a vector of 3 individual values on a set of conditioning variables, is a matrix