Power Analysis for the Wald, LR, Score, and Gradient Tests in a Marginal Maximum Likelihood Framework: Applications in IRT

Power analysis for the Wald, LR, score, and gradient tests in a marginal maximum likelihood framework: Applications in IRT Felix Zimmer1, Clemens Draxler2, and Rudolf Debelak1 1University of Zurich 2The Health and Life Sciences University March 9, 2021 Abstract The Wald, likelihood ratio, score and the recently proposed gradient statistics can be used to assess a broad range of hypotheses in item response theory models, for instance, to check the overall model fit or to detect differential item functioning. We introduce new methods for power analysis and sample size planning that can be applied when marginal maximum likelihood estimation is used. This avails the application to a variety of IRT models, which are increasingly used in practice, e.g., in large-scale educational assessments. An analytical method utilizes the asymptotic distributions of the statistics under alternative hypotheses. For a larger number of items, we also provide a sampling-based method, which is necessary due to an exponentially increasing computational load of the analytical approach. We performed extensive simulation studies in two practically relevant settings, i.e., testing a Rasch model against a 2PL model and testing for differential item functioning. The observed distributions of the test statistics and the power of the tests agreed well with the predictions by the proposed methods. We provide an openly accessible R package that implements the methods for user-supplied hypotheses. Keywords: marginal maximum likelihood, item response theory, power analysis, Wald test, score test, likelihood ratio, gradient test We wish to thank Carolin Strobl for helpful comments in the course of this research. This work was supported by an SNF grant (188929) to Rudolf Debelak. Correspondence concerning this manuscript should be addressed to Felix Zimmer, Department of Psychology, University of Zurich, Binzmuehlestrasse 14, 8050 Zurich, Switzerland. Email: [email protected] 1 1 Introduction When applying item response theory (IRT), it is often necessary to check whether a model fits the data or whether the parameters differ between the groups of participants. To test these research questions, the Wald, likelihood ratio (LR), and score statistics are established tools (for an introduction, see e.g., Glas & Verhelst, 1995). The determination of their statistical power serves critical purposes during several phases of a research project (e.g., Cohen, 1992; Faul et al., 2009). Before data collection, it can be determined how many participants are required to achieve a certain level of power. For model fitting purposes, it answers the question of how many participants are needed to reject a wrongly assumed model with a predetermined desired probability. This is a reasonable approach of empirically substantiating interpretations based on the model, e.g., considering the participants' abilities. Furthermore, since the three statistics are only asymptotically equivalent (Silvey, 1959; Wald, 1943), one may infer which of the tests has the highest power in finite sample sizes, in particular, in cases of practically relevant sample sizes. After data collection, a power analysis provides an additional perspective on the observed effect and the size of the sample. Assuming that the null hypothesis is indeed false, one can infer the probability of rejecting it for any sample size. This can be applied when planning the sample size for an adequately powered replication study. For its critical roles during the research process, power analysis has been a major dis- cussion point in the course of the recent replication crisis (e.g. Cumming, 2014; Dwyer et al., 2018): estimating and reporting the statistical power has been firmly established as a research standard (National Academies of Sciences, En- gineering, and Medicine, 2019). However, in IRT, methods of power analysis are rarely presented. As one exception, Maydeu-Olivares and Monta~no(2013) have described a method of power analysis for some tests of contingency tables, such as the M2 test. Draxler (2010) and Draxler and Alexandrowicz (2015) provided formulas to calculate the power or sample size of the Wald, LR, and score tests in the context of exponential family models and conditional maximum likelihood (CML) estimation. Two examples of exponential family models in IRT are the Rasch (Rasch, 1960) and the partial credit models (Masters, 1982), in which the parameters represent the item difficulties. Several estimation methods are available for this family of models, in particular, CML and marginal maximum likelihood (MML) estimation (for an overview, see Baker & Kim, 2004). The advantage of the CML approach lies in the reliance on the participants' overall test score as a sufficient statistic for their underlying ability. However, models that do not belong to an exponential family cannot be estimated by CML because they do not offer a sufficient statistic. This paper treats a general class of IRT models, where each item is described by one or more types of parameters. Examples are models that extend the Rasch model by a discrimination parameter (two-parameter logistic model, 2PL) or a guessing parameter (three-parameter logistic model, 3PL; Birnbaum, 1968). These more complex IRT models are becoming increasingly important. In the Trends in International Mathematics and Science Study (TIMSS, Martin et 1 al., 2020), for example, the scaling model was changed from a Rasch to a 3PL model in 1999. More recently, the methodology in the Program for International Student Assessment (PISA, OECD, 2017) was changed from a Rasch to a 2PL model (see also Robitzsch et al., 2020). Yet, a power analysis for the Wald, LR and score tests in IRT models is currently limited by the requirement that both the null and alternative hypotheses belong to an exponential family model where each item is described by only one parameter type. An approach applicable to MML would allow either hypotheses to refer to more complex models and enable, e.g., a power analysis for a test of a Rasch model against a 2PL model. Recently, the gradient statistic was introduced as a complement to the Wald, LR, and score statistics (Lemonte, 2016; Terrell, 2002). It is asymptotically equivalent to the other three statistics without uniform superiority of one of them (Lemonte & Ferrari, 2012). It is easier to compute in many instances because it does not require the estimation of an information matrix. Concern- ing analytical power analysis in IRT, Lemonte and Ferrari (2012) provided a comparison of the power for exponential family models. Draxler et al. (2020) showcased an application in IRT where the gradient test provided a compara- tively higher power. To our knowledge, the gradient statistic and its analytical power have not yet been formulated or evaluated for linear hypotheses that form the research context of this paper. Furthermore, the gradient test has not been discussed in the MML framework in psychometrics. To address these gaps in the present literature, we propose the gradient statistic for arbitrary linear hypotheses and introduce an analytical power analysis for a general class of IRT models and MML estimation for all four mentioned statistics. The main obstacle for this is a strongly increased computational load of the analytical approach when larger numbers of items are considered. For this scenario, we present a sampling-based method that builds on and complements the analytical method. We subsequently evaluate the procedures in a test of a Rasch against a 2PL model and a test for differential item functioning (DIF) in extensive simulation studies. We contrast the power of the computationally simpler gradient test with that of the other, more established tests. The impli- cations of misspecification regarding the person parameter distribution are also investigated briefly. Furthermore, the application to real data is showcased in the context of an assessment of academic skills. We provide an R package that implements the power analysis for user-defined parameter sets and hypotheses at https://github.com/flxzimmer/pwrml. Finally, we discuss some limitations and give an outlook on possible extensions. 2 Power Analysis We will first define a general IRT model for which we assume local independence of items and unidimensional person parameters that are independent and iden- tically distributed. The probability distribution of the item response function l is expressed as fβ,θv (x), where β represents the item parameters β 2 R and θv denotes the unidimensional person parameter for person v = 1 : : : n. Fur- 2 thermore, let X be a discrete random variable with realizations x 2 f1; ::; KgI . Here, K is the number of different response categories and I is the number of items. Each possible value x is therefore a vector of length I that represents one specific pattern of answers across all items. The vector β depends on the specific model, but shall generally have length l. For the Rasch model, for example, there is only the difficulty parameter so l is equal to the number of items I. We consider the test of a linear hypothesis using the example of testing a Rasch model against a 2PL model. The use of such linear hypotheses provides a flexible framework for power analyses. The null hypothesis is expressed as T (β) = c or equivalently Aβ = c; (1) where T is a linear transformation of the item parameters, T : Rl ! Rm with m ≤ m l. Let c 2 R be a vector of constants and A 2 Mml(R) the unique matrix that represents T . In the following, we will denote a set of item parameters that follow the null hypothesis by β0 2 B0 = fβjAβ = cg. Similarly, we will refer to parameters that follow the alternative as βa 2 Ba = fβjAβ 6= cg. To describe the 2PL model for the test of a Rasch against a 2PL model, let the probability of a positive answer to item i = 1 :::I be given by 1 Pβ,θ(xi = 1) = (2) 1 + exp(−(aiθ + di)) with discrimination parameter ai and difficulty parameter di.

Power Analysis for the Wald, LR, Score, and Gradient Tests in a Marginal Maximum Likelihood Framework: Applications in IRT

Three Statistical Testing Procedures in Logistic Regression: Their Performance in Differential Item Functioning (DIF) Investigation

Testing for INAR Effects

Comparison of Wald, Score, and Likelihood Ratio Tests for Response Adaptive Designs

Robust Score and Portmanteau Tests of Volatility Spillover Mike Aguilar, Jonathan B

Econometrics-I-11.Pdf

Rao's Score Test in Econometrics

An Improved Sample Size Calculation Method for Score Tests in Generalized Linear Models Arxiv:2006.13104V1 [Stat.ME] 23 Jun 20

Lagrange Multiplier Test

Lecture 02: Statistical Inference for Binomial Parameters

Skedastic: Heteroskedasticity Diagnostics for Linear Regression

Piagnostics for Heteroscedasticity in Regression by R. Dennis Cook and Sanford Weisberg

Chapter 1 Introduction