Power analysis for the Wald, LR, , and gradient tests in a marginal maximum likelihood framework: Applications in IRT

Felix Zimmer1, Clemens Draxler2, and Rudolf Debelak1

1University of Zurich 2The Health and Life Sciences University

March 9, 2021

Abstract The Wald, likelihood ratio, score and the recently proposed gradient can be used to assess a broad of hypotheses in item re- sponse theory models, for instance, to check the overall model fit or to detect differential item functioning. We introduce new methods for power analysis and sample size planning that can be applied when marginal maximum likelihood estimation is used. This avails the application to a variety of IRT models, which are increasingly used in practice, e.g., in large-scale educational assessments. An analytical method utilizes the asymptotic distributions of the statistics under alternative hypotheses. For a larger number of items, we also provide a -based method, which is necessary due to an exponentially increasing computational load of the analytical approach. We performed extensive simulation studies in two practically relevant settings, i.e., testing a Rasch model against a 2PL model and testing for differential item functioning. The observed distribu- tions of the test statistics and the power of the tests agreed well with the predictions by the proposed methods. We provide an openly accessible R package that implements the methods for user-supplied hypotheses.

Keywords: marginal maximum likelihood, item response theory, power analysis, , score test, likelihood ratio, gradient test

We wish to thank Carolin Strobl for helpful comments in the course of this research. This work was supported by an SNF grant (188929) to Rudolf Debelak. Correspondence concerning this manuscript should be addressed to Felix Zimmer, Department of Psychology, University of Zurich, Binzmuehlestrasse 14, 8050 Zurich, Switzerland. Email: [email protected]

1 1 Introduction

When applying item response theory (IRT), it is often necessary to check whether a model fits the or whether the parameters differ between the groups of participants. To test these research questions, the Wald, likelihood ratio (LR), and score statistics are established tools (for an introduction, see e.g., Glas & Verhelst, 1995). The determination of their statistical power serves critical pur- poses during several phases of a research project (e.g., Cohen, 1992; Faul et al., 2009). Before , it can be determined how many participants are required to achieve a certain level of power. For model fitting purposes, it answers the question of how many participants are needed to reject a wrongly assumed model with a predetermined desired probability. This is a reasonable approach of empirically substantiating interpretations based on the model, e.g., considering the participants’ abilities. Furthermore, since the three statistics are only asymptotically equivalent (Silvey, 1959; Wald, 1943), one may infer which of the tests has the highest power in finite sample sizes, in particular, in cases of practically relevant sample sizes. After data collection, a power anal- ysis provides an additional perspective on the observed effect and the size of the sample. Assuming that the null hypothesis is indeed false, one can infer the probability of rejecting it for any sample size. This can be applied when planning the sample size for an adequately powered study. For its critical roles during the research process, power analysis has been a major dis- cussion point in the course of the recent replication crisis (e.g. Cumming, 2014; Dwyer et al., 2018): estimating and reporting the statistical power has been firmly established as a research standard (National Academies of Sciences, En- gineering, and Medicine, 2019). However, in IRT, methods of power analysis are rarely presented. As one exception, Maydeu-Olivares and Monta˜no(2013) have described a method of power analysis for some tests of contingency tables, such as the M2 test. Draxler (2010) and Draxler and Alexandrowicz (2015) provided formulas to calculate the power or sample size of the Wald, LR, and score tests in the context of models and conditional maximum likelihood (CML) estimation. Two examples of exponential family models in IRT are the Rasch (Rasch, 1960) and the partial credit models (Masters, 1982), in which the parameters represent the item difficulties. Several estimation methods are available for this family of models, in particular, CML and marginal maximum likelihood (MML) estimation (for an overview, see Baker & Kim, 2004). The advantage of the CML approach lies in the reliance on the participants’ overall test score as a sufficient for their underlying ability. However, mod- els that do not belong to an exponential family cannot be estimated by CML because they do not offer a sufficient statistic. This paper treats a general class of IRT models, where each item is described by one or more types of parameters. Examples are models that extend the Rasch model by a discrimination parameter (two-parameter logistic model, 2PL) or a guessing parameter (three-parameter logistic model, 3PL; Birnbaum, 1968). These more complex IRT models are becoming increasingly important. In the Trends in International Mathematics and Science Study (TIMSS, Martin et

1 al., 2020), for example, the scaling model was changed from a Rasch to a 3PL model in 1999. More recently, the methodology in the Program for International Student Assessment (PISA, OECD, 2017) was changed from a Rasch to a 2PL model (see also Robitzsch et al., 2020). Yet, a power analysis for the Wald, LR and score tests in IRT models is currently limited by the requirement that both the null and alternative hypotheses belong to an exponential family model where each item is described by only one parameter type. An approach applicable to MML would allow either hypotheses to refer to more complex models and enable, e.g., a power analysis for a test of a Rasch model against a 2PL model. Recently, the gradient statistic was introduced as a complement to the Wald, LR, and score statistics (Lemonte, 2016; Terrell, 2002). It is asymptotically equivalent to the other three statistics without uniform superiority of one of them (Lemonte & Ferrari, 2012). It is easier to compute in many instances because it does not require the estimation of an information matrix. Concern- ing analytical power analysis in IRT, Lemonte and Ferrari (2012) provided a comparison of the power for exponential family models. Draxler et al. (2020) showcased an application in IRT where the gradient test provided a compara- tively higher power. To our knowledge, the gradient statistic and its analytical power have not yet been formulated or evaluated for linear hypotheses that form the research context of this paper. Furthermore, the gradient test has not been discussed in the MML framework in . To address these gaps in the present literature, we propose the gradient statistic for arbitrary linear hypotheses and introduce an analytical power anal- ysis for a general class of IRT models and MML estimation for all four mentioned statistics. The main obstacle for this is a strongly increased computational load of the analytical approach when larger numbers of items are considered. For this scenario, we present a sampling-based method that builds on and complements the analytical method. We subsequently evaluate the procedures in a test of a Rasch against a 2PL model and a test for differential item functioning (DIF) in extensive simulation studies. We contrast the power of the computationally simpler gradient test with that of the other, more established tests. The impli- cations of misspecification regarding the person parameter distribution are also investigated briefly. Furthermore, the application to real data is showcased in the context of an assessment of academic skills. We provide an R package that implements the power analysis for user-defined parameter sets and hypotheses at https://github.com/flxzimmer/pwrml. Finally, we discuss some limitations and give an outlook on possible extensions.

2 Power Analysis

We will first define a general IRT model for which we assume local independence of items and unidimensional person parameters that are independent and iden- tically distributed. The of the item response function l is expressed as fβ,θv (x), where β represents the item parameters β ∈ R and θv denotes the unidimensional person parameter for person v = 1 . . . n. Fur-

2 thermore, let X be a discrete with realizations x ∈ {1, .., K}I . Here, K is the number of different response categories and I is the number of items. Each possible value x is therefore a vector of length I that represents one specific pattern of answers across all items. The vector β depends on the specific model, but shall generally have length l. For the Rasch model, for ex- ample, there is only the difficulty parameter so l is equal to the number of items I. We consider the test of a linear hypothesis using the example of testing a Rasch model against a 2PL model. The use of such linear hypotheses provides a flexible framework for power analyses. The null hypothesis is expressed as T (β) = c or equivalently Aβ = c, (1) where T is a linear transformation of the item parameters, T : Rl → Rm with m ≤ m l. Let c ∈ R be a vector of constants and A ∈ Mml(R) the unique matrix that represents T . In the following, we will denote a set of item parameters that follow the null hypothesis by β0 ∈ B0 = {β|Aβ = c}. Similarly, we will refer to parameters that follow the alternative as βa ∈ Ba = {β|Aβ 6= c}. To describe the 2PL model for the test of a Rasch against a 2PL model, let the probability of a positive answer to item i = 1 ...I be given by 1 Pβ,θ(xi = 1) = (2) 1 + exp(−(aiθ + di)) with discrimination parameter ai and difficulty parameter di. The discrimina- tion parameters of the 2PL model are allowed to differ between the items, while they take on a common value in the Rasch model. We can summarize the dis- crimination and difficulty parameters of the 2PL model introduced in (2) in a vector:   a1 d1    .  β =  .  . (3)   aI  dI The Rasch model is nested within the 2PL model and can be obtained if we set all discrimination parameters to a common value. One way to express the associated linear hypothesis T and design matrix A is   a1 − a2  .   .    T (β) = Aβ = ai−1 − ai  = 0, (4)    .   .  aI−1 − aI 1 0 −1 0 ... 0 0 . .. . A = . . . . (5) 0 ... 0 1 0 −1 0

3 An analytical approach for power analysis is based on the fact that the statis- tics under the null hypothesis asymptotically follow a different distribution than under an . Under the null hypothesis, they asymptotically follow a central χ2 distribution (Silvey, 1959; Terrell, 2002; Wald, 1943). This is equivalent to a noncentral χ2 distribution with the noncentrality parame- ter λ = 0. Under an alternative hypothesis, they asymptotically follow the same noncentral χ2 distribution with λ 6= 0 (Lemonte & Ferrari, 2012). For a proof of the asymptotic distribution under an alternative, we need to assume that parameters βa converge to β0 for n → ∞. A similar method was used for CML and exponential family models (Draxler & Alexandrowicz, 2015) and for MML and tests based on contingency tables (Maydeu-Olivares & Monta˜no, 2013). The asymptotic distribution of the statistics depends on the consistency of the maximum likelihood (ML) parameter estimator (e.g., Casella & Berger, 2002). The proof of the consistency itself relies on regularity conditions, such as the identifiability of parameters. Under the null hypothesis and weak regularity ˆ conditions, the estimated parameters β converge to the true parameters β0 with a multivariate normal distribution, √ ˆ n→∞ ˆ d β −−−−→ β0 =⇒ n(β − β0) −→ N[0, Σβ0 ].

Now, if the estimate βˆ converges to a set of parameters following the alternative, βa, it follows that √ ˆ n→∞ ˆ n→∞ β −−−−→ βa =⇒ n(β − β0) −−−−→∞. √ ˆ This is the case regardless of the specific choice of βa and β0, since n(β − n→∞ √ β0) −−−−→ nδ with δ = βa − β0 6= 0. Since asymptotic normality does not apply in this case, the statistics will not follow a χ2 distribution. Instead, as Wald (1943) noted, the power of the respective tests converges to 1 for n → ∞. Therefore, to derive an asymptotic distribution under the alternative, we need to consider a deviation that shrinks with the sample size. We set δ β = β + √n (6) n 0 n and assume δn is chosen in a way that βn follows the alternative hypothesis for all n ∈ N. In this case, √ ˆ n→∞ ˆ d β −−−−→ βn =⇒ n(β − β0) −→ N[δn, Σβn ], from which we can conclude the asymptotic noncentral χ2 distribution. More- over, Silvey (1959) notes the asymptotic equality of the Wald, LR, and score statistics for this scenario. The procedure to apply (6) to finite samples is as follows. Given an alterna- tive βa, a null parameter set√βr as described in the sequel in (11), and a sample size n, we simply set δn = n(βa − βr). Then, βn = βa. With βn defined this

4 way, the proposed distribution will hold asymptotically, and the error for the finite sample case may be investigated for practical relevance. By application of these technical details, the noncentrality parameters λ are obtained by evaluating the statistics at the population parameters (Silvey, 1959). Specifically, λ(β, n) = S(β, n) (7) where S represents the Wald, LR, score or gradient statistic. The parameter set β represents the population parameter of the consistent ML estimator βˆ. The noncentrality parameters also depend on an assumed person parameter distribution that we specify in (9), but omit from the notation. As outlined above, in case β follows an alternative, β ∈ Ba, the noncentrality parameters in (7) accurately describe the respective distributions for n → ∞ and β converging to β0. We may also calculate the noncentrality parameters to approximate the distributions in finite samples and for fixed values of β. As they rely on an asymptotic result, the agreement of the corresponding expected and observed distributions will be higher in larger samples and for lower differences |β − β0|, i.e., lower effect sizes. The reliance on these results can be considered common practice according to Agresti (2013) and has been shown to involve only minor errors in a simulation study by Draxler and Alexandrowicz (2015) in the CML context. Also, note that the test of the null hypothesis, i.e., a test against a central χ2 distribution, involves a similar assumption. To illustrate the procedure, we again consider a test of a Rasch against a 2PL model. Assuming variable discrimination parameters, and therefore, a deviation from the Rasch model, all four statistics are expected to behave differently than under the null hypothesis. Using the noncentrality parameters in (7), we obtain an expected distribution for each of the statistics under the alternative (Figure 1). Note that in contrast to the asymptotic case, the noncentrality parameters are not necessarily the same for the four tests. For each of the statistics, the power is represented by the area under the graphs that is above the critical value for rejecting the null hypothesis. The formula for the noncentrality parameters (7) are further explained in section 2.1. For an assessment of the observed and expected distributions, we conduct extensive simulation studies for some practically relevant use cases and common sample sizes in section 3.

2.1 Expected Noncentrality Parameters Two concepts necessary for the estimation of the expected noncentrality param- eters are the expected matrix and the expected restricted parameters, which we will briefly introduce.

2.1.1 Expected Let the expected covariance matrix for a parameter set β in a sample of size n be denoted by Σ(β, n). Generally, it is given by the inverse of the information

5 0.15

Null 0.10 Wald LR Density Score Gradient

0.05

0.00

0 10 20 30 Statistic

Figure 1: Distributions of the Wald, LR, score, and gradient statistics under the null and an alternative hypothesis in a test of a Rasch versus a 2PL model. The curve labelled ”Null” represents a central χ2 distribution that applies to all four statistics under the null hypothesis. For the other four curves, the colored areas under the curve represent the power of the corresponding test under the alternative.

6 matrix, Σ(β, n) = I−1(β, n). A range of different methods has been suggested to estimate the information matrix (for an overview, see Yuan et al., 2014). In general, one can distinguish expected and observed information matrices (Efron & Hinkley, 1978). Observed information matrices require empirical data for their calculation, while expected information matrices can also be calculated in the absence of data and are therefore suitable for a power analysis. The Fisher expected information matrix is given by: ¨ IF (β, n) = −nEx[lβ(x)], where ¨ X ¨ Ex[lβ(x)] = lβ(x)gβ(x), x∈X ∂2l (x) ¨l (x) = β , β ∂β2

lβ(x) = log(gβ(x)), (8) Z gβ(x) =Eθ[fβ,θ(x)] = fβ,θ(x)Φ(θ)dθ. (9)

Here, X is the set of all possible response patterns, and fβ,θ is the probability distribution of the respective IRT model. Instead of θv, which refers to the ability parameter of person v in an observed dataset, we consider θ here as a general person parameter of which we take the expectation across an assumed population distribution Φ, e.g., a standard normal distribution. The probabil- ity of observing a response pattern x given β is denoted by gβ(x). Finally, ¨ the product nEx[lβ(x)] uses the independence of observations, i.e., persons are assumed to be drawn randomly from the population of interest. It follows that 1 Σ(β, n) = Σ(β, 1). (10) n The of the parameter estimators, which lie on the diagonal of the covariance matrix, decrease accordingly with increasing sample size.

2.1.2 Expected Restricted Parameters Consider the case that the LR statistic is used to differentiate between two mod- els, e.g., a Rasch and a 2PL model. As outlined in section 2.1.4, ML estimation is performed for both models, resulting in two separate estimated parameter sets. In the following, we want to infer some properties of the respective expected parameter sets to inform a power analysis. We refer to the ML parameters of the nested model — here, the Rasch model — as the restricted parameters βr. Considering the corresponding linear hypothesis, the restricted parameters follow the null hypothesis, βr ∈ B0. Out of all parameters that follow the null hypothesis, β0 ∈ B0, βr exhibit the highest likelihood. We may generally define them as X βr = arg max lβ0 (x)gβ(x). (11) β0∈B0 x∈X

7 for an analytical power analysis. Here, l is the log probability of a specific response pattern, as defined in (8). Note that gβ as specified in (9) depends on the true item parameters β. To find the maximum in (11), the algorithm by Nelder and Mead (1965) is used throughout this paper. It is the default general- purpose optimization algorithm used in the ”stats” package in R (R Core Team, 2020). When β ∈ Ba follows an alternative hypothesis, the restricted parameters obtained by ML estimation cannot be the data generating model, βr 6= β. For ˆ the purpose of a power analysis, we assume that a ML estimator βr converges to a unique βr for n → ∞. In our example, this implies that the ML estimates of the restricted model, which assume a common value for the discrimination parameters, are unique. As will also be illustrated in our simulation study in section 3, this assumption typically holds.

2.1.3 Wald Statistic We may now derive the noncentrality parameters (7) using a similar approach as in Draxler (2010) and Draxler and Alexandrowicz (2015). Let Xˆ denote an observed dataset and hXˆ (x) denote the of a response pattern x in the ˆ dataset Xˆ. The estimated item parameters βr are retrieved using a consistent ML estimator. The Wald statistic S1 is based on the parameter estimates and their covari- ances (Wald, 1943; see also Glas and Verhelst, 1995). The statistic is given by ˆ ˆ 0 ˆ 0 −1 ˆ S1(β, Xˆ) = (Aβ − c) [AΣ(β, Xˆ)A ] (Aβ − c), (12) where Σ(βˆ, Xˆ) is a –covariance matrix using the estimated parameters and the observed dataset. If we replace the estimator βˆ with its population parameter β, and consider an expected covariance matrix for a sample of size n, we get

λ1(β, n) = S1(β, n) = (Aβ − c)0[AΣ(β, n)A0]−1(Aβ − c) = n(Aβ − c)0[AΣ(β, 1)A0]−1(Aβ − c), where Σ(β, n) is an expected variance–covariance matrix as specified in section 2.1.1 and the last equality uses the inverse proportionality of the covariance ma- trix (10). If β follows the null hypothesis, then Aβ−c = 0 and the noncentrality parameter is λ1(β, n) = 0 for any sample size n.

2.1.4 Likelihood Ratio Statistic The LR statistic directly compares the likelihoods of a restricted model and an unrestricted model (Silvey, 1959; see also Glas and Verhelst, 1995). The statistic for an observed dataset is given by X S (βˆ, βˆ , Xˆ) = 2 (l (x) − l (x))h (x) 2 r βˆ βˆr Xˆ x∈X

8 where l is given in (8). ML estimation is performed for both the unrestricted and the restricted parmeter set. Analogous to the Wald statistic, the noncentrality parameter is given as

λ2(β, n) = S2(β, βr, n) X = 2 (lβ(x) − lβr (x))ngβ(x) x∈X X = 2n (lβ(x) − lβr (x))gβ(x), x∈X where βr is calculated as in (11) with respect to a null hypothesis Aβ = c introduced in (1). Note that the frequency of each response pattern x in the population is given by its probability given the true parameters gβ(x) multiplied by the sample size. This uses the assumption of independent and identically distributed person parameters. As we noted above, βr is equal to β when the null hypothesis holds; the noncentrality parameter is then λS2 (β, n) = 0.

2.1.5 Score statistic The concept of score statistics is to estimate only the restricted parameter set and to consider the gradient of its (Rao, 1948; see also Glas and Verhelst, 1995). When the absolute gradient at the restricted parameters is high, we may conclude a bad model fit and discard the null hypothesis. It is based on the assumption that the gradient of the likelihood is close to zero at the true parameter values. The score statistic is given by X X S (βˆ , Xˆ) = ( l˙ (x)h (x))0Σ(βˆ , Xˆ)( l˙ (x)h (x)) (13) 3 r βˆr Xˆ r βˆr Xˆ x∈X x∈X for an observed dataset, where l˙ is the first derivative of (8). For the noncen- trality parameter, we arrive at

λ3(β, n) = S3(β, n) (14) X ˙ 0 X ˙ = ( lβr (x)ngβ(x)) Σ(βr, n)( lβr (x)ngβ(x)) x∈X x∈X X ˙ 0 X ˙ = n( lβr (x)gβ(x)) Σ(βr, 1)( lβr (x)gβ(x)). x∈X x∈X

where we can immediately confirm that λS3 (β, n) = 0 under the null hypothesis.

2.1.6 Gradient statistic The gradient statistic combines the Wald and score statistics to eliminate the need to calculate an information matrix. It was formulated by Terrell (2002) in the case that A is the identity matrix. Subsequently, it was extended to composite hypotheses, where each row of the A matrix contains one nonzero

9 entry (e.g., Lemonte, 2016). We propose a generalization to arbitrary linear hy- potheses. First, consider an alternative version of the Wald statistic in (12) that ˆ uses a covariance matrix Σ(βr, Xˆ) that is evaluated at the restricted estimates rather than at the unrestricted estimates of the parameters. We can express it as a vector product ∗ ˆ ˆ ˆ 0 ˆ S1 (β, X) = [B(Aβ − c)] B(Aβ − c) (15) where B is a solution for B0B = [AΣ(βˆ, Xˆ)A0]−1. Secondly, consider an equiv- alent expression of the score statistic in (13), ∗ ˆ ˆ 0 ˆ ˆ 0 0 −1 −1 0 S3 (βr, X) = k [AΣ(β, X)A ]k = k B (B ) k (16) where k is a vector of Lagrange multipliers (see e.g., Silvey, 1959). It is the solution of X l˙ (x)h (x) = −Ak. (17) βˆr Xˆ x∈X The gradient statistic is then defined as the product of the square roots of (15) and (16): ˆ 0 −1 ˆ 0 ˆ S4(βr, Xˆ) = k B B(Aβ − c) = k (Aβ − c). (18)

By replacing the frequency hXˆ (x) by its expected value ngβ(x) in the derivation of the Lagrange multiplier (17), we obtain 0 λ4(β, n) = nk (Aβ − c) as the noncentrality parameter.

2.2 Sampling-based Noncentrality Parameters In the above sections, we presented an analytical method for determining power in the Wald, LR, score, and gradient tests that is applicable to a general class of IRT models outside of the exponential family. The necessary computations quickly become prohibitive for a higher number of items I since calculations over all unique response patterns are required. One example is the term X ˙ lβr (x)gβ(x)) x∈X in the score statistic (14) where we need to sum over all KI possible response patterns in X. A on a five-point Likert scale and 100 items implies calculations for each of ∼ 7.89 · 1069 unique patterns, which is infeasible even for modern computers. For this scenario, we propose a sampling-based approach to approximate the noncentrality parameters (7). This approach builds upon the assumption of asymptotically χ2-distributed statistics, the proportionality of the noncentrality parameter with the sample size, λ(β, n) = nλ(β, 1), as well as the asymptotic convergence of the ML estimators. Given the hypothesized population param- eters of the model under the alternative hypothesis, β, and a sample of size n, the steps to calculate λ(β, 1) are:

10 1. Generate an artificial dataset Xˆ of size n using the item response function

fβ,θv and a person parameter distribution Φ. 2. Perform ML estimation and calculate the desired statistics. 3. Fit a noncentral χ2 distribution to the statistics and receive estimates of the noncentrality parameters using λ(β, 1) = λ(β, n)/n. A sample size n can be chosen freely according to the available computational resources. With increasing n, the estimated noncentrality parameters converge to the noncentrality parameters calculated using the analytical method. As for the analytical method, the resulting noncentrality parameters can be used to approximate the power curve at arbitrary sample sizes.

3 Evaluation

The outlined simulation study aims to test a) whether the distributions of the test statistics are accurately described by the proposed noncentral χ2 distri- butions and b) whether the observed power of the statistics is consistent with the respective predictions. The simulation conditions differ in the type of pos- tulated hypothesis, the number of items, the postulated effect size, and the sample size. As a result, we compare the observed and expected statistics with regards to their distribution and power. The respective agreement of expected and observed distribution and hit rate under the null hypothesis is considered as benchmark. With this design, we can provide empirical evidence for the approximation of the statistics by noncentral χ2 distributions given practically relevant effect and sample sizes. In a second section, a similar procedure is used to evaluate whether an incorrect specification of the person parameter distribution leads to different results. This is relevant with respect to the robustness of the procedure in case neither the null hypothesis nor the alternative hypotheses are true. We provide the complete code for all analyses performed in the supplementary material.

3.1 Design In this section we perform a simulation study with 36 conditions by fully crossing two different types of hypotheses, two numbers of items, three sample sizes, and three effect sizes. Under each condition, 500 artificial datasets are generated.

Type of hypothesis. The types of the postulated hypotheses are the test of a Rasch model against a 2PL model and a test for DIF in the 2PL model that we briefly introduce below.

Rasch against 2PL. One property of the Rasch model is that items ex- hibit equal discrimination parameters. If this is not the case, a 2PL model will generally provide a better fit for the data. A test of the null hypothesis of equal

11 discrimination parameters is therefore relevant to possibly discard a wrongly assumed Rasch model. We consider the 2PL model introduced in (2) and the linear hypothesis of equal discrimination parameters expressed in (4) with the design matrix A given in (5). Note that there are many different formulations of the design matrix that imply identical restricted and unrestricted models. All equivalent matrices are given by B = CA with an invertible matrix C. As can be seen from (12) and (18), the Wald and gradient statistics remain unchanged for all such choices of C. Since the implied restricted and unrestricted models are identical, the LR and score statistic are also not affected by the specific choice of C. The degrees of freedom of the associated expected noncentral dis- tribution are equal to the number of rows in A, i.e., I − 1 (e.g., Glas & Verhelst, 1995).

DIF. We test for DIF in one item between two groups, A and B. If DIF is present, the response function differs between the two groups. For items affected by DIF, test takers with the same abilities might have different solution probabilities. This can be caused by differences in the d as well as in the a parameters. Consider the null hypothesis that the parameters of the first item are equal in both groups,

0 0 (a1A, d1A) = (a1B, d1B) .

Therefore, let β be structured so that the parameters of the first item in (3) 0 are replaced by (a1A, d1A, a1B, d1B) . The remaining items are estimated jointly for both groups and thereby serve as an anchor for the comparison of the item difficulties in the first item. The linear hypothesis T and design matrix A take the form a − a  T (β) = 1A 1B = 0, d1A − d1B 1 0 −1 0 0 ... 0 A = . 0 1 0 −1 0 ... 0 Note that, as described above, there are many equivalent ways to set up the matrix A. The corresponding expected noncentral χ2 distribution has two de- grees of freedom, i.e., the number of rows in A. For simplicity of presentation, a standard normal distribution of the person parameter is assumed in each group. If this assumption cannot be justified, one can, e.g., estimate the of the person parameter distribution in each group and add corresponding rows and columns to the design matrix. Also, one may choose other anchoring strategies and, for example, estimate the remaining item parameters separately in each group.

Number of items and sample sizes. Two different numbers of items are used, 10 and 50. The sample sizes are 500, 1000, and 3000.

12 Effect Sizes. Three different effect sizes labelled ”no,” ”small,” and ”large” are used. The d parameters are drawn from a standard normal distribution for both types of hypotheses and all effect sizes. For the Rasch against 2PL hypothesis, the a parameters are drawn from a lognormal distribution with standard deviations of 0.15 (0.07) for the large and 0.1 (0.04) for the small effect size condition for 10 and 50 items, respectively. The of the a parameters is chosen smaller for the 50 item condition than for the 10 item condition because this type of deviation from the Rasch model is more easily detected using larger itemsets, and we want to assure a broad spectrum of resulting power values. For the DIF hypothesis and the small effect size, the item parameters for the first item were set to a1A = 1.1, d1A = 0.1 in the first group, and to a1B = 0.9, d1B = −0.1 in the second group. The respective group differences are increased from 0.2 to 0.4 for the large effect size. The a parameters for the remaining items are drawn from a lognormal distribution with a standard deviations of 0.1.

Estimation of statistics and noncentrality parameters. All analysis are performed with R (R Core Team, 2020). The package ”mirt” (Version 1.33.2, Chalmers, 2012) is used to fit the IRT models to the artificial datasets. The maximum number of cycles used in the expectation maximization algorithm (Bock & Aitkin, 1981) is set to 5,000. For the case of 10 items, both the analytical and sampling-based power analysis approach are used while for 50 items, only the sampling-based approach is feasible for computational reasons outlined in section 2.2. In the sampling- based approach, the freely selectable sample size for the artificial dataset Xˆ is set to 1,000,000 for 10 items and 100,000 for 50 items. Analogously, calculation of the Fisher expected matrix used in the observed Wald and score statistics is feasible for 10 items, but not for 50 items. In this case, we also employ a sampling-based approach using an increased sample size and an observed information matrix. Since it is the default option in mirt, the method described by Oakes (1999) is used (see Chalmers, 2018). It can be calculated quickly and converges toward the Fisher expected information matrix in large samples. It is also applied in the sampling-based power analysis approach. The increased sample size for the approximation of the Fisher expected matrix is 100,000 for 10 items and 10,000 for 50 items. For the observed statistics in the sampling-based power analysis approach, we increase these numbers by a factor of 10 since they make up only a small part of the overall computation time.

Analysis methods. For each artificial dataset, the respective hypothesis is tested by fitting the implied IRT models and calculating the statistics. QQ-plots are used for visual evaluation of the resulting distributions of the statistics. To provide another of investigation, the distribution of the observed values is tested against the expected distribution using Kolmogorov–Smirnov tests. Although the results of these tests depend on the number of iterations, they provide further information for comparing the conditions. A significance

13 level of .05 and a Bonferroni correction for the number of performed tests for each combination of hypothesis type and number of items, i.e., the number of conditions shown in each QQ-plot, are applied. For the plots displaying the observed and expected hit rate, a 95% confidence envelope is plotted as a visual anchor (see also Fox, 2016). The width of the confidence envelope is calculated according to pn · p · (1 − p ) 1.96 r e e , nr where nr denotes the number of simulation runs and the expected hit rate pe.

3.2 Results 3.2.1 Agreement of the Distributions In all but one condition, no significant deviation of the expected and observed distributions could be found. The respective QQ-plots are included in Appendix A. The sampling-based approach displays an equally good fit as the analytical method for the 10 item conditions. For 50 items and the Rasch versus 2PL hy- pothesis, some deviations between the expected and observed distributions are visible for the Wald statistic (Figure 2). These differences are also statistically significant according to Kolmogorov–Smirnov tests (Appendix B). As theoreti- cally expected, we note that the agreement increases with higher sample sizes. Under some conditions, we can observe that the lines dissipate over the higher expected values. This is expected to a certain degree since visual deviations from the diagonal are more likely at higher expected values, because the num- ber of values falling in this range is smaller and the resulting pattern is more volatile.

3.2.2 Agreement of Power Figures 3 and 4 visualize the expected and observed hit rate for 10 and 50 items. The table in Appendix C displays the results in more detail for each condition. It can be seen that the expected and observed hit rates are largely in agree- ment under most conditions. Most observed values are within the confidence envelope, which is the expected variability given a perfect agreement and the number of simulations. For the condition of Rasch against 2PL and 50 items, the observed hit rate of the Wald test is below the expectation. This observa- tion reflects the lower agreement of the observed and expected distributions for these conditions. The gradient statistic has a numerically larger power than the other three statistics under most conditions.

3.3 Performance under Misspecification We aim to investigate the robustness of the method in case neither the null nor the alternative hypothesis is true. Therefore, we generate artificial data with two non-normal distributions of the person parameter and otherwise proceed

14 Rasch vs 2PL, 50 items, sampling−based power

n = 500 n = 1000 n = 3000

80 H0: No effect 60

40

Wald*

100

H1: Small effect Statistic

80 Wald LR 60 Score Gradient

Observed Quantiles 40 Wald* Wald*

150 H1: Large effect

100

50 Wald* Wald*

40 60 80 100 25 50 75 100 50 100 150 Expected Quantiles

Figure 2: QQ-plots for the Rasch versus 2PL hypothesis with 50 items and the sampling-based power analysis method. Combinations where Kolmogorov– Smirnov tests at the 5% level indicated significant deviations from the expected χ2 distributions are marked with a star in the respective panel, see Appendix A for the full results. A Bonferroni correction for 27 comparisons was applied.

15 Analytical power

Rasch vs 2PL DIF

1.0

0.8

0.6 Statistic

10 items Wald LR Score 0.4 Gradient Observed Hit Rate

0.2

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Expected Hit Rate

Figure 3: Observed and expected hit rates using the analytical power analysis approach. The gray area represents a 95% confidence envelope of the expected hit rate.

Sampling−based power

Rasch vs 2PL DIF 1.0

0.8

0.6 10 items

0.4

0.2 Statistic

Wald 0.0 LR 1.0 Score Gradient

Observed Hit Rate 0.8

0.6 50 items

0.4

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Expected Hit Rate

Figure 4: Observed and expected hit rates using the sampling-based power analysis approach. The gray area represents a 95% confidence envelope of the expected hit rate.

16 analogously to the evaluation for the Rasch against the 2PL hypothesis and 10 items presented above. We refrain from reporting Kolmogorov–Smirnov tests for the agreement of distributions since significant deviations are expected. The goal of this simulation study is rather to gain a qualitative understanding of the extent of the deviations. To keep our presentation concise, we use only the analytical power analysis method in the following; however, analogous results can be expected for the sampling-based approach.

Non-normal person parameter distributions. Uniform and skewed nor- mal distributions are used. The parameters of both are set so that the expected mean and standard deviation are 0 and 1, respectively. The remaining degree of freedom (i.e., the alpha parameter of the skewed normal distribution) is chosen as 4. The of the skewed normal distribution is therefore ∼ 0.784.

3.3.1 Results Figures 5 and 6 show the QQ-plots for the uniform and skewed-normal distri- butions. Figures 7 and 8 show the expected and observed hit rate, respectively. The table in Appendix D displays the results for the observed and expected hit rates in more detail. For the uniform distribution condition, the statistics exhibit higher agreement under the null hypothesis than under the alternative hypothesis. Specifically, the observed power is higher than expected for larger sample sizes. For the skewed-normal distribution and both the null and alter- native hypotheses, the agreement is lower with higher sample sizes. However, the size of the statistic is not uniformly underestimated as it lies below the ex- pectation for the condition of a large sample and a large effect size. This is also reflected in the observed versus the expected hit rate, where most, but not all, conditions show a higher hit rate than expected.

4 Real Data Application

The introduced method is now applied to data to which the fit of a Rasch against a 2PL model has been tested in the literature. Bock and Lieberman (1970) tested this hypothesis using the Law School Admission Test (LSAT), separately for sections 6 and 7 (termed LSAT6 and LSAT7). The datasets are publicly available in the mirt package (Chalmers, 2012). They consist of five variables and 1000 observations each. The estimated parameters when fitted without restrictions are given in Table 1. Apparently, the items in LSAT7 show considerable variance in the discrimination parameters. In a formal test, the null hypothesis of equal discrimination parameters in the LSAT7 is to be rejected for all four statistics,

χ2(4,N = 1, 000) = [10.53, 12.19, 11.47, 13.98], p = [.032,.016,.022,.007].

17 Rasch vs 2PL, 10 items, analytical power, uniform distribution

n = 500 n = 1000 n = 3000 30

20 H0: No effect

10

0

60

H1: Small effect Statistic

Wald 40 LR Score 20 Gradient Observed Quantiles

0 80

60 H1: Large effect

40

20

0 0 10 20 30 0 10 20 30 40 0 20 40 60 80 Expected Quantiles

Figure 5: QQ-plots for the Rasch versus 2PL hypothesis with 10 items using the analytical power analysis and uniformly distributed person parameters.

Table 1: Estimated Parameters in the LSAT6 and LSAT7 datasets

Dataset Parameter Item 1 Item 2 Item 3 Item 4 Item 5 LSAT6 a 0.825 0.723 0.890 0.689 0.658 d 2.773 0.990 0.249 1.285 2.054 LSAT7 a 0.988 1.081 1.706 0.765 0.736 d 1.856 0.808 1.804 0.486 1.855

18 Rasch vs 2PL, 10 items, analytical power, skewed normal distribution

n = 500 n = 1000 n = 3000 40

30 H0: No effect

20

10

0

60

H1: Small effect Statistic

40 Wald LR Score 20 Gradient Observed Quantiles

0

60 H1: Large effect

40

20

0 0 10 20 30 0 10 20 30 40 0 20 40 60 80 Expected Quantiles

Figure 6: QQ-plots for the Rasch versus 2PL hypothesis with 10 items using the analytical power analysis and skewed-normal distributed person parameters.

19 Analytical power, uniform distribution

Rasch vs 2PL

1.0

0.8

0.6 Statistic

10 items Wald LR Score 0.4 Gradient Observed Hit Rate

0.2

0.0

0.0 0.2 0.4 0.6 0.8 1.0 Expected Hit Rate

Figure 7: Observed and expected hit rates for the Rasch versus 2PL hypothesis with 10 items using the analytical power analysis and uniformly distributed person parameters. The gray area represents a 95% confidence envelope of the expected hit rate.

For the LSAT6 dataset, no siginificant difference could be detected for either statistic,

χ2(4,N = 1, 000) = [.70,.57,.44,.59], p = [.952,.967,.979,.964].

For the power analysis, we used the parameter estimates given in Table 1 as the 2PL model item parameters under the alternative hypothesis, the sample size of n = 1, 000, and the analytical method. The resulting power for the Wald, LR, score, and gradient statistics were 8.7%, 9.0%, 9.0%, and 9.1%, respectively, for the LSAT6 and 74.1%, 85.9%, 84.4%, and 90.5%, respectively, for the LSAT7. The general relationship of power and sample size is displayed in Figure 9. For example, it follows that for a replication of the finding in the LSAT7 with a power of 80%, a smaller sample size of n = 764 would be sufficient for the gradient statistic. The finding that a sample size of n = 1, 134 is required for the same level of power in the Wald statistic underscores the importance of the choice of statistic. On the other hand, to reject the null hypothesis (i.e., the Rasch model) when considering the LSAT6, a considerably larger minimum sample size (n = 15, 621) would be required. For the LSAT6, in addition to the non-significant hypothesis test, this suggests that the Rasch model already provides a good description of the data.

20 Analytical power, skewed normal distribution

Rasch vs 2PL

1.0

0.8

0.6 Statistic

10 items Wald LR Score 0.4 Gradient Observed Hit Rate

0.2

0.0

0.0 0.2 0.4 0.6 0.8 1.0 Expected Hit Rate

Figure 8: Observed and expected hit rates for the Rasch versus 2PL hypothesis with 10 items using the analytical power analysis and skewed-normal distributed person parameters. The gray area represents a 95% confidence envelope of the expected hit rate.

LSAT6 LSAT7

1.0 1.0

0.8 0.8

Statistic 0.6 0.6 Wald LR

Power Score Gradient 0.4 0.4

0.2 0.2

0 10000 20000 30000 40000 50000 0 1000 2000 3000 Sample size

Figure 9: Power curves in a test of equal discrimination parameters of the LSAT6 and LSAT7 datasets.

21 5 Discussion

In this paper, we proposed two new methods of power analysis to test linear hypotheses using the Wald, LR, score, and gradient tests applicable to a gen- eral class of IRT models and MML estimation. In particular, we introduced an analytical as well as a sampling-based approach to approximate the distri- bution of the statistics under alternative hypotheses. In extensive simulation studies involving two practically relevant hypotheses and MML estimation, we demonstrated that the generated approximation suffices for practical purposes and may be applied in power estimation and sample size planning. Under the studied conditions, the power of the relatively new and computationally advan- tageous gradient statistic was on par with the other three statistics. To make the new methods publicly available, we provide an implementation in the form of an R package available at https://github.com/flxzimmer/pwrml. It offers power analysis and sample size planning for user-defined parameter sets and hypotheses. In the results of the simulation studies, we observed that the Wald statistic is lower than expected in some conditions for the Rasch versus 2PL hypothesis and 50 items. This decrease in agreement from the 10 item to the 50 item con- dition is however not visible for the DIF hypothesis. One possible explanation for this is the difference in the degrees of freedom in the Rasch against 2PL conditions. Here, the degrees of freedom are 9 for 10 items and 49 for 50 items while they are 2 in both cases for the DIF hypothesis. If we assume a simple deviation of the statistic, i.e., that the Wald statistic is slightly lower than ex- pected, a higher number of degrees of freedom implies a higher power to detect the deviation both visually and statistically. For the Wald statistic specifically, this is mediated by a higher number of summands in (12). We conclude that since some disagreement is also visible under the null hypothesis, accuracy gen- erally increases with a lower effect size and a higher sample size. Especially for the Wald statistic and high degrees of freedom, one needs to ensure a large enough sample for the observed statistics to converge sufficiently close toward the asymptotic distributions assumed by the proposed methods. When the assumption of the person parameter distribution is wrong, we show that the accuracy of the power analysis is reduced and the power estimate is rather conservative in the two scenarios. Under a skewed normal person parameter distribution, however, we also observed a condition where the power was lower than expected. Therefore, we recommend a thorough verification of the assumptions made when using the presented method. Note, in case that the person parameters are expected to be non-normally distributed, our approaches can also be applied to estimate the power or required sample size. For the analytical approach, one may therefore replace the person parameter distribution Φ in (9). For the sampling-based approach, one may choose a different person parameter distribution in step 1. In the application example, we demonstrated the calculation of power and the planning of the sample size on an empirical dataset. It was demonstrated how one may infer which of the four statistics has the highest power and accord-

22 ingly requires the lowest sample size to reject a false null hypothesis. Although the gradient statistic is less established than the other three statistics (Draxler et al., 2020), we found that it exhibits the highest power in this specific scenario. This adds to the finding that the gradient test numerically offered the highest power in our simulation study. Hence, we call for further investigation of the pros and cons of the gradient statistic in IRT applications. This is especially relevant to with higher numbers of items (e.g., 100), since calcu- lating the gradient statistic generally involves the lowest computational effort of the four discussed statistics.

5.1 Limitations Computational limits of the presented analytical method are reached quickly when large itemsets are considered. The required calculation steps grow by 2I since they imply going over all possible patterns for all studied statistics; therefore, for larger itemsets, only the sampling-based approach is feasible. The accuracy of the sampling-based approach was similar to the analytical approach for smaller itemsets. Since the sampling-based approach can be expected to converge to the analytical approach with increasing computational resources, we recommend using the analytical method as long as it is feasible with respect to the number of items. To facilitate the decision on the approach to be used, we provide a function to estimate the computation time of the analytical approach in the R package. The general line of argumentation in this paper involves the application of asymptotic mathematical results to finite samples. It must be expected that the results are more prone to error with smaller sample sizes and larger effect sizes; however, we note that the dependence on the sample size also applies to the statistics under a true null hypothesis. To avoid this problem, one can use simulation-based approaches (e.g., Wilson et al., 2020). The main disadvantage of these are that for sample size planning, in particular, a high computational load can be expected to approximate the relationship between sample size and power. Therefore, in practice, assuming that usually sufficiently large sample sizes and mild deviations from the null hypothesis are already considered rele- vant, the approaches presented in this paper might be preferable. One important step in an analyis of power is the selection of plausible and practically relevant deviations from the hypothesis or model of interest. There are usually multiple plausible alternative models against which the studied tests should have power. In the instance of a Rasch model as the null hypothesis, one might — besides the presence of DIF or varying item discriminations — also consider the influence of guessing. The 3PL model extends the 2PL model by a guessing parameter that lies in the range of 0 to 1, where 0 represents the absence of guessing and the sufficiency of a 2PL model. When a 3PL model is fit to data generated by a Rasch model, the guessing parameters can not be expected to result to 0 on average since they lie on the boundary of the parameter space. Consequently, the Wald, LR, and gradient statistics deviate from a central χ2 distribution in this scenario and the hypothesis test cannot be performed as

23 usual. However, the score test is still applicable in this scenario because it uses only the fit of the 2PL model. Accordingly, our approaches to power analysis can be reasonably applied only to the score test for this hypothesis type.

5.2 Outlook This work can be extended along the dimensions of the ML estimator used and the hypotheses evaluated. Although we set a focus on MML estimation, the presented methods can also be combined with other consistent ML estimators, e.g., pairwise maximum likelihood estimation (Katsikatsou et al., 2012). In the evaluation section, we focused on two alternative hypotheses that are relevant in practice; however, they represent only a fraction of the hypotheses already available under the framework of linear hypotheses. Future studies may expose further alternatives, such as DIF in more than one item. The presented analyt- ical framework as well as the code implementation have therefore been designed for a straightforward extension to further usage scenarios. From the aspect of practical relevance, effect sizes must be considered. The researcher faces the task of selecting a suitable alternative model and thereby a relevant effect size. This must be done carefully, as the effect size should be large enough to represent a relevant violation, but also not so large that practi- cally relevant violations might be overlooked when the sample size is chosen in accordance with it. A suitable discussion of this practical problem is given by Draxler (2010). Furthermore, there are general measures of difference between statistical models available (e.g., the Kullback–Leibler , Kullback & Leibler, 1951). We can also consider specific item parameter sets and their dis- tances from the respective null models as an effect size (Steinberg & Thissen, 2006). Yet, identical parameter differences can yield differences in the sample sizes required for detection since they are contingent on their absolute location. Thus, as noted by Draxler (2010), we might also consider the noncentrality parameters as an additional descriptor of effect size. Future studies are ne- cessitated to further investigate the feasibility of the suggested effect sizes in IRT. Another area of extension is the implementation of further covariates. An example is the group variable in the DIF hypothesis. Kim et al. (1995) presented an approach to extend the Wald test to multiple groups and variable group sizes. Our approach can be easily extended to provide a power analysis of all four of the statistics in this scenario. Covariates can also influence the person parameter, e.g., when two groups have normally distributed θ parameters with different means. We can then estimate the means of the person parameter distribution separately for each group and calculate the statistics and their analytical power by including the parameters in the β vector. In our application of the linear hypothesis framework to investigate DIF, we used an implicit anchor to compare the parameter values of the first item. Specifically, we jointly estimated the remaining items to put the values under consideration on a common scale. To avoid misspecification, it is important to ensure that no DIF is present in these items, which might be difficult to

24 establish in practice. One workaround is to estimate all items separately for both groups and apply an anchoring strategy afterwards (for an overview, see Kopf et al., 2015). The of such anchoring strategies and power may be the subject of further studies. The presented methods could be used to further investigate the general result that power increases with a higher number of items used in the anchor (e.g., Kopf et al., 2013). In case an implementation of model fitting is only available for the model representing the null hypothesis but not for the alternative, we may still apply the score test for both fit and power analysis. For example, we may accommo- date non-linear covariates that impact the item difficulty parameter d. This is possible because the score test uses only the restricted model. In addition to extensions to different models, we can also consider further tests. For example, an analytical power analysis of the LR test by Andersen (1973) for assessing a global model fit is not yet available (Baker & Kim, 2004).

25 References

Agresti, A. (2013). Categorical Data Analysis (3rd ed., Vol. 792). Wiley. Andersen, E. B. (1973). Conditional Inference and Models for Measuring (Vol. 5). Mentalhygiejnisk forlag. Baker, F. B., & Kim, S.-H. (2004). Item Response Theory. CRC Press. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores. Addison-Wesley. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46 (4), 443–459. https://doi.org/10.1007/BF02293801 Bock, R. D., & Lieberman, M. (1970). Fitting a response model forn dichoto- mously scored items. Psychometrika, 35 (2), 179–197. https://doi.org/ 10.1007/BF02291262 Casella, G., & Berger, R. L. (2002). (2nd Edition). Duxbury. Chalmers, R. P. (2012). mirt : A Multidimensional Item Response Theory Pack- age for the R Environment. Journal of Statistical Software, 48 (6). https: //doi.org/10.18637/jss.v048.i06 Chalmers, R. P. (2018). Numerical approximation of the observed information matrix with Oakes’ identity. The British journal of mathematical and statistical psychology, 71 (3), 415–436. https://doi.org/10.1111/bmsp. 12127 Cohen, J. (1992). Statistical Power Analysis. Current Directions in Psychological Science, 1 (3), 98–101. https://doi.org/10.1111/1467-8721.ep10768783 Cumming, G. (2014). The new statistics: why and how. Psychological science, 25 (1), 7–29. https://doi.org/10.1177/0956797613504966 Draxler, C. (2010). Sample Size Determination for Rasch Model Tests. Psy- chometrika, 75 (4), 708–724. https://doi.org/10.1007/s11336-010-9182- 4 Draxler, C., & Alexandrowicz, R. W. (2015). Sample Size Determination Within the Scope of Conditional Maximum Likelihood Estimation with Spe- cial Focus on Testing the Rasch Model. Psychometrika, 80 (4), 897–919. https://doi.org/10.1007/s11336-015-9472-y Draxler, C., Kurz, A., & Lemonte, A. J. (2020). The gradient test and its fi- nite sample size properties in a conditional maximum likelihood and psychometric modeling context. Communications in Statistics - Simu- lation and Computation, 1–19. https://doi.org/10.1080/03610918.2019. 1710193 Dwyer, D. B., Falkai, P., & Koutsouleris, N. (2018). Machine Learning Ap- proaches for Clinical Psychology and Psychiatry. Annual review of clin- ical psychology, 14, 91–118. https://doi.org/10.1146/annurev-clinpsy- 032816-045037 Efron, B., & Hinkley, D. V. (1978). Assessing the accuracy of the maximum like- lihood estimator: Observed versus expected . Biometrika, 65 (3), 457–483. https://doi.org/10.1093/biomet/65.3.457

26 Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses using G*Power 3.1: tests for correlation and regression analy- ses. Behavior research methods, 41 (4), 1149–1160. https://doi.org/10. 3758/BRM.41.4.1149 Fox, J. (2016). Applied and generalized linear models (Third Edition). SAGE. Glas, C. A. W., & Verhelst, N. D. (1995). Testing the Rasch Model. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch Models (pp. 69–95). Springer- Verlag. Katsikatsou, M., Moustaki, I., Yang-Wallentin, F., & J¨oreskog, K. G. (2012). Pairwise likelihood estimation for models with ordinal data. Computational Statistics & Data Analysis, 56 (12), 4243–4258. https://doi.org/10.1016/j.csda.2012.04.010 Kim, S.-H., Cohen, A. S., & Park, T.-H. (1995). Detection of Differential Item Functioning in Multiple Groups. Journal of Educational Measurement, 32 (3), 261–276. https://doi.org/10.1111/j.1745-3984.1995.tb00466.x Kopf, J., Zeileis, A., & Strobl, C. (2013). Anchor methods for DIF detection: A comparison of the iterative forward, backward, constant and all-other anchor class. https://doi.org/10.5282/UBM/EPUB.14759 Kopf, J., Zeileis, A., & Strobl, C. (2015). Anchor Selection Strategies for DIF Analysis: Review, Assessment, and New Approaches. Educational and Psychological Measurement, 75 (1), 22–56. https://doi.org/10.1177/ 0013164414529792 Kullback, S., & Leibler, R. A. (1951). On Information and Sufficiency. The Annals of , 22 (1), 79–86. https://doi.org/10. 1214/aoms/1177729694 Lemonte, A. J. (2016). The gradient test: Another likelihood-based test. Else- vier/AP Academic Press is an imprint of Elsevier. Lemonte, A. J., & Ferrari, S. L. P. (2012). The local power of the gradient test. Annals of the Institute of Statistical Mathematics, 64 (2), 373–381. https://doi.org/10.1007/s10463-010-0315-4 Martin, M. O., von Davier, M., & Mullis, I. V. S. (Eds.). (2020). Methods and Procedures: TIMSS 2019 Technical Report. TIMSS & PIRLS Interna- tional Study Center. Masters, G. N. (1982). A rasch model for partial credit scoring. Psychometrika, 47 (2), 149–174. https://doi.org/10.1007/BF02296272 Maydeu-Olivares, A., & Monta˜no,R. (2013). How should we assess the fit of Rasch-type models? Approximating the power of goodness-of-fit statis- tics in categorical data analysis. Psychometrika, 78 (1), 116–133. https: //doi.org/10.1007/s11336-012-9293-1 National Academies of Sciences, Engineering, and Medicine. (2019). Repro- ducibility and Replicability in Science. https://doi.org/10.17226/25303 Nelder, J. A., & Mead, R. (1965). A Simplex Method for Function Minimiza- tion. The Computer Journal, 7 (4), 308–313. https://doi.org/10.1093/ comjnl/7.4.308

27 Oakes, D. (1999). Direct calculation of the information matrix via the EM. Jour- nal of the Royal Statistical Society: Series B (Statistical Methodology), 61 (2), 479–482. https://doi.org/10.1111/1467-9868.00188 OECD. (2017). PISA 2015 Technical Report. OECD Publishing. R Core Team. (2020). R: A Language and Environment for Statistical Comput- ing. https://www.R-project.org/ Rao, C. R. (1948). Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44 (1), 50–57. https: //doi.org/10.1017/s0305004100023987 Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Danish Institute for Educational Research. Robitzsch, A., L¨udtke, O., Goldhammer, F., Kroehne, U., & K¨oller,O. (2020). Reanalysis of the German PISA Data: A Comparison of Different Ap- proaches for Trend Estimation With a Particular Emphasis on Effects. Frontiers in Psychology, 884. https://doi.org/10.3389/fpsyg. 2020.00884 Silvey, S. D. (1959). The Lagrangian Multiplier Test. The Annals of Mathemat- ical Statistics, 30 (2), 389–407. http://www.jstor.org/stable/2237089 Steinberg, L., & Thissen, D. (2006). Using effect sizes for research reporting: examples using item response theory to analyze differential item func- tioning. Psychological methods, 11 (4), 402–415. https://doi.org/10. 1037/1082-989x.11.4.402 Terrell, G. R. (2002). The gradient statistic. Computing Science and Statistics, 34 (34), 206–215. Wald, A. (1943). Tests of Statistical Hypotheses Concerning Several Parameters When the Number of Observations is Large. Transactions of the Ameri- can Mathematical Society, 54 (3), 426. https://doi.org/10.2307/1990256 Wilson, D. T., Hooper, R., Brown, J., Farrin, A. J., & Walwyn, R. E. (2020). Ef- ficient and flexible simulation-based sample size determination for clini- cal trials with multiple design parameters. Statistical methods in medical research, 962280220975790. https://doi.org/10.1177/0962280220975790 Yuan, K.-H., Cheng, Y., & Patton, J. (2014). Information matrices and standard errors for MLEs of item parameters in IRT. Psychometrika, 79 (2), 232– 254. https://doi.org/10.1007/s11336-013-9334-4

28 Appendix A

Rasch vs 2PL, 10 items, analytical power

n = 500 n = 1000 n = 3000

20 H0: No effect

10

50

40 H1: Small effect Statistic Wald 30 LR

20 Score Gradient 10 Observed Quantiles

0

60 H1: Large effect

40

20

0 0 10 20 30 0 10 20 30 40 0 20 40 60 80 Expected Quantiles

Figure A1: QQ-plots for the Rasch versus 2PL hypothesis with 10 items and the analytical power analysis method. Kolmogorov–Smirnov tests at the 5% level indicated no significant deviations from the expected χ2 distributions. A Bonferroni correction for 27 comparisons was applied.

29 Rasch vs 2PL, 10 items, sampling−based power

n = 500 n = 1000 n = 3000

20 H0: No effect

10

50

40 H1: Small effect Statistic Wald 30 LR

20 Score Gradient 10 Observed Quantiles

0

60 H1: Large effect

40

20

0 0 10 20 30 0 10 20 30 40 0 20 40 60 80 Expected Quantiles

Figure A2: QQ-plots for the Rasch versus 2PL hypothesis with 10 items and the sampling-based power analysis method. Kolmogorov–Smirnov tests at the 5% level indicated no significant deviations from the expected χ2 distributions. A Bonferroni correction for 27 comparisons was applied.

30 DIF, 10 items, analytical power

n = 500 n = 1000 n = 3000

9 H0: No effect

6

3

0

30

H1: Small effect Statistic

20 Wald LR Score 10 Gradient Observed Quantiles

0

60 H1: Large effect

40

20

0 0 10 20 0 10 20 30 40 0 20 40 60 Expected Quantiles

Figure A3: QQ-plots for the Rasch versus 2PL hypothesis with 10 items and the analytical power analysis method. Kolmogorov–Smirnov tests at the 5% level indicated no significant deviations from the expected χ2 distributions. A Bonferroni correction for 27 comparisons was applied.

31 DIF, 10 items, sampling−based power

n = 500 n = 1000 n = 3000

9 H0: No effect

6

3

0

30

H1: Small effect Statistic

20 Wald LR Score 10 Gradient Observed Quantiles

0

60 H1: Large effect

40

20

0 0 10 20 0 10 20 30 40 0 20 40 60 Expected Quantiles

Figure A4: QQ-plots for the Rasch versus 2PL hypothesis with 10 items and the sampling-based power analysis method. Kolmogorov–Smirnov tests at the 5% level indicated no significant deviations from the expected χ2 distributions. A Bonferroni correction for 27 comparisons was applied.

32 DIF, 50 items, sampling−based power

n = 500 n = 1000 n = 3000 12.5

10.0 H0: No effect 7.5

5.0

2.5

0.0

30

H1: Small effect Statistic

Wald 20 LR Score 10 Gradient Observed Quantiles

0 80

60 H1: Large effect

40

20

0 0 10 20 30 0 10 20 30 40 0 20 40 60 80 Expected Quantiles

Figure A5: QQ-plots for the Rasch versus 2PL hypothesis with 50 items and the sampling-based power analysis method. Kolmogorov–Smirnov tests at the 5% level indicated no significant deviations from the expected χ2 distributions. A Bonferroni correction for 27 comparisons was applied.

33 Appendix B

Table B1: Kolmogorov–Smirnov tests of the observed distribution versus the expected distribution for the Rasch vs 2PL hypothesis

Wald LR Score Gradient Items Effect size n Method D p D p D p D p 10 no 500 analytical 0.029 .808 0.032 .670 0.035 .563 0.042 .342 10 no 1000 analytical 0.025 .904 0.033 .656 0.028 .839 0.031 .725 10 no 3000 analytical 0.035 .557 0.037 .496 0.038 .459 0.038 .468 10 small 500 analytical 0.028 .843 0.050 .167 0.057 .082 0.059 .063 10 small 1000 analytical 0.034 .625 0.058 .067 0.046 .236 0.055 .095 10 small 3000 analytical 0.038 .464 0.037 .490 0.039 .417 0.039 .439 10 large 500 analytical 0.029 .786 0.045 .273 0.042 .331 0.049 .181 10 large 1000 analytical 0.049 .180 0.038 .456 0.035 .586 0.034 .598 10 large 3000 analytical 0.034 .624 0.043 .319 0.044 .284 0.037 .490 10 no 500 sampling 0.029 .808 0.032 .670 0.035 .563 0.042 .342 10 no 1000 sampling 0.025 .904 0.033 .656 0.028 .839 0.031 .725 10 no 3000 sampling 0.035 .557 0.037 .496 0.038 .459 0.038 .468 10 small 500 sampling 0.029 .805 0.051 .152 0.057 .076 0.060 .057 10 small 1000 sampling 0.036 .520 0.060 .056 0.047 .210 0.057 .081 10 small 3000 sampling 0.043 .317 0.039 .419 0.041 .366 0.041 .379 10 large 500 sampling 0.031 .725 0.049 .175 0.047 .210 0.054 .103 10 large 1000 sampling 0.041 .373 0.028 .823 0.024 .933 0.024 .936 10 large 3000 sampling 0.050 .164 0.063 .040 0.065 .031 0.058 .069 50 no 500 sampling 0.114 <.001* 0.033 .644 0.027 .871 0.045 .252 50 no 1000 sampling 0.076 .006 0.058 .067 0.063 .038 0.064 .034 50 no 3000 sampling 0.031 .720 0.035 .578 0.031 .723 0.037 .515 50 small 500 sampling 0.136 <.001* 0.041 .362 0.046 .242 0.052 .135 50 small 1000 sampling 0.098 <.001* 0.053 .122 0.058 .069 0.051 .155 50 small 3000 sampling 0.065 .028 0.048 .193 0.047 .216 0.047 .222 50 large 500 sampling 0.150 <.001* 0.029 .782 0.026 .896 0.043 .319 50 large 1000 sampling 0.093 <.001* 0.028 .829 0.024 .925 0.034 .610 50 large 3000 sampling 0.048 .198 0.035 .564 0.034 .596 0.037 .493 Note. n: sample size. The degrees of freedom are 500 for each test. * significant after Bonferroni correction for n = 27 comparisons, i.e., the number of conditions in each QQ-plot.

34 Table B2: Kolmogorov–Smirnov tests of the observed distribution versus the expected distribution for the DIF hypothesis

Wald LR Score Gradient Items Effect size n Method D p D p D p D p 10 no 500 analytical 0.038 .455 0.038 .464 0.040 .394 0.038 .478 10 no 1000 analytical 0.038 .473 0.037 .492 0.039 .438 0.037 .494 10 no 3000 analytical 0.066 .027 0.067 .023 0.065 .029 0.067 .023 10 small 500 analytical 0.031 .723 0.037 .512 0.030 .751 0.040 .397 10 small 1000 analytical 0.038 .479 0.037 .484 0.036 .536 0.037 .487 10 small 3000 analytical 0.031 .733 0.033 .646 0.033 .639 0.034 .609 10 large 500 analytical 0.052 .130 0.042 .336 0.041 .380 0.040 .415 10 large 1000 analytical 0.047 .213 0.049 .177 0.047 .216 0.050 .171 10 large 3000 analytical 0.036 .536 0.034 .624 0.035 .581 0.033 .629 10 no 500 sampling 0.038 .455 0.038 .464 0.040 .394 0.038 .478 10 no 1000 sampling 0.038 .473 0.037 .492 0.039 .438 0.037 .494 10 no 3000 sampling 0.066 .027 0.067 .023 0.065 .029 0.067 .023 10 small 500 sampling 0.031 .731 0.036 .528 0.030 .755 0.040 .408 10 small 1000 sampling 0.037 .495 0.037 .506 0.036 .542 0.037 .506 10 small 3000 sampling 0.032 .700 0.034 .601 0.034 .628 0.035 .570 10 large 500 sampling 0.057 .076 0.047 .209 0.046 .239 0.045 .260 10 large 1000 sampling 0.057 .075 0.061 .047 0.059 .059 0.062 .042 10 large 3000 sampling 0.034 .593 0.040 .410 0.039 .437 0.046 .249 50 no 500 sampling 0.033 .646 0.034 .610 0.035 .589 0.032 .674 50 no 1000 sampling 0.049 .189 0.053 .123 0.054 .113 0.054 .113 50 no 3000 sampling 0.024 .945 0.023 .957 0.018 .996 0.023 .947 50 small 500 sampling 0.035 .564 0.025 .914 0.027 .851 0.022 .963 50 small 1000 sampling 0.030 .775 0.034 .621 0.031 .737 0.034 .625 50 small 3000 sampling 0.045 .258 0.046 .244 0.051 .141 0.044 .278 50 large 500 sampling 0.068 .018 0.042 .332 0.044 .284 0.038 .462 50 large 1000 sampling 0.047 .217 0.069 .016 0.065 .028 0.076 .007 50 large 3000 sampling 0.063 .036 0.069 .017 0.076 .006 0.069 .017 Note. n: sample size. The degrees of freedom are 500 for each test. * significant after Bonferroni correction for n = 27 comparisons, i.e., the number of conditions in each QQ-plot.

35 Appendix C Wald LR Score Gradient Table C1: Observed and expected hit rates for the Rasch vs 2PL hypothesis Method OHR EHR Env OHR EHR Env OHR EHR Env OHR EHR Env n : sample size. OHR: observed hit rate. EHR: expected hit rate. Env: 95% confidence envelope of the expected hit rate. n . Items Effect size 101010 no10 no10 no10 small10 500 small10 1000 small analytical 50010 3000 analytical large 0.058 100010 analytical 0.050 large analytical 0.050 analytical 300010 0.048 0.050 large 0.118 [0.031, 0.069] 0.250 analytical10 0.050 [0.031, 500 no 0.069] 0.133 0.056 0.243 0.71210 [0.031, 1000 no 0.069] 0.052 [0.103, analytical 0.163] 0.050 [0.205, 0.70310 3000 analytical no 0.281] 0.050 0.050 0.276 0.154 [0.031, 0.069] [0.663,10 analytical 0.520 small 0.264 0.743] 0.050 [0.031, 0.267 0.069] 0.133 0.05810 500 0.988 0.538 small 0.245 0.722 [0.031, [0.229, 0.069] 0.050 [0.104, 0.306] 0.163] 0.05010 1000 0.982 [0.495, small [0.207, 0.707 0.582] sampling 0.282] 0.048 0.050 0.294 0.150 [0.031, 50010 sampling 3000 [0.970, large 0.069] [0.667, 0.993] 0.538 0.266 0.747] 0.050 [0.031, 0.058 0.270 0.069] 0.133 100050 sampling large 0.062 0.050 sampling 0.990 0.544 0.244 0.726 [0.031, 0.050 [0.231, 0.056 0.069] [0.104, 3000 sampling50 0.309] large 0.163] 0.050 0.050 0.048 0.983 [0.500, [0.207, 0.706 0.587] 0.118 [0.031, 0.282] 0.050 0.050 sampling 0.069]50 0.302 500 no 0.158 [0.031, 0.250 [0.031, 0.050 [0.971, 0.069] [0.666, 0.069] 0.994] 0.542 0.132 0.266 0.746] 0.050 [0.031, 0.05650 0.269 0.069] 1000 no 0.134 0.712 0.241 [0.031, sampling 0.052 0.069] 0.988 0.541 [0.102, 0.245 0.726 [0.031, 0.161] 0.05050 [0.230, 0.069] sampling 3000 no [0.104, 0.698 [0.203, 0.308] 0.163] 0.278] 0.050 0.050 0.982 [0.497, [0.207, 0.708 0.276 0.585] 0.154 [0.031,50 0.283] sampling small 0.069] [0.658, 0.300 0.520 0.738] 0.264 [0.031, 0.050 [0.971, [0.668, 0.263 0.069] 0.994] 0.548 0.13350 500 0.748] small 0.058 0.271 0.529 0.988 0.722 0.243 [0.031, [0.224, 0.050 0.069] 0.990 0.545 [0.103,50 1000 0.301] small 0.163] 0.050 [0.232, [0.486, 0.980 sampling 0.704 [0.206, 0.310] 0.573] 0.281] 0.050 0.048 0.983 [0.502, 50050 3000 sampling 0.294 large 0.589] 0.150 [0.031, [0.967, 0.069] [0.664, 0.538 0.992] 0.020 0.744] 0.266 [0.031, 0.050 [0.972, 1000 sampling 0.265 0.069] large 0.994] 0.133 0.040 sampling 0.062 0.533 0.990 0.050 0.726 0.243 [0.031, 3000 sampling [0.226, 0.056 0.069] large [0.103, 0.046 0.050 0.303] 0.163] 0.050 0.062 [0.489, 0.980 [0.031, 0.704 [0.206, 0.577] sampling 0.069] 0.281] 500 0.050 0.050 0.128 0.050 [0.031, 0.302 0.158 [0.031, 0.069] 0.110 [0.968, 0.069] [0.664, 0.542 0.993] 0.050 0.744] 0.266 1000 [0.031, 0.050 0.604 0.197 [0.031, 0.263 0.069] 0.133 sampling 0.069] 0.060 [0.082, 0.137] 0.530 0.988 0.050 0.726 0.244 3000 sampling [0.031, 0.661 [0.163, [0.224, 0.069] [0.103, 0.232] 0.060 0.050 0.302] 0.163] 0.172 0.132 [0.486, 0.980 [0.031, 0.705 [0.206, sampling 0.574] 0.069] [0.619, 0.282] 0.602 0.702] 0.210 0.050 [0.031, 0.300 0.304 0.069] 0.110 [0.967, [0.665, 0.548 0.992] 0.052 0.745] 0.994 0.655 0.622 0.199 [0.031, 0.265 [0.264, 0.069] 0.064 [0.083, 0.344] 0.138] 0.534 0.990 0.050 0.999 [0.613, 0.664 [0.164, [0.226, 0.697] 0.234] 0.062 0.050 0.304] 0.324 0.134 [0.490, 0.981 [0.031, [0.997, 0.578] 0.069] [0.623, 1.000] 0.694 0.706] 0.206 0.050 [0.031, 0.311 0.069] 0.110 [0.969, 0.993] 0.052 0.994 0.668 0.624 0.199 [0.031, [0.271, 0.069] 0.062 [0.083, 0.352] 0.138] 0.050 0.999 [0.626, 0.665 [0.164, 0.709] 0.234] 0.058 0.050 0.314 0.140 [0.031, [0.997, 0.069] [0.624, 1.000] 0.682 0.706] 0.216 0.050 [0.031, 0.311 0.069] 0.110 0.996 0.668 0.624 0.199 [0.031, [0.271, 0.069] [0.083, 0.352] 0.138] 0.999 [0.627, 0.665 [0.164, 0.709] 0.234] 0.342 [0.997, [0.623, 1.000] 0.694 0.706] 0.312 0.996 0.669 [0.271, 0.352] 0.999 [0.628, 0.710] [0.997, 1.000] Note

36 Table C2: Observed and expected hit rates for the DIF hypothesis

Wald LR Score Gradient Items Effect size n Method OHR EHR Env OHR EHR Env OHR EHR Env OHR EHR Env 10 no 500 analytical 0.036 0.050 [0.031, 0.069] 0.048 0.050 [0.031, 0.069] 0.046 0.050 [0.031, 0.069] 0.052 0.050 [0.031, 0.069] 10 no 1000 analytical 0.048 0.050 [0.031, 0.069] 0.050 0.050 [0.031, 0.069] 0.050 0.050 [0.031, 0.069] 0.052 0.050 [0.031, 0.069] 10 no 3000 analytical 0.060 0.050 [0.031, 0.069] 0.060 0.050 [0.031, 0.069] 0.060 0.050 [0.031, 0.069] 0.060 0.050 [0.031, 0.069] 10 small 500 analytical 0.140 0.161 [0.129, 0.194] 0.170 0.163 [0.130, 0.195] 0.162 0.163 [0.130, 0.195] 0.174 0.163 [0.131, 0.195] 10 small 1000 analytical 0.264 0.286 [0.247, 0.326] 0.284 0.289 [0.249, 0.329] 0.284 0.289 [0.249, 0.329] 0.290 0.290 [0.250, 0.330] 10 small 3000 analytical 0.684 0.711 [0.672, 0.751] 0.690 0.717 [0.677, 0.756] 0.688 0.716 [0.677, 0.756] 0.690 0.718 [0.679, 0.758] 10 large 500 analytical 0.486 0.510 [0.466, 0.554] 0.506 0.530 [0.486, 0.574] 0.498 0.528 [0.484, 0.572] 0.514 0.535 [0.491, 0.579] 10 large 1000 analytical 0.846 0.822 [0.788, 0.855] 0.852 0.840 [0.808, 0.872] 0.854 0.838 [0.806, 0.870] 0.858 0.844 [0.812, 0.876] 10 large 3000 analytical 1.000 0.999 [0.997, 1.000] 1.000 1.000 [0.998, 1.000] 1.000 1.000 [0.998, 1.000] 1.000 1.000 [0.998, 1.000] 10 no 500 sampling 0.036 0.050 [0.031, 0.069] 0.048 0.050 [0.031, 0.069] 0.046 0.050 [0.031, 0.069] 0.052 0.050 [0.031, 0.069] 10 no 1000 sampling 0.048 0.050 [0.031, 0.069] 0.050 0.050 [0.031, 0.069] 0.050 0.050 [0.031, 0.069] 0.052 0.050 [0.031, 0.069] 10 no 3000 sampling 0.060 0.050 [0.031, 0.069] 0.060 0.050 [0.031, 0.069] 0.060 0.050 [0.031, 0.069] 0.060 0.050 [0.031, 0.069] 10 small 500 sampling 0.140 0.162 [0.129, 0.194] 0.170 0.163 [0.131, 0.195] 0.162 0.163 [0.130, 0.195] 0.174 0.163 [0.131, 0.196]

37 10 small 1000 sampling 0.264 0.287 [0.247, 0.326] 0.284 0.290 [0.250, 0.330] 0.284 0.289 [0.249, 0.329] 0.290 0.290 [0.251, 0.330] 10 small 3000 sampling 0.684 0.712 [0.673, 0.752] 0.690 0.718 [0.679, 0.757] 0.688 0.717 [0.677, 0.756] 0.690 0.719 [0.680, 0.759] 10 large 500 sampling 0.486 0.503 [0.459, 0.547] 0.506 0.522 [0.478, 0.566] 0.498 0.520 [0.476, 0.563] 0.514 0.526 [0.482, 0.570] 10 large 1000 sampling 0.846 0.815 [0.781, 0.849] 0.852 0.833 [0.800, 0.865] 0.854 0.830 [0.798, 0.863] 0.858 0.836 [0.804, 0.869] 10 large 3000 sampling 1.000 0.999 [0.997, 1.000] 1.000 0.999 [0.997, 1.000] 1.000 0.999 [0.997, 1.000] 1.000 1.000 [0.998, 1.000] 50 no 500 sampling 0.040 0.050 [0.031, 0.069] 0.048 0.050 [0.031, 0.069] 0.048 0.050 [0.031, 0.069] 0.050 0.050 [0.031, 0.069] 50 no 1000 sampling 0.070 0.050 [0.031, 0.069] 0.074 0.050 [0.031, 0.069] 0.074 0.050 [0.031, 0.069] 0.076 0.050 [0.031, 0.069] 50 no 3000 sampling 0.052 0.050 [0.031, 0.069] 0.054 0.050 [0.031, 0.069] 0.058 0.050 [0.031, 0.069] 0.056 0.050 [0.031, 0.069] 50 small 500 sampling 0.164 0.190 [0.155, 0.224] 0.174 0.192 [0.157, 0.226] 0.174 0.192 [0.157, 0.226] 0.178 0.192 [0.157, 0.226] 50 small 1000 sampling 0.348 0.345 [0.303, 0.386] 0.356 0.349 [0.307, 0.390] 0.350 0.349 [0.307, 0.390] 0.356 0.349 [0.307, 0.391] 50 small 3000 sampling 0.794 0.804 [0.769, 0.839] 0.794 0.809 [0.775, 0.844] 0.794 0.809 [0.775, 0.844] 0.794 0.810 [0.776, 0.844] 50 large 500 sampling 0.560 0.583 [0.540, 0.626] 0.572 0.596 [0.553, 0.639] 0.570 0.590 [0.546, 0.633] 0.576 0.601 [0.558, 0.644] 50 large 1000 sampling 0.894 0.883 [0.854, 0.911] 0.898 0.892 [0.865, 0.919] 0.900 0.887 [0.859, 0.915] 0.900 0.895 [0.868, 0.922] 50 large 3000 sampling 1.000 1.000 [0.999, 1.000] 1.000 1.000 [0.999, 1.000] 1.000 1.000 [0.999, 1.000] 1.000 1.000 [0.999, 1.000] Note. n: sample size. OHR: observed hit rate. EHR: expected hit rate. Env: 95% confidence envelope of the expected hit rate. Appendix D Wald LR Score Gradient Distribution OHR EHR Env OHR EHR Env OHR EHR Env OHR EHR Env Table D1: Observed and expected hit rates under misspecification of the person parameter distribution n : sample size. OHR: observed hit rate. EHR: expected hit rate. Env: 95% confidence envelope of the expected hit rate. n . Effect size nonono 500small uniform 1000small uniform 3000small 500 uniform uniformlarge 1000 0.064 uniformlarge 3000 0.050 0.072 uniform [0.031, 0.050 0.069]large 500 0.068 [0.031, 0.070 0.069] 0.146 0.050no 0.050 0.068 uniform 1000 0.133 [0.031, 0.340 0.069] [0.031, 0.050 uniform 0.069] [0.103, 0.243no 0.163] 0.082 [0.031, 3000 0.886 0.080 0.069] [0.205, 0.180 0.050 0.281] uniform 0.703 0.050 0.070no 0.133 [0.031, 0.370 0.069] [0.663, [0.031, 0.050 0.743] 0.069] [0.104, 0.245 0.316 0.163] 500 0.094small [0.031, 0.902 0.074 0.069] [0.207, 0.267 0.194 0.662 0.050 0.282] 0.707 0.050 0.070 skewed [0.229, 1000small 0.133 0.538 normal [0.031, 0.378 0.306] 0.069] [0.667, [0.031, 0.050 0.992 skewed 0.747] 0.069] [0.104, [0.495, 0.086 0.244 normal 0.358 0.163] 0.582] 0.084 [0.031, 3000small 0.982 500 0.906 0.050 0.069] [0.207, 0.142 0.270 0.186 0.684 0.050 0.282] skewed [0.970, 0.706 normal [0.031, 0.993] 0.050 [0.231, 0.134 0.544 0.069] [0.031, 0.374 skewedlarge 0.309] 1000 normal 0.069] [0.666, 0.424 [0.031, 0.996 0.746] [0.104, [0.500, 0.084 0.245 0.069] 0.374 skewed 0.163] 0.587] 0.210 0.050 normal 0.983 0.904large 0.050 [0.207, 3000 0.142 0.269 0.692 0.133 0.283] [0.031, 0.230 [0.971, 0.708 0.069] [0.031, skewed 0.994] 0.050 [0.230, 0.541 normal 0.069] [0.103, 0.243 0.308]large 0.163] [0.668, 0.434 500 [0.031, 0.996 0.748] [0.497, 0.892 0.074 0.069] [0.205, 0.366 0.585] 0.218 0.050 0.281] 0.982 0.703 0.050 0.132 0.271 0.682 skewed 0.133 1000 [0.031, 0.246 normal [0.971, 0.069] [0.663, [0.031, 0.994] 0.050 [0.232, 0.545 0.743] 0.069] [0.104, skewed 0.245 0.310] 0.306 0.163] normal 0.426 [0.031, 0.994 [0.502, 3000 0.894 0.084 0.069] [0.207, 0.267 0.589] 0.228 0.458 0.050 0.282] 0.983 skewed 0.707 0.050 0.142 normal [0.229, 0.133 0.538 [0.031, 0.232 0.306] [0.972, 0.069] [0.667, [0.031, 0.994] 0.050 0.954 0.747] 0.069] [0.104, [0.495, 0.244 0.346 0.163] 0.582] 0.438 [0.031, 0.982 0.892 0.069] [0.207, 0.270 0.236 0.478 0.050 0.282] [0.970, 0.706 0.993] [0.231, 0.134 0.544 [0.031, 0.244 0.309] 0.069] [0.666, 0.958 0.746] [0.104, [0.500, 0.245 0.346 0.163] 0.587] 0.983 0.900 [0.207, 0.269 0.482 0.283] [0.971, 0.708 0.994] [0.230, 0.541 0.308] [0.668, 0.962 0.748] [0.497, 0.370 0.585] 0.982 0.271 0.478 [0.971, 0.994] [0.232, 0.545 0.310] 0.962 [0.502, 0.589] 0.983 [0.972, 0.994] Herein, the Rasch vs 2PL hypothesis was tested using 10 items and the analytical power analysis. Note

38