psychometrika doi: 10.1007/s11336-017-9554-0

GENERALIZED FIDUCIAL INFERENCE FOR LOGISTIC GRADED RESPONSE MODELS Yang Liu

UNIVERSITY OF CALIFORNIA, MERCED

Jan Hannig

THE UNIVERSITY OF NORTH CAROLINA, CHAPEL HILL

Samejima’s graded response model (GRM) has gained popularity in the analyses of ordinal response data in psychological, educational, and health-related assessment. Obtaining high-quality point and interval estimates for GRM parameters attracts a great deal of attention in the literature. In the current work, we derive generalized fiducial inference (GFI) for a family of multidimensional graded response model, implement a Gibbs sampler to perform fiducial estimation, and compare its finite-sample performance with several commonly used likelihood-based and Bayesian approaches via three simulation studies. It is found that the proposed method is able to yield reliable inference even in the presence of small sample size and extreme generating parameter values, outperforming the other candidate methods under investigation. The use of GFI as a convenient tool to quantify sampling variability in various inferential procedures is illustrated by an empirical data analysis using the patient-reported emotional distress data. Key words: generalized fiducial inference, confidence interval, Markov chain Monte Carlo, Bernstein–von Mises theorem, item response theory, graded response model, bifactor model.

1. Introduction

Ordinal rating scales frequently appear in psychological, educational, and health-related measurement. For instance, Likert-type items are routinely used to elicit responses to, e.g., the degree to which a statement can be concurred with, the frequency of substance use, or the severity of a disease’s interference on daily activities, etc. Another example, more common in proficiency assessments, is assigning partial credits to constructed responses that agree only in part with the answer key, or that reflect progressive levels of mastery. The logistic graded response model (GRM), first introduced in Samejima’s (1969) Psychome- trika monograph, has become a standard statistical tool for analyzing ordinal response data. The GRM models an item response as an ordinal logistic regression (Agresti, 2002; also known as a proportional odds model) on one or more latent variables representing the underlying constructs of interest. Heuristically, an ordinal response is treated as a discrete realization of a continuous but latent propensity that is related to individual differences in target constructs and also item characteristics. The relative position of a particular response category on the latent continuum is gauged by the adjacent item difficulty parameters that are transformations of the slope and intercept parameters in the regression. The GRM reduces to the two-parameter logistic (2PL; Birnbaum, 1968) model when there are only two response categories. Maximum likelihood (ML) estimation of the GRM parameters via Newton-type (e.g., Bock & Lieberman, 1970; Haberman, 2013) or Expectation–Maximization (EM; Bock & Aitkin, 1981)

Electronic supplementary material The online version of this article (doi:10.1007/s11336-017-9554-0) contains supplementary material, which is available to authorized users. Correspondence should be made to Yang Liu, Psychological Sciences, School of Social Sciences, Humanities, and Arts, University of California, Merced, 5200 North Lake Road, Merced, CA 95343, USA. Email: [email protected]

© 2017 The Psychometric Society PSYCHOMETRIKA algorithms has been implemented in software packages such as Mplus (Muthén & Muthén, 2012), flexMIRT (Houts & Cai, 2013), IRTPRO (Cai, Thissen, & du Toit, 2011), and the R package mirt (Chalmers, 2012). One technical aspect that requires special handling is the intractable integration over the latent variable space involved in the GRM likelihood function. Simple outer- product rectangular or Gauss–Hermite quadrature approximation functions efficiently when the latent dimensionality is low (e.g., less than 3). For high-dimensional GRMs, adaptive quadrature (e.g., Haberman, 2006; Schilling & Bock, 2005) or stochastic approximation techniques (e.g., Cai, 2010a, 2010b; Meng & Schilling, 1996) must be invoked to overcome the well-known “curse of dimensionality”. When the test is short and the sample size is small, instability of ML estimation has been noted in Thissen and Hill (2004) and Hill (2004). In particular, if the test only consists of two three-category graded items, Thissen and Hill (2004) identified cases in which the likelihood keeps increasing as the slope increases, and consequently the EM iterations cannot terminate properly. A confidence interval (CI) captures the uncertainty of parameter estimation due to sampling variability. Most often, CIs reported in GRM applications are constructed from inverting the Wald test, i.e., the ML estimate plus or minus the standard error times the appropriate normal quantile determined by the nominal coverage level. Standard error calculation for item response theory (IRT) models has been extensively studied; more details can be found in Yuan, Cheng, and Patton (2014) and Cai (2008). Because the Wald-type CI relies on a normal approximation of the ML esti- mates’ sampling distribution, its performance is largely contingent upon the deviation from such an approximation. For example, Liu and Hannig (2016) noticed for binary logistic IRT models that generating parameter values close to the boundary of the parameter space is likely to produce skewed sampling distributions for the ML estimates, which further induces under-coverage of Wald-type CIs. The Delta method (e.g., Lehmann, 1999, pp. 85–93) and resampling procedures such as bootstrapping (Efron & Tibshirani, 1994) are often resorted to when for a reparameterization is intended. The Delta method yields Wald-type CIs for transformed parameters using a first-order Taylor series expansion argument, and thus suffers from the draw- backs of using a normal approximation. While resampling methods may lead to better CIs for parameters near the boundary, its empirical performance is seldom studied in the IRT literature. for unidimensional (Curtis, 2010; Kieftenbeld & Natesan, 2012) and multidimensional (Edwards, 2010) GRMs via Markov chain Monte Carlo (MCMC) sampling has also been proposed and evaluated in the literature. With the help of generic Gibbs samplers such as openBUGS (Spiegelhalter, Thomas, Best, & Lunn, 2010), JAGS (Plummer, 2013a, 2013b), and Stan (Carpenter et al., 2016), arbitrary functionals of the posterior distribution can be efficiently approximated by Monte Carlo methods, from which point estimates and CIs of (transformations of) model parameters can be obtained. Kieftenbeld and Natesan (2012) reported that the Bayesian estimator resulted from a specific prior configuration performs better than the ML estimator in recovering the difficulty parameters when the sample size is small. It is well known that the finite-sample behavior of Bayesian inference is determined to a great extent by the choice of prior distributions relative to the true data-generating parameters, which may hinder the generalizability of Kieftenbeld and Natesan’s findings. Unfortunately, prior sensitivity in the estimation of GRM parameters has not been systematically studied, and thus Bayesian methods should be used with caution. In the current research, generalized fiducial inference (GFI; Hannig, 2009, 2013; Hannig, Iyer, Lai, and Lee, 2015) is derived for a general class of multidimensional GRMs, comple- menting the existing full information inferential frameworks. This recent theoretical extension of Fisher’s (1930,1933,1935) fiducial inference serves as a middle ground between likelihood-based and Bayesian methods. Inferential procedures rest on a probability distribution defined on the parameter space, namely a fiducial distribution, which is derived using only the information con- tained in the data. Consequently, it inherits all the flexibility of Bayesian methods, but requires YANG LIU AND JAN HANNIG no prior knowledge of model parameters. It has been demonstrated in applications that GFI not only offers asymptotically optimal inference but often outperforms ML and Bayesian approaches in small samples as well (e.g., Cisewski & Hannig, 2012; Hannig, 2009; Liu & Hannig, 2016). In the current work, we show that GFI, again, when applied to the GRM, delivers added value over conventional likelihood-based and Bayesian methods: We continue to see that GFI is well- behaved even in extreme conditions (small sample and skewed item parameters) where both ML and Bayesian approaches may fail.

2. Theory

2.1. Generalized Fiducial Inference We first introduce the generic recipe of GFI. More detailed descriptions can be found in Hannig (2009) and Liu and Hannig (2016). For a fixed family of parametric models and an observed data set, the goal of GFI is to find a fiducial distribution that quantifies the propensity or plausibility of different parameter values in generating the observed data, in the absence of prior knowledge thereof. It is achieved by Fisher’s signature role-switching argument between data and parameters, which also serves as the foundation in defining the likelihood function from a probability density function. The fiducial argument operates on the data generating equation (DGE; also known as the structural equation; Hannig, 2009): Y = g(θ, U), (1) which describes the data Y as a function of the parameters θ ∈  and random variables U following a completely known distribution. For fixed data Y = y, properly solving for θ from Eq. 1, i.e., expressing the parameters as a function of the data and random components, translates the known probability measure of U to the parameter space and produces a fiducial distribution. Consider the set inverse of the DGE (Eq. 1):

Q(y, u) = {θ : g(θ, u) = y} , (2) which is composed of all solutions of θ from the DGE for fixed y and u1. In general, Eq. 2 may contain more than one element for some values of u, and may be empty for others; in terms of finding a solution θ, they correspond to under-identified and over-identified systems, respectively. Here, drawing analogy to solving a system of linear equations might be helpful. When Eq. 2 consists of multiple elements, it resembles a linear system that has fewer equations than variables, and thus more than one solution is admissible. A general solution for such an under-identified system can be denoted v(Q(y, u)), in which v(·) is some user-defined selection rule that chooses a point from the set determined by Eq. 2. On the other hand, when Eq. 2 is an empty set, it is similar to a linear system that has more equations than variables, in which case conflict may arise and no solution can be found. Because we assume the model is correctly specified, at least the true parameter values should be contained in the set inverse evaluated at y and the u that generates y; intuitively, u values leading to an empty set inverse are deemed not helpful to the inference of θ. To prevent this from happening, we concentrate on the set of u such that Eq. 2 is nonempty. Subsequently, a fiducial distribution can be defined as

  v(Q(y, U )) |{Q(y, U ) =∅}, (3)

1As a notational convention, we use the corresponding lowercase letter for a fixed value of the random variable. PSYCHOMETRIKA

Yij =2 10

Aij

αj2 + βjZi αj1 + βjZi

Figure 1. Generating a single trichotomous response Yij from a unidimensional GRM. in which U has the same distribution but is independent of the data generating U2. Hannig (2009; 2013) commented that Eq. 3 depends on the choice of the DGE (Eq. 1) and the selection rule v(·), which implies that the fiducial distribution may not be uniquely defined. In our application of GFI to the GRM, a specific DGE and a non-informative selection rule are introduced. We remark that GFI is not intended to resolve the theoretical and philosophical controversies of Fisher’s original fiducial inference, which has led to heated debates over decades (see Zabell, 1992 for a review). Some of the “deficiencies” of Fisher’s theory remains in GFI; for example, it does not always yield exact inference in the frequentist sense. We also do not attempt to pur- sue prolonged philosophical discussions on the meaning of fiducial probability. In contrast, we emphasize that GFI, analogous to ML, provides an inferential recipe that can be easily imple- mented for various statistical models. It often yields asymptotically correct inferential procedures and often exhibits satisfactory small sample performance as suggested by Monte Carlo studies.

2.2. GFI for the GRM For ease of explication, we only present here the derivation of GFI for trichotomous graded items and a unidimensional latent variable; for a general GRM with more than three response categories and a multidimensional latent variable, the formulae are provided in Appendix A3.Our notation is more attached to the mixed-effects modeling convention than the default notation in the IRT literature. Conditional on the unidimensional latent variable Zi , respondent i’s response to a trichotomous item j under the GRM is a single multinomial trial with probabilities: ⎧ ⎨ 1 − (αj1 + β j zi ), k = 0; { = | = }= (α + β ) − (α + β ), = ; P Yij k Zi zi ⎩ j1 j zi j2 j zi k 1 (4) (αj2 + β j zi ), k = 2.

Equation 4 is often referred to as the item response function (IRF), in which (·) denotes the standard logistic cumulative distribution function (cdf), α j1 and α j2 are intercept parameters subject to the order constraint α j1 >αj2, and β j denotes the slope parameter. In Appendix A, the IRF of a general logistic GRM is denoted by f j (θ j , k|zi ), in which θ j collects all the slope and intercept parameters for item j An illustration of generating a trichotomous response Yij from the GRM (Eq. 4) can be found in Fig. 1.GivenZi and fixed item parameters, we draw Aij ∼ Logistic(0, 1) and compare it with α j1 + β j Zi and α j2 + β j Zi :IfAij >αj1 + β j Zi , then Yij = 0; if Aij ≤ α j2 + β j Zi , then Yij = 2; otherwise, Yij = 1. The following DGE summarizes all three cases in one formula:

Yij = I{Aij ≤ α j1 + β j Zi }+I{Aij ≤ α j2 + β j Zi }, (5)

2The added asterisk is used to distinguish a random variable from its data-generating counterpart, which is adopted as a notational convention in the sequel. 3All appendices are included as online supplemental materials. YANG LIU AND JAN HANNIG

Y ij = 0 Y ij = 1 Y ij = 2

slope slope slope

intercept 2 intercept 2 intercept 2

intercept 1 intercept 1 intercept 1

Figure 2. The set inverse of a single response entry Yij for a trichotomous graded item; three panels correspond to the cases Yij = 0, 1, and 2, respectively. The vertical wireframes on the first and third panels give the boundary of the order constraint imposed for intercept parameters: α j1 >αj2. Arrows point into the corresponding half-spaces. in which I{·} is the indicator function. Our formulation of the DGE closely follows Hannig’s (2009) general discussion on the multinomial model and extends Liu and Hannig’s (2016) treatment for binary item responses. In Eq. 5, Aij and Zi together are identified as the pivotal component U in the general formula (Eq. 1). The set inverse function of Eq. 5 is a subset of the three-dimensional parameter space of two intercepts and one slope:

 3 Qij(yij, aij, zi ) ={(α j1,αj2,βj ) ∈ R : α j1 >αj2; aij >αj1 + β j zi , if yij = 0; α j2 + β j zi < aij ≤ α j1 + β j zi , if yij = 1; aij ≤ α j2 + β j zi , if yij = 2.} (6)

Geometrically, Eq. 6 corresponds to an intersection of two half-spaces: If yij = 1, i.e., the middle category, Eq. 6 is the intersection of aij >αj2 + β j zi and aij ≤ α j1 + β j zi ;ifyij = 0 or 2, i.e., the extreme categories, Eq. 6 reduces to the intersection of aij >αj1 + β j zi or aij ≤ α j2 + β j zi with the order constraint α j1 >αj2. A graphical illustration of Eq. 6 is provided as Fig. 2. The set inverse for n independent and identically distributed (i.i.d.) responses to a trichoto- = ( )n mous item j, denoted Y( j) Yij i=1, is given by the intersection of all individual set inverse functions (Eq. 6): m Q j (y( j), a( j), z) = Qij(yij, aij, zi ) (7) j=1 = ( )n = ( )n in which a( j) aij i=1 and z zi i=1 are realizations of the logistic and normal random variables. An intersection is taken for the reason that the same intercept and slope parameters appear in the data-generating equations of all n responses, and that the set inverse should contain values of those item parameters that are consistent with all the equations. Figure 3 depicts the set inverse for five responses to the item: A three-dimensional polytope is obtained as the intersection PSYCHOMETRIKA

slope

intercept 2

intercept 1

Figure 3. The set inverse of five i.i.d. responses for a 3-category graded item. Arrows point into the corresponding half-spaces. The intersection of all half-spaces is shown as the polytope surrounded by its highlighted edges and vertices. of the corresponding half-spaces. It is noted that the intersection of half-spaces may not always form a closed polytope; unbounded polyhedrons are more frequently observed when the sample size is small and/or the data-generating parameters are extreme. = ( )n = ( )n m Next, consider a sample of i.i.d. responses Y Yi i=1 Yij i=1 j=1 to a test of m graded items, each of which is calibrated by a form of Eq. 4. Because items do not share parameters in the current model specification, the set inverse function for the entire set of item response data

m Q(y, a, z) = × Q j (y( j), a( j), z), (8) j=1 in which × denotes the Cartesian product. A fiducial distribution can be constructed following the general recipe (Eq. 3):

    v(Q(y, A , Z )) |{Q(y, A , Z ) =∅}, (9) in which A and Z are i.i.d. copies of A and Z, respectively. Note that both A and Z are continuous random variables, and thus we do not differentiate Q(y, A, Z) from its closure: i.e., the polyhedrons with attained boundaries. A random variable following the distribution given by Eq. 9 is termed a generalized fiducial quantity (GFQ), denoted R(y). Any deterministic or stochastic rule v(·) that selects a unique point from the set inverse Q(y, A, Z) can be applied in the definition of a fiducial distribution (Eq. 9). We first consider a specific selection rule that identifies the extremal point Q(y, A, Z) along a fixed direction = ( )m d d j j=1 on the parameter space, which projects onto the extremal vertex of the polyhedron ( ,  , ) Q y( j) A( j) Z along d j for each j. A Bernstein-von Mises theorem can be established for the resulting fiducial distribution (Theorem 1). In practice, however, there is no clear indication of YANG LIU AND JAN HANNIG which direction d should be preferred, and thus we recommend the selection rule that selects with equal probability an extremal point of the overall set inverse, in parallel with Liu and Hannig’s (2016) choice for binary logistic IRT models.

2.3. Asymptotic Properties d(θ| ) Let gn y be the density function of the fiducial distribution (Eq. 9) associated with the selection of the extremal point along direction d, and Rd(y) denote the corresponding GFQ. As d(θ| ) the main theoretical result of this paper, we establish that gn y satisfies a Bernstein–von Mises theorem (Theorem 1). We extend Liu and Hannig’s (2016) work in two aspects: a) The theorem is proved for an exact fiducial distribution, not an empirical Bayesian approximation, and b) the class of GRMs being considered here subsumes the binary logistic models as special cases. The proof is outlined in Appendix B; arguments similar to what have already been elaborated in Liu and Hannig’s (2016) proof are omitted. = ( )n Theorem 1. (Bernstein-von Mises) Suppose that item response data Y Yi i=1 are i.i.d. with probability mass function:

 m f (θ 0, yi ) = f j (θ 0, j , yij|zi )d(zi ), (10) Rr j=1

θ = (θ )m ∈  ⊂ Rq  in which 0 0, j j=1 denotes the true parameter values, is the parameter space, f j (θ 0, j , yij|zi ) is the IRF of item j, and (·) denotes the probability measure of an r-dimensional standard normal distribution. Also write  ∂ ∂ I(θ) =Eθ f (θ, Y ) f (θ, Y ) ∂θ log i ∂θ  log i  ∂2 =−Eθ f (θ, Y ) ∂θ∂θ  log i (11) as the Fisher information matrix, and I0 = I(θ 0) for simplicity. Assume that (i) m ≥ r + 1;

(ii) For all θ, θ ∈  such that θ = θ ,f(θ, yi ) = f (θ , yi ) for some response pattern yi ; (iii) θ 0 is in the interior of ; (iv) The Fisher information matrix I0 is positive definite. √ √ √ ¯d( | ) = d(θ + / | )/ [ ( ) − θ ] Let gn h y gn 0 h n y n be the fiducial density of n Rd y 0 ,Hn be the −1 −1 correspondingly scaled parameter space, and φI−1 ,I−1 be the density of N (I Sn, I ),in 0 Sn 0 0 0 which n 1 Sn = √ s(θ 0, Yi ) (12) n θ=θ i=1 0 denotes the sample score function. Then,  Pθ ¯d( | ) − φ ( ) →0 , gn h Y I−1 ,I−1 h dh 0 (13) 0 Sn 0 Hn

θ P 0 where → denotes convergence in probability under θ 0. PSYCHOMETRIKA

We discovered in pilot simulation studies that the polytopes involved in the set inverse function are typically small in hyper-volume when the sample size is large (> 500), and thus conjecture that the Bernstein-von Mises phenomena is applicable to all valid selection rules. Similar to Liu and Hannig’s (2016) Theorem 2, it can be established for the special case r = 1 that the diameter of the set inverse converges to 0 at the rate 1/n; therefore, the conjecture is verified for unidimensional GRMs (see Liu, 2015 for more details). ∼ N ( , I−1) I−1 →d →d Let X 0 0 ; by the Central Limit Theorem, 0 Sn X, in which denotes convergence in distribution. As a corollary√ of Theorem 1, a location functional, e.g., the mean or median, of the scaled fiducial distribution n[Rd(Y) − θ 0] converges in distribution to the same N ( , I−1) functional applied to the same functional applied to X 0 , which gives X; see Theorem 10.8 of van der Vaart (2000) and Theorem 5.5.3 of Bickel and Doksum (2015). We also know ˆ √ ˆ d that the ML estimate of the GRM parameters, denoted θ, satisfies n(θ − θ 0) → X therefore, the fiducial mean or median is asymptotically equivalent to the ML estimate. Another notable consequence of Theorem 1 is that fiducial percentile CIs have the correct frequentist coverage as the sample size tends to infinity. Let θ be a single coordinate of θ, and R(y) ( ) σ 2 be the corresponding coordinate for the GFQ Rd y ; also write X and 0 as the corresponding I−1 entries in X and the diagonal of the inverse Fisher information matrix 0 , respectively. The √ d 100(1 − α)th percentile (0 <α<1) of R(y), denoted rα(Y), satisfies n[rα(Y) − θ0] −→ X + σ0 zα. It follows that √ P{θ0 ≤ rα(Y)}=P{ n[rα(Y) − θ0]≥0}→P{X + σ0zα ≥ 0}=1 − α. (14)

In other words, the coverage of the one-sided 100(1 − α)th fiducial percentile CI (−∞, rα(Y)] converges to its nominal level. Related to Eq. 14 is the definition of a confidence distribution (CD; e.g., Efron, 1998; Schwider and Hjort, 2002; Xie & Singh, 2013), which is also historically rooted in Fisher’s fiducial inference. A CD for a single parameter θ (with possibly the presence of nuisance parameters) is a data- dependent distribution, represented by a random variable C(y), is defined such that its upper α quantile cα(y) is the upper limit of a one-sided 100(1 − α)% CI under the true model: i.e., P{θ0 ≤ cα(Y)}=1 − α. In many practical problems, however, only asymptotic CDs (ACDs) can be found: P{θ0 ≤ cα(Y)} only converges to rather than matches exactly the nominal level 1 − α. Equation 14 indicates that GFQ of each item parameter θ is an ACD. Various existing inference tools, such as Bayesian posterior distributions and bootstrap distributions, are also ACDs, provided suitable regularity conditions are satisfied. The comparative performance of GFI versus other ACDs in finite samples is studied via Monte Carlo simulations in Sect. 4.

3. Computation

Consider the data y fixed. Recall that the generalized fiducial distribution (Eq. 9) is determined by the distribution of A and Z truncated to the set Q(y, A, Z) =∅. Algorithm 1 defines a Gibbs sampler for the truncated sampling of A and Z. With starting values of A and Z, and a large bounding set on the parameter space, the size of which is determined by the order constraints for intercepts and a pre-specified upper bound M for the absolute values of item parameters (Algorithm S.4 in Appendix F), the algorithm at each cycle updates sequentially each component of A and Z conditional on the current values of all other variates and the key restriction Q(y, A, Z) =∅(Algorithms S.1 and S.2 in Appendices C and D). The representation of the implied set inverse must be updated after each conditional sampling step, in order to yield the desirable truncation at the next conditional sampling step; the details are relegated to Algorithm YANG LIU AND JAN HANNIG

S.3 in Appendix E. As the number of cycles tends to infinity, the generated Markov chain is stable around the joint distribution of A, Z | Q(y, A, Z). At the end of each cycle, an extremal point of the updated set inverse is selected, which is regarded as a realization of the GFQ.

Algorithm 1 A Gibbs sampler.   1: Starting values for A and Z (Algorithm S.4) 2: do cycles s = 1,...,S 3: do observations i = 1,...,n 4: do items j = 1,...,m 5: Unlink observation i from the interior polytope 6: end do 7: do dimensions d = 1,...,r  8: Update Zid (Algorithm S.2) 9: end do 10: do items j = 1,...,m  11: Update Aij (Algorithm S.1) 12: Update the jth polytope (Algorithm S.3) 13: end do 14: end do 15: do items j = 1,...,m 16: Select with equal probability a vertex of the updated interior polytope 17: end do 18: end do

For each i, we need the operation of Line 5 in Algorithm 1 prior to executing any updating steps about this particular observation. When neither half-space given by observation i is interior, no extra computation is needed there. When i constitutes the interior polytope, however, the unlinking step is computationally challenging. Currently, Line 5 is achieved by intersecting the initial bounding box with the half-spaces for all but the ith observations (i.e., repeatedly running Algorithm S.3). Fortunately, the interior polytope is usually determined by very few observations, and we only need to run the unlinking once for each combination of i and j when observation i is interior. The proposed sampler relies on updating bounded polytopes, and thus we need to specify an upper limit 0 < M < ∞ for the absolute values of the item parameters. It has been briefly commented in Sect. 2.2 that the intersections of generated half-spaces (Eq. 7) are not necessarily bounded and that infinity can be a feasible solution of the DGE. When the sample size is small, the unbounded cases are more likely to appear. Because Algorithm 1 operates on the restricted ∞ parameter space  ∩{θ :θ∞ ≤ M}, in which ·∞ denotes the L -norm, unbounded polyhedrons are truncated at the bound θ∞ = M. Consequently, larger M values may produce longer-tailed fiducial distributions in small samples. To reduce the impact of the tuning parameter M on the performance of the sampler, we propose a workaround naturally adapted to the GRM, which amounts to introducing artificial extremal response categories for each item; the details can be found in Appendix G. As demonstrated in our first simulation study (Sect. 4.1; n = 200, m = 2, and K1 = K2 = 3), using M = 20 or 200 does not substantially change the fiducial point estimates. In practice, we consider M = 20 a generous enough bound for item parameters: A slope of 20 on the logit scale implies that a one-unit increase in the latent variable leads to e20 ≈ 4.85 × 108 times the odds of endorsing higher response categories. PSYCHOMETRIKA

Factors affecting the computational complexity of Algorithm 1 are the sample size n,thetest length m, and the number of dimensions q j = r j + K j −1 of each polytope. A major limitation of the proposed algorithm is that the dimensionality of the polytope and thus the computational time of the polytope-updating step (Line 12 of Algorithm 1) depend on both the dimensionality of the latent variable and the number of response categories. As a result, fitting a unidimensional model to five-category items takes about the same time as fitting a three-dimensional model to three- category items. We also observed that for some parameters the autocorrelations of the generated Markov chains grow as the sample size increases. We leave the improvement on computational efficiency as a topic for future research. Because our simulation studies (Sect. 4) suggest that GFI only outperforms ML when the sample size is small and the computational time of Algorithm 1 therein is reasonable, it is still practically desirable to invest more time to secure higher-quality inference. In large samples, however, the Wald-type CI often works well due to the asymptotic theory and in the meantime is more computationally economical than MCMC-based methods (i.e., fiducial and Bayesian methods).

4. Monte Carlo Simulations

Three simulations studies are reported in this section. In the first study (Sect. 4.2), we revisited Thissen and Hill’s (2004) inquiry of fitting a unidimensional GRM to two items, using ML, GFI, and Bayesian inference. We are interested in the accuracy of the resulting point estimators under various tuning configurations of the estimation algorithms. Next, the finite-sample behavior of the previously discussed implementation of GFI is evaluated when applied to unidimensional (Sect. 4.2) and bifactor GRMs (Sect. 4.2) with more items. Comparing fiducial interval estimators with existing likelihood-based and Bayesian approaches is the focus of our attention. A Fortran program that implements the proposed Gibbs sampler (Algorithm 1) was used to obtain Monte Carlo samples from the fiducial distribution of item parameters. The source code of the program has been posted on the first author’s personal website (http://faculty.ucmerced.edu/yliu85) and is also available upon request by email. A majority of the computations involved in the simulation study were completed on the parallel computing cluster KillDevil located at the University of North Carolina at Chapel Hill.

4.1. Study 1: Two Items Tests consisting of only two ordinal items, albeit not common, exist in large-scale writing assessments (e.g., the writing items in Haberman, 2013, p. 29), introductory texts of IRT modeling (e.g., the political attitude example in Thissen & Steinberg, 1988), or software documentation. Fitting a unidimensional GRM to only two items using ML has been proved to be challenging: The log-likelihood surface can be flat, resulting in convergence issues of the optimization algorithm. In this study, we demonstrate using Monte Carlo simulations that sampling-based estimation tech- niques, including both fiducial and Bayesian methods, are superior to ML in terms of producing more accurate point estimates. We also reveal that the quality of fiducial point estimates is not substantially affected by the tuning parameter M, i.e., the largest absolute value allowed for item parameters (see Sect. 3 and Appendix G), whereas the Bayesian estimator using a Cauchy prior vastly outperforms the one using a uniform prior.

4.1.1. Simulation Design and Candidate Methods The accuracy of the point estimates was summarized across 500 simulated data sets; each data set is composed of 200 response patterns to two (unidimensional) three-category graded items with identical item parameters: α j1 = 2.5, α j2 = 0, β j = 2, for j = 1, 2. YANG LIU AND JAN HANNIG

To initiate the proposed fiducial MCMC sampler, we set (1, −1) as starting values for the intercepts, and 1 for the slopes. The starting values for the normal variates z0 were generated from a standard normal distribution unconditionally, and those for the logistic variates a0 were generated by Algorithm S.4 described in Appendix F. Item parameters were restricted to the bounding set defined by Eq. S.39. Two bounding-set sizes were considered here: M = 20 and 200; the resulting point estimators are referred to as FID20 and FID200, respectively. Note that item parameters are defined on the logit scale. In the literature, we have never seen reported item parameter estimates with absolute values greater than 20; hence, setting M = 20 should not pose a substantial restriction on the parameter space. In each replication of the simulation, the sampler was run for 60,000 cycles. We visually examined in a pilot study the resulting trace plots and decided that 60,000 cycles are sufficient for the generated Markov chain to attain stationarity. In addition, we burned in the first 10,000 cycles to remove the influence of starting values and used a thinning interval of 10 to reduce the auto-correlation of the generated Markov chain, as well as the usage of computer storage. Fiducial medians of item parameters were estimated based on 5000 draws in each replication; the median, which is the default choice in the GFI literature (e.g., Cisewski and Hannig, 2012; Hannig 2009), serves as a more stable central tendency estimate compared to the mean when the target distribution is long-tailed. ML estimates were computed via the Bock–Aitkin EM algorithm using Mplus 7.0 (Muthén & Muthén, 2012). The integral in the likelihood function was approximated using 49 equally spaced rectangular quadrature points on interval [−5, 5]. The maximum number of iterations was fixed to 10,000, and the convergence criterion for the EM algorithm was tuned to either 10−4 (the default in mirt; the resulting estimates are labeled ML-4) or 10−6 (the default in IRTPRO; the resulting estimates are labeled ML-6). The same starting values for item intercepts and slopes as in fiducial estimation. Two objective Bayesian methods with diffuse priors, reflecting our lack of aprioriknowledge of item parameters, were also included in the comparison: The priors were derived from a wide uniform distribution (from −20 to 20) and a standard Cauchy distribution, respectively. For each item, the slope followed a uniform or Cauchy distribution, and independently the two intercepts were the order statistics of a uniform or Cauchy distribution4. While diffuse uniform priors have been widely used in the Bayesian literature, our adoption of the Cauchy prior is largely exploratory. It was motivated by Gelman, Jakulin, Pittau, and Su’s (2008) recommendations on the use of weakly informative priors for logistic regression models. Posterior medians (denoted by BAYC and BAYU for Cauchy and uniform priors, respectively) were used as point estimates. JAGS (Plummer, 2013a, 2013b) and its R interface package rjags (Plummer, 2013b)wereused for Bayesian estimation. The JAGS code for fitting a unidimensional GRM can be found in Curtis (2010, Section 5). A single Markov chain of 60,000 cycles was generated with a burn-in period of 10,000 and a thinning interval of 10, which paralleled the sampler setup for fiducial estimation.

4.1.2. Results For each parameter, the median absolute error (MAE) of all six candidate point ˜ estimators was calculated and tabulated in Table 1. MAE is calculated as Median(θ − θ0),in which θ˜ denotes a point estimator and the median is taken over 500 replications. The median MAE across the six parameters was also included in the table as an overall measure of estimation error. The results listed in Table 1 are in favor of GFI with parameter bound M = 20 (FID20): In general, it yields more accurate estimates than other candidate methods. Increasing the bound to M = 200 (FID200) does not noticeably change the MAE of point estimates. Here, using the median instead of the mean is crucial in maintaining the stability of point estimation when a larger M is specified. The trace plots for the slope parameter β1 in the first replication are shown in the

4Note that a different order-statistic approach applied to the difficulty parameters (see Eq. 17) was recommended by both Curtis (2010) and Kieftenbeld and Natesan (2012). PSYCHOMETRIKA

Table 1. Median absolute error (MAE) of the candidate point estimators (n = 200, m = 2, K1 = K2 = 3).

Parameter FID20 FID200 ML-4 ML-6 BAYC BAYU

α11 0.24 0.26 0.62 0.79 0.24 0.95 α12 0.17 0.19 0.19 0.22 0.16 0.23 β1 0.25 0.28 0.84 1.04 0.30 1.09 α21 0.23 0.26 0.56 0.74 0.26 0.97 α22 0.16 0.18 0.18 0.21 0.15 0.23 β2 0.25 0.30 0.82 1.04 0.29 1.17 Overall 0.24 0.26 0.59 0.77 0.25 0.96

Note The column-wise median MAE was tabulated in the last row as an overall accuracy measure.

FID20 FID200 1 1 β β Slope Slope 2468 020406080 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 Cycle Cycle

BAYC BAYU 1 1 β β Slope Slope 51015 0 5 10 15 20 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 Cycle Cycle

Figure 4. Trace plots for β1 produced by the four sampling-based methods (FID20, FID200, BAYC, BAYU) in the first replication. upper panels of Fig. 4; it is noted that the marginal fiducial distribution for this parameter has a long tail. The ML estimates for α j1 = 2.5 and β j = 2, j = 1, 2, are inaccurate. The MAE is even larger under a tighter convergence criterion (ML-6); with this setting, 4–5% of the estimates have absolute values over 20. In comparison, α12 = α22 = 0 are relatively well estimated. Our findings suggest limited usefulness of ML estimation for fitting a unidimensional GRM to only two items in small samples. The divergent performance of the two Bayesian estimators highlights the dominant role of prior specification in the two-item problem. The widely used uniform prior (BAYU) results in even more problematic estimates than ML, and its use is not recommended; the trace plot (the lower-right panel of Fig. 4) for parameter β1 in the first replication indicates that the Markov chain does not mix well within 60,000 cycles. In contrast, using a Cauchy prior (BAYC) not only YANG LIU AND JAN HANNIG improves the mix of the chain (see the lower-right panel of Fig. 4) but also leads to well-behaved point estimates comparable to FID20.

4.1.3. Discussion We conclude that both GFI and the Bayesian inference derived from a Cauchy prior are superior to the gold-standard ML estimation in this two-item calibration prob- lem. Moreover, GFI circumvents prior specification, which proves to be an overriding factor in determining the quality of Bayesian estimates. The only tuning parameter M in the proposed Gibbs sampling algorithm is found to have a minor impact on the point estimates. Although fitting the GRM to only two items is somewhat unusual in practice, the results highlight the usefulness of GFI when the conventional ML and sub-optimal Bayesian inference cannot be relied on.

4.2. Study 2: A Unidimensional Model 4.2.1. Simulation Design The second simulation study aims at comparing the performance of likelihood-based, Bayesian, and fiducial point estimates and CIs for various types of item parameters. Graded response data were generated from unidimensional GRMs under a fully factorial design involving two sample size levels, n = 100 and 500, and two test length levels, m = 9 and 18. Compared to Study 1, the sizes of simulated data in Study 2 more closely resemble those encountered in real data analyses. All items had five ordered response categories (K j = 5 for all j). Results were accumulated across 500 simulated data sets in each condition. For m = 9, the true item parameters were determined by two factors: communality and skewness. In the factor analysis literature, communality for each item measures the proportion of variance explained by the latent variables. Under the logit parameterization of the unidimensional GRM (Eq. S.1), the explained variance is approximated by the squared value of the standardized factor loading parameter (see e.g., Wirth & Edwards, 2007):

β j /1.7 λ j = , (15) 2 1 + (β j /1.7) in which 1.7 is the constant to match the standardized logistic and normal cdfs. Values 0.1, 0.5, and 0.9 were selected to represent low, medium, and high levels of communality, respectively. Skewness refers to the degree to which the intercept parameters α jk’s are centered around zero. Here, we manipulated a standardized version of the intercept, namely, the threshold parameter:

−α jk/1.7 τ jk = , k = 1, 2, 3, 4. (16) 2 1 + (β j /1.7)

Values (−0.75, −0.55, −0.05, 0.75), (−0.25, −0.05, 0.45, 1.25), and (0.25, 0.45, 0.95, 1.75) were used as symmetric, moderately skewed, and skewed threshold conditions, respec- tively. The nine combinations of three communality and three threshold levels yielded the data- generating parameter values for all items in the test; the true parameter values are tabulated in Table 2.Form = 18, the first half of the items had the same parameter values as listed in Table 2; the second half had the same factor loading parameters, and threshold parameters with the same absolute values but with reversed signs and ordering. We remark that the parameter val- ues considered here are more extreme than those used in many simulation studies (e.g., Forero, Maydeu-Olivares, & Gallardo-Pujol, 2009). Highly skewed or highly discriminating items are by no means rare in practice, especially in health-related surveys; an example would be an item about suicide in a scale measuring depressive symptoms (e.g., the Hamilton Depression Rating Scale; Hamilton, 1960). PSYCHOMETRIKA 4 j δ 2 00 5.53 16 2.37 42 3.95 34 2.47 64 1.77 07 1.06 00 1.84 47 1.32 05 0.79 j ...... 0 0 0 δ − − − 2 42 3 74 16 1 64 1 07 0 78 47 1 05 0 58 j ...... 1 0 0 0 0 0 δ − − − − − − 1 79 1 79 37 35 0 06 26 0 26 79 35 j ...... 1 2 0 0 0 0 δ − − − − − − 4 4 j j 3.14 0 2.24 1.34 4.21 0 1.80 9.41 0 6.72 4.03 3.01 τ α − − − − − − − − − 9). = m 2 3 45 1.25 95 1.75 70 05 0.75 81 09 05 0.75 28 45 1.25 12 11 42 27 45 1.25 95 1.75 05 0.75 95 1.75 08 j j ...... 1 5 1 2 2 0 0 0 0 τ α − − − − − − − − − Intercepts Difficulties 2 Thresholds 2 05 0 45 0 81 55 09 99 0 55 08 05 0 32 0 42 96 0 27 05 0 45 0 55 45 0 12 j j ...... 1 2 0 0 0 0 0 0 0 τ α − − − − − − − − − 2. 1 1 25 25 0 45 75 45 0 34 0 75 60 25 80 1 34 03 2 34 0 25 25 0 75 25 0 60 0 j j ...... 1 0 0 0 0 0 0 0 0 τ α − − − − − − − − − Table j j λ β Data-generating parameter values for the unidimensional GRM ( 3 Low Skewed 0.32 0 4 Medium Symmetric 0.71 2 Low Moderate 0.32 2 Low Moderate 0.57 0 3 Low Skewed 0.57 Item1 Communality Skewness Loading LowItem1 Symmetric Communality 0.32 Skewness Slope Low Symmetric 0.57 1 5 Medium Moderate 0.71 4 Medium Symmetric 1.70 1 7 High Symmetric 5.10 4 6 Medium Skewed 1.70 6 Medium Skewed 0.71 0 5 Medium Moderate 1.70 0 9 High Skewed 5.10 8 High Moderate 5.10 1 9 High Skewed 0.95 0 8 High Moderate 0.95 7 High Symmetric 0.95 YANG LIU AND JAN HANNIG

Apart from the original slope-intercept and the standardized loading-threshold parameteri- zations, the item difficulty (also known as intensity) parameter is also of interest:

δ jk =−α jk/β j . (17)

The kth difficulty parameters, k = 1,...,K j −1, gives the latent variable value at which response categories {0,...,k − 1} and {k,...,K j − 1} are equally likely to be endorsed. In the scenario of assigning partial credit in an educational test, δ jk gauges the difficulty to obtain an item score higher than or equal to k. The true values of these parameters can also be found in Table 2.

4.2.2. Candidate Methods The fiducial sampling algorithm was configured similarly as in Study 1. There were two exceptions: a) (1.5, 0.5, −0.5, −1.5) were now used as starting values for the four intercepts of each item, and b) the tuning parameter M, which determines the size of the initial bounding set on the parameter space, was set to 20. Similar to Study 1, we estimated each parameter by the fiducial median. Equi-tailed percentile fiducial CIs of model parameters were calculated at a 95% nominal level: the lower and upper confidence bounds for a particular item parameter are set to the 2.5 and 97.5 empirical percentiles of a random sample produced by the sampler. Fiducial distributions for transformed parameters (i.e., threshold, loading, and difficulty parameters) were approximated by transforming the Monte Carlo samples drawn from the original fiducial distribution under the slope-intercept parameterization. The posterior median was used as the Bayesian point estimator. Bayesian equi-tailed per- centile CIs were constructed in a fashion similar to the fiducial ones. For conciseness, only the Cauchy prior was considered for this part, since it surpasses the uniform prior in the previous simulation study: For each item, the slope followed a standard Cauchy prior distribution, and the intercepts were Cauchy order statistics independent from the slope. The JAGS control parameters remain the same as was configured in the first simulation. For ML estimation, we used 10−4 as the convergence criterion in this section. Wald-type CIs for item slopes and intercepts were obtained from two types of standard errors resulting from two commonly used sample estimates of the Fisher information matrix: i.e., the cross-product form (in Mplus,setestimator = MLF) and the Hessian form (set estimator = ML). For transformed parameters, the Delta method is used, which is the default method in Mplus.A sampling-based interval estimator, named ML Monte Carlo subsequently, was also examined: 5000 random draws were generated from N (θˆ, [nI(θˆ)]−1), i.e., a normal approximation to the sampling distribution of the ML estimates under the original slope-intercept parameterization, from which equi-tailed 95% CIs were calculated. This approach can be considered as an approx- imation to the parametric bootstrap and was also referred to as a multiple-imputation method by Yang, Hansen, and Cai (2012). Due to its reliance on the normal approximation, ML Monte Carlo is computationally much more efficient than MCMC-based methods (i.e., fiducial and Bayesian methods); however, the resulting interval estimates are wider for certain parameters in small samples (see Fig. 5).

4.2.3. Results Figure 5 summarizes the MAE for three types of point estimates, and the empir- ical coverage and length for five types of CIs. A CI is said to be liberal/conservative if its empirical coverage is below 93.1%/above 96.9%, i.e., falling outside of a 95% normal-approximation confi- dence interval for the nominal coverage level over 500 replications. To better visualize the length comparison, we calculated the log ratio between the length of a CI and that of the fiducial CI (LLR): For example, LLR = log(1.1) = 0.095 if a CI is 10% wider than the fiducial CI, LLR = log(0.9) =−0.105 if it is 10% narrower, and LLR = 0 if it has the same length. The median LLR PSYCHOMETRIKA

Median Absolute Error (MAE), Coverage, and Median Log Length Ratio (MLLR)

MB F MB F MB F MB F HCMBFHCMBFHCMBFHCMBF

Slope Coverage 0.92 0.98

MAE 1.17 0.2 0.6 1.0 MLLR −0.1 0.3

Intercept Coverage 0.85 0.95 MAE 0.27 1.13 0.0 MLLR −0.3 0.0 0.5 1.0 1.5 0.96

Loading Coverage 0.88

MAE 1.16 0.2 MLLR 0.0 0.00 0.04 0.08

Threshold Coverage 0.85 0.95 MAE 1.12 0.1 MLLR 0.05 0.15 0.25 −0.3

Difficulty Coverage 0.86 0.94

MAE 1.02 0.0 MLLR 0.0 0.5 1.0 1.5 n = 100 n = 100 n = 500 n = 500 n = 100 n = 100 n = 500 n = 500 −0.6 m = 9 m = 18 m = 9 m = 18 m = 9 m = 18 m = 9 m = 18

Figure 5. The median absolute error (MAE), empirical coverage, and median log length ratio (MLLR) of various candidate methods in Study 2. Five-category ordinal response data were generated from unidimensional graded response models (GRMs) under four sample size and test length combinations (n = 100, 500; m = 9, 18). The graphical table has five rows and two columns. Each row corresponds to one type of parameters. The first column contains boxplots for the MAE of the ML estimate (M), Bayesian posterior median (B), and fiducial median (F). The second column summarizes the coverage and length of Hessian-form Wald (H), cross-product-form Wald (C), ML Monte Carlo (M), Bayesian percentile (B), and fiducial percentile (F) 95% confidence intervals (CIs). Each cell in the second column has two panels. The top panel displays the empirical coverage, in which the 95% normal-approximation confidence limits for the nominal level are marked with dashed lines.Thebottom panel displays the MLLR, in which 0 indicates that a CI is on average as long as the fiducial CI. An arrow and the median MLLR value are shown in place of the boxplot when the cross-product-form CI is much wider than other candidate methods.

(MLLR) across 500 replications was computed as an overall length measure. For conciseness, boxplots for MAE, coverage, and MLLR by types of estimators, parameters, and conditions were presented in Fig. 5. Detailed coverage and length results for individual parameters can be found in Appendix H. The MAE results suggest that the three types of point estimators are mostly comparable. The fiducial median is slightly more accurate for large slopes, especially when the sample size is small (n = 100), which corresponds to smaller third quartiles in the boxplots. For threshold parameters, YANG LIU AND JAN HANNIG a wider range of MAE values (i.e., wider boxes) are observed for the Bayesian posterior median in small samples. We observe that generally the fiducial percentile CIs exhibit on-target coverage. Moreover, they are at least as short as other interval estimators in most scenarios. For the difficulty parameters of the low-communality items (items 1 to 3), the fiducial CIs are less efficient than the cross- product-form Wald intervals when n = 100 and m = 9, and slightly less efficient than the Bayesian intervals when n = 100 and m = 18; see Figures S.2 and S.4 in Appendix H. Consequently, we conclude that GFI is the most reliable approach in recovering item parameters among the four candidates being considered in the current work. The Hessian-form Wald CI having been regarded as the gold-standard interval estimator asso- ciated with the ML estimation, is the most comparable alternative to the fiducial CI. However, it can be liberal when applied to the loading and difficulty parameters in small samples (n = 100). Figures S.2 and S.4 in Appendix H suggest that the low coverage is for medium and high loadings (items 4, 6, 7, 8, and 9). In those cases, the true parameter values are close to the boundary of the parameter space (i.e., 1), and the quadratic approximation to the log-likelihood fails. Similar rea- soning applies to the under-coverage for the difficulty parameters of the low-communality items (item 1 to 3). When a small slope co-exists with a large intercept, the resulting difficulty param- eters tend to be large. When m = 18, improvement on Hessian-form CIs’ coverage is observed, especially for the difficulty parameter: It produces shorter intervals than the other methods while maintaining adequate coverage for most difficulty parameters of items 1–3. Besides, the Hessian- form Wald CI yields slightly longer intervals than the fiducial approach for extreme slope values when the sample size is small. The cross-product-form Wald CIs, on the other hand, is too conservative in small sample conditions, resulting in substantially wider intervals than other candidate methods. The pattern is the most salient when n = 100 and m = 18, in which case the coverages are almost always 1 and the intervals are about three times as wide as its competitors. Note that the unidimensional GRM is barely identified in this extreme condition: The number of parameters is 18 × 5 = 90, which is likely to cause numerical difficulty in inverting the cross-product information matrix; that may underlie the excessively conservative results. ML Monte Carlo CIs, by construction, yield virtually the same results as the Hessian-form Wald CI under the original parameterization; the difference between the two methods is only expected in transformed parameters. Through Monte Carlo sampling, the Hessian-form CI’s prob- lematic coverage for loading and difficulty parameters resulting from poor normal approximations (in small samples) is rectified. In the meantime, however, the efficiency is sacrificed to a certain degree, which is reflected in the MLLR boxplots (Fig. 5). A closer inspection at Figures S.2 and S.4 in Appendix H reveals that for high factor loadings (items 7–9), as well as the corresponding threshold and difficulty parameters, ML Monte Carlo CIs can be more than 10% wider than the fiducial CIs. As shown in Fig. 5 and Figure S.2 in Appendix H, the specific type of Bayesian CI considered in the simulation study can be very liberal for extreme intercept, threshold, and difficulty param- eters when the test is short (m = 9) and the sample size is small (n = 100). The coverage can be as low as 0.75 under the nominal level 0.95. Although they improve as the sample size increases, coverage can still be less than the nominal level when n = 500. The coverage is improved in longer tests (m = 18); the Bayesian CI behaves similarly to the fiducial and Hessian-form Wald ones. The observed inferior performance in short tests and small samples might be traced to the particular prior distribution we used; future research is encouraged to explore alternative prior configurations for improvement.

4.2.4. Discussion The simulation results suggest that GFI often provides reliable and efficient interval estimates for various GRM model parameters, even when the sample size is small, the true PSYCHOMETRIKA values are extreme, and/or the local quadratic approximation of the log-likelihood is poor. The particular Bayesian CI does not handle extreme intercept/threshold/difficulty parameters properly. In small samples, the Hessian-form Wald CI suffers from liberal coverage when the transformed parameter values are close to the boundary. Improvement is found when Monte Carlo sampling is used instead of normal approximation, at a significant cost of efficiency nonetheless. The cross- product form Wald CI, on the other hand, is almost always too conservative in small samples.

4.3. Study 3: A Bifactor Model 4.3.1. Simulation Design and Candidate Methods In the third simulation study, we compare fiducial, likelihood-based, and Bayesian interval estimators in recovering bifactor model param- eters. The simulated test comprises 8 trichotomous items, i.e., K j = 3 for all j. Two sample size conditions were considered, n = 200 and 500; under each condition, 500 data sets were simulated. The data-generating model has a general latent variable that loads on all items in the test and two secondary latent variables that loads on the first and second half of the test, respectively; the three latent variables are orthogonal to each other. Candidate interval estimators under comparison and configurations of estimation algorithms remain the same as in Study 2. Similar to the previous study, we converted item parameters to the standardized scale and manipulated their values. The standardized factor loading vector under a multidimensional GRM model can be expressed as β / . j 1 7 λ j = , (18) + β β / . 2 1 j j 1 7 a generalization of Eq. 15. In the data-generating model, the loading/slope vector for each item j has only two non-zero elements, corresponding to the related primary and secondary latent variables for that item. In our simulation study, the two design factors that determine the true  loading parameters are: a) two levels of communality λ j λ j , with values 0.2 and 0.8, and b) four levels of relative impact of the general factor, explaining 80%, 60%, 40%, and 20% of the common variance, respectively. Similar to Eq. 16, the multidimensional version of the threshold parameter is defined as −α jk/1.7 τ jk = , k = 1, 2. (19) + ββ / . 2 1 j j 1 7 The threshold parameters used in data generation were (−1.25, 0.25) for odd items, and (−0.25, 1.25) for even items. We are also interested in a multidimensional generalization of the item difficulty parameter: α jk δ jk =− . (20) β β j j

The absolute value of δ jk is the Euclidean distance from the mean of the latent variable, i.e., = ( − ) α +β  = zi 0,tothe r 1 -dimensional linear subspace jk j zi 0 (see Reckase, 2009), at which categories higher than or equal to k is equally likely to be selected compared to categories lower 5 δ −α δ < (α +β  )> than k . jk hasthesamesignof jk. When jk 0, the logistic function value jk j 0 1/2. It means that an average person in the population has more than 50% chance to endorse a category higher than or equal to k; intuitively speaking, threshold k is “easy to pass”. Similarly, δ jk > 0 implies threshold k is “hard to pass”. Table 3 displays the data-generating values of the five types of parameters.

5  Eq. S.1 implies P{Yij ≥ k|Zi = zi }=(αjk + β j zi ). YANG LIU AND JAN HANNIG

4.3.2. Results Similar to the summary of unidimensional results, we present in Fig. 6 boxplots for the MAE of point estimates, and empirical coverage and MLLR of interval estimates. Readers are referred to Appendix H for more detailed coverage and length results for each item parameter. The three point estimators behave similarly in terms of MAE for most parameters under both sample size conditions. When the sample size is small (n = 200), the fiducial median is slightly more accurate for slopes in general. Meanwhile, the Bayesian posterior median have noticeably taller boxes for slopes and loadings, indicating its larger estimation error for some parameters. There were 9 replications in which the Hessian matrix evaluated at the ML estimates is not positive definite when n = 200, and there were 6 such replications when n = 500; in those cases, the Hessian-form intervals cannot be calculated. The coverage of Hessian-form CIs can be problematic for slopes, loadings, and difficulties. Figure S.6 in Appendix H shows that the liberal coverage for primary slopes is more severe for high communality items (items 1–4), whereas that for secondary slopes is more severe for low communality items (items 5–8). Similar to the uni- dimensional results reported in Sect. 4.2, the Hessian-form CI under-covers when the parameter is closer to the boundary of the parameter space, including large loadings and difficulties for low communality items. The performance improves as the sample size increases; however, liberal cov- erage is observed for most loading parameters even when n = 500 (see Figure S.7 in Appendix H). The cross-product-form CI outperforms the Hessian-form CI for slope, intercept, and diffi- culty parameters: It often exhibits on-target coverage without trading in too much efficiency. It is the most efficient CI for the secondary slopes and intercepts of low communality items (items 1–4) among the methods exhibiting adequate coverage. However, even the cross-product-form CI is not able to attain the nominal level for the loading parameters when the true values are close to the boundary (see Figures S.6 and S.7 in Appendix H). Since the normal approximation is problematic under the original slope-intercept parame- terization, the ML Monte Carlo method does not completely address the coverage problem for transformed parameters in small samples. As shown in Figures S.6 and S.7 in Appendix H, it still under-covers most loading parameters and the difficulty parameters of low communality items (items 1–4). In addition, the ML Monte Carlo interval is noticeably wider than the other candidates for certain parameters, e.g., the loadings for item 8 and the thresholds for items 5 and 8. The fiducial CI is able to maintain on-target coverage for most parameters. Its length is often comparable to other methods; however, for the slopes and intercepts of low communality items (items 1–4), the fiducial CI is wider than the normal-approximation CIs. Figures S.6 and S.7 in Appendix H further shows that even in those occasions when the fiducial CI is slightly liberal, it usually still performs better than other candidate methods. The Bayesian CI fares slightly better than the Hessian-form CI, but not as good as the fiducial CI. For high communality items, it still yields liberal coverage for slopes and the corresponding loadings (see Figures S.6 and S.7 in Appendix H). Improvement in both coverage and efficiency is observed when n = 500; however, the coverage is still lower than the nominal level for several parameters, and the Bayesian CIs can still be wider than the other candidates.

4.3.3. Discussion In general, the findings in this section confirm our recommendations made in Study 2 (Sect. 4.2). GFI is reliable and well suited for small-sample calibrations, whereas likelihood-based and Bayesian inference may not be trusted under certain parameterizations. We also observe that under a more complex bifactor GRM, the performance of candidate CIs under certain parameterizations can still be problematic even when n = 500. It suggests that n = 500 may still not be a “large enough sample” under the bifactor model. PSYCHOMETRIKA 2 j δ 1 0.28 1.40 1.40 0.28 0.28 1.40 1.40 0.28 0.56 2.80 2.80 0.56 0.56 2.80 2.80 0.56 j − − − − − − δ − − 2 4.75 0.95 4.75 0.95 2.38 0.48 2.38 0.48 j − − − − − − α − − 1 j Interceptsα Difficulties 3 j β 2 j β 8). j = β m 1 j β 2 j τ 3. 1 0.25 1.25 1.52 0.00 3.04 0.95 1.25 0.25 2.15 0.00 2.63 4.75 0.25 1.25 2.63 0.00 2.15 0.95 1.25 0.25 3.04 0.00 1.52 4.75 0.25 1.25 0.38 0.76 0.00 0.48 1.25 0.25 0.54 0.66 0.00 2.38 0.25 1.25 0.66 0.54 0.00 0.48 1.25 0.25 0.76 0.38 0.00 2.38 j − − − − − − − τ − Table 3 j λ 2 j λ Data-generating parameter values for the bifactor GRM ( 1 j λ 8 High 20% 0.40 0.00 0.80 7 High 40% 0.57 0.00 0.69 6 High 60% 0.69 0.00 0.57 5 High 80% 0.80 0.00 0.40 4 Low 20% 0.20 0.40 0.00 3 Low 40% 0.28 0.35 0.00 2 Low 60% 0.35 0.28 0.00 Item Communality1 General variance Loading Low Thresholds 80% Slope 0.40 0.20 0.00 YANG LIU AND JAN HANNIG

Median Absolute Error (MAE), Coverage, and Median Log Length Ratio (MLLR) MB F MB F HCMBF HCMBF Coverage Slope 0.85 0.95 MAE MLLR 0.2 0.4 0.6 0.8 −0.6 0.0

Intercept Coverage 0.90 0.96 MAE MLLR 0.2 0.4 0.6 0.8 1.0 −0.4 0.2 Coverage Loading 0.80 0.95 MAE 0.2 MLLR −0.4 0.05 0.10 0.15 0.20 0.25

Threshold Coverage 0.90 0.96 MAE MLLR 0.04 0.06 0.08 −0.05 0.15 Coverage Difficulty 0.85 0.95 MAE MLLR −0.2 0.1 0.0 0.2 0.4 0.6n 0.8 = 200 n = 500 n = 200 n = 500 m = 8 m = 8 m = 8 m = 8

Figure 6. The median absolute error (MAE), empirical coverage, and median log length ratio (MLLR) of various candidate methods in Study 3. Three-category ordinal response data were generated from a bifactor graded response model (GRM) under two sample size conditions (n = 200, 500); the true parameters are tabulated in Table 3. The graphical table has five rows and two columns. Each row corresponds to one type of parameters. The first column contains boxplots for the MAE of the ML estimate (M), Bayesian posterior median (B), and fiducial median (F). The second column summarizes the coverage and length of Hessian-form Wald (H), cross-product-form Wald (C), ML Monte Carlo (M), Bayesian percentile (B), and fiducial percentile (F) 95% confidence intervals (CIs). Each cell in the second column has two panels. The top panel displays the empirical coverage, in which the 95% normal-approximation confidence limits for the nominal level are marked with dashed lines.Thebottom panel displays the MLLR, in which 0 indicates that a CI is on average as long as the fiducial CI.

5. Empirical Example

Ordinal response data from the Patient-Reported Outcomes Measurement Information Sys- tem (PROMIS; Irwin et al., 2010) study of chronic illness are analyzed using the proposed imple- mentation of GFI. We illustrate that GFI, similar to Bayesian inference, is more flexible and straightforward than traditional likelihood-based methods in terms of accounting for the sam- pling variability of parameter estimates—a prominent part affecting many inferential procedures PSYCHOMETRIKA

Table 4. PROMIS emotional distress short-form items.

Label Item stem Ang1 I felt fed up Ang2 I felt mad Ang3 I felt upset Ang4 I was so angry I felt like throwing something Ang5 I was so angry I felt like yelling at somebody Ang6 When I got mad, I stayed mad Anx1 I worried about what could happen to me Anx2 I was afraid that I would make mistakes Anx3 I felt nervous Anx4 I felt like something awful might happen Anx5 I felt scared Anx6 I worried when I went to bed at night Anx7 I thought about scary things Anx8 I felt worried Dep1 I felt alone Dep2 I felt like I couldn’t do anything right Dep3 I felt everything in my life went wrong Dep4 I felt sad Dep5 I thought that my life was bad Dep6 I could not stop feeling sad Dep7 I felt lonely Dep8 I felt unhappy which has long been ignored in practice. Meanwhile, GFI does not require prior specification, bypassing the subjective judgment involved therein which calls for expertise and scrutiny in practice. The data set comprises 455 complete responses to 22 short-form items that are designed to measure three aspects of emotional distress: anger (6 items), anxiety (8 items), and depression (8 items). Although the proposed method is able to handle missing data in a natural fashion, we applied listwise deletion in order to mimic the scenario of small-sample calibrations. All items have five response categories. The common response scale ranges form 0 to 4: 0 = never, 1 = almost never, 2 = sometimes, 3 = often, and 4 = almost always. The item stems are tabulated in Table 4. We fit a bifactor model with a primary dimension on which all items load, and three sec- ondary dimensions for items measuring specifically anger, anxiety, and depression, respectively. Parameter estimates from the default limited information estimator in Mplus (i.e., estimator = WLSMV) were obtained in advance and set as the starting values for item parameters in the proposed sampler (Algorithm 1); the corresponding factor score estimates (i.e., save = fscores)were used as the starting values for the normal variates. Other specifications of the sampling algorithm remain unchanged. The highest fiducial density regions (at nominal coverage levels 75, 90, and 95%) for each item’s primary and secondary factor loading pairs are displayed in Figs. 7, 8, 9. To obtain the contours on each two-dimensional parameter space, we used the R package ks (Duong, 2014) to estimate a two-dimensional density by kernel smoothing from the 5000 fiducial samples. We selected the bandwidth by the plug-in method (Wand & Jones, 1994) using function Hpi() in the ks package. The implementation relies on the optimizer nlm or optim, which was found to be slow when the number of data points is large. As a result, a further thinned sample of 500 YANG LIU AND JAN HANNIG

Ang1: I felt fed up Ang2: I felt mad Ang3: I felt upset

0.7

0.58

0.5 Secondary loading 0.4 0.5 0.6 0.7 0.8

0.71 0.78 0.8

Ang4: I was so angry I felt like throwing something Ang5: I was so angry I felt like yelling at somebody Ang6: When I got mad, I stayed mad

0.8 0.74

0.61 Secondary loading 0.4 0.5 0.6 0.7 0.8

0.75 0.8 0.68 0.4 0.5 0.6 0.7 0.8 0.4 0.5 0.6 0.7 0.8 0.4 0.5 0.6 0.7 0.8 Primary loading Primary loading Primary loading

Figure 7. Two-dimensional confidence regions for the primary and secondary loadings of anger items. The points are 500 draws selected via a further thinning interval of 10. The three contours shown on each panel are the 75%, 90%, and 95% highest fiducial density regions. The fiducial medians are marked by the cross symbols, and their numerical values are also displayed next to the axes. The diagonal dashed line indicates that an item contributes evenly to the primary and secondary factors.

Monte Carlo draws was extracted for bandwidth selection, while the entire sample was still used for the subsequent density estimation. We are able to visualize the relative contributions of primary and secondary factors to each item on those bivariate plots: The primary factor dominates if the point cloud is below the diagonal line. This is the case for the anger item “I felt upset” (Ang3), the anxiety item “I was afraid that I would make mistakes” (Anx2), and all but the two locally dependent depression items. This pattern also suggests the insufficient fit of a testlet response model (Bradlow, Wainer, & Wang, 1999), which requires the relative contribution to the primary and secondary factors to be constant among all items within the same symptom domain (i.e., a testlet). In addition, we observe that the depression dimension is dominated by the “alone/lonely” pair (Dep1 and Dep7): They are the only items with significant secondary loadings. It implies that the particular secondary dimension only captures the wording similarity of the two locally dependent items above and beyond the general construct of emotional distress (e.g., Liu & Thissen, 2014; Thissen & Steinberg, 2010, p.131). Non-elliptical point clouds of fiducial samples are also observed for those two items. We also study the impact of sampling error in reliability analysis. Various methods have been proposed to quantify the reliability of a scale in the context of IRT. A popular method is to compute the Fisher information matrix with respect to the latent variables zi for fixed item parameters θ; J ( , θ) (θ, | ) = 22 (θ , | ) for easy reference, we denote it by zi .Alsolet f yi zi j=1 f j j yij zi be the conditional probability of an individual response pattern yi . It can be verified by direct calculation that PSYCHOMETRIKA

Anx1: I worried about what could happen to me Anx2: I was afraid that I would make mistakes Anx3: I felt nervous

0.7 0.67

Secondary loading 0.39 0.3 0.4 0.5 0.6 0.7 0.8 0.8 0.7 0.72

Anx4: I felt like something awful might happen Anx5: I felt scared Anx6: I worried when I went to bed at night

0.8 0.74 0.68 Secondary loading 0.3 0.4 0.5 0.6 0.7 0.8 0.78 0.82 0.75 0.3 0.4 0.5 0.6 0.7 0.8 Primary loading Anx7: I thought about scary things Anx8: I felt worried

0.74 0.68 0.6 0.7 0.8 0.5 Secondary loading 0.3 0.4 0.66 0.76 0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 Primary loading Primary loading

Figure 8. Two-dimensional confidence regions for the primary and secondary loadings of anxiety items. The points are 500 draws selected via a further thinning interval of 10. The three contours shown on each panel are the 75%, 90%, and 95% highest fiducial density regions. The fiducial medians are marked by the cross symbols, and their numerical values are also displayed next to the axes. The diagonal dashed line indicates that an item contributes evenly to the primary and secondary factors.   ∂2 J (z , θ) =−Eθ log f (θ, y |z ) i ∂ ∂  i i zi zi    2 22 4 α jk+β zi α j,k+1+β zi 1 e j e j  = − β β . (21) α +β α +β j j f j (θ j , k|zi ) jk j zi 2 j,k+1 j zi 2 j=1 k=0 [1 + e ] [1 + e ]

The inverse of J (zi , θ) gives the asymptotic covariance matrix for the ML estimates of the 4 latent variables. For the four-dimensional model being fitted here, zi ∈ R . For each dimension d ∈{1, 2, 3, 4}, we define the marginal standard error function (MSEF) as following: YANG LIU AND JAN HANNIG

Dep1: I felt alone Dep2: I felt like I couldn't do anything right Dep3: I felt everything in my life went wrong

0.78 0.5 0.9

0.1 −0.02

Secondary loading −0.3

−0.7 −0.3 0.93 0.82 0.9

Dep4: I felt sad Dep5: I thought that my life was bad Dep6: I could not stop feeling sad 0.5 0.9

0.14 0.1

−0.19 Secondary loading −0.3 0.1

−0.7 0.81 0.86 0.77 −0.7 −0.3 0.1 0.5 0.9 Dep7: I felt lonely Dep8: I felt unhappy Primary loading

0.9 0.8

0.1 0.1 0.5 Secondary loading −0.3

−0.7 0.92 0.86 −0.7 −0.3 0.1 0.5 0.9 −0.7 −0.3 0.1 0.5 0.9 Primary loading Primary loading

Figure 9. Two-dimensional confidence regions for the primary and secondary loadings of depression items. The points are 500 draws selected via a further thinning interval of 10. The three contours shown on each panel are the 75, 90, and 95% highest fiducial density regions. The fiducial medians are marked by the cross symbols, and their numerical values are also displayed next to the axes. The diagonal dashed line indicates that an item contributes evenly to the primary and secondary factors.

 υ ( , θ) = σ 2( , θ) ( ), d zid d zi d zi,−d (22)

σ 2( , θ) J ( , θ)−1 in which d zi is the dth diagonal element of zi , and the integral is taken with respect to the remaining three dimensions, denoted by zi,−d . υd (zid, θ) gauges the average precision of the scale at each level zid of a particular dimension d. Most often in practice, item parameters need to be calibrated, and thus the plug-in version of the asymptotic covariance matrix evaluated at the point estimates is subject to sampling variability. With the aid of a fiducial sample of item parameters, the carry-over impact of sampling variability can be easily integrated into reliability PSYCHOMETRIKA 2.5 3.0 1.0 1.5 2.0 Marginal standard error

0.0−2 0.5 −1 0 1 2 −2 −1 0 1 2 Emotional distress Anger 2.5 3.0 2.0 8101214 46 Marginal standard error Marginal standard error 02 0.0−2 0.5 1.0 1.5 −1 0 1 2 −2 −1 0 1 2 Anxiety Depression

Fiducial median Pointwise CB

Figure 10. The fiducial median and 95% confidence bands (CBs) for marginal standard error curves for the four dimensions. Pointwise CBs are shown in colored dashed lines. Note that the lower-right panel for depression has a different y-axis from the rest.

analysis (cf. the multiple-imputation approach of Yang et al., 2012). For fixed zid, υd (zid, θ) is just a single transformed parameter, and thus its 95% fiducial CI can be computed as usual. Pooling across all zid ∈ R, we obtain a pointwise confidence band (CB) for the MSEF. Three-dimensional integration is needed in the numerical evaluation of Eq. 22; to reduce the computational burden, we use 21 quadrature points on each dimension. Fiducial median and 95% pointwise confidence band (CB) for the MSEFs are shown in Fig. 10. We observe in Fig. 10 that the depression-specific latent variable is poorly measured at all levels; this is anticipated because the dimension is only effectively indicated by the “longly/alone” pair. The general dimension is often more precisely assessed compared to the specific symptom dimensions, for the reason that all 22 items load on the general dimension and provide information about individual differences in emotional distress symptoms. Because the threshold parameters are skewed, the measurement error at the low end of the latent continuum is larger.

6. Discussion and Conclusion

We have derived generalized fiducial inference (GFI) for a family of multidimensional graded response models (GRM). It can be rigorously established that GFI for the GRM yields asymp- totically correct inference in the frequentist sense, equivalent to likelihood-based and Bayesian methods that have been extensively studied in the literature. Furthermore, we have shown by Monte Carlo simulations that GFI using the proposed Gibbs sampler is reliable for parameter recovery, even in situations when the sample size is too small and/or the data-generating parameters are too extreme for likelihood-based and Bayesian counterparts to behave well. The usefulness and flexibility of the proposed method have been illustrated with an empirical example. We conclude that GFI is a preferred inferential framework for calibrating ordinal items in small samples, and YANG LIU AND JAN HANNIG that sampling variability, which is a more salient issue in small-sample data analyses, can be accounted for easily with a Monte Carlo approximation of the fiducial distribution. There are several remaining challenges to be addressed by future research. First, theoretical interpretations of the superiority of fiducial interval estimates in small sam- ples should be sought. The higher-order expansion of the fiducial distribution function was recently studied by Pal Majumder and Hannig (2016), in which a shrinkage argument (Datta & Mukerjee, 2004; Ghosh & Bickel, 1990) was exploited to derive conditions under which the fiducial prob- ability and the frequentist coverage probability are first- and second-order matching. Inspired by their work, we conjecture that fiducial CIs for GRM item parameters may have more favorable higher-order asymptotic properties compared to the normal-approximation CIs. A more solid jus- tification for the use of GFI in small samples could be established by examining the asymptotic expansion of the fiducial distribution. Second, efforts should be devoted to improve the computational efficiency of the sampling algorithm. The proposed Gibbs sampler is significantly slowed down as the dimensionality of the polytopes increases; the chains may also become slow-mixing in large samples. Although it has been argued that the sampler is useful in small samples wherein the advantage of fiducial CIs is the most salient, a more efficient sampling algorithm could further enhance the usefulness of GFI in real data problems. Alternative Monte Carlo methods such as sequential Monte Carlo (SMC; Doucet, De Freitas, & Gordon, 2001) has been successfully used for GFI in the context of linear-normal mixed effect models (Cisewski and Hannig, 2012). Due to the non-iterative nature, SMC samplers can be more efficient than the Gibbs sampler, and its use in GFI for IRT modeling should be explored. Finally, the application of GFI to other item response models should be pursued. In the current work, we only focus on a family of GRMs in which the covariance structure of the latent variables is known, e.g., exploratory and bifactor models. In practice, however, simple-structure models (also known as independent cluster models) and general two-tier models (i.e., replacing the general factor in a bifactor model by multiple factors with unconstrained covariance structure; Cai, 2010c) might be preferred over exploratory models for estimation efficiency and ease of interpretation. In addition, it is also of interest to derive GFI for unordered polytomous item response models such as Bock’s nominal response model (Bock, 1972), and models with categorical latent variables such as cognitive diagnostic models (Rupp, Templin, & Henson, 2010) and latent class models (Lazarsfeld and Henry, 1968). We have observed in our simulation study that GFI is more reliable than ML in the presence of empirical model identification difficulties, which renders it a promising alternative for psychometric models in which ML estimator is known to be ill-behaved.

Acknowledgments

We are grateful to Drs. David Thissen, Daniel Bauer, Patrick Curran, and Andrea Hussong from the Department of Psychology at the University of North Carolina at Chapel Hill, and Drs. Shelby Haberman and Yi-Hsuan Lee at Educational Testing Service (ETS) for their valuable advice and feedback on this paper. The work is sponsored by the Harold Gulliksen Psychometric Research Fellowship generously offered by ETS. Jan Hannig’s research was supported in part by the National Science Foundation under Grant No. 1512945 and 1633074.

References

Agresti, A. (2002). Categorical data analysis. Hoboken, NJ: Wiley. Bickel, P. J., & Doksum, K. A. (2015). : Basic ideas and selected topics (2nd ed., Vol. i). Boca Raton, FL: CRC Press. Birnbaum, A. (1968). Some latent train models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 395–479). Reading, MA: Addison-Wesley. PSYCHOMETRIKA

Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37(1), 29–51. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459. Bock, R. D., & Lieberman, M. (1970). Fitting a response model for n dichotomously scored items. Psychometrika, 35(2), 179–197. Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153–168. Cai, L. (2008). SEM of another flavour: Two new applications of the supplemented EM algorithm. British Journal of Mathematical and Statistical Psychology, 61(2), 309–329. Cai, L. (2010a). High-dimensional exploratory item factor analysis by a Metropolis-Hastings Robbins-Monro algorithm. Psychometrika, 75(1), 33–57. Cai, L. (2010b). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis. Journal of Educa- tional and Behavioral Statistics, 35(3), 307–335. Cai, L. (2010c). A two-tier full-information item factor analysis model with applications. Psychometrika, 75(4), 581–612. Cai, L., Thissen, D., & du Toit, S. H. C. (2011). IRTPRO for windows [Computer software manual]. Lincolnwood, IL: Scientific Software International. Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., et al. (2016). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 1–32. Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. http://www.jstatsoft.org/v48/i06/. Cisewski, J., & Hannig, J. (2012). Generalized fiducial inference for normal linear mixed models. The Annals of Statistics, 40(4), 2102–2127. Curtis, S. M. (2010). BUGS code for item response theory. Journal of Statistical Software, 36(1), 1–34. Datta, G. S., & Mukerjee, R. (2004). Probability matching priors: Higher order asymptotics. New York: Springer. Doucet, A., De Freitas, N., & Gordon, N. (2001). An introduction to sequential Monte Carlo methods. New York: Springer. Duong, T. (2014). ks: Kernel smoothing [Computer software manual]. R package version 1.9.3. http://CRAN.R-project. org/package=ks. Edwards, M. C. (2010). A Markov chain Monte Carlo approach to confirmatory item factor analysis. Psychometrika, 75(3), 474–497. Efron, B. (1998). R. A. Fisher in the 21st century. Statistical Science, 13(2), 95–114. Efron, B., & Tibshirani, R. (1994). An Introduction to the bootstrap. Boca Raton, FL: CRC Press. Retrieved from https:// books.google.com/books?id=gLlpIUxRntoC. Fisher, R. A. (1930). . Proceedings of the Cambridge Philosophical Society, 26, 528–535. Fisher, R. A. (1933). The concepts of inverse probability and fiducial probability referring to unknown parameters. Proceedings of the Royal Society of London Series A, 139(838), 343–348. Fisher, R. A. (1935). The fiducial argument in . Annals of Eugenics, 6(4), 391–398. Forero, C. G., Maydeu-Olivares, A., & Gallardo-Pujol, D. (2009). Factor analysis with ordinal indicators: A Monte Carlo study comparing DWLS and ULS estimation. Structural Equation Modeling, 16(4), 625–641. Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y.-S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2(4), 1360–1383. Ghosh, J., & Bickel, P. J. (1990). A decomposition for the likelihood ratio statistic and the bartlett correction: A Bayesian argument. Annals of Statistics, 18(3), 1070–1090. Haberman, S. J. (2006). Adaptive quadrature for item response models. ETS Research Report Series, 2006(2), 1–10. Haberman, S. J. (2013). A general program for item-response analysis that employs the stabilized newton-raphson algo- rithm. ETS Research Report Series, 2013(2). doi:10.1002/j.2333-8504.2013.tb02339.x. Hamilton, M. (1960). A rating scale for depression. Journal of Neurology, Neurosurgery, and Psychiatry, 23(1), 56–62. Hannig, J. (2009). On generalized fiducial inference. Statistica Sinica, 19(2), 491. Hannig, J. (2013). Generalized fiducial inference via discretization. Statistica Sinica, 23(2), 489–514. Hannig, J., Iyer, H., Lai, R. C. S., & Lee, T.C.M. (2015). Generalized fiducial inference: A review (Unpublished manuscript). Hill, C. D. (2004). Precision of parameter estimates for the graded item response model (Unpublished master’s thesis). The University of North Carolina at Chapel Hill. Houts, C. R., & Cai, L. (2013). flexMIRT user’s manual version 2: Flexible multilevel multidimensional item analysis and test scoring [Computer software manual]. Chapel Hill, NC: Vector Psychometric Group. Irwin, D. E., Stucky, B., Langer, M. M., Thissen, D., DeWitt, E. M., Lai, J. S., et al. (2010). An item response analysis of the pediatric PROMIS anxiety and depressive symptoms scales. Quality of Life Research, 19(4), 595–607. Kieftenbeld, V.,& Natesan, P.(2012). Recovery of graded response model parameters: A comparison of marginal maximum likelihood and Markov chain Monte Carlo estimation. Applied Psychological Measurement, 36(5), 399–419. Lehmann, E. (1999). Elements of large-sample theory. New York, NY: Springer. Retrieved from https://books.google. com/books?id=geIoxvgTXlEC. Liu, Y. (2015). Generalized fiducial inference for graded response models. (Doctoral dissertation), Retrieved from Pro- Quest Dissertations and Theses (Accession No. UNC15157) Liu, Y., & Hannig, J. (2016). Generalized fiducial inference for binary logistic item response models. Psychometrika, 81(2), 290–324. YANG LIU AND JAN HANNIG

Liu, Y., & Thissen, D. (2014). Comparing score tests and other local dependence diagnostics for the graded response model. British Journal of Mathematical and Statistical Psychology, 67(3), 496–513. Meng, X. L., & Schilling, S. (1996). Fitting full-information item factor models and an empirical investigation of bridge sampling. Journal of the American Statistical Association, 91(435), 1254–1267. Muthén, L. K., & Muthén, B. O. (2012). Mplus user’s guide [Computer software manual]. Los Angeles, CA: Muthén & Muthén. Pal Majumder, A., & Hannig, J. (2016). Higher order asymptotics of Generalized Fiducial Distribution (Unpublished manuscript). Plummer, M. (2013a). Jags version 3.4.0 user manual [Computer software manual]. http://sourceforge.net/mcmc-jags/ files/Manuals/3.x/. Plummer, M., (2013b). rjags: Bayesian graphical models using MCMC [Computer software manual]. R package version 3-10. http://CRAN.R-project.org/package=rjags. Reckase, M. (2009). Multidimensional item response theory. New York: Springer. Rupp, A. A., Templin, J., & Henson, R. A. (2010). Diagnostic assessment: Theory, methods, and applications.NewYork: Guilford. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika monograph (Vol. 17). Richmond, VA: Psychometric Society. Schilling, S., & Bock, R. D. (2005). High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika, 70(3), 533–555. Schweder, T., & Hjort, N. L. (2002). Confidence and likelihood. Scandinavian Journal of Statistics, 29(2), 309–332. Spiegelhalter, D., Thomas, A., & Best, N. D. L. (2010). OpenBUGS version 3.1.1 user manual. http://www.openbugs. info/. Thissen, D., & Hill, C. D. (2004). Infinite slope estimates in item response theory. Presentation at the annual meeting of the Psychometric Society, Monterey, CA, June 14–17. Thissen, D., & Steinberg, L. (1988). Data analysis using item response theory. Psychological Bulletin, 104(3), 385–395. Thissen, D., & Steinberg, L. (2010). Using item response theory to disentangle constructs at different levels of generality. In S. Embretson (Ed.), Measuring psychological constructs: Advances in model-based approaches (pp. 123–144). Washington, DC: American Psychological Association. van der Vaart, A. W. (2000). Asymptotic statistics. New York: Cambridge University Press. Wand, M. P., & Jones, M. C. (1994). Kernel smoothing. London: Chapman and Hall. Wirth, R., & Edwards, M. C. (2007). Item factor analysis: Current approaches and future directions. Psychological Methods, 12(1), 58. Xie, M., & Singh, K. (2013). Confidence distribution, the frequentist distribution estimator of a parameter: A review. International Statistical Review, 81(1), 3–39. Yang, J. S., Hansen, M., & Cai, L. (2012). Characterizing sources of uncertainty in item response theory scale scores. Educational and Psychological Measurement, 72(2), 264–290. Yuan, K. H., Cheng, Y., & Patton, J. (2014). Information matrices and standard errors for MLEs of item parameters in IRT. Psychometrika, 79(2), 232–254. Zabell, S. L. (1992). R. A. Fisher and fiducial argument. Statistical Science, 7(3), 369–387.

Manuscript Received: 17 FEB 2016 Final Version Received: 21 OCT 2016