GENERALIZED FIDUCIAL INFERENCE for LOGISTIC GRADED RESPONSE MODELS Yang Liu

psychometrika doi: 10.1007/s11336-017-9554-0 GENERALIZED FIDUCIAL INFERENCE FOR LOGISTIC GRADED RESPONSE MODELS Yang Liu UNIVERSITY OF CALIFORNIA, MERCED Jan Hannig THE UNIVERSITY OF NORTH CAROLINA, CHAPEL HILL Samejima’s graded response model (GRM) has gained popularity in the analyses of ordinal response data in psychological, educational, and health-related assessment. Obtaining high-quality point and interval estimates for GRM parameters attracts a great deal of attention in the literature. In the current work, we derive generalized fiducial inference (GFI) for a family of multidimensional graded response model, implement a Gibbs sampler to perform fiducial estimation, and compare its finite-sample performance with several commonly used likelihood-based and Bayesian approaches via three simulation studies. It is found that the proposed method is able to yield reliable inference even in the presence of small sample size and extreme generating parameter values, outperforming the other candidate methods under investigation. The use of GFI as a convenient tool to quantify sampling variability in various inferential procedures is illustrated by an empirical data analysis using the patient-reported emotional distress data. Key words: generalized fiducial inference, confidence interval, Markov chain Monte Carlo, Bernstein–von Mises theorem, item response theory, graded response model, bifactor model. 1. Introduction Ordinal rating scales frequently appear in psychological, educational, and health-related measurement. For instance, Likert-type items are routinely used to elicit responses to, e.g., the degree to which a statement can be concurred with, the frequency of substance use, or the severity of a disease’s interference on daily activities, etc. Another example, more common in proficiency assessments, is assigning partial credits to constructed responses that agree only in part with the answer key, or that reflect progressive levels of mastery. The logistic graded response model (GRM), first introduced in Samejima’s (1969) Psychome- trika monograph, has become a standard statistical tool for analyzing ordinal response data. The GRM models an item response as an ordinal logistic regression (Agresti, 2002; also known as a proportional odds model) on one or more latent variables representing the underlying constructs of interest. Heuristically, an ordinal response is treated as a discrete realization of a continuous but latent propensity that is related to individual differences in target constructs and also item characteristics. The relative position of a particular response category on the latent continuum is gauged by the adjacent item difficulty parameters that are transformations of the slope and intercept parameters in the regression. The GRM reduces to the two-parameter logistic (2PL; Birnbaum, 1968) model when there are only two response categories. Maximum likelihood (ML) estimation of the GRM parameters via Newton-type (e.g., Bock & Lieberman, 1970; Haberman, 2013) or Expectation–Maximization (EM; Bock & Aitkin, 1981) Electronic supplementary material The online version of this article (doi:10.1007/s11336-017-9554-0) contains supplementary material, which is available to authorized users. Correspondence should be made to Yang Liu, Psychological Sciences, School of Social Sciences, Humanities, and Arts, University of California, Merced, 5200 North Lake Road, Merced, CA 95343, USA. Email: [email protected] © 2017 The Psychometric Society PSYCHOMETRIKA algorithms has been implemented in software packages such as Mplus (Muthén & Muthén, 2012), flexMIRT (Houts & Cai, 2013), IRTPRO (Cai, Thissen, & du Toit, 2011), and the R package mirt (Chalmers, 2012). One technical aspect that requires special handling is the intractable integration over the latent variable space involved in the GRM likelihood function. Simple outer- product rectangular or Gauss–Hermite quadrature approximation functions efficiently when the latent dimensionality is low (e.g., less than 3). For high-dimensional GRMs, adaptive quadrature (e.g., Haberman, 2006; Schilling & Bock, 2005) or stochastic approximation techniques (e.g., Cai, 2010a, 2010b; Meng & Schilling, 1996) must be invoked to overcome the well-known “curse of dimensionality”. When the test is short and the sample size is small, instability of ML estimation has been noted in Thissen and Hill (2004) and Hill (2004). In particular, if the test only consists of two three-category graded items, Thissen and Hill (2004) identified cases in which the likelihood keeps increasing as the slope increases, and consequently the EM iterations cannot terminate properly. A confidence interval (CI) captures the uncertainty of parameter estimation due to sampling variability. Most often, CIs reported in GRM applications are constructed from inverting the Wald test, i.e., the ML estimate plus or minus the standard error times the appropriate normal quantile determined by the nominal coverage level. Standard error calculation for item response theory (IRT) models has been extensively studied; more details can be found in Yuan, Cheng, and Patton (2014) and Cai (2008). Because the Wald-type CI relies on a normal approximation of the ML estimates’ sampling distribution, its performance is largely contingent upon the deviation from such an approximation. For example, Liu and Hannig (2016) noticed for binary logistic IRT models that generating parameter values close to the boundary of the parameter space is likely to produce skewed sampling distributions for the ML estimates, which further induces under-coverage of Wald-type CIs. The Delta method (e.g., Lehmann, 1999, pp. 85–93) and resampling procedures such as bootstrapping (Efron & Tibshirani, 1994) are often resorted to when interval estimation for a reparameterization is intended. The Delta method yields Wald-type CIs for transformed parameters using a first-order Taylor series expansion argument, and thus suffers from the draw- backs of using a normal approximation. While resampling methods may lead to better CIs for parameters near the boundary, its empirical performance is seldom studied in the IRT literature. Bayesian inference for unidimensional (Curtis, 2010; Kieftenbeld & Natesan, 2012) and multidimensional (Edwards, 2010) GRMs via Markov chain Monte Carlo (MCMC) sampling has also been proposed and evaluated in the literature. With the help of generic Gibbs samplers such as openBUGS (Spiegelhalter, Thomas, Best, & Lunn, 2010), JAGS (Plummer, 2013a, 2013b), and Stan (Carpenter et al., 2016), arbitrary functionals of the posterior distribution can be efficiently approximated by Monte Carlo methods, from which point estimates and CIs of (transformations of) model parameters can be obtained. Kieftenbeld and Natesan (2012) reported that the Bayesian estimator resulted from a specific prior configuration performs better than the ML estimator in recovering the difficulty parameters when the sample size is small. It is well known that the finite-sample behavior of Bayesian inference is determined to a great extent by the choice of prior distributions relative to the true data-generating parameters, which may hinder the generalizability of Kieftenbeld and Natesan’s findings. Unfortunately, prior sensitivity in the estimation of GRM parameters has not been systematically studied, and thus Bayesian methods should be used with caution. In the current research, generalized fiducial inference (GFI; Hannig, 2009, 2013; Hannig, Iyer, Lai, and Lee, 2015) is derived for a general class of multidimensional GRMs, comple- menting the existing full information inferential frameworks. This recent theoretical extension of Fisher’s (1930,1933,1935) fiducial inference serves as a middle ground between likelihood-based and Bayesian methods. Inferential procedures rest on a probability distribution defined on the parameter space, namely a fiducial distribution, which is derived using only the information con- tained in the data. Consequently, it inherits all the flexibility of Bayesian methods, but requires YANG LIU AND JAN HANNIG no prior knowledge of model parameters. It has been demonstrated in applications that GFI not only offers asymptotically optimal inference but often outperforms ML and Bayesian approaches in small samples as well (e.g., Cisewski & Hannig, 2012; Hannig, 2009; Liu & Hannig, 2016). In the current work, we show that GFI, again, when applied to the GRM, delivers added value over conventional likelihood-based and Bayesian methods: We continue to see that GFI is well- behaved even in extreme conditions (small sample and skewed item parameters) where both ML and Bayesian approaches may fail. 2. Theory 2.1. Generalized Fiducial Inference We first introduce the generic recipe of GFI. More detailed descriptions can be found in Hannig (2009) and Liu and Hannig (2016). For a fixed family of parametric models and an observed data set, the goal of GFI is to find a fiducial distribution that quantifies the propensity or plausibility of different parameter values in generating the observed data, in the absence of prior knowledge thereof. It is achieved by Fisher’s signature role-switching argument between data and parameters, which also serves as the foundation in defining the likelihood function from a probability density function. The fiducial argument operates on the data generating equation (DGE; also known as the structural equation; Hannig, 2009): Y = g(θ, U), (1) which describes the data Y as a function of the parameters θ ∈ and random variables U following a completely known distribution.

GENERALIZED FIDUCIAL INFERENCE for LOGISTIC GRADED RESPONSE MODELS Yang Liu

Fiducial Inference: an Approach Based on Bootstrap Techniques

The Significance Test Controversy Revisited: the Fiducial Bayesian

Fisher's Fiducial Argument and Bayes Theorem

Improper Priors & Fiducial Inference

Should the Widest Cleft in Statistics - How and Why Fisher Oppos Ed Neyman and Pearson

Generalized Fiducial Inference for Normal Linear Mixed Models

Statistical Inference from Data to Simple Hypotheses

When Did Bayesian Inference Become “Bayesian”?

On Rereading R. A. Fisher [Fisher Memorial Lecture, with Comments]

How Ronald Fisher Became a Mathematical Statistician Comment Ronald Fisher Devint Statisticien

Bayesian, Fiducial, Frequentist

Generalized Fiducial Inference for Graded Response Models