Parameter Recovery for the Four-Parameter Unidimensional Binary IRT Model: A
Comparison of Marginal Maximum Likelihood and Markov Chain Monte Carlo
Approaches
A dissertation presented to
the faculty of
The Gladys W. and David H. Patton College of Education of Ohio University
In partial fulfillment
of the requirements for the degree
Doctor of Philosophy
Hoan Do
April 2021
© 2021 Hoan Do. All Rights Reserved. 2
This dissertation titled
Parameter Recovery for the Four-Parameter Unidimensional Binary IRT Model: A
Comparison of Marginal Maximum Likelihood and Markov Chain Monte Carlo
Approaches
by
HOAN DO
has been approved for
the Department of Educational Studies
and The Gladys W. and David H. Patton College of Education by
Gordon P. Brooks
Professor of Educational Studies
Renée A. Middleton
Dean, The Gladys W. and David H. Patton College of Education 3
Abstract
DO, HOAN, Ph.D., April 2021, Educational Research and Evaluation
Parameter Recovery for the Four-Parameter Unidimensional Binary IRT Model: A
Comparison of Marginal Maximum Likelihood and Markov Chain Monte Carlo
Approaches
Director of Dissertation: Gordon P. Brooks
This study assesses the parameter recovery accuracy of MML and two MCMC methods, Gibbs and HMC, under the four-parameter unidimensional binary item response function. Data were simulated under the fully crossed design with three sample size levels (1,000, 2,500 and 5,000 respondents) and two types of latent trait distribution
(normal and negatively skewed). Results indicated that in general, MML took a more substantive impact of latent trait skewness but also absorbed the momentum from sample size increase to improve its performance more strongly than MCMC. Two MCMC methods remained advantageous with lower RMSE of item parameter recovery across all conditions under investigation, but sample size increase brought a correspondingly narrower gap between MML and MCMC regardless of latent trait distributions. Gibbs and HMC provided nearly identical outcomes across all conditions, and no considerable difference between two MCMC methods was detected.
Specifically, when θs were generated from a normal distribution, MML and
MCMC estimated the b, c and d parameters with little mean bias, even at N = 1,000.
Estimates of the a parameter were positively biased for MML and negatively biased for
MCMC, and mean bias by all methods was considerably large in absolute value (> 0.10) 4 even at N = 5,000. MML item parameter recovery became less biased than Gibbs and
HMC at N = 5,000. Under normal θ, all methods consistently improved RMSE of item parameter recovery in conjunction with sample size increase, except for MCMC estimation of the c parameter which did not exhibit a clear trend.
When latent trait scores were skewed to the left, there was a concomitant deterioration in the quality of item parameter recovery by both MML and MCMC generally. Under skewed θ, MML had total errors of item parameter recovery diminished as more examinees took a test, yet sample size increase did not appear to benefit mean bias. Indeed, MML became increasingly negatively biased in estimation of the d parameter as sample size increased, and mean biases of estimating other item parameters remained considerably large at N=5,000. For Gibbs and HMC, sample size increase under skewed θ benefited only mean bias of item slopes recovery while rendering their estimation of other item parameters more negatively biased. In addition, unlike MML, there was no appreciable RMSE improvement in the b and d parameter estimation by two
MCMC methods as more cases were drawn from a skewed θ distribution.
Sample size and latent trait distribution had little observable effect on person parameter recovery on average. Both MML-EAP and MCMC were essentially unbiased and had similar RMSE of trait score estimation across all conditions.
5
Dedication
This dissertation is dedicated to my mother, Tam Nguyen.
6
Acknowledgments
The completion of my dissertation would not be possible without the support and guidance of my professors. I would like to express my gratitude to Dr. Gordon Brooks, my advisor and dissertation chair, for encouraging me to go back to graduate school, allowing me to pursue the research topic I am interested in, and helping me formulate the research questions clearly. Under Dr. Brooks's supervision, I gained research methodology knowledge, statistical programming skills, critical perspectives on the research and knowledge production enterprise, a sense of humor, and five pounds of belly fat. The side effect, of course, is attributed to me over-following Dr. Brooks in his footsteps, and I take complete responsibility for it.
Dr. Bruce Carlson has always been an academic inspiration. His course on
Bayesian analysis laid a strong foundation for my pursuit of this dissertation topic. His questions pushed me to think more philosophically beyond the technical contents of my study. One can only feel overwhelmed by his knowledge and devotion to academic rigor.
I am grateful to have his instruction and guidance.
Dr. Sebastián Díaz has always been more than a professor to me. In him, I find a mentor, an advocate, and a friend. His encouragement made the dissertation research process less mentally brutal, and his support for me as a doctoral student over these years made graduate school more enjoyable. Discussions with him helped me develop a more practical approach to research and academic work. I am thankful for the well-rounded education I have received from Dr. Díaz. 7
I am grateful to have Dr. Adah Ward Randolph as my professor, dissertation committee member, and sister. Dr. Randolph helped me broaden my research methodological repertoire, strengthen my writing skills, and improve many aspects of my dissertation. She taught me how to position ourselves and navigate the academia as feminists of color. The critical thinking skills and commitment to justice I learnt from Dr.
Randolph are meaningful lifelong lessons, and I am forever thankful.
I would like to thank the Ohio Supercomputer Center for granting me the resources and helping me through the process to run my R code in the Linux system, and my friend Nina Adanin for setting up a group of office computers for my simulation. I am grateful to my family, especially my sister, my niece and nephew, for their support. My sister, Loan Do, has covered many family duties for me while I am engaged in coursework and research at graduate school. Without my sister’s sacrifice, my doctoral journey would not bear fruit. Finally, I would like to thank my friends, An Dinh, Mai
Tran, Thuy Ho, Duong Tran, Hai Mai, and Linda Sauer, for their moral support, free meals, and free trips to Kroger. They made my graduate school experience a happy one.
8
Table of Contents
Page
Abstract ...... 3 Dedication ...... 5 Acknowledgments...... 6 List of Tables ...... 12 List of Figures ...... 13 Chapter 1: Introduction ...... 14 Overview of IRT ...... 14 Assumptions of IRT ...... 16 Major Types of IRT Models ...... 17 Unidimensional IRT Models for Binary Data ...... 19 The Rasch/One-parameter IRT Model...... 20 The Two-parameter IRT Model ...... 21 The Three-parameter IRT Model ...... 23 The Lesser-known Four-parameter IRT Model ...... 26 Parameter Estimation Approaches in IRT ...... 30 Joint Maximum Likelihood Estimation (JML) ...... 31 Marginal Maximum Likelihood Estimation (MML) ...... 31 Fully Bayesian Approach: Markov Chain Monte Carlo Estimation ...... 33 Problem Statement ...... 34 Research Objectives ...... 35 Research Question 1 ...... 36 Research Question 2 ...... 36 Significance of the Study ...... 36 Scope of the Study ...... 36 Definition of Terms...... 38 Structure of the Dissertation ...... 43 Chapter Summary ...... 43 Chapter 2: Literature Review ...... 45 History of the Four-parameter IRT Model...... 45 Barton And Lord’s (1981) Pioneering Study and Disappointing Results...... 45 9
Reise and Waller’s (2003) Findings and Renewed Interests ...... 47 Increasing Applications of the 4PM across Disciplines ...... 50 Theoretical Discussions of the 4PM ...... 53 The Form, Nature and Utility of the Upper Asymptote Parameter ...... 55 Marginal Maximum Likelihood (MML) with Expectation-Maximization Algorithm 59 Bock and Lieberman’s (1970) Proposal with Gauss-Hermite Quadrature ...... 60 Bock and Aitkin’s (1981) MML with EM Algorithm ...... 62 Improvements on MML with EM ...... 63 Bayesian Approach to Estimation ...... 64 Bayes’ Rule ...... 64 MCMC and Its Role in Bayesian Estimation ...... 67 Gibbs Sampling...... 70 Hamiltonian Monte Carlo ...... 73 Studies on Comparisons of MML, Gibbs, and HMC in IRT ...... 75 Unidimensional IRT Models for Dichotomous Data ...... 75 Unidimensional IRT Models for Polytomous Data ...... 77 Multidimensional IRT Models ...... 79 Research on the Four-parameter IRT Model Estimation ...... 83 Factors Manipulated in IRT Parameter Recovery Studies ...... 90 Sample Size ...... 90 Latent Trait Distribution ...... 92 Issues for Consideration with MML ...... 93 Prior Distribution for Abilities ...... 93 Quadrature Points...... 94 Person Parameter Estimation Method ...... 94 Issues for Consideration with MCMC ...... 95 Priors for Person and Item Parameters...... 95 Convergence Assessment...... 96 Chapter Summary ...... 101 Chapter 3: Methodology ...... 102 Monte Carlo Research...... 102 Research Design...... 105 Replications...... 106 10
Analytical Outcomes ...... 107 Data Generation ...... 109 Item characteristics ...... 109 Sample Size ...... 111 Latent Trait Distribution ...... 112 Data Calibration ...... 114 The R Programming Environment...... 114 MML with the mirt Package ...... 115 Gibbs with JAGS and HMC with Stan via R Interface Packages...... 116 Bayesian Priors for the 4PM Parameters ...... 117 Markov Chain Configurations ...... 118 Verification of Algorithm ...... 122 Manual Verification ...... 123 Modular Testing ...... 123 Checking against Known Solutions ...... 125 Sensitivity Testing ...... 126 Stress Testing ...... 127 Chapter Summary ...... 129 Chapter 4: Results ...... 130 Model Calibration and Convergence ...... 130 Data Analytic Results ...... 132 Item Discrimination Parameters ...... 142 Item Difficulty Parameters ...... 145 Item Lower Asymptote Parameters ...... 148 Item Upper Asymptote Parameters ...... 150 Person Parameters ...... 153 Supplemental Analyses ...... 155 Answer to Research Question 1 ...... 160 Answer to Research Question 2 ...... 161 Chapter Summary ...... 163 Chapter 5: Discussions ...... 165 Summary of Findings ...... 165 Discussions ...... 166 11
Calibration Time ...... 183 Practical Implications...... 185 Recommendations for Future Research ...... 188 Frequentist versus Bayesian Philosophies and Applications: A Commentary ...... 190 Ethical Reflections ...... 200 References ...... 205 Appendix A: Calibrations by MML with N = 10,000 ...... 233 Appendix B: Latent Trait Distributions in the Main Simulation versus the Follow-up Simulation (N = 2,500) ...... 234 Appendix C: Calibrations by MML and MCMC under Negatively Skewed Latent Trait with More High-Ability Examinees (N = 1,000) ...... 235 Appendix D: Calibrations by MML and MCMC under Negatively Skewed Latent Trait with More High-Ability Examinees (N = 2,500) ...... 236 Appendix E: Calibrations by MML and MCMC at N = 1,000 under Normal Latent Trait when MML Converged Technically ...... 237 Appendix F: Calibrations by MML and MCMC at N = 1,000 under Negatively Skewed Latent Trait when MML Converged Technically ...... 238 Appendix G: R Script for the Main Simulation Data Generation ...... 239 Appendix H: R Script for MML Estimations ...... 243 Appendix I: R Script for HMC Estimations ...... 245 Appendix J: Gibbs.txt ...... 248 Appendix K: R Script for Gibbs Estimations ...... 249 Appendix L: R Script for Follow-up (MML Converged Only Technically) ...... 252 Appendix M: R Script for Follow-up (Skewed θ Distribution with More High-Ability Respondents) ...... 255
12
List of Tables
Page
Table 1 Sample Descriptive Statistics of Item and Person Parameter Distributions with N = 1,000 ...... 125 Table 2 Comparison of Bias and RMSE under Normal and Negatively Skewed θ with N = 1,000...... 126 Table 3 Percentage of Practical Convergence Given Technical Convergence in MML 131 Table 4 Comparison of MML, Gibbs and HMC under Normal θ and N = 1,000 ...... 133 Table 5 Comparison of MML, Gibbs and HMC under Normal θ and N = 2,500 ...... 134 Table 6 Comparison of MML, Gibbs and HMC under Normal θ and N = 5,000 ...... 135 Table 7 Comparison of MML, Gibbs and HMC under Negatively Skewed θ and N = 1,000...... 136 Table 8 Comparison of MML, Gibbs and HMC under Negatively Skewed θ and N = 2,500...... 137 Table 9 Comparison of MML, Gibbs and HMC under Negatively Skewed θ and N = 5,000...... 138 Table 10 Relative Efficiency of MCMC versus MML across Measurement Conditions 140 Table 11 Comparison of MML, Gibbs and HMC Item Parameter Recovery in Extreme Situations...... 156 Table 12 MML-EAP, Gibbs and HMC Person Parameter Recovery across True θ Ranges ...... 158
13
List of Figures
Page
Figure 1 Characteristic Curves of Two Hypothetical Items Modeled with the 1PM ...... 21 Figure 2 Characteristic Curves of Two Hypothetical Items Modeled with the 2PM ...... 22 Figure 3 Characteristic Curve of a Hypothetical Item Modeled with the 3PM ...... 25 Figure 4 Characteristic Curve of a Hypothetical Item Modeled with the 4PM ...... 27 Figure 5 Test Information and Standard Error of the 3PM and the 4PM ...... 89 Figure 6 Sample PSRF as a Function of Chain Length and Burn-in with Gibbs ...... 119 Figure 7 Sample PSRF as a Function of Chain Length and Burn-in with HMC ...... 119 Figure 8 Sample Trace Plots of Item and Person Parameters with Gibbs ...... 121 Figure 9 Sample Trace Plots of Item and Person Parameters with HMC ...... 121 Figure 10 Normal and Negatively Skewed Latent Distributions with 1,000 Cases ...... 124 Figure 11 Trace Plot and Plot of R̂ against Iterations of an Item Discrimination Parameter in an Unidentified 4M Estimated with Gibbs Sampling ...... 128 Figure 12 Trace Plot and Plot of R̂ against Iterations of an Item Discrimination Parameter in an Unidentified 4PM Estimated with HMC ...... 128 Figure 13 Item Discrimination Estimation Bias by MML, Gibbs and HMC under Normal θ ...... 142 Figure 14 Item Discrimination Estimation Bias by MML, Gibbs and HMC under Skewed θ ...... 143 Figure 15 Item Difficulty Estimation Bias by MML, Gibbs and HMC under Normal θ . 145 Figure 16 Item Difficulty Estimation Bias by MML, Gibbs and HMC under Skewed θ 146 Figure 17 Item Lower Asymptote Estimation Bias by MML, Gibbs and HMC under Normal θ ...... 148 Figure 18 Item Lower Asymptote Estimation Bias by MML, Gibbs and HMC under Skewed θ ...... 149 Figure 19 Item Upper Asymptote Estimation Bias by MML, Gibbs and HMC under Normal θ ...... 150 Figure 20 Item Upper Asymptote Estimation Bias by MML, Gibbs and HMC under Skewed θ ...... 151 Figure 21 Latent Trait Estimation Bias by MML, Gibbs and HMC under Normal θ ..... 153 Figure 22 Latent Trait Estimation Bias by MML, Gibbs and HMC under Skewed θ ..... 154 Figure 23 Latent Trait Estimation Bias by MML-EAP, Gibbs and HMC across θ Intervals ...... 159 14
Chapter 1: Introduction
The utility of a statistical model depends to a great extent on how accurately its parameters can be estimated. The purpose of this study is to evaluate the quality of parameter recovery by three estimation procedures for the four-parameter item response theory (IRT) model, which has gained increasing research attention and applications in educational and psychological measurement and other fields for the past two decades. In order to provide the necessary background for the research questions to be addressed, this first chapter begins with a short introduction to IRT, a description of popular IRT models with a particular focus on unidimensional IRT models for binary data, and a brief nontechnical explanation of the estimation approaches under examination. The later part lays out the problem statement and the research questions and defines the scope of the current investigation.
Overview of IRT
IRT is a theory of mental measurement that aims at establishing the relationship between observable item responses and the underlying (i.e., unobservable, latent) trait of the respondents or examinees1. IRT is not a theory in the conventional sense, because it does not explain the cognitive mechanisms behind respondents’ answers in test or scale administrations. Rather, IRT consists of multiple statistical models to characterize person
1 In this study, test-takers, examinees and respondents are used interchangeably to denote persons who a scale (test/survey/clinical assessment instrument) is administered to in a variety of contexts, including but not limited to cognitive assessment. 15 and item traits, and concretize these traits in the form of model parameter estimates in order to predict observed responses (de Ayala, 2009; Hambleton & Jones, 1993).The basic premise of IRT is that the relationship between examinees’ latent traits, scale item characteristics and responses can be specified mathematically. Characteristics of a test in its entirety and measurement error can be derived following person and item parameter estimates (Hambleton & Swaminathan, 1985; Suen, 1990).
Like other theories of measurement, two fundamental aspects of IRT are estimates of characteristics of stimulees (e.g., respondents) and estimates of characteristics of stimuli (e.g., items of a scale). Several distinctive features of IRT set this mental appraisal approach apart from other measurement frameworks, however. The latent variable in IRT is assumed to be continuous, as opposed to latent class theory in which the latent variable is categorical. In terms of parameter estimates, IRT also differs from Classical Test
Theory in two fundamental ways: (1) latent traits (i.e., person parameters) and item characteristics (i.e., item parameters) are independent of each other, and (2) both person and item parameters are expressed on the same scale. In IRT, the characteristics of respondents’ latencies do not depend on the scale used to measure them, and the characteristics of a scale do not depend on whom it is administered to. As characteristics of both respondents and stimuli are independent of each other, they have the properties of invariance. Both person and item parameters are located on the same latent continuum, and the latent scale serves as the reference. IRT emphasizes the relation between person parameters and the latent scale, and the relation between item parameters and the latent 16 scale, not the relation between person and item parameters (de Ayala, 2009; Lord, 1953;
Osterlind, 2010).
Assumptions of IRT
Like many other statistical models, IRT is based on certain assumptions about test takers, or survey respondents, and scale items. The first assumption is that the location of respondents is fixed while responding to a scale. This assumption does not usually hold in real world assessments because test-takers learn something as they interact with scale items. However, the change in person location is assumed to be small and negligible.
Second, the item characteristics, such as item difficulty, are assumed to be constant across scale administration contexts. It means the difficulty of the item does not change, although the parametric representation of the item might change in accordance with the abilities of the examinee sample. Another important assumption is termed conditional or local independence, which states that examinees’ responses to one item are determined by their location on the latent continuum and are independent of their responses to other items. The fourth assumption is the functional form assumption, which means the behavior patterns of data correspond to the function specified by the model. The next assumption is optimal performance, that is, test takers are assumed to exercise their best effort to tackle a scale item. Finally, researchers also distinguish between unidimensional
IRT and multidimensional IRT. In unidimensional IRT, there is an additional assumption that observations on scale items are the function of one latent variable. For example, if the responses to a group of math test items are dependent on essentially only one math proficiency variable, a unidimensional IRT model can be fitted. In contrast, if responses 17 also depend on language proficiency (such as for test takers who speak English as a foreign language), then this assumption is violated, and a multidimensional IRT model should be considered instead (de Ayala, 2009; Embretson & Reise, 2000; Osterlind,
2010; Reckase, 2009; Suen, 1990). Reckase (2009) discussed further assumptions associated with multi-dimensional IRT, including (a) the assumption of continuous mathematical function between the person location in a multidimensional space and the probability of getting an item correct (no discontinuities) and (b) the monotonicity assumption, which states that the correct response to an item increases when the location of examinees increases on any of the coordinate dimensions.
Major Types of IRT Models
A number of popular IRT models are used to cater for differences in types of data
(e.g., dichotomous vs. polytomous, ordered vs. unordered), the dimension(s) underlying performance, the number of parameters necessary to adequately represent data, and the choice of mathematical expression (e.g., logistic vs. normal ogive) across research settings. For example, if responses to a test question are dichotomous (i.e., answers are coded as right and wrong) and if the difficulty level is the only item parameter to characterize item functioning, then the probability of the correct answer derived by test- takers is the function of their ability and item difficulty, resulting in the Rasch/one- parameter model. If items are allowed to vary in terms of their capacity to discriminate examinees on the latent continuum, then the item discrimination parameter is added, yielding the two-parameter model. The three-parameter model takes into account the probability that low-ability examinees select the correct option based on partial 18 information or guesses, and the four-parameter model considers the probability that high- ability respondents select the wrong option due to carelessness or fatigue (Barton & Lord,
1981; Birnbaum, 1968; Embretson & Reise, 2000; Hambleton & Swaminathan, 1985).
IRT modeling is not limited to cognitive testing with typically dichotomous data, but is extended to characterize other types of data outside the cognitive domain, such as data from the administration of a questionnaire to which participants respond by endorsing a point on a scale from 1 to 6 (e.g., 1=strongly disagree, 2=moderately disagree, 3=mildly disagree, 4=mildly disagree, 5=moderately agree, 6=strongly agree).
Several IRT approaches to modeling this ordered and polytomous data type are available.
In the Graded Response Model (GRM; Samejima, 1969, 1997), which is probably the most popular way to model Likert-type data, selection of a category is compared with selection of other categories below it. For example, the probability of a respondent selecting option 4 or higher (i.e., 5 or 6) is compared to the probability of this respondent choosing option 1, 2, or 3. The Partial Credit Model (PCM) and Generalized Partial
Credit Model (GPCM) provide an alternative solution to ordered polytomous data modeling by conceptualizing a transition location from one point to another point on the scale (Masters, 1982; Muraki, 1990, 1992). With the Likert data example above, PCM
(Masters, 1982) calculates transition location parameters to represent the “difficulty” of selecting one point over the point before it on the scale (e.g., 2 versus. 1, 3 versus. 2, and so on) and GPCM allows the discrimination parameter to vary across items (Muraki,
1990, 1992). GRM, PCM and GPCM share a similar feature in conceptualizing ordered polytomous data as a series of dichotomous data points and apply the IRT model for 19 dichotomous data at different cut points on the polytomous scale. However, while PCM and GPCM dichotomize one category response versus another category response, GRM pits one category response versus a group of responses. The Nominal Response Model
(NRM; Bock, 1972, 1997) accommodates IRT analysis of polytomous data without an inherent order by treating each alternative as an opposite to the baseline option and estimating parameters for each alternative. PCM and GPCM have been shown to be special cases of NRM with constraints to account for the ordered nature of the response options (see Huggins-Manley & Algina, 2015, and the references therein). All the aforementioned models assume a unidimensional latent variable is being measured. If more than one latent trait underlies the response data, a multi-dimensional IRT model is needed (Reckase, 2009). When the normal ogive function rather than logistic function is chosen to express the relation between parameters and the probability of a particular response option, a normal ogive model results instead of a logistic model (Suen, 1990).
Unidimensional IRT Models for Binary Data
For dichotomous data, the most popular IRT models are the unidimensional
Rasch/one-parameter, two-parameter and three-parameter models. Another, much lesser known unidimensional binary IRT model, is the four-parameter IRT model, which has only regained research interest very recently after several decades of neglect. The following section briefly describes (1) the one, two, and three-parameter IRT models, (2) the reason the four-parameter model has been less commonly used than its other simpler models, and (3) how the long neglected four-parameter IRT model is gaining recognition in diverse disciplines, and the discussions surrounding its nature and use. 20
When the more common logistic density is employed for the item response function in lieu of the normal ogive for mathematical convenience purposes, it results in the one-parameter logistic model (1PL), two-parameter logistic model (2PL), three- parameter logistic model (3PL), and four-parameter logistic model (4PL), respectively, without change in model fit (Embretson & Reise, 2000; Suen, 1990). In this study, more inclusive terms (the 1PM, 2PM, 3PM, and 4PM) are used to denote the fact that these models can be expressed with the normal density function.
The Rasch/One-parameter IRT Model
Suppose we have a test item with dichotomously coded answers (1=correct,
0=incorrect) and we can represent a test-taker’s ability and an item’s difficulty level on a hypothesized latent variable continuum ranging from -∞ to +∞. We denote this person’s location as θ, and item difficulty as b. The one-parameter model estimates θ and only one item-specific parameter: item difficulty or location. As an item becomes more difficult, its location is shifted toward the right end of the latent scale (i.e., toward +∞). The equation to express the probability of endorsing the target option (e.g., correct answer) as a function of examinees’ abilities and characteristics of an item, or item response function
(IRF), can be written as: