Parameter Recovery for the Four-Parameter Unidimensional Binary IRT Model: A

Comparison of Marginal Maximum Likelihood and Markov Chain Monte Carlo

Approaches

A dissertation presented to

the faculty of

The Gladys W. and David H. Patton College of Education of Ohio University

In partial fulfillment

of the requirements for the degree

Doctor of Philosophy

Hoan Do

April 2021

© 2021 Hoan Do. All Rights Reserved. 2

This dissertation titled

Parameter Recovery for the Four-Parameter Unidimensional Binary IRT Model: A

Comparison of Marginal Maximum Likelihood and Markov Chain Monte Carlo

Approaches

by

HOAN DO

has been approved for

the Department of Educational Studies

and The Gladys W. and David H. Patton College of Education by

Gordon P. Brooks

Professor of Educational Studies

Renée A. Middleton

Dean, The Gladys W. and David H. Patton College of Education 3

Abstract

DO, HOAN, Ph.D., April 2021, Educational Research and Evaluation

Parameter Recovery for the Four-Parameter Unidimensional Binary IRT Model: A

Comparison of Marginal Maximum Likelihood and Markov Chain Monte Carlo

Approaches

Director of Dissertation: Gordon P. Brooks

This study assesses the parameter recovery accuracy of MML and two MCMC methods, Gibbs and HMC, under the four-parameter unidimensional binary item response function. Data were simulated under the fully crossed design with three sample size levels (1,000, 2,500 and 5,000 respondents) and two types of latent trait distribution

(normal and negatively skewed). Results indicated that in general, MML took a more substantive impact of latent trait skewness but also absorbed the momentum from sample size increase to improve its performance more strongly than MCMC. Two MCMC methods remained advantageous with lower RMSE of item parameter recovery across all conditions under investigation, but sample size increase brought a correspondingly narrower gap between MML and MCMC regardless of latent trait distributions. Gibbs and HMC provided nearly identical outcomes across all conditions, and no considerable difference between two MCMC methods was detected.

Specifically, when θs were generated from a normal distribution, MML and

MCMC estimated the b, c and d parameters with little mean bias, even at N = 1,000.

Estimates of the a parameter were positively biased for MML and negatively biased for

MCMC, and mean bias by all methods was considerably large in absolute value (> 0.10) 4 even at N = 5,000. MML item parameter recovery became less biased than Gibbs and

HMC at N = 5,000. Under normal θ, all methods consistently improved RMSE of item parameter recovery in conjunction with sample size increase, except for MCMC estimation of the c parameter which did not exhibit a clear trend.

When latent trait scores were skewed to the left, there was a concomitant deterioration in the quality of item parameter recovery by both MML and MCMC generally. Under skewed θ, MML had total errors of item parameter recovery diminished as more examinees took a test, yet sample size increase did not appear to benefit mean bias. Indeed, MML became increasingly negatively biased in estimation of the d parameter as sample size increased, and mean biases of estimating other item parameters remained considerably large at N=5,000. For Gibbs and HMC, sample size increase under skewed θ benefited only mean bias of item slopes recovery while rendering their estimation of other item parameters more negatively biased. In addition, unlike MML, there was no appreciable RMSE improvement in the b and d parameter estimation by two

MCMC methods as more cases were drawn from a skewed θ distribution.

Sample size and latent trait distribution had little observable effect on person parameter recovery on average. Both MML-EAP and MCMC were essentially unbiased and had similar RMSE of trait score estimation across all conditions.

5

Dedication

This dissertation is dedicated to my mother, Tam Nguyen.

6

Acknowledgments

The completion of my dissertation would not be possible without the support and guidance of my professors. I would like to express my gratitude to Dr. Gordon Brooks, my advisor and dissertation chair, for encouraging me to go back to graduate school, allowing me to pursue the research topic I am interested in, and helping me formulate the research questions clearly. Under Dr. Brooks's supervision, I gained research methodology knowledge, statistical programming skills, critical perspectives on the research and knowledge production enterprise, a sense of humor, and five pounds of belly fat. The side effect, of course, is attributed to me over-following Dr. Brooks in his footsteps, and I take complete responsibility for it.

Dr. Bruce Carlson has always been an academic inspiration. His course on

Bayesian analysis laid a strong foundation for my pursuit of this dissertation topic. His questions pushed me to think more philosophically beyond the technical contents of my study. One can only feel overwhelmed by his knowledge and devotion to academic rigor.

I am grateful to have his instruction and guidance.

Dr. Sebastián Díaz has always been more than a professor to me. In him, I find a mentor, an advocate, and a friend. His encouragement made the dissertation research process less mentally brutal, and his support for me as a doctoral student over these years made graduate school more enjoyable. Discussions with him helped me develop a more practical approach to research and academic work. I am thankful for the well-rounded education I have received from Dr. Díaz. 7

I am grateful to have Dr. Adah Ward Randolph as my professor, dissertation committee member, and sister. Dr. Randolph helped me broaden my research methodological repertoire, strengthen my writing skills, and improve many aspects of my dissertation. She taught me how to position ourselves and navigate the academia as feminists of color. The critical thinking skills and commitment to justice I learnt from Dr.

Randolph are meaningful lifelong lessons, and I am forever thankful.

I would like to thank the Ohio Supercomputer Center for granting me the resources and helping me through the process to run my code in the Linux system, and my friend Nina Adanin for setting up a group of office computers for my simulation. I am grateful to my family, especially my sister, my niece and nephew, for their support. My sister, Loan Do, has covered many family duties for me while I am engaged in coursework and research at graduate school. Without my sister’s sacrifice, my doctoral journey would not bear fruit. Finally, I would like to thank my friends, An Dinh, Mai

Tran, Thuy Ho, Duong Tran, Hai Mai, and Linda Sauer, for their moral support, free meals, and free trips to Kroger. They made my graduate school experience a happy one.

8

Table of Contents

Page

Abstract ...... 3 Dedication ...... 5 Acknowledgments...... 6 List of Tables ...... 12 List of Figures ...... 13 Chapter 1: Introduction ...... 14 Overview of IRT ...... 14 Assumptions of IRT ...... 16 Major Types of IRT Models ...... 17 Unidimensional IRT Models for Binary Data ...... 19 The Rasch/One-parameter IRT Model...... 20 The Two-parameter IRT Model ...... 21 The Three-parameter IRT Model ...... 23 The Lesser-known Four-parameter IRT Model ...... 26 Parameter Estimation Approaches in IRT ...... 30 Joint Maximum Likelihood Estimation (JML) ...... 31 Marginal Maximum Likelihood Estimation (MML) ...... 31 Fully Bayesian Approach: Markov Chain Monte Carlo Estimation ...... 33 Problem Statement ...... 34 Research Objectives ...... 35 Research Question 1 ...... 36 Research Question 2 ...... 36 Significance of the Study ...... 36 Scope of the Study ...... 36 Definition of Terms...... 38 Structure of the Dissertation ...... 43 Chapter Summary ...... 43 Chapter 2: Literature Review ...... 45 History of the Four-parameter IRT Model...... 45 Barton And Lord’s (1981) Pioneering Study and Disappointing Results...... 45 9

Reise and Waller’s (2003) Findings and Renewed Interests ...... 47 Increasing Applications of the 4PM across Disciplines ...... 50 Theoretical Discussions of the 4PM ...... 53 The Form, Nature and Utility of the Upper Asymptote Parameter ...... 55 Marginal Maximum Likelihood (MML) with Expectation-Maximization Algorithm 59 Bock and Lieberman’s (1970) Proposal with Gauss-Hermite Quadrature ...... 60 Bock and Aitkin’s (1981) MML with EM Algorithm ...... 62 Improvements on MML with EM ...... 63 Bayesian Approach to Estimation ...... 64 Bayes’ Rule ...... 64 MCMC and Its Role in Bayesian Estimation ...... 67 Gibbs Sampling...... 70 Hamiltonian Monte Carlo ...... 73 Studies on Comparisons of MML, Gibbs, and HMC in IRT ...... 75 Unidimensional IRT Models for Dichotomous Data ...... 75 Unidimensional IRT Models for Polytomous Data ...... 77 Multidimensional IRT Models ...... 79 Research on the Four-parameter IRT Model Estimation ...... 83 Factors Manipulated in IRT Parameter Recovery Studies ...... 90 Sample Size ...... 90 Latent Trait Distribution ...... 92 Issues for Consideration with MML ...... 93 Prior Distribution for Abilities ...... 93 Quadrature Points...... 94 Person Parameter Estimation Method ...... 94 Issues for Consideration with MCMC ...... 95 Priors for Person and Item Parameters...... 95 Convergence Assessment...... 96 Chapter Summary ...... 101 Chapter 3: Methodology ...... 102 Monte Carlo Research...... 102 Research Design...... 105 Replications...... 106 10

Analytical Outcomes ...... 107 Data Generation ...... 109 Item characteristics ...... 109 Sample Size ...... 111 Latent Trait Distribution ...... 112 Data Calibration ...... 114 The R Programming Environment...... 114 MML with the mirt Package ...... 115 Gibbs with JAGS and HMC with Stan via R Interface Packages...... 116 Bayesian Priors for the 4PM Parameters ...... 117 Markov Chain Configurations ...... 118 Verification of Algorithm ...... 122 Manual Verification ...... 123 Modular Testing ...... 123 Checking against Known Solutions ...... 125 Sensitivity Testing ...... 126 Stress Testing ...... 127 Chapter Summary ...... 129 Chapter 4: Results ...... 130 Model Calibration and Convergence ...... 130 Data Analytic Results ...... 132 Item Discrimination Parameters ...... 142 Item Difficulty Parameters ...... 145 Item Lower Asymptote Parameters ...... 148 Item Upper Asymptote Parameters ...... 150 Person Parameters ...... 153 Supplemental Analyses ...... 155 Answer to Research Question 1 ...... 160 Answer to Research Question 2 ...... 161 Chapter Summary ...... 163 Chapter 5: Discussions ...... 165 Summary of Findings ...... 165 Discussions ...... 166 11

Calibration Time ...... 183 Practical Implications...... 185 Recommendations for Future Research ...... 188 Frequentist versus Bayesian Philosophies and Applications: A Commentary ...... 190 Ethical Reflections ...... 200 References ...... 205 Appendix A: Calibrations by MML with N = 10,000 ...... 233 Appendix B: Latent Trait Distributions in the Main Simulation versus the Follow-up Simulation (N = 2,500) ...... 234 Appendix C: Calibrations by MML and MCMC under Negatively Skewed Latent Trait with More High-Ability Examinees (N = 1,000) ...... 235 Appendix D: Calibrations by MML and MCMC under Negatively Skewed Latent Trait with More High-Ability Examinees (N = 2,500) ...... 236 Appendix E: Calibrations by MML and MCMC at N = 1,000 under Normal Latent Trait when MML Converged Technically ...... 237 Appendix F: Calibrations by MML and MCMC at N = 1,000 under Negatively Skewed Latent Trait when MML Converged Technically ...... 238 Appendix G: R Script for the Main Simulation Data Generation ...... 239 Appendix H: R Script for MML Estimations ...... 243 Appendix I: R Script for HMC Estimations ...... 245 Appendix J: Gibbs.txt ...... 248 Appendix K: R Script for Gibbs Estimations ...... 249 Appendix L: R Script for Follow-up (MML Converged Only Technically) ...... 252 Appendix M: R Script for Follow-up (Skewed θ Distribution with More High-Ability Respondents) ...... 255

12

List of Tables

Page

Table 1 Sample Descriptive Statistics of Item and Person Parameter Distributions with N = 1,000 ...... 125 Table 2 Comparison of Bias and RMSE under Normal and Negatively Skewed θ with N = 1,000...... 126 Table 3 Percentage of Practical Convergence Given Technical Convergence in MML 131 Table 4 Comparison of MML, Gibbs and HMC under Normal θ and N = 1,000 ...... 133 Table 5 Comparison of MML, Gibbs and HMC under Normal θ and N = 2,500 ...... 134 Table 6 Comparison of MML, Gibbs and HMC under Normal θ and N = 5,000 ...... 135 Table 7 Comparison of MML, Gibbs and HMC under Negatively Skewed θ and N = 1,000...... 136 Table 8 Comparison of MML, Gibbs and HMC under Negatively Skewed θ and N = 2,500...... 137 Table 9 Comparison of MML, Gibbs and HMC under Negatively Skewed θ and N = 5,000...... 138 Table 10 Relative Efficiency of MCMC versus MML across Measurement Conditions 140 Table 11 Comparison of MML, Gibbs and HMC Item Parameter Recovery in Extreme Situations...... 156 Table 12 MML-EAP, Gibbs and HMC Person Parameter Recovery across True θ Ranges ...... 158

13

List of Figures

Page

Figure 1 Characteristic Curves of Two Hypothetical Items Modeled with the 1PM ...... 21 Figure 2 Characteristic Curves of Two Hypothetical Items Modeled with the 2PM ...... 22 Figure 3 Characteristic Curve of a Hypothetical Item Modeled with the 3PM ...... 25 Figure 4 Characteristic Curve of a Hypothetical Item Modeled with the 4PM ...... 27 Figure 5 Test Information and Standard Error of the 3PM and the 4PM ...... 89 Figure 6 Sample PSRF as a Function of Chain Length and Burn-in with Gibbs ...... 119 Figure 7 Sample PSRF as a Function of Chain Length and Burn-in with HMC ...... 119 Figure 8 Sample Trace Plots of Item and Person Parameters with Gibbs ...... 121 Figure 9 Sample Trace Plots of Item and Person Parameters with HMC ...... 121 Figure 10 Normal and Negatively Skewed Latent Distributions with 1,000 Cases ...... 124 Figure 11 Trace Plot and Plot of R̂ against Iterations of an Item Discrimination Parameter in an Unidentified 4M Estimated with Gibbs Sampling ...... 128 Figure 12 Trace Plot and Plot of R̂ against Iterations of an Item Discrimination Parameter in an Unidentified 4PM Estimated with HMC ...... 128 Figure 13 Item Discrimination Estimation Bias by MML, Gibbs and HMC under Normal θ ...... 142 Figure 14 Item Discrimination Estimation Bias by MML, Gibbs and HMC under Skewed θ ...... 143 Figure 15 Item Difficulty Estimation Bias by MML, Gibbs and HMC under Normal θ . 145 Figure 16 Item Difficulty Estimation Bias by MML, Gibbs and HMC under Skewed θ 146 Figure 17 Item Lower Asymptote Estimation Bias by MML, Gibbs and HMC under Normal θ ...... 148 Figure 18 Item Lower Asymptote Estimation Bias by MML, Gibbs and HMC under Skewed θ ...... 149 Figure 19 Item Upper Asymptote Estimation Bias by MML, Gibbs and HMC under Normal θ ...... 150 Figure 20 Item Upper Asymptote Estimation Bias by MML, Gibbs and HMC under Skewed θ ...... 151 Figure 21 Latent Trait Estimation Bias by MML, Gibbs and HMC under Normal θ ..... 153 Figure 22 Latent Trait Estimation Bias by MML, Gibbs and HMC under Skewed θ ..... 154 Figure 23 Latent Trait Estimation Bias by MML-EAP, Gibbs and HMC across θ Intervals ...... 159 14

Chapter 1: Introduction

The utility of a statistical model depends to a great extent on how accurately its parameters can be estimated. The purpose of this study is to evaluate the quality of parameter recovery by three estimation procedures for the four-parameter (IRT) model, which has gained increasing research attention and applications in educational and psychological measurement and other fields for the past two decades. In order to provide the necessary background for the research questions to be addressed, this first chapter begins with a short introduction to IRT, a description of popular IRT models with a particular focus on unidimensional IRT models for binary data, and a brief nontechnical explanation of the estimation approaches under examination. The later part lays out the problem statement and the research questions and defines the scope of the current investigation.

Overview of IRT

IRT is a theory of mental measurement that aims at establishing the relationship between observable item responses and the underlying (i.e., unobservable, latent) trait of the respondents or examinees1. IRT is not a theory in the conventional sense, because it does not explain the cognitive mechanisms behind respondents’ answers in test or scale administrations. Rather, IRT consists of multiple statistical models to characterize person

1 In this study, test-takers, examinees and respondents are used interchangeably to denote persons who a scale (test/survey/clinical assessment instrument) is administered to in a variety of contexts, including but not limited to cognitive assessment. 15 and item traits, and concretize these traits in the form of model parameter estimates in order to predict observed responses (de Ayala, 2009; Hambleton & Jones, 1993).The basic premise of IRT is that the relationship between examinees’ latent traits, scale item characteristics and responses can be specified mathematically. Characteristics of a test in its entirety and measurement error can be derived following person and item parameter estimates (Hambleton & Swaminathan, 1985; Suen, 1990).

Like other theories of measurement, two fundamental aspects of IRT are estimates of characteristics of stimulees (e.g., respondents) and estimates of characteristics of stimuli (e.g., items of a scale). Several distinctive features of IRT set this mental appraisal approach apart from other measurement frameworks, however. The latent variable in IRT is assumed to be continuous, as opposed to latent class theory in which the latent variable is categorical. In terms of parameter estimates, IRT also differs from Classical Test

Theory in two fundamental ways: (1) latent traits (i.e., person parameters) and item characteristics (i.e., item parameters) are independent of each other, and (2) both person and item parameters are expressed on the same scale. In IRT, the characteristics of respondents’ latencies do not depend on the scale used to measure them, and the characteristics of a scale do not depend on whom it is administered to. As characteristics of both respondents and stimuli are independent of each other, they have the properties of invariance. Both person and item parameters are located on the same latent continuum, and the latent scale serves as the reference. IRT emphasizes the relation between person parameters and the latent scale, and the relation between item parameters and the latent 16 scale, not the relation between person and item parameters (de Ayala, 2009; Lord, 1953;

Osterlind, 2010).

Assumptions of IRT

Like many other statistical models, IRT is based on certain assumptions about test takers, or survey respondents, and scale items. The first assumption is that the location of respondents is fixed while responding to a scale. This assumption does not usually hold in real world assessments because test-takers learn something as they interact with scale items. However, the change in person location is assumed to be small and negligible.

Second, the item characteristics, such as item difficulty, are assumed to be constant across scale administration contexts. It means the difficulty of the item does not change, although the parametric representation of the item might change in accordance with the abilities of the examinee sample. Another important assumption is termed conditional or local independence, which states that examinees’ responses to one item are determined by their location on the latent continuum and are independent of their responses to other items. The fourth assumption is the functional form assumption, which means the behavior patterns of data correspond to the function specified by the model. The next assumption is optimal performance, that is, test takers are assumed to exercise their best effort to tackle a scale item. Finally, researchers also distinguish between unidimensional

IRT and multidimensional IRT. In unidimensional IRT, there is an additional assumption that observations on scale items are the function of one latent variable. For example, if the responses to a group of math test items are dependent on essentially only one math proficiency variable, a unidimensional IRT model can be fitted. In contrast, if responses 17 also depend on language proficiency (such as for test takers who speak English as a foreign language), then this assumption is violated, and a multidimensional IRT model should be considered instead (de Ayala, 2009; Embretson & Reise, 2000; Osterlind,

2010; Reckase, 2009; Suen, 1990). Reckase (2009) discussed further assumptions associated with multi-dimensional IRT, including (a) the assumption of continuous mathematical function between the person location in a multidimensional space and the probability of getting an item correct (no discontinuities) and (b) the monotonicity assumption, which states that the correct response to an item increases when the location of examinees increases on any of the coordinate dimensions.

Major Types of IRT Models

A number of popular IRT models are used to cater for differences in types of data

(e.g., dichotomous vs. polytomous, ordered vs. unordered), the dimension(s) underlying performance, the number of parameters necessary to adequately represent data, and the choice of mathematical expression (e.g., logistic vs. normal ogive) across research settings. For example, if responses to a test question are dichotomous (i.e., answers are coded as right and wrong) and if the difficulty level is the only item parameter to characterize item functioning, then the probability of the correct answer derived by test- takers is the function of their ability and item difficulty, resulting in the Rasch/one- parameter model. If items are allowed to vary in terms of their capacity to discriminate examinees on the latent continuum, then the item discrimination parameter is added, yielding the two-parameter model. The three-parameter model takes into account the probability that low-ability examinees select the correct option based on partial 18 information or guesses, and the four-parameter model considers the probability that high- ability respondents select the wrong option due to carelessness or fatigue (Barton & Lord,

1981; Birnbaum, 1968; Embretson & Reise, 2000; Hambleton & Swaminathan, 1985).

IRT modeling is not limited to cognitive testing with typically dichotomous data, but is extended to characterize other types of data outside the cognitive domain, such as data from the administration of a questionnaire to which participants respond by endorsing a point on a scale from 1 to 6 (e.g., 1=strongly disagree, 2=moderately disagree, 3=mildly disagree, 4=mildly disagree, 5=moderately agree, 6=strongly agree).

Several IRT approaches to modeling this ordered and polytomous data type are available.

In the Graded Response Model (GRM; Samejima, 1969, 1997), which is probably the most popular way to model Likert-type data, selection of a category is compared with selection of other categories below it. For example, the probability of a respondent selecting option 4 or higher (i.e., 5 or 6) is compared to the probability of this respondent choosing option 1, 2, or 3. The Partial Credit Model (PCM) and Generalized Partial

Credit Model (GPCM) provide an alternative solution to ordered polytomous data modeling by conceptualizing a transition location from one point to another point on the scale (Masters, 1982; Muraki, 1990, 1992). With the Likert data example above, PCM

(Masters, 1982) calculates transition location parameters to represent the “difficulty” of selecting one point over the point before it on the scale (e.g., 2 versus. 1, 3 versus. 2, and so on) and GPCM allows the discrimination parameter to vary across items (Muraki,

1990, 1992). GRM, PCM and GPCM share a similar feature in conceptualizing ordered polytomous data as a series of dichotomous data points and apply the IRT model for 19 dichotomous data at different cut points on the polytomous scale. However, while PCM and GPCM dichotomize one category response versus another category response, GRM pits one category response versus a group of responses. The Nominal Response Model

(NRM; Bock, 1972, 1997) accommodates IRT analysis of polytomous data without an inherent order by treating each alternative as an opposite to the baseline option and estimating parameters for each alternative. PCM and GPCM have been shown to be special cases of NRM with constraints to account for the ordered nature of the response options (see Huggins-Manley & Algina, 2015, and the references therein). All the aforementioned models assume a unidimensional latent variable is being measured. If more than one latent trait underlies the response data, a multi-dimensional IRT model is needed (Reckase, 2009). When the normal ogive function rather than logistic function is chosen to express the relation between parameters and the probability of a particular response option, a normal ogive model results instead of a logistic model (Suen, 1990).

Unidimensional IRT Models for Binary Data

For dichotomous data, the most popular IRT models are the unidimensional

Rasch/one-parameter, two-parameter and three-parameter models. Another, much lesser known unidimensional binary IRT model, is the four-parameter IRT model, which has only regained research interest very recently after several decades of neglect. The following section briefly describes (1) the one, two, and three-parameter IRT models, (2) the reason the four-parameter model has been less commonly used than its other simpler models, and (3) how the long neglected four-parameter IRT model is gaining recognition in diverse disciplines, and the discussions surrounding its nature and use. 20

When the more common logistic density is employed for the item response function in lieu of the normal ogive for mathematical convenience purposes, it results in the one-parameter logistic model (1PL), two-parameter logistic model (2PL), three- parameter logistic model (3PL), and four-parameter logistic model (4PL), respectively, without change in model fit (Embretson & Reise, 2000; Suen, 1990). In this study, more inclusive terms (the 1PM, 2PM, 3PM, and 4PM) are used to denote the fact that these models can be expressed with the normal density function.

The Rasch/One-parameter IRT Model

Suppose we have a test item with dichotomously coded answers (1=correct,

0=incorrect) and we can represent a test-taker’s ability and an item’s difficulty level on a hypothesized latent variable continuum ranging from -∞ to +∞. We denote this person’s location as θ, and item difficulty as b. The one-parameter model estimates θ and only one item-specific parameter: item difficulty or location. As an item becomes more difficult, its location is shifted toward the right end of the latent scale (i.e., toward +∞). The equation to express the probability of endorsing the target option (e.g., correct answer) as a function of examinees’ abilities and characteristics of an item, or item response function

(IRF), can be written as:

( ) P(x=1│θi, a, bj,) = ( ) (1)

For the 1PM, all items have the identical capacity to discriminate respondents on the latent scale (i.e., the item discrimination parameter is constant for all items), which explains why the discrimination parameter, abbreviated as a, is not subscripted. If a is equal to 1, the resulting model is a Rasch model. The item characteristic curves in the 21 one-parameter/Rasch model look parallel because the identical discrimination parameter means all items discriminate test-takers to the same degree. The inflection point of a 1PM item curve is where θ = b, which implies that the probability of an examinee with θ = b having the correct answer is P(x = 1) = .5. Two illustrative item characteristic curves are displayed in figure 1. Both items have the same discrimination parameter a = 1.67, but item 1 (red curve, difficulty parameter b1 = -1.26) is easier than item 2 (black curve, difficulty parameter b2 = 1.74)

Figure 1

Characteristic Curves of Two Hypothetical Items Modeled with the 1PM

The Two-parameter IRT Model

In reality, test items usually vary in their ability to distinguish examinees of differing latent trait levels. The 2PM was introduced to address the differential 22 sensitivities of test items at various locations on the θ continuum (Birnbaum, 1958,

1968). The 2PM IRF is written via a logistic function as:

( ) P(x=1│θi, aj, bj) = ( ) (2)

The item discrimination parameter, aj, is now subscripted to convey its variation across items in a scale. Theoretically, item discrimination parameters can be both positive and negative, but typically test items are piloted and removed from the official test/item bank if they exhibit negative discrimination. The discrimination parameter is proportional to the slope at the inflection point on the item characteristic curve, hence another term slope parameter. As a increases in the positive direction, the slope becomes steeper, and the item’s discriminating capacity becomes higher accordingly. Figure 2 shows the characteristic curves of two items in a data set modeled with the 2PM.

Figure 2

Characteristic Curves of Two Hypothetical Items Modeled with the 2PM

23

Because item discrimination parameters are allowed to vary across items, two item curves are no longer parallel. Similar to the 1PM, the inflection point of a 2PM item curve is where θ = b. The vertical line stemming from the location of item 1 (b1 = -0.26) crosses the item 1 curve at its inflection point where discrimination parameter a1 = 1.46.

For test-takers with θ equal to the item difficulty parameter b1 = -0.26, the chance of endorsing the correct answer to item 1 is 50%. Similarly, item 2 is most discriminatory at the θ = b2 = 1.52, at which point the item curve is steepest with discrimination parameter equal to 0.72.

The Three-parameter IRT Model

In authentic testing contexts, it is reasonable to assume that test-takers at the low end of the ability scale might endorse the correct option in a difficult item by luck or use of partial information. In this case, the lower asymptote of the item characteristic curve might not be zero but should be more appropriately conceptualized as being asymptotic with a non-zero value. Thus, it is necessary to develop a model that reflects this pseudo- chance in item responses. The 3PM expands on the 2PM by incorporating the pseudo- chance (or pseudo-guessing) parameter, denoted as c (Birnbaum, 1968). This parameter equals the probability of examinees getting an item right P(x=1) when their θ score (i.e., ability) approaches −∞, and is also referred to as the lower asymptote parameter. The probability for a test-taker to answer an item x correctly under the three-parameter logistic model is expressed in the following IRF:

( ) P(x=1│θi, aj, bj, cj,) = cj + (1 - cj) ( ) (3) 24 in which θi is the latent trait score of examinee i, and aj, bj, cj, are the discrimination, difficulty, and lower asymptote parameter of item j, respectively. In the 3PM, the discrimination parameter is still proportional to the slope at the inflection point on the item characteristic curve where θ = b. The slope is equal to .25a*(1 - c), and as the parameter c increases, the slope (and hence the item’s discriminating capacity) decreases.

When the c parameter is added to the model, the difficulty parameter is now redefined as the point where probability of getting an item correct equals .5(c + 1) rather than .5, because this probability is now constrained by the lower asymptote. The probability of the correct answer at θ = b is higher than .5 when the lower asymptote c is larger than zero. Figure 3 illustrates the characteristic curve of one item modeled with the 3PM. This item is characterized by three parameters: the difficulty parameter b = 0.66, the discrimination parameter a = 0.76, and the pseudo-chance parameter c = .11. The lower asymptote implies the chance of endorsing the correct answer to this item for very low ability examinees equals .11.

25

Figure 3

Characteristic Curve of a Hypothetical Item Modeled with the 3PM

It is worth noting that in IRT methodology, the lower asymptote parameter is also frequently referred to as the guessing parameter. However, it is rare that test takers provide the right answer with mere guesses. Real test data analytic results show that the c parameter is usually lower than the probability of having the correct answer by blindly guessing alone (i.e., 1 divided by the number of response options), and that the c parameter appears even in easy test items. It is possible that test strategies like reasoning with incomplete information and the attractiveness of cleverly designed distracters all play a part in test-takers’ endorsement among alternatives. Therefore, the c parameter should not be interpreted as representing probability of a correct choice due to only guessing. It is more appropriately viewed as representing interaction between a respondent and an item (de Ayala, 2009; Little, 1962; Lord, 1974). 26

The Lesser-known Four-parameter IRT Model

While the 3PM takes into account the possibility that low-ability examinees can derive a correct answer with partial knowledge or test-taking experience, the upper asymptote of the item response curve remains 1, which assumes that the probability of high-ability examinees to correctly respond to scale items is 1. This is not always the case, because even strong exam candidates can make mistakes in questions deemed easier than their ability level due to carelessness or fatigue effect. The four-parameter IRT model (Barton & Lord, 1981; Magis, 2013; McDonald, 1967) with an upper asymptote d less than 1 was introduced to capture this non-zero possibility of “slip” (i.e., 1 - d) among test-takers. In this unidimensional binary IRT model with 4 parameters, the probability of the right answer (or the answer coded as 1 in the binary code) to an item is expressed formulaically with the four-parameter model IRF as

( ) P(x = 1│θi, aj, bj, cj, dj) = cj + (dj - cj) ( ) (4) in which θi is the latent trait score of examinee i, and aj, bj, cj, dj are the discrimination, difficulty, lower asymptote, and upper asymptote parameters of item j, respectively. The subscript of θ denotes the unique estimate of latent trait level for each scale respondent, and the subscript of four item parameters indicates these values are allowed to vary across scale items. The characteristic curve of an item modeled with the 4PM is illustrated in figure 4. This item is characterized by four parameters: difficulty parameter b = 1.00, discrimination parameter a = 1.20, lower asymptote parameter c = .20, and upper asymptote parameter d = .90. The upper asymptote implies the chance of endorsing the wrong answer to this item, even for very high ability examinees, equals 10%. 27

Figure 4

Characteristic Curve of a Hypothetical Item Modeled with the 4PM

In the same way that the 3PM expanded on the 2PM with the inclusion of the lower asymptote parameter, the 4PM built upon the 3PM with the upper asymptote parameter lower than one. All unidimensional binary IRT models have hierarchically nested relationships, such that the constraint of the d parameter at one turns the 4PM into the 3PM, fixing the c parameter at zero makes the 3PM functionally equivalent to the

2PM, and the 2PM becomes the 1PM/Rasch model when the a parameter is forced to be the same across items/equal to one in all items.

While the 3PM is the frequently employed approach to modeling unidimensional binary test data, the 4PM has not enjoyed similar popularity, despite the initial conceptualization of freeing the upper asymptote by McDonald in the 1960s. Part of the reason lies in the lack of readily available psychometric software to accommodate the upper asymptote estimation in 4PM until relatively recently. The other reason, and 28 arguably the primary one, is the offset from the preliminary study of the usefulness of the

4PM by Barton and Lord in the early 1980s. In their investigation, Barton and Lord

(1981) found little value of fitting the model with an upper asymptote parameter less than one. Compared to the 3PM estimates, the introduction of the non-one upper asymptote resulted in negligible changes in both likelihood and ability estimates generally, hence the authors’ discouragement of the use of the model. The increased computational complexity in exchange for minor gains was also put forward as another unappealing feature of the 4PM applications. This concern is understandable, given the non-existence of sufficient statistics, model identification, and non-convergence issues already widely documented with estimation of three item parameters (Baker & Kim, 2004; de Ayala,

2009; Embretson & Reise, 2000; Maris & Bechger, 2009; Osterlind, 2010; San Martín et al., 2009). However, it is noteworthy that rather than estimate the 4PM, Barton and Lord fixed the upper asymptote values to .99 and .98 and examined the changes in estimations.

The values of .99 and .98 were arguably too close to one to make many considerable differences in parameter estimates when compared to results obtained from the 3PM.

Barton and Lord’s (1981) conclusion on the disappointing value of the non-unity upper asymptote remained the state-of-the-art knowledge about the 4PM model until almost two decades later. One empirical investigation which deserves credit for sparking a fresh look into the usefulness of the 4PM in educational and psychological research is the psychopathology study by Reise and Waller (2003). These authors fitted the 3PM to data produced by the Minnesota Multiphasic Personality Inventory – Adolescent version with scores keyed in both pathology and non-pathology directions. Parameter estimates 29 revealed the striking difference in the presence of the c parameter when responses were scored in opposite directions. While only 20 items had a lower asymptote > .10 with scores keyed in the pathology direction, up to over 100 items (out of over 300 items) featured a lower asymptote above .10 with responses coded in non-pathology direction.

The presence of a non-trivial lower asymptote in the reverse-keyed data implied the presence of the non-unity upper asymptote in the original keyed data. Based on this surprising finding with differentially keyed data, Reise and Waller (2003) argued for the necessity of the upper asymptote parameter (i.e., the 4PM) to model psychopathology item characteristics. Applied researchers employing IRT have also suggested freeing the upper asymptote in other fields, such as criminology (Osgood et al., 2002) and genetics research (Tavares et al., 2004).

Since the pioneering article by Reise and Waller (2003), the 4M has found its data characterization utility in a variety of fields, including computerized adaptive testing

(CAT; Liao et al., 2012; Rulison & Loken, 2009; Yen, Ho, Liao, & Chen, 2012; Yen, Ho,

Laio, et al., 2012), cognitive appraisal such as financial literacy, mathematics, reading, and physics testing (Barnard-Brak et al., 2018; Culpepper, 2017; Sideridis et al., 2016;

Walstad & Rebeck, 2017), health behavior assessment (Culpepper, 2016), food security

(Gregory, 2019), and further applications in psychopathology and personality assessment

(Fuerstahler & Waller, 2014; Waller & Feuerstahler, 2017; Waller & Reise, 2010). The substantive meaning of the upper asymptote parameter varies across disciplines and will be detailed in chapter 2. 30

The proliferation of the 4PM applications should not be equated with the idea that the usefulness or appropriateness of the 4PM to model data is unequivocal among researchers. Indeed, discussions on the nature of the upper asymptote and utility of the

4PM continue in tandem with an increasing number of studies using an additional upper asymptote parameter to model data (Green, 2011; Hambleton & Swaminathan, 1985;

Osgood et al., 2002; Reise & Waller, 2003; Waller & Reise, 2010), in the same way that the forerunner 3PM has stirred debates from its early days until today (e.g., Maris &

Bechger, 2009; Wainer, 1983). Regarding the necessity of the 4PM, I concur with Reise and Waller (2003) that an attempt should be made to find the best fitting model whenever possible, while acknowledging that estimating more parameters does not necessarily provide better fitting and more interpretable models. The choice of model should be guided by desired scale properties, appropriateness of assumptions, estimation purpose and model-data fit, among other criteria (Embretson & Reise, 2000). When there are reasons to believe that modeling an additional upper asymptote might allow better reflection of the data generation process and bring stronger model-data agreement than the traditional 3PM, characterizing data with the 4PM could be considered. In this context, an important issue ahead to tackle is how to obtain accurate estimates of parameters for the 4PM.

Parameter Estimation Approaches in IRT

The fruitfulness of modeling data with IRT depends to a large part on the accurate and efficient recovery of person and item parameters, hence the extensive treatment of 31 estimation techniques in IRT methodology literature. In this section, key approaches to

IRT parameter estimation are sketched.

Joint Maximum Likelihood Estimation (JML)

JML (Birnbaum, 1968) derives estimates for both item and person parameters in an iterative procedure with two steps. In the first step, one set of parameters is estimated with the temporary estimates for the other set. It is commonplace for test-takers to outnumber scale items; therefore, item parameters are estimated first because interim estimates of abilities are more informative for item parameter estimation than vice versa.

In the second step, estimated item parameter values from the first step are used to feed the ability estimation. This iterative process goes on until the number of pre-defined steps has been reached, or the difference between results from two consecutive cycles becomes miniscule and negligible (smaller than the predetermined convergence criterion).

JML has several drawbacks, including biased estimates of person and item parameters in short scales, inconsistency and lack of improvement in item parameter estimation accuracy in the presence of increased sample size, failure to estimate abilities of test-takers with extreme score patterns (all right or all wrong answers), and the need to recalibrate tests when items are removed (Baker & Kim, 2004; de Ayala, 2009;

Embretson, 2000; Lord, 1986; Suen, 1990).

Marginal Maximum Likelihood Estimation (MML)

MML (Bock & Lieberman, 1970; Bock & Aitkin, 1981) separates the estimation of person parameters from the estimation of item parameters. In the first step in MML, test-takers are treated as being randomly drawn from a population and their abilities are 32 assumed to follow a distribution. Thus, the likelihood function is integrated over the assumed population distribution of abilities, and the unwieldy task of estimating both person and item parameters is reduced to estimating only item parameters. Person parameters are then estimated in the second step, typically with the Bayesian inference approach (Baker & Kim, 2004; de Ayala, 2009). MML for IRT models was first described by Bock and Lieberman (1970) but their solution to the integral equation was cumbersome. The expectation-maximization (EM) algorithm proposed by Dempster et al.

(1977) and extended by Bock and Aitkin (1981) turned the item parameter estimation into a computationally manageable task.

MML with the EM algorithm (MML for short) is popular because of its many advantages. MML produces more consistent results than JML, even when more examinees take the test. MML also offers applicability to all IRT model types and improves efficiency over JML because of the reduction in computational task. In the IRT literature, MML has become the dominant parameter estimation method and is considered the gold standard, while JML use is gradually declining. However, MML still has inherent disadvantages because the estimation process depends on population distribution of abilities, and the wrong assumption about this distribution results in biased and inefficient estimates (de Ayala, 2009; Embretson & Reise, 2000). In addition, MML cannot estimate item parameters with extreme response patterns (all right or all wrong answers), and it might converge slowly and produce out-of-bounds parameter estimates

(Baker & Kim, 2004). 33

Fully Bayesian Approach: Markov Chain Monte Carlo Estimation

Contrary to the classical/frequentist statistics which produce point estimates of parameters, the Bayesian method treats a parameter as a random variable and aims to build a whole distribution of probable estimates conditional on data (Baker & Kim, 2004;

Gelman et al., 2013; Kruschke, 2015). The use of Bayes’ theorem allows us to combine the pre-existing knowledge (in the form of prior distribution) with data (in the form of the likelihood function) to construct the probability distribution of possible parameter values

(known as the posterior distribution). Results from the Bayesian analysis offer a group of parameter candidates, and allow each possible parameter value to take on its own probability. Point and interval estimates can then be derived from posterior distribution summaries. Because parameters in IRT modeling are continuous rather than categorical, the primary goal of Bayesian estimation in IRT is to characterize the posterior density of person and item parameters. It has been suggested that Bayesian methods might be useful estimations for complex IRT models and for small data sets (Baker, 1998; Baker & Kim,

2004; Swaminathan & Gifford, 1982), and for avoidance of inadmissible parameter values (Lord, 1986).

However, an off-putting feature of the Bayesian approach lies in its computationally complicated mathematics. The use of conjugate priors makes Bayesian analytics less demanding, yet the constraint of prior distributions for parameters into the limited family of conjugacy models is not easily justified. The grid method is helpful in simpler models but becomes infeasible in more complex ones (Gelman et al., 2013;

Kruschke, 2015). Fortunately, Markov Chain Monte Carlo (MCMC) algorithms have 34 been developed to effectively sample from the joint posterior distribution and turn

Bayesian analytical task toward the simulation direction. As a parameter estimation approach, MCMC found its psychometric applications around 1990 and has become increasingly popular in Bayesian IRT literature, especially when MML algorithm has not been developed for a particular model or when models involve high dimensionality or multi-level structures for which MML cannot work effectively and efficiently (Junker et al., 2016; Kim & Bolt, 2007; Patz & Junker, 1999a, 1999b). Within the MCMC methodology, Gibbs sampling (Albert, 1992; Gelfand & Smith, 1990; Geman & Geman,

1984) has been widely examined in IRT modeling, while Hamiltonian Monte Carlo

(HMC; Gelman et al., 2013; Kruschke, 2015; Neal, 2011) represents a relatively new method for which little research into its utility in IRT model estimations has been conducted (see also Cai, 2010a, 2010b; Kuo & Sheng, 2016; and Patz & Junker, 1999a,

1999b, for other MCMC methods in IRT model estimations).

Problem Statement

The coexistence of multiple estimation approaches in IRT-based data modeling logically motivates measurement researchers to pose questions about how well the available methods compare to one another under various measurement conditions.

Extensive research comparing MML and MCMC estimations has been reported in the literature on a number of IRT model types, such as Rasch model (Kim, 2001), two- parameter model (Baker, 1998), three-parameter model (Béguin & Glas, 2001), two- parameter testlet model (Luo & Wolf, 2019), graded response model (Kieftenbeld &

Natesan, 2012; Kuo & Sheng, 2016), nominal response model (Wollack et al., 2002), 35 generalized partial credit model (Luo, 2018), and bifactor IRT model (Martin-Fernandez

& Revuelta, 2017).

Like other IRT models, the 4PM is useful only when person and item latency is well understood in the form of accurately estimated parameters. Given that the unidimensional 4PM is newly emergent from its dormant phase, few studies have been conducted on the many important aspects of this highly parameterized model, including accurate recovery of its person and item parameters. Both the standard estimation procedure MML and newer methods via MCMC have been used in measurement practice with 4PM (Barnard-Brak et al., 2018; Culpepper, 2016, 2017; Fuerstahler & Waller,

2014; Waller & Reise, 2010; Walstad & Rebeck, 2017), and their statistical properties have been investigated individually (Culpepper, 2016; Feuerstahler & Waller, 2014;

Loken & Rullison, 2010; Sheng, 2015), yet the merits of these estimation approaches relative to one other have not been examined. This literature gap waits to be addressed.

Research Objectives

The purpose of this research is to evaluate the quality of MML with EM algorithm and two MCMC sampling mechanisms, Gibbs and HMC, in parameter recovery under the unidimensional four-parameter binary logistic model for the item response function.

Specifically, the present study aims at understanding how well the aforementioned estimators can recover the person and item parameters (i.e., how accurately these methods can estimate the true parameter values) of the 4PM under various measurement conditions. Answers to the following questions are sought: 36

Research Question 1

How accurate are MML, Gibbs and HMC parameter estimations for the unidimensional four-parameter binary IRT model across latent trait distributions?

Research Question 2

How accurate are MML, Gibbs and HMC parameter estimations for the unidimensional four-parameter binary IRT model across sample size levels?

Significance of the Study

The present study will offer contributions both theoretically and practically. From a theoretical perspective, this research will enrich our knowledge of the 4PM and enhance the currently scarce literature on this model regarding quality of parameter recovery.

Practically, the results of this research would shed light on the merits of three estimation approaches (MML, Gibbs, and HMC procedure), and serve as estimator selection and sample size requirement guidelines for measurement practitioners as they employ the four-parameter modeling in their unique contexts. In particular, this study incorporates the investigation of HMC sampling, an estimation method which possesses a high potential utility but remains understudied in the IRT literature. Because the use of HMC for the 4PM estimation has not been explored in previous research, new understanding of the properties of this estimator will open up a new research trajectory for researchers interested in the 4PM and a new estimation possibility for IRT users simultaneously.

Scope of the Study

Several features define the delimitations of this study. First, this study focuses on the comparison between MML with EM algorithm and two specific MCMC procedures 37

(Gibbs and HMC). Other MCMC methods such as Hastings within Gibbs (HwG; Cowels

& Carlin, 1996; Albert & Chib, 1993) and blocked Metropolis (BM; Patz & Junker,

1999a, 1999b), and the hybrid method Metropolis Hastings Robins Monroe (Cai, 2010a,

2010b) are beyond the scope of this study due to the lack of popular R software packages with the built-in estimator regarding the two former methods and the intended utility in multi-dimensional IRT models of the latter method. Second, expected a posteriori (EAP) will be used because it is the most popular choice for θ estimation subsequent to MML item parameter estimation. MML-EAP with the correct informative θ prior was known to consistently outperform maximum likelihood (ML) and maximum a posteriori (MAP) in person parameter estimation accuracy across different combinations of latent trait and item difficulty parameter distributions under the two-parameter item response function

(Sass et al., 2008). In this study, the assumption that the researchers have certain prior knowledge about plausible θ estimates thanks to calibration results with less parameterized unidimensional IRT models for dichotomous data (i.e., the 2PM and 3PM) is adopted, hence MML-EAP as an optimal choice. Other possible choices for latent trait estimation, ML and MML-MAP, will not be considered.

With reference to the manipulated measurement conditions in this study, latent distributions will be generated from a unit normal distribution and a negatively skewed distribution. The N(0,1) is a typical assumption of the θ distribution, and negative skewness is a common feature of latent trait distributions in actual educational measurement scenarios (Ho & Yu, 2015; Lord, 1955). Furthermore, previous research has indicated that the 4PM is suitable for large data projects and has provided rough 38 guidelines about the necessary sample size for the 4PM estimation (Culpepper, 2016;

2017; Feuerstahler & Waller, 2014; Waller & Feuerstahler, 2017; Waller & Reise, 2010).

This study will adopt these guidelines and will only test three sample size levels (1,000,

2,500, and 5,000 cases) to further contribute to the state-of-the art knowledge about the

4PM. Finally, the scale length is constrained to be fairly short with 20 items. Parameter recovery accuracy under the 4PM with larger samples and longer scales will be the target of future research.

Definition of Terms

Item response theory (IRT). A theory of mental measurement that places the latent trait of the scale respondents and characteristics of scale items on the same continuum (de Ayala, 2009; Hambleton & Swaminathan, 1985; Lord & Novick, 1968; van der Linden & Hambleton, 1997).

Item response function (IRF). The equation which expresses the probability of endorsing a particular response option as a function of the latent trait level of respondents and characteristics of scale items.

Item information. How much information a test/scale item provides for estimating person parameters.

Total information. The total amount of information provided by all scale/test items for estimating person parameters.

Unidimensional one-parameter binary IRT model (1PM). IRT model for dichotomous data which characterizes items with only one parameter, which is location 39 or difficulty parameter, denoted as b. The ability of items to distinguish good from poor latent trait, or item discrimination parameter, is a constant.

Unidimensional two-parameter binary IRT model (2PM). An extension of the

1PM which allows the discrimination parameter, denoted as a, to vary across scale items.

Unidimensional three-parameter binary IRT model (3PM). An extension of the

2PM which also takes into account the probability of endorsing a response option in the positive directional relationship with the latent trait level as the latent trait reaches negative infinity. In cognitive measurement, this parameter represents the probability of having the correct answer among poor examinees. This third parameter is referred to as the pseudo-guessing, pseudo-chance or lower asymptote parameter, demoted as c.

Unidimensional four-parameter binary IRT model (4PM). An extension of the

3PM which also considers the probability of endorsing a response option in the positive directional relationship with the latent trait among respondents with high latent trait level.

In cognitive measurement, this parameter, denoted as d, conveys the notion that the probability of having the correct answer among proficient examinees is less than 1. This fourth parameter is referred to as the upper asymptote parameter or slipping parameter

(i.e., the probability of failing an item among high-achieving examinees, 1 - d).

Marginal Maximum Likelihood with Expectation - Maximization algorithm

(MML). The approach to IRT parameter estimation by assuming person parameters follow a distribution and integrating them out of the likelihood equation. Item parameters are estimated using “artificial data”, i.e., expected number of responses and expected number of correct responses, in an iterative process (Bock & Aitkin, 1981). 40

Confidence interval (CI). The interval constructed around a sample statistic. The intervals constructed using the same rule supposedly contain the population parameter with a certain frequency-based probability (i.e., 95% CI is believed to capture the true population parameter in 95 out of 100 samples). CI is used in frequentist statistics (e.g.,

MML) as opposed to Bayesian inference.

Bayesian analysis. A computational approach based on Bayes’ rule, which combines the information in collected data (likelihood) and knowledge/belief about parameters (prior distribution) to construct the posterior distribution.

Bayesian inference. The statistical conclusion made in terms of probability statements about an unknown quantity conditional on the observed values.

Bayesian modal estimation (BME). An extension of MML into a fully Bayesian estimation framework by including prior distributions for item parameters.

Expected a posteriori (EAP). A common Bayesian method to quantify latent trait parameters in IRT models after item parameters have been calibrated with MML. EAP employs the mean of the posterior distribution as its point estimate.

Maximum a posteriori (MAP). Bayesian modal estimation of person parameters.

It is another common Bayesian method to calculate person parameters in IRT models subsequent to MML item parameter estimation. Unlike EAP, MAP uses the mode of the posterior distribution as its point estimate.

Markov Chain Monte Carlo (MCMC). A group of algorithms to draw samples from a target probability distribution in such a way that one step is directly dependent on the step immediately before it. The draws in the sampling process form a Markov chain 41 and with sufficient length, the Markov chain reaches stationarity and can approximate the target distribution. MCMC methods are used in Bayesian inference to avoid the laborious and impossible Bayesian analytical task of calculating the high-dimensional, intractable posterior distribution.

MCMC convergence. The state when the Markov chain has stabilized in the target

(posterior) distribution and generates representative samples from it.

Burn-in. The initial part of the Markov chain before it reaches convergence.

Draws in the burn-in segment are influenced by starting values and considered unrepresentative of the posterior distribution, and, thus, are discarded.

Autocorrelation. The correlation between every kth draw in the Markov chain.

Large autocorrelation between draws means steps in the Markov chain are highly dependent and do not provide much independent information about the posterior distribution.

Potential scale reduction factor (PSRF). The ratio between variance between

Markov chains and variance within these chains. PSRF close to 1 suggests the chains have mixed well and converged to the posterior distribution.

Trace plot. The plot of samples from the posterior distribution as a function of steps in the Markov chain. It is a visual means to evaluate chain convergence.

Highest density interval (HDI). The region in the posterior distribution in which parameter values have a higher probability than those outside it. For example, the 95%

HDI contains 95% parameter candidates in the posterior probability distribution and 42 values within this interval are more probable than the remaining 5% candidates outside the interval.

Gibbs sampling. An MCMC method which employs the random walk behavior and draws random samples of a parameter conditional on all other parameters in the model. Gibbs is used when the posterior distribution of each parameter can be expressed in closed form, or it can be combined with other algorithms.

JAGS. JAGS stands for “Just Another Gibbs Sampler”, a program for Bayesian computations using Gibbs sampling. JAGS can be run via its R interface packages rjags and runjags.

Hamiltonian Monte Carlo (HMC). An MCMC method which avoids the random walk behavior by incorporating a momentum variable in its algorithm to increase rate of sample proposal acceptance in distribution areas with high density.

No-U-turn sampler (NUTS). An extension of HMC which automatically stops when the trajectory starts to turn around and switches the sampling to another direction.

Stan. The software program which performs Bayesian computations with NUTS.

Stan can be executed in R via the interface package rstan.

Bias. The distance between a true parameter and the corresponding estimated value. It is a measure of systematic estimation error.

Root mean squared error (RMSE). The square root of mean square error (MSE), which is the average squared deviation of the estimates from their true parameter values.

RMSE is the measure of total estimation error. 43

Structure of the Dissertation

This dissertation consists of five parts. Part one provides a general introduction to the IRT methodology, the four-parameter model and dominant estimation procedures. It also describes the literature gap regarding the current knowledge of the 4PM and two research questions on the parameter recovery properties of MML and two MCMC approaches (Gibbs and HMC). In part two, a more thorough review of the literature is presented, including the history of the 4PM, the nature of the upper-asymptote in psychometric applications, the conceptual explanation of each estimation method, results from previous studies comparing MML versus two MCMC procedures aforementioned, and what is known about the 4PM estimation thus far. Manipulations of sample size and latent distribution in the simulations are detailed in part three. Part four is devoted to the presentation of major simulation results and interpretations. Finally, discussions of findings and recommendations for 4PM estimation strategies are offered in part five.

Chapter Summary

Chapter 1 introduced IRT, a modern psychometric modeling approach which place both items and examinees on the same latent scale. The major assumptions (such as local independence and parameter invariance) underlying IRT and popular models in this framework were also described, followed by more detailed explanations of unidimensional IRT models for dichotomous data, with a stronger focus on the history, revival of interest in and the form of the 4M in which items are characterized by four parameters (difficulty, discrimination, lower and upper asymptote). A brief discussion of the current estimation methods in the IRT literature, including JML, MML, and Bayesian 44 inference with MCMC (particularly Gibbs and HMC) was included as a segway into the literature gap regarding parameter recovery for the 4PM. The problem statement, research questions on the parameter recovery quality of three estimation methods (MML,

Gibbs and HMC) under the IRF for the 4PM, and delimitations of this research were then delineated. This chapter ended with definitions of key terms in the study.

45

Chapter 2: Literature Review

In this chapter, a more comprehensive review of the 4PM will be provided, including initiation and discouragement of its use from its early start, how its value was reassessed, and its increasing applications across disciplines in tandem with discussions surrounding the nature and interpretation of the fourth item parameter. The later part of chapter two will deal with MML and Bayesian modeling via Gibbs and HMC in greater detail. While a snapshot of the parameter estimation literature regarding these three methods in the IRT methodology will be provided, the majority of space in the final part will focus on the current 4PM estimation literature and features of model configurations which can influence parameter estimation results and require researchers’ careful attention.

History of the Four-parameter IRT Model

Barton And Lord’s (1981) Pioneering Study and Disappointing Results

Although the bulk of the literature germane to the 4PM came out very recently, the initial conceptualization of an upper asymptote different from one dated back in the

1960s. In his monograph Nonlinear Factor Analysis, one of the trailblazing studies to demonstrate IRT as a special case of nonlinear factor analysis, McDonald (1967) raised a concern over the lower and upper asymptote of the normal ogive model. Essentially, the lower asymptote of zero fails to take into account the guessing behavior in low ability examinees, while the upper asymptote of one excludes the possibility of chance error among proficient respondents. McDonald went on to recommend the replacement of the lower asymptote and upper asymptote restrictions at zero and one respectively with data- 46 based estimates for these parameters. While the attribution of the relaxation of lower asymptote to McDonald (1967), as suggested by Feuerstahler and Waller (2017), is unwarranted because the c parameter was in use in biological assay (analytics commonly used in biotechnology, pharmaceutical and environmental research) long before it was materialized in educational and psychological measurement (Baker & Kim, 2004),

McDonald’s (1967) observation of the inhibiting nature of the upper asymptote of one in mental appraisal was sharp and novel. It was not until 1981 that Barton and Lord published the first empirical investigation into the usefulness of the non-unity d parameter. To address the excessively punitive possibility of the upper asymptote of one among high-ability students with clerical mistakes, Barton and Lord (1981) calibrated four standardized test data sets from Educational Testing Service (ETS) with the 3PM and compared the results with those obtained from the 4PM calibrations. The authors found negligible differences in the likelihood and a noticeable change of more than one standard error for ability estimate in only 1 out of 4000 cases. In addition, contrary to the common assumption that clerical errors are more likely present in speeded tests and adversely impact trait estimates, Barton and Lord’s (1981) reported that the 3PM was the better fitting model, thereby rejecting this hypothesis. It should be noted, however, that

Barton and Lord’s (1981) had several poor methodological choices in their study. The most conspicuous problem is that these authors did not calibrate the data with the 4PM in the exact sense of the word, but instead constrained the upper asymptote at 1.00, .99 and

.98 for comparative purpose. With reference to the estimated upper asymptotes in recent studies, such as Waller and Reise (2010), which ranged from .40 to .95, the constraint of 47 the d parameter at values very close to 1 in Barton and Lord’s (1981) study might not have truly reflected the true values of the fourth parameter and, hence, the actual degree of clerical errors or carelessness among test takers. Moreover, in light of the state-of-the- art knowledge of estimators for IRT modeling, Barton and Lord’s (1981) choice of joint maximum likelihood does not appear to be an optimal option. These estimation issues in

Barton and Lord’s (1981) pioneering study are probably due to the constraints of their time, but the disappointing findings from their investigation, coupled with the computational inefficiency, led these influential IRT scholars to discourage the use of the

4PM.

Reise and Waller’s (2003) Findings and Renewed Interests

The 4PM received either no mention or only passing mention in popular IRT books (e.g., de Ayala, 2009; Hambleton & Swaminathan, 1985; van de Linden &

Hambleton, 1997) and essentially remained dormant for more than two decades later.

Even when this model was included in the class of IRT models for binary data, it was recommended for theoretical rather than practical purpose (Hambleton & Swaminathan,

1985), to echo Barton and Lord’s (1981) advice due to their failure to find utility in adopting the fourth parameter. In other classic texts in the field, the 4PM was completely absent, such as in the popular IRT handbook edited by van de Linden and Hambleton

(1997) where an extensive list of IRT models was discussed. The 4PM only was referred to in a humble appendix note in more recent texts (e.g., de Ayala, 2009). The suggestion to free the assumption that the upper asymptote must take on the value of one intermittently continued but received little attention in the measurement literature. 48

Osgood et al. (2002) recommended varying the fixed value of one for the upper asymptote to capture the factual phenomenon that the adolescents with high scores in the delinquency scale would not always report certain criminal behavior, including less serious acts, although the authors obtained a well-fitting two-parameter graded response model for their youth delinquency data. This note resonated with McMorris’s (1997) comment that lowering the upper asymptote could improve the model fit for certain dichotomous items in the National Youth Survey. The limited space dedicated to consideration of the 4PM combined with the tendency to dismiss its utility due to Barton and Lord’s (1981) previous findings mean that the 4PM was effectively rendered hidden from sight.

What motivated measurement scholars to seriously reconsider the potential of the

IRT model with non-one upper asymptote is often attributed to a 2003 study by Reise and

Waller (Loken & Rullison, 2009; Waller & Reise, 2010). Reise and Waller (2003) fitted the 3PM to the data from the administration of 15 Minnesota Multiphasic Personality

Inventory scales to adolescents (MMPI-A) across inpatient, outpatient, and school settings. While examining the necessity and meaning of the c parameter (i.e., why the

2PM was inadequate) to model personality and psychopathology data, Reise and Waller found that the calibration results for many items consistently exhibited a lower asymptote substantially larger than zero despite the reverse scoring direction. More specifically, when data were scored in the psychopathology direction (i.e., higher scores indicated psychopathology), 20 items had the c parameter larger than .10 for female data sets, which means adolescents who did not have psychopathological problems had a 49 probability larger than 10% to endorse items which describe symptoms of mental health issues. However, as items were reverse keyed (i.e., reverse coded so that higher scores indicated nonpathology), up to 105 out of 316 items still had the lower asymptote larger than .10. These results suggested that many adolescents who scored low in the nonpathology direction (i.e., had psychopathological issues) still had a non-zero probability to reject the items which describe contents germane to their mental health problems. Similar results were reported for the male data sets, although the actual number of items which displayed a substantial c parameter slightly differed. The unanticipated findings led Reise and Waller (2003) to surmise that an upper asymptote lower than one might be necessary to model psychopathology items to account for the fact that even respondents with high scores (i.e., with mental health disorder) do not always endorse items relevant to their psychiatric symptoms. Reise and Waller’s (2003) speculation later found corroboration in their own follow up study (Waller & Reise, 2010) and other research (Feuerstahler & Waller, 2014; Waller & Feuerstahler, 2017) using the same data. In congruence with the previous report by Reise and Waller (2003) that 17 out of 23 reverse scored MMPI-A Low Self-Esteem scale (LSE) items had a lower asymptote greater than .10, Waller and Reise (2010) fitted the 4PM to the same data using Gibbs sampler and found that up to 18 LSE items had the upper asymptote estimates lower than

.90. In Waller and Feuerstahler’s (2017) study, most items in three MMPI-A scales had the d parameters lower than one, many substantially so, when the 4PM was fitted to the female dataset using Bayesian modal estimation. 50

Increasing Applications of the 4PM across Disciplines

Since Reise and Waller’s (2003) conjecture about the potential utility of the fourth parameter in characterizing psychopathology and personality assessment data, useful applications of the 4PM have blossomed in diverse domains. In computerized adaptive testing (CAT), in which test item implementation for each respondent is tailored to the currently best estimate of their ability, an upper asymptote lower than one helped the estimation trajectory in CAT recover faster and more steadily than the 3PM when examinees make mistakes early in CAT, thereby lessening negative bias in ability estimation among examinees with early slipping and improving measurement efficiency.

These findings were consistently derived from simulation research (Liao et al., 2012;

Rulison & Loken, 2009; Yen, Ho, Liao, & Chen, 2012), empirical experiment (Yen, Ho,

Laio, et al., 2012), and analytical comparison (Cheng & Liu, 2015). In achievement testing, Walstad and Rebek (2017) found the 4PM provided good fit to almost all (44 out of 45) items in a newly developed financial literacy test, while the 3PM and 2PM demonstrated item-data fit for 40 and 38 items respectively, although the authors did not perform hierarchically nested model comparisons to examine if the 4PM was significantly better fitting than its less-parameterized predecessors overall. Culpepper

(2017) found that slipping was prevalent in the low-stakes Mathematics and Reading

Progress tests administered by National Assessment of Educational Progress (NAEP). In fact, approximately half the items had .05 or larger probability of slipping, and about one fifth of the items had over .10 probability of slipping (i.e., 1 - d > .10) for both

Mathematics and Reading tests (Culpepper, 2017). In a similar vein, data from a 51 standardized physics test for Saudi Arabian high-school students were better fitted with

4PM than 3PM and 2PM, and revealed an upper asymptote lower than one for all response style groups, even the test completers group to which high-achieving students were more likely to belong than low-ability students (Sideridis et al., 2016).

In general educational sciences, Barnard-Brak et al. (2018) investigated the relationship between opportunity to learn and mathematics performance with a nationally representative sample of 15 year-old students who participated in the Program for

International Student Assessment (PISA). These authors found high-achieving students who reported high opportunity to learn had an advantage over similarly high achieving students with lower opportunity to learn. Specifically, when the PISA mathematics data were subject to differential item functioning analysis under the 4PM, the former group tended to have higher c and d parameter (i.e., higher chance of pseudo-guessing the correct answer and lower chance of slipping), whereas the latter group had lower c and d parameter (i.e., lower chance of pseudo-guessing the correct answer and higher chance of slipping), despite their similar latent trait level. These interesting findings from applications of the 4PM has important implications for learning interventions for students from disadvantaged backgrounds. Other explorations of the 4PM recently include analysis of general mental ability in online assessment for recruitment purpose (Storme et al., 2019), human figure drawing as indicator of cognitive development in children (Primi et al., 2018), and visual aesthetic ability (Myszkowski & Storme, 2017), although the

4PM was not recommended due to its high parameterized nature, technical issues with parameter estimation or similar substantive conclusions from simpler models. 52

In the field of juvenile delinquency research, Loken and Rulison (2010) adopted the fourth parameter in modeling the 2005 Monitoring the Future survey administered to

12th graders, per recommendation made by Osgood et al. (2002) regarding the lowering of the upper asymptote, and found that even youth at the upper end of the latent scale would not report certain delinquent acts (the d parameters of all scale items ranged from

.72 to .89). Culpepper (2016) studied bullying behavior among adolescents and reported significantly stronger fit indices for the 4PM over the 3PM and 2PM.

The 4PM has also established its utility in disciplines outside the psychological and educational sciences. In genetics research, Tavares et al. (2004) proposed the 4PM to measure the probability that individuals with a certain predisposition would have an active gene (i.e., the relationship between predisposition and a gene-related problem such as an illness). In this model, genes are treated as items, and the lower asymptote represents the probability that an individual with a low predisposition would have an activated gene, while the upper asymptote reflects the probability that individuals with a high propensity would have the gene deactivated. Gregory (2019) used the 4PM to model underreporting of food insecurity in the U.S. and predicted food insecurity prevalence to be one to three percent points higher than the current estimates. While the value of modeling latent attributes with the 4PM varies across disciplines, applications of the 4PM are rapidly growing and diversifying, which renders accurate parameter recovery for the

4PM ever more important.

The 4PM applications in educational and psychological measurement, and other fields, are not synonymous with the idea that adoption of the fourth parameter should 53 better fit all types of response data or serve our purpose better than other unidimensional models currently in wide use — quite the contrary, indeed, as more studies lend their voice to the discussion on modeling data with the 4PM. Swist (2015) did not find the

4PM to be superior to its lower parameterized forerunners 2PM and 3PM in detecting violations of test item writing principles. In Yalçın (2018), the 3PM provided better fit than the 4PM for data from a Science and Technology exam, although the mixture model was the best-fitting among various competing models. As many authors have noted, the need to model the upper asymptote seems to be more pronounced in low-stakes testing situations and minimal in high-stakes testing (Culpepper, 2017; Rulison & Loken, 2010;

Sheng, 2015). Difficulty in estimation is another obstacle to applications of the 4PM

(Ogasawara, 2012, 2017; Rupp, 2003). Fortunately, together with the increasing usefulness of the 4PM came the updates in several popular software programs to handle the 4PM estimation, such as SAS (see Cole & Paek, 2017), R package mirt (Chalmers,

2012), MPlus (Muthén & Muthén, 2017) and jMetrik (Meyer, 2014), or completely new packages like fourPNO (Culpepper, 2016) also in R, to aid applied researchers in modeling with the 4PM. It is predicted that with the facilitation brought about by the incorporation of the 4PM estimation in commonly used psychometric programs, future researchers will be able to include 4PM in their research agenda and explore the value of this IRT model more widely.

Theoretical Discussions of the 4PM

As the 4PM is gaining a strong foothold in IRT applications, theoretical discussions have also emerged. Sijtsma and Hemker (2000) showed that the 4PM shares 54 the properties of stochastic ordering of latent trait based on unweighted sum of scores and non-invariant item ordering with the 2PM and 3PM model, while invariant item ordering is unique to 1PM/Rasch model. Ogasawara (2012) extended the previous work of Lord

(1983) by deriving the asymptotic approximations of ability estimator, and Ogasawara

(2017) discussed identified and unidentified cases of the fixed-effects 4PM. Tendeiro and

Meijer (2012) recommended the 4PM to model test anxiety as a specific form of aberrant behavior in item responses, and Magis (2013) delineated the information function for the

4PM. And in a rather different direction, Hessen (2004, 2005) and Maris (2008) discussed a class of IRT models called constant latent odds-ratios models for dichotomous data where sum scores are sufficient statistics. However, while some studies included the 4PM model for the sake of generality, examples and simulations still demonstrated a stronger focus on the more popular 3PM (Ogasawara, 2012; Tendeiro &

Meijer, 2012). More recent developments in the literature directly focused on examinations of the 4PM identifiability and estimation. Culpepper (2016) presented the full conditionals for the 4-parameter ogive model for Gibbs procedure and developed a package to estimate this model in R (R Core Team, 2019). Kern and Culpepper (2020) introduced the Dyad-4PM, in which items are divided into groups of two (dyads) and load on one latent binary attribute, and showed that this model is identified. Zhang et al.

(2020) developed the Gibbs-slice sampling algorithm for estimation of the 4PM with two steps, one to update the upper and lower asymptote parameters with truncated beta distribution conjugation, and the other to slice sample for the item discrimination and difficulty parameters with different auxiliary variables. Meng et al. (2019) reformulated 55 the 4PM as a mixture model and estimated it with marginalized maximum a posteriori via a newly developed E-M algorithm.

The Form, Nature and Utility of the Upper Asymptote Parameter

Several important points about the 4PM are worth noting. Barton and Lord

(1981), probably due to technological constraints, proposed the 4PM in which the upper asymptote was fixed at test level (i.e., all items had the identical d parameter), and this model was utilized in early 4PM investigations (Rulison & Loken, 2009; Yen, Ho, Liao,

& Chen, 2012; Yen, Ho, Laio, et al., 2012). However, more recent empirical investigations, supported by advances in statistical computations to accommodate the

4PM estimation, revealed that this fourth parameter substantially varies across items

(e.g., Barnard-Brak et al., 2018; Culpepper, 2017; Meng et al., 2019; Sideridis et al.,

2016; Waller & Reise, 2010; Walstad & Rebek, 2018). Thus, a more general form of the

4PM incorporating an item-specific upper asymptote to be estimated has emerged (Loken

& Rulison, 2010) and is the target of examination in this study.

Second, while the mathematical meaning of the upper asymptote lower than one is clear (a probability less than 100% to endorse a particular option in the positive directional relationship with the construct, even for respondents at the far right end of the latent scale), its interpretation and utility in measurement practice are discipline- dependent and far from unanimous consensus. In psychopathology, Waller and Reise

(2010) argued that the presence of the asymptote parameters might signify phenomena with very high or very low prevalence rate, or item-level multidimensionality, such as ambiguous item content for respondents at one end of the latent scale but not for 56 respondents at the other end (see also Reise & Waller, 2003). In criminology, the d parameter lower than one might be used to express the uncertainty that youth who scored high on the delinquency scale committed all types of offenses, particularly very serious but infrequent crimes, within a year before completing a self-reported measure (Osgood et al., 2002; Loken & Rulison, 2010). In his health behavior study, Culpepper (2016) similarly suggested that the gap between the upper asymptote and one (i.e., 1 - d) represents the probability that students neglected to report bullying behavior toward their peers. Gregory (2019) used the upper asymptote less than one to model the underreporting of food insecurity in both adult and child food security modules. And in genetics research, Tavares (2004) also proposed that the fourth parameter less than one allows the possibility that a gene is not active in persons with a high predisposition to a disease. In cognitive testing, discussions of anomalous response patterns attribute the reasons for highly proficient students’ incorrect answers to items at or below their ability

(captured by the value 1 - d) to stress, carelessness, inattention, slipping, demotivation, lack of effort, creative response style, unusual item wording or test speededness

(Sideridis et al., 2016).

Apart from the pronounced difficulty in estimation (Rupp, 2003), the justification of the fourth parameter is another area of contention. Green (2011) maintained that adoption of the upper asymptote lower than one is questionable because early mistakes in

CAT among highly proficient test-takers (Cheng & Liu, 2015; Liao et al., 2012; Rulison

& Loken, 2009; Yen, Ho, Liao, & Chen, 2012; Yen, Ho, Laio, et al., 2012) are very unlikely phenomena, and rewards for high-ability blunderers at the expense of total test 57 information (i.e., test information reduction) as data are characterized with 4PM is an unreasonable trade-off. However, if the 4PM yields significantly better fitting approximation to test data, the idea that one should choose the 3PM because it paints a positive picture about test properties is unwarranted. The choice of model must be guided by modeling purpose as well as fit statistics and the appropriateness of assumptions

(Embretson & Reise, 2000). And contrary to the widely held belief, the real presence of the non-one upper asymptote is prevalent in cognitive testing, not only for a few high performing students but also low achieving examinees (Culpepper, 2017; Sideridis et al.,

2016). Green (2011) also pointed out the idea of adjusting θ estimates to account for mistakes committed by students, be they careless responses or failures to follow test instructions, is counter to the conventional wisdom that students should only be rewarded for the correct answers rather than wrong ones. This argument is valid in its own right, especially given that the 4PM estimates of θ tend to be slightly higher than the 2PM and

3PM counterparts with the same response data sets. However, the increase in θ estimates should be less of a concern with four-parameter modeling because in general, θ estimates tend to be nearly identical to the corresponding results from 3PM and 2PM for most examinees, and that their ranks across models correlate almost perfectly (Culpepper,

2017; Loken & Rulison, 2010; Waller & Reise, 2010). Rather, with the 4PM, differences are observed in (1) items with higher discrimination and lower difficulty parameters, and

(2) high-ability students (higher θ estimates, larger standard error, and better coverage of the true parameters by the interval estimates). In addition, if penalizing students for pseudo-guessing an item more difficult than their ability level is acceptable (i.e., 58 modeling with the 3PM), rewarding students with a high latent trait level who miss test items out of carelessness (i.e., modeling with the 4PM) should also be considered.

Overall, finding the best fitting model is always a worthwhile endeavor (Reise & Waller,

2003), and the potential influence of slipping, if present, on the quantification of item characteristics methodologically makes the case for this fourth parameter to be estimated once available data permit (Culpepper, 2016, 2017; Ogasawara, 2012; Raiche et al.,

2013; Reise & Waller, 2003; Waller & Reise, 2010). Moreover, if the anomalous response is indeed due to test anxiety, the use of the 4PM for the purpose of detecting it might be more effective than other procedures built to deter aberrant score patterns generally (Tendeiro & Meijer, 2012).

Finally, Loken and Rulison (2010) pointed out that because Barton and Lord

(1981) aimed to incorporate the d parameter to represent the tendency to respond incorrectly due to carelessness among high-ability students, their interpretation of this additional parameter concerned students’ response behavior rather than test item characteristics. If the d parameter is better viewed as test-takers’ attribute, it should be estimated as a person-specific parameter and, therefore, a different formulation of the

4PM is needed. Similar discussions have been made about the nature of the lower asymptote parameter (de Ayala, 2009). This study adopts the view that the upper asymptote parameter, like the lower asymptote parameter, is reflective of the interaction between person and item characteristics, and it is not a pure indicator of an item’s quality, at least in educational achievement tests. However, to remain consistent with the 4PM 59 literature to date, the fourth parameter will be estimated as an item parameter in this research.

Marginal Maximum Likelihood (MML) with Expectation-Maximization Algorithm

First, a brief description of maximum likelihood estimation (MLE) is needed. For convenience, suppose a set of dichotomously scored data is attained from a scale administration. Under the local independence assumption, which means responses to items are independent of one other after controlling for θ, the joint probability of an observed response pattern (response vector) ui is the product of the probabilities of all responses in the pattern, given the model parameters. The probability of the manifest data, given the parameters, is termed likelihood function, denoted as L in the following equation:

L(ui|θi, aj, bj, cj, dj) = ∏ p(xj=1|θi, aj, bj, cj, dj) (5) in which xj = 1 is the target response to item j in IRT modeling (e.g., correct answer in cognitive testing), ui is a response vector of all answers by person i, θi is the person parameter (e.g., latent trait or ability) for respondent i, and aj, bj, cj, dj are the discrimination, difficulty, pseudo-chance and slipping parameters of item j, respectively, in the four-parameter IRT model. Expanding on this line, we obtain the probability

(likelihood) of the observed data matrix U by multiplying probabilities of all observed response vectors ui conditional on all the model parameters. To avoid working with very small probability values, we usually transform the likelihood function into a logarithm function for mathematical convenience, which results in the log-likelihood (Baker & 60

Kim, 2004; de Ayala, 2009; Embretson & Reise, 2000; Hambleton & Swaminathan,

1985; Suen, 1990).

IRT models the probability of responses as a function of the test-takers’ abilities θ and the item parameters, and the values of person and item parameters that maximize the likelihood function are maximum likelihood estimates. Since both person and item parameters are required, these two groups of parameters can be estimated jointly or separately, hence different MLE procedures. As briefly discussed in chapter 1, JML derives the estimated values for both person and item parameters in a cyclical process.

The inherent problems with JML necessitate a different approach that can maximize the likelihood of observed data while producing stable parameter estimates in the presence of changing test-taker samples.

Bock and Lieberman’s (1970) Proposal with Gauss-Hermite Quadrature

Bock and Lieberman (1970) offered the first groundbreaking discussion on MML approach to item parameter estimation in IRT models for binary data, using the 2PM as an example. Central to MML is the assumption that examinees are a random sample from a population of underlying abilities. Because θ is supposed to be continuous in IRT modeling, the likelihood of a response vector ui mathematically equals the integration of all conditional probabilities of this vector across possible values of θ as dictated by the characteristics of the θ population distribution:

∞ L(ui| θi, aj, bj, cj, dj) = ∫∞ p(ui|aj, bj, cj, dj) g(θ)dθ (6) in which g(θ) is the population distribution of θ, and other notations are described above.

As such, the likelihood function of a response vector ui no longer depends on a particular 61

θ value of a respondent and turns into the marginal probability. In this manner, the estimation task is simplified to estimation of item parameters only, which is separated from estimation of person parameters. Values of θ are treated as nuisance or incidental parameters and the integration of their underlying distribution removes them from the likelihood function. For binary data with k items, each response can take two values

k (right or wrong), hence a total of 2 possible response vectors ui. The likelihood function for the response matrix U would equal the product of probabilities of 2k response vectors ui, and it is termed marginal likelihood function because it, again, is not conditional on individual θs. The item parameter values aj, bj, cj, and dj for each item j, which together maximize the likelihood of the response matrix, are then solved for.

It is important to mention that because θ is a continuous latent variable, in theory

θ values can range from -∞ to +∞. Therefore, it is impossible to express the integration function above in closed form (i.e., solvable in a finite number of analytical steps) because θ values can reach infinity theoretically. Bock and Lieberman’s (1970) solution was to approximate the θ distribution curve with the Hermite-Gauss quadrature.

Essentially, their idea was to employ the weighted sum of bars in a histogram to provide an approximation to the θ density distribution curve. In their example, Bock and

Lieberman (1970) adjusted the quadrature points and their weight coefficients taken from

Stroud and Secrest (1966) to approximate the unit normal density of θ in the calibration of the Law School Admission Test and found that estimated values of item difficulty and item discrimination for the two-parameter model were close to the results derived from conventional analysis. 62

However, only five test items were put to trial estimation in Bock and

Lieberman’s (1970) example due to an important disadvantage of their method. Their approach to the computational task requires the inversion of a large item parameter matrix, with dimensions being equal to the number of items multiplied by the number of parameters to estimate for each item. Therefore, as the number of items increases, the matrix increases in dimensions and estimation becomes increasingly complex. Due to the cumbersome nature of the mathematical task involved, Bock and Lieberman (1970) acknowledged that this estimation method is of theoretical interest and might prove practical for short tests only (not longer than 10-12 items). Some packages such as mirt

(Chalmers, 2012) still provide Bock and Lieberman’s approach as an estimation method option, but with a recommendation against its use for long tests.

Bock and Aitkin’s (1981) MML with EM Algorithm

The significant improvement to MML estimation method came from Bock and

Aitkin’s (1981) extension of the EM algorithm (Dempster et al., 1977) to IRT parameter estimation. Although the EM algorithm had been reported in the statistical literature regarding special cases, Demspter et al. (1977) upgraded the algorithm to a higher level of generality and made it applicable to models outside the exponential group. Bock and

Aitkin (1981) applied this approach by Dempster and colleagues to treating person parameters as incomplete data, and proposed the iterative two-step estimation procedure.

In the E (expectation) step, the expected number of examinees who respond to an item and the expected number of examinees who answer correctly are calculated at each quadrature middle point (called quadrature node). Then, in the M (maximization) step, 63 the interim estimates from the E step are used in the reformulated likelihood function to estimate item parameters. The cyclical process goes on, with refined results from each step being fed to the next step for estimation improvement, and stops when convergence criteria are satisfied (Baker & Kim, 2004; Bock & Aitkin, 1981; Harwell et al., 1988).

Because item parameters are estimated on an individual item basis in MML with EM, the problems of dealing with large matrices and increased parameters to estimate encountered in JML and MML with Hermite-Gauss quadrature are avoided. The advantages of the MML with EM algorithm lie in its flexible applicability to a number of estimation situations, especially tests with small and large numbers of items, and multidimensional IRT models which proved a challenging problem in the early 1980s

(Bock & Aitkin, 1981). It is noteworthy that MML with EM only estimates item parameters using expected information about the ability distribution. Person parameters estimation can proceed in the ensuing stage with many approaches, such as ML or

Bayesian methods like EAP and MAP.

Improvements on MML with EM

For the past decades since its inception, MML with EM algorithm has been considered the gold standard for IRT model parameter estimations (Albert, 1992; Patz &

Junker, 1999a), yet it is not without weaknesses. Improvements on MML with EM have been made to address two fundamental issues with this estimation procedure. First, the integration over the latent trait distribution in the likelihood function is easily recognizable as a Bayes-oriented feature (Bayes theorem is explained in the next section), but the solution to the marginal likelihood function is still situated within a frequentist 64 framework. Thus, a natural extension of MML would be to place it in a fully Bayesian framework by incorporating prior densities for item parameters, an estimation method typically known as Bayes modal estimation (BME; Mislevy, 1986). BME can be performed with EM to produce both person and item parameter estimates, but more importantly, BME prevents item parameters from taking on infinite or inadmissible values, especially in small to moderate samples for complex models. Another substantial contribution was made by Schilling and Bock (2005) who replaced the fixed-point Gauss-

Hermite quadrature with adaptive quadrature for better accuracy and faster convergence in multidimentional IRT models.

Bayesian Approach to Estimation

Bayes’ Rule

Suppose one has the observed item response data U and wants to estimate the parameter θ. According to probability theory, the joint probability of U and θ is p(U,θ) = p(Uǀθ)p(θ) = p(θǀU)p(U). With a simple flip, the probability of the parameter θ conditional on the data U is

(ǀ)() p(θǀU) = (7) ()

The above equation represents one way to formulate Bayesian rule about conditional probability of the parameter θ, given the observed data once the likelihood p(Uǀθ), the probability of parameter θ, p(θ), and probability of data U, p(U), are known.

The denominator p(U) can be thought of as the sum of the probabilities of data U across all possible θ values and is constant with fixed data sets. This cumulative sum equals the integral ∫ p(Uǀθ)p(θ)dθ in the case of continuous θ, and is referred to as the normalizing 65 constant. Thus, we can remove this constant from the equation and Bayes’s rule can be rewritten as

p(θǀU) ∝ p(Uǀθ)p(θ) (8) in which p(θǀU), the posterior distribution, expresses the probability of a particular parameter value of θ conditional on the data set U. The probability of the observations given the parameter p(Uǀθ) is the likelihood function, and p(θ) reflects our pre-existing knowledge or belief about the parameter before data collection. This relationship between the likelihood, prior and posterior distributions distinguishes the Bayesian from the traditional frequentist approach to estimation: while the latter aims at the point estimate of the parameter which maximizes the likelihood of observed data, the ultimate task of

Bayesian estimation is to reproduce the posterior probability (density) distribution of possible parameter values by combining both prior information and the collected data. In this manner, Bayesian estimation might provide richer and more nuanced information about a parameter of interest than a single point estimate.

Instead of capturing a single parameter, θ can represent a vector of parameters and p(θ) becomes a multivariate distribution. Bayesian inference in the multivariate context, thus, turns into the task of determining the joint posterior distribution p(θǀU) and integrating this joint posterior over the nuisance parameters (i.e., parameters we are not interested in) to obtain the marginal posterior distribution of only the parameters of interest. Bayes’s rule, therefore, remains straightforward and can accommodate models of varying degrees of complexity, at least theoretically. The posterior distribution is useful 66 to know because it directly handles the question of how credible different possible parameter values are, given the data collected.

Bayesian analysis offers both theoretical and practical benefits for statistical modeling in general (König & van de Schoot, 2018; Krusche, 2015; van de Schoot et al.,

2014). Theoretically, Bayesian computations allow ease of interpretation, direct expression of uncertainty, and facilitation of knowledge update via prior the distribution specification. Practically, Bayesian estimation aids modeling with small sample sizes, does not assume a particular distributional form of parameters, and guards against unlikely/implausible results. Similar advantages of Bayesian applications have been acknowledged in the psychometric modeling literature (Levy & Mislevy, 2016). For instance, Bayesian estimation has the potential to avoid extreme and infinite parameters in IRT modeling, especially in small to moderate samples, while it does not require the assumption of normal latent trait distribution as is generally the case when using MML

(Mislevy, 1986). Sheng (2013) demonstrated the value of Bayesian estimation in accurately estimating the 2PM and 3PM with only 100 cases when appropriate conjugate priors are set up in a hierarchical manner (i.e., parameters of the prior distribution also have priors for themselves). Finch and French (2019) tried even smaller sample sizes and reported reasonable the 2PM and GPCM item difficulty parameter recovery with

Bayesian methods even with samples as small as 25 cases. These results are remarkable given that general sample size guidelines for the 2PM and 3PM are roughly 500 and

1,000 cases, respectively (e.g., de Ayala, 2009). 67

Much as Bayesian modeling holds great potentials for statistical inferences, it is hindered by the fact that the integral ∫ p(Uǀθ)p(θ)dθ is not always easy, indeed sometimes impossible, to calculate directly (Junker et al., 2016; van Ravenzwaaij et al., 2018).

Methods to simplify the calculation of this normalizing constant have been proposed, such as the grid approximation and the use of exact formal analysis via conjugate functions (see Gelman et al., 2013; Kruschke, 2015; and McElreath, 2016 for details of these methods). The grid method provides good approximate results when models are low in dimensions, but computations become intractable when multiple parameters are involved. The conjugacy method saves computational efforts if the conjugate function truly reflects the prior distribution. If not, the idea of constraining the prior knowledge of parameters in a certain conjugate distribution for computational convenience is hardly justified. Fortunately, MCMC methods have come for a rescue and made Bayesian computations tractable.

MCMC and Its Role in Bayesian Estimation

Random observations are said to form a Markov chain when the sampling of one observation is conditional on only the observation before it but is not dependent on all other previous observations. MCMC (Gelman et al., 2013; Geyer, 2011; Kim & Bolt,

2007; Kruschke, 2015; Patz & Junker, 1999a, 1999b; von Ravenzwaaij et al., 2018) is, as its name suggests, a Monte Carlo procedure to investigate the characteristics of a distribution by drawing random samples from that distribution. MCMC, thus, like other

Monte Carlo methods, turns analytics into the task of drawing of random samples to reconstruct or approximate a distribution, thus avoiding the intractable computation. 68

However, the distinguishing feature of MCMC is that its generation of random samples in a sequential process yields dependent draws to form a Markov chain, while traditional

Monte Carlo samples are independent of each other.

The initiation of MCMC simulations is usually credited to Metropolis et al.

(1953) with their study on liquid equilibrium, and Hastings’s (1970) generalization of

Metropolis algorithm (Geyer, 2011; Robert & Casella, 2011). MCMC algorithms were not initially devised for Bayesian calculations, but their popularity in Bayesian inference has been motivated by the fact that they allow the approximation of the joint posterior distribution with empirically sampled elements. Bayesian statistical computations of the posterior distribution are often intractable, especially for highly parameterized models, hence the difficulty in drawing conclusions about parameters of interest. The value of

MCMC for Bayesian inference lies in the transition distribution (i.e., the conditional distribution of one state of the chain, given the most recent state), which can be designed so that the chain is continuously improved at each step and the target stationary distribution it converges to is the joint posterior distribution. The properties of the posterior distribution can later be acquired through summary statistics of the distribution reproduced by sampled values in the Markov chain. MCMC, thus, represents a viable solution to the examination of the posterior distribution characteristics when Bayesian analytics proves difficult to derive (Gelman et al., 2013; Kruschke, 2015).

MCMC methodology entered the literature around 1990 (Junker et al., 2016; Levy et al., 2011) and for the past three decades, has become increasingly appealing to IRT modeling. The popularity of MCMC is in part due to the growing 69 recognition of Bayesian reasoning and its methodologically flexible applications to a wide range of models, especially highly complex models where traditional maximum likelihood is not yet available, and in part thanks to the development of user-friendly software allowing MCMC applications (Kim & Bolt, 2007; Patz & Junker, 1999a,

1999b). Patz and Junker (1999a), who are usually credited with introducing MCMC methodological guidance in IRT, pointed out that the combination of Bayesian computation and MCMC methodology appeals to psychometricians not only as a framework which offers a straightforward method to fit diverse and complex IRT models, but also as a computational approach that allows the quantification of the person parameter estimation uncertainty, a feature which still remains a challenge in MML.

Also, many studies which have demonstrated the hallmark advantage of Bayesian estimation in IRT with small sample size levels were enabled by the use of MCMC methods, such as studies by Sheng (2013) and Finch and French (2019).

As we construct the Markov chain to approximate the Bayesian posterior distribution, we must assess whether the chain possesses the desirable features of representativeness, accuracy, and efficiency (Kruschke, 2015). The potential scale reduction factor (PSRF), also known as the shrink factor (Gelman & Rubin, 1992; Brooks

& Gelman, 1998), and the trace plot can be used to examine the representativeness of the

Markov chain draws. The autocorrelation function can be graphed and the effective sample size calculated to determine how much independent information is available in the MCMC chains to provide an accurate and stable representation of the posterior distribution. The choice of the sampling procedure in the MCMC family, which can 70 affect how quickly the sampler can move through the parameter space and converge, is also of our concern, especially regarding complex models. The following section provides an overview of Gibbs and HMC, two of the most important MCMC methods in modern Bayesian inference according to McElreath (2016).

Gibbs Sampling. Gibbs sampling, (Gelfand & Smith, 1990; Geman & Geman,

1984), the simplest MCMC procedure, has frequently been the subject of investigation in the IRT parameter estimation literature since Albert (1992) pioneered his examination for the two-parameter normal ogive model. Technical expositions of the Gibbs sampler can be found in Albert (1992) for the two-parameter IRT model, Béguin and Glas (2001) and

Sahu (2002) for the three-parameter IRT model, and Sheng (2015) and Culpepper (2016) for the four-parameter IRT model. In the following section, the general sampling procedure with Gibbs is described with the 4PM estimation as an illustration.

Let k be the number of binary items in a unidimensional data set, x be a random element in the response matrix U, theta be the vector of person parameters θ = (θ1, θ2,

…θi, θi+1,…, θN-1,θN) for N test takers, and ξ be the vector of item parameters ξ = (ξ1, ξ2,

…, ξj, …, ξ4k-1, ξ4k) for k items in the four-parameter IRT model. Note that ξ (the Greek letter pronounced as Xi) in the 4PM includes the a parameter which measures an item’s sensitivity to distinguish low-ability from high-ability examinees, the b parameter which refers to item difficulty, the c parameter which denotes the probability that respondents with very low ability can provide the right answer with partial information or guessing or both, and the d parameter which captures the probability that even proficient examinees 71 with a latent trait location higher than item difficulty can provide the wrong answer. The probability that scale respondent i answers item j correctly is

(θ ) P(xij = 1│θi, ξj) = cj + (dj - cj) (θ ) (8)

Following Bayes’s theorem, the probability that θ and ξ take on certain values, given the data, is proportional to product of the likelihood function and the prior distributions of θ and ξ, which can be expressed with statistical notations as

p(θ, ξǀU) ∝ p(Uǀ θ, ξ)p(θ)p(ξ) (9)

Note that p(θ,ξ) = p(θ)p(ξ) because person and item parameters are assumed to be independent (see chapter 1). Due to the highly parameterized nature of IRT models, conjugate joint posterior distribution for all parameters might prove impossible or computationally cumbersome to calculate. In this case, one can employ MCMC procedures to approximate the posterior joint distribution p(θ, ξǀU). Gibbs algorithm samples from the joint posterior distribution of all parameters by alternating between parameters. More specifically, Gibbs sampler starts by drawing a random value from the posterior distribution of one parameter, and then proposes a new random value of that parameter conditional on all other parameters and the data (complete conditional distributions). Because the drawn value is from the posterior distribution of that particular parameter (i.e., the marginalized posterior distribution), this random walk proposal is always accepted (Gelman et al., 2013; Geyer, 2011; Krushchke, 2015). Gibbs sampler then picks one value for another parameter from its posterior, given the complete conditional with the latest updated draw for the previous parameter, and the process 72 continues. Gibbs sampler walks through every complete conditional distribution iteratively until the pre-determined number of iterations is reached.

In IRT modeling, Gibbs estimates both person and item parameters, with each value update in one step fed into the complete conditional distributions for the next step.

To translate this procedure to the 4PM estimation, suppose the Gibbs sampler starts with

a random value θ of the first person parameter θ1. Next, Gibbs samples another value θ

conditional on the remaining θs, all ξj, and the observed data U. Once the value of θ is updated, Gibbs sampler continues by proposing the latent trait value for another

respondent, θ2, conditional on the last update on θ in the previous step, the remaining θs, all ξs, and the collected data U. Similarly, when Gibbs transition proposal is accepted for an item parameter ξj, the chain proceeds by sampling a new value for ξj+1 given the complete conditional now consisting of the observed data U, all the most recent θ values, and all ξ elements with ξj+1 excluded. Let T denote a random iteration in a Markov chain simulation to estimate the 4PM, we can express the workings of Gibbs sampler to transition from T to T+1 as follows:

Draw θ ~ p(θ1ǀθ , ξ , U) with i > 1

Draw θ ~ p(θ2ǀθ , θ , ξ , U) with i > 2

Draw θ ~ p(θNǀθ , θ , … , θ, ξ , U)

Draw ξ ~ p(ξ1ǀθ , ξ , U) with j > 1

Draw ξ ~ p(ξ2 ǀθ , ξ , ξ , U) with j > 2

… 73

Draw ξ ~ p(ξ4k ǀθ , ξ, U)

When complete conditional distributions are obtained in closed forms and direct sampling from them is possible, the traditional Gibbs sampler can be employed.

Otherwise, additional sampling methods can be used to supplement each step in Gibbs sampling, such as rejection sampling or adaptive rejection sampling (Junker et al., 2016;

Patz & Junker, 1999a). If the number of iterations in this sampling process is sufficiently large, the distribution of draws will reach stationarity (Baker & Kim, 2004), and the sequence of values produced in the Gibbs sampling process can then be used to construct the joint posterior distribution and summary statistics are calculated to derive point estimates of the parameters of interest.

Hamiltonian Monte Carlo. HMC (Neal, 2011) combines MCMC and molecular dynamics in physics. As opposed to Gibbs sampling or other MCMC methods like

Metropolis-Hastings, HMC does not use random walk proposals to sample the parameter space. Instead, HMC pairs the variable of interest with a with an additional “momentum” variable as a way to increase sampling in regions with a high probability density and reduce the time spent in regions with a low probability density. HMC calculates the proposal distribution using the most recent position and modifies it at each step of the chain rather than use a fixed proposal distribution (Gelman et al., 2013; Kruschke, 2015;

McElreath, 2016; Neal, 2011). The common analogy for HMC sampling is that of a marble rolling down a flipped distribution curve, a bell-shaped normal curve for instance.

It is easy to visualize how the marble will speed when it is rolling on the steep part of the upside-down bell curve and move slowly when it reaches the trough. In this manner, the 74 jump proposal is accepted with higher probability when the sampler reaches the high probability density portion of the posterior distribution (the trough) and explores this region more thoroughly.

In each HMC step, both the key variable and the momentum variable are updated simultaneously with leapfrog method (Gelman et al., 2013). HMC starts by drawing a random value for the momentum variable and updates the momentum variable with half step. In each jump that follows, the variable of interest and its momentum are updated with full steps until the last iteration when the Markov chain finishes with half a step for the momentum. Again, let U be the data matrix, θ be the parameter one is examining, and

λ be its momentum, the probability that the proposed jump from state T to T+1 in the parameter space is accepted is

( ǀ)( ) p = min ( , 1) (10) (ǀ)()

On the right side of the equation, the numerator features the proposed position in the posterior (density) and the proposed momentum, while the denominator represents the current position in the posterior density and the current momentum. If this ratio between their probabilities is larger than 1, the jump is accepted and the Markov chain proceeds with updates on both parameter and momentum values. In contrast, if the ratio between two products is smaller than 1, the proposed jump is accepted only probabilistically. It is conspicuous from the equation that the momentum variable is highly influential on the jumping distribution in the HMC chain, as a larger probability of momentum value yields a larger acceptance likelihood for the jump. The inclusion of the momentum vector λ with an independent distribution can be viewed as a data augmentation technique to help 75

HMC sampler explore the parameter space faster (Gelman et al., 2013; Tanner & Wong,

1987). In the case of acceptance, proposed values of both θT+1 and λT+1 are updated, but usually only the sampled parameter values are recorded for subsequent analysis, while the auxiliary momentum variable is not of interest and, thus, is not tracked (Gelman et al.,

2013; Kruschke, 2015). Formal analysis reveals the optimal acceptance rate for HMC simulation is around 65% (Neal, 2011).

Studies on Comparisons of MML, Gibbs, and HMC in IRT

Unidimensional IRT Models for Dichotomous Data

Since MCMC methods were introduced in the psychometric literature around

1990, multiple studies have been conducted to explore the merits of the standard practice

MML and newer MCMC techniques, especially Gibbs sampler. The first study to explore the application of MCMC methods in IRT was undertaken by Albert (1992). In this pioneering article, Albert presented the Gibbs sampling algorithm to simulate from the joint posterior distribution of parameters in the two-parameter normal ogive IRT model and demonstrated EM and Gibbs estimations using real mathematics placement test data.

In general, the two methods produced almost identical values for the b parameter, but discrimination parameter estimates with Gibbs sampler were consistently larger than their counterparts with EM. One of the interesting findings from Albert’s exploration was that for highly discriminating items, EM estimates of the a parameter come with smaller standard errors than Bayesian estimation with Gibbs simulation, yet the reverse is true for items which poorly distinguish examinees. However, Albert’s (1992) employment of data from 100 examinees, a very small sample for most, if not, all IRT applications, and the 76 unknown true data generating mechanism mean it was impossible to determine which methods offered more accurate estimates. More importantly, item parameters were estimated without a prior distribution imposition, which, as Baker and Kim (2004) commented, represents a non-Bayesian computation with Gibbs sampler.

A more comprehensive investigation of item parameter recovery by MML and

Gibbs sampler under the two-parameter normal ogive model was conducted by Baker

(1998) with manipulations of test length, sample size, and person and item latency. Both

Albert’s (1992) Gibbs sampler and MML were found to shrink the original metric for small samples of examinees and performed well for large samples. After item parameters were equated (i.e., placed on the same original metric), overall MML had an obvious advantage over Gibbs sampler since the former consistently produced smaller mean estimation bias across maneuvered conditions. Interestingly, the patterns of estimation bias were different for the two methods. While Gibbs tended to produce positive bias for the item discrimination, MML estimation bias was more likely to be negative, and both estimators had the tendency to produce smaller item difficulty estimates than their true values. MML and Gibbs improved their estimation accuracy with increasing sample sizes, a trend especially more prominent for the latter estimator. At the largest tested sample size of 500 cases, two estimators generally produced negligible mean bias and small standard deviations of bias for both item difficulty and item discrimination parameters. These patterns of findings were quite similar across two groups with different

θ generating distributions, and test length was reported to have little impact on estimation accuracy in general. 77

Albert’s (1992) non-Bayesian Gibbs sampler version was extended into a fully

Bayesian approach to estimation of the three-parameter normal ogive IRT model by

Béguin and Glas (2001), and findings somewhat mimicked what was found with the two- parameter model by Baker (1998). With 10 test items and a sample of 1,000 cases, Gibbs sampler and MML within a Bayesian framework, that is BME, produced comparable estimation bias and bias standard deviations, except for one item in which Gibbs sampler had substantially larger bias. Again, Gibbs sampler was not tried under many testing conditions and latent distributions, hence limitations on the generalizability of the findings. Kim (2001) used four authentic data sets to examine MML and Gibbs sampler estimations (in addition to JML and conditional maximum likelihood) in the Rasch model, and found that all estimators of interest produced markedly similar results, with

MML giving smaller confidence intervals for estimates than Gibbs procedure, findings which were echoed in Béguin and Glas (2001). However, because of the use of real world data in Kim (2001), the accuracy of estimation was impossible to be assessed and compared.

Unidimensional IRT Models for Polytomous Data

Although dichotomous data tend to be more common in cognitive testing, it is by no means the case that they are the only data type psychometricians encounter in educational measurement. Polytomous data tend to be popular in constructed response format, such as the Test of English as a Foreign Language (TOEFL) paper-based version in which the writing section has a score range of 1–5. Polytomous data are also highly relevant in psychological measurement (Embretson & Reise, 2000) and other social 78 science disciplines with the typical use of Likert-type instruments. Comparisons of estimation methods’ performances have been made for a variety of unidimensional models for non-binary data. Regarding the generalized partial credit model (GPCM), Luo

(2018) compared MML and HMC across combined conditions of test length (5, 10 and

20 items), latent distribution (uniform, normal, and skewed), and sample size (500, 1,000, and 2,000 examinees), and reported interesting patterns of recovery accuracy. While the mean biases of MML and HMC in item location parameter estimation across latent distributions and sample sizes were similarly close to zero, their properties differed in estimating the discrimination parameter. Under the uniform θ prior distribution, item discrimination estimates by MML and HMC were positively biased, whereas as θ was drawn from a normal distribution, HMC estimates were negatively biased while MML yielded positive bias, with MML being less accurate in both circumstances (i.e., having larger absolute values for mean bias). In contrast, if θ distribution was skewed, HMC and

MML both underestimated the discrimination parameter, with MML being more accurate

(having smaller absolute mean bias). Overall, both MML and HMC performed better under the normal latent trait distribution than under a uniform or skewed latent trait distribution, and both estimation methods improved the quality of person parameter recovery with the increase in test length. Wollack et al. (2002) reported trivial differences in MML and Gibbs sampler recovery of parameters of the IRF under nominal response model (NRM), although MML tended to offer slightly more accurate estimates, especially under conditions when both estimators performed poorly. Interestingly, the test length jump from 10 to 30 items assisted parameter recovery, yet the 20 to 30 items leap 79 did not offer noticeable improvement in estimation accuracy despite the increase in test information, arguably due to the set-back effect of the reduced ratio between sample size and number of parameters to estimate (SSR; de Ayala & Sava-Bolesta, 1999; Wollack et al., 2002). Kieftenbeld and Natesan (2012) also found that for the graded response model

(GRM), MML and Gibbs sampling differed very little in item parameter recovery when samples had 300 or more cases, and Gibbs had an advantage over MML in smaller samples (75 or 150 cases). With respect to person scoring, Gibbs sampling and MML-

EAP produced comparable estimates, and were both superior to MML-MAP. For both

MML and Gibbs, test length was more influential than sample size on person parameter recovery whereas the reverse is the case for item parameter recovery under GRM

(Kieftenbeld & Natesan, 2012), results which somewhat find parallels in Luo’s (2018) exploration of HMC in GPCM.

Multidimensional IRT Models

In many measurement situations, it is reasonable to assume that a single latent trait underlies manifest responses. A standardized mathematics test, for instance, is likely to test students’ mathematical knowledge and not intended to measure their understanding of chemistry simultaneously. However, other situations might call for the multidimensional model to better capture the nature of response data. For example, a mathematics examination crafted in English but administered to speakers to other languages might actually tap two underlying domains, mathematical and English proficiency. The decision to model correlated data together, such as reading comprehension and vocabulary in a literacy test, can also necessitate a multidimensional 80 structure consideration. Thus, data in many measurement contexts require IRT models which satisfactorily elucidate the multiple underlying latent traits, and investigations into

MML and MCMC estimations in multidimensional models have also been reported in the literature.

Testlet models represent an attempt to address the dependence of responses to several scale items (Wang & Wilson, 2005). For example, examinees’ answers to a group of comprehension questions (or testlet) based on one reading passage are likely to depend on both their general latent ability parameter and their latent ability on that testlet, hence the violation of the local independence assumption. The magnitude of the additional testlet-specific ability parameter is represented by its variance. Like many other IRT models, parameter estimation methods, including MML and MCMC techniques have been proposed for testlet models. In Jiao et al.’s (2013) study, the Rasch/one-parameter testlet model estimation was simulated using several computation methods, and MML was found to be more accurate in testlet ability variance, while Gibbs sampler implemented in WinBUGS was more accurate in item parameter recovery. However, essentially negligible differences in parameter estimates by MML and Gibbs were observed in practical data modeling. For the two-parameter testlet model estimation investigated by Luo and Wolf (2019), Gibbs and MML similarly had no significant accuracy differences in the recovery of testlet variance, item difficulty, and item discrimination parameters across sample size and testlet variance conditions considered.

Sample size increase, as expected, led to the corresponding accuracy improvement in item parameter estimation performances, but reached a point of diminishing returns 81 between medium (1,000 cases) to large size (2,000 cases) for testlet variance parameter estimation by both methods. An evaluation of estimation methods for the bifactor model

(Martin-Fernandez & Revuelta, 2017) showed that Gibbs and HMC, among other algorithms, estimated model parameters with essentially equal precision levels, irrespective of sample size conditions (500 and 1,000) and prior densities (informative vs. uninformative), but Gibbs was slower than the newer MCMC counterpart HMC. An interesting study undertaken by Chang (2017) spanned across unidimensional and multi- unidimensional structures for the two-parameter IRT model with binary data and indicated that Gibbs and HMC were equally accurate in estimating both model types.

These two sampling methods, however, differed in efficiency across models of different dimensionalities: while HMC was advantageous in the unidimensional, Gibbs sampler worked faster in the multi-unidimensional.

Contrary to findings with testlet and bifactor models, Kuo and Sheng (2016) compared seven estimation methods for the multi-unidimensional graded response model, including MML and Gibbs sampling procedure, and found that overall, Gibbs sampling was less accurate than other estimators. In addition, Gibbs tended to overestimate both item discrimination and item thresholds (i.e., location parameters for categories in an item with more than two ordered options), while other estimators, including MML (with both EM and adaptive quadrature), also overestimated item discrimination and the first threshold, but underestimated the second threshold for three-category items.

Several points seem conspicuous from the above review. Among MCMC methods, Gibbs sampling is still the preferred method when an alternative to MML is 82 sought after, when MML procedure has not yet been developed for a particular IRT model, or when authors aim to demonstrate how MCMC simulations in general work theoretically and practically. HMC was the less frequently studied MCMC method than

Gibbs, in part due to the statistical and IRT modeling traditions with MCMC brought about by trailblazing authors (Albert, 1992; Gelfand & Smith, 1990; Geman & Geman,

1984). The scarcity of research into HMC within the IRT framework also owes to the fact that some popular software packages designed to calibrate IRT models feature Gibbs sampling among available MCMC methods or at times as the only MCMC method, while

HMC represents a relatively new MCMC approach and remains absent in most IRT software development considerations. The intensive programming required for IRT calibrations in software programs such as Stan (Carpenter et al., 2017), which is dedicated to Bayesian estimation in general rather than specialize in IRT parameter estimation, might preclude enthusiastic researchers from exploring this promising

MCMC method to meet their IRT modeling needs. In addition, comparisons of estimator properties have been devoted to item parameter more than person parameter recovery, probably in part because person scoring is not directly executed with MML. Regarding computational efficiency, MCMC conspicuously lags behind MML, with the remarkable exception of Luo and Wolf’s (2019) study, in which Gibbs sampling was found to be faster than MML. Though many factors might contribute to MCMC slow convergence, such as model complexity and the length of the Markov chain, the higher efficiency of

Gibbs sampling in this particular case appeared to be brought about by the use of MPlus rather than commonly used software like BUGS and WinBUGS in other studies (Luo & 83

Wolf, 2019). With respect to estimation accuracy, little discrepancy was observed for

MML and MCMC methods overall, although MML appeared to be slightly advantageous to MCMC in more parameter recovery experiments.

Research on the Four-parameter IRT Model Estimation

The vast majority of research into the 4PM estimation to date has focused on

Bayesian estimation with both analytic and simulation procedures (Culpepper, 2016;

Loken & Rulison, 2010; Meng et al., 2019; Sheng, 2015; Waller & Feuerstahler, 2017;

Zhang et al., 2020), with the notable exception of Feuerstahler and Waller (2014). The first study to investigate the properties of Gibbs estimation in the 4PM was conducted by

Loken and Rulison (2010). For a sample of 600 cases, Gibbs recovered the difficulty parameter most successfully, with consistent but minor negative bias on average, the correlation between the true and estimated values at .97 or higher, and the 95% credible interval coverage at the minimum of .98 across three test length conditions. Although average estimation biases for other item parameters were also miniscule (average biases differed from zero only in the second or third decimal place), the correlations between the true and estimated values were much lower, ranging from over .30 to less than .70, which tended to increase with test length, except for the upper asymptote parameter. Overall, the coverage of the true parameters by the 95% credible intervals was better for the difficulty and slope parameters than for the lower and upper asymptotes, but generally remained quite high at 87% at the minimum. Person parameters also were recovered with little bias, and the correlations between true and estimated θs improved from .80 to .92 with the corresponding increase from 15 to 45 items. For all test lengths, the 95% credible interval 84 captured the true θ well with 95% of the time on average, and at least 92% for θ values outside the middle portion [-1, 1] of the latent continuum. Loken and Rulison’s (2010) simulation was limited to test length (15, 30, and 45 items) as the only manipulated design characteristic while adopting informative priors for the upper and lower asymptotes, a fixed sample size of 600 cases and unit normal distribution of θ.

Strong attention to prior specifications in Bayesian computations via Gibbs sampling has been paid to modeling with 4PM. Sheng (2015) specifically examined the effect of prior density selection on the estimation of 4PM slope and threshold parameters with Gibbs, and found that although the slope and threshold estimation errors were proportional to the magnitude of the former and the extremity of the latter irrespective of prior choices, overall informative priors facilitated recovery accuracy when they were correctly specified. Sheng, thus, recommended informative priors for slope and intercept parameters whenever pre-existing knowledge about these parameters is available, partly for higher estimation accuracy and partly as an attempt to avoid convergence issues. In addition, Culpepper (2016) argued that information about the upper asymptote is not always available, given the renewed interest in this model only very recently, and demonstrated the satisfactory item recovery with Gibbs using uninformative priors for both asymptote parameters when sample size is sufficiently large (2,500 examinees for educational measurement, and up to 10,000 cases for psychopathology data). Although the idea of having large data sets to overwhelm the influence of priors is classic in

Bayesian estimation, Culpepper (2016) particularly attributed accurate recovery of the upper asymptote to the presence of more cases at the higher end of the latent trait 85 distribution, especially for highly difficult items. Zhang et al. (2020) developed the

Gibbs-slice sampling algorithm to estimate the 4PM with two steps, one to update the upper and lower asymptote parameters with truncated beta distribution conjugation, and the other to slice sample for the item discrimination and difficulty parameters with different auxiliary variables. The authors demonstrated that Gibbs-slice sampling effectively recovered item and person parameters even with a sample of 500 cases, and was not sensitive to prior specifications as long as the prior distributions were not wildly unreasonable.

Also within the Bayesian framework, BME was reported to accurately recover item parameters with 5,000 cases, while accurate person scoring in the subsequent phase with EAP required a sample size of only 1,000 cases for simulated data which echoed

Reise and Waller’s (2003) psychopathology data characteristics (Waller & Feuerstahler,

2017). Regarding estimation bias patterns, BME generally overestimated the discrimination and pseudo-chance parameters, and consistently underestimated the upper asymptote (meaning estimation of the slipping parameter 1 - d was larger than its actual value), while bias for the item location parameter was mostly positive and negligible.

Further investigation revealed the d parameter was least accurately recovered in highly difficult items, in line with Culpepper’s (2016) conclusion with Gibbs sampler. In order to improve the d parameter recovery accuracy, it is advisable to incorporate a large number of respondents with high trait values, a note by Waller and Feuerstahler (2017) which also corroborated Culpepper’s (2016) findings. Meng et al. (2019) reformulated the 4PM as a mixture model and estimated it with marginalized maximum a posteriori 86

(MMAP) via a newly developed E-M algorithm. It was shown that Meng et al.’s (2019) estimation method had superior convergence rate compared to the E-M algorithm in

BME and also offered higher parameter recovery accuracy than BME with samples less than 10,000 cases.

The only investigation into the 4PM parameter recovery with MML by

Feuerstahler and Waller (2014) indicated the necessity for a large sample of 10,000 cases to obtain accurate item parameter estimation in the same psychopathology data calibration. MML tended to overestimate the discrimination and pseudo-chance parameters while underestimating the upper asymptote parameter across all sample sizes tested (1,000, 5,000, and 1,0000 cases). Unfortunately, little else is known about the study, such as the number of quadrature points and how the latent traits were generated.

A consistent finding in the studies by Culpepper (2016) and Loken and Rulison

(2010) pointed to the merits of Gibbs sampling in recovering person parameters well, with the correlation between the true and estimated θs verging on .90 or higher. Waller and Feuerstahler (2017) had a similar result with θ estimation using EAP following BME when the generated responses mimicked the real-world psychopathology data characteristics. Zhang et al. (2020) reported the correlations between true and estimated

θs ranged from .86 to .92 with Gibbs-slice sampling, depending on test length and sample size combinations. However, the discrepancy in sample sizes across these studies is clear:

Loken and Rulison (2010) reported satisfactory θ estimates with a sample of 600, Zhang et al. (2020) examined samples as small as 500 cases, while a sample of at least 1,000 cases was proposed by Waller and Feuerstahler (2017). Recommended sample sizes for 87 accurate item calibration, too, pointed to variations across studies, which ranged from several thousand to 10,000 cases (Culpepper, 2016; Feuerstahler & Waller, 2014; Meng et al., 2019; Waller & Feuerstahler, 2017). It is important to note that data characteristics affected the sample size requirement: simulated data in both Feuerstahler and Waller’s

(2014) and Waller and Feuerstahler’s (2017) studies echoed the characteristics of psychopathology measurement which feature more items with high difficulty parameter values, hence the need for a very large sample size (which is synonymous with increased number of cases with high latent trait) in order to attain accurate item calibrations.

Similarly, Culpepper (2016) also recommended a smaller sample of size 2,500 for cognitive testing data and a larger sample size of 10,000 cases for psychopathology data in order to obtain accurate item parameter estimation when no prior knowledge is available. The test length factor also influenced parameter estimation, and it was varied in some studies (Loken & Rulison, 2010; Zhang et al., 2020) but held constant in other simulations (Culpepper, 2016; Meng et al., 2019).

Certain other results are salient and recurrent from the past studies looking at the consequences of the failure to incorporate a valid upper asymptote parameter. When data were generated under the 4PM but were modeled with the 3PM and 2PM, comparisons revealed that under-parameterized models underestimated item discrimination but overestimated item difficulty (Culpepper, 2016; Loken & Rulison, 2010). Estimations of the c parameter also tended to become less accurate, with higher root mean square error and lower correlation with the true pseudo-chance parameter values (Loken & Rulison,

2010). These consequences of ignoring a valid d parameter also found corroboration in 88 studies which model real world data when the 4PM was reported to be the better fitting model (Culpepper, 2017; Waller & Reise, 2010). Person parameters were well estimated in general, with essentially zero bias from all three models and the high correlation between the latent trait estimates by the 4PM with their counterparts by the 3PM and

2PM. However, the precision of θ estimates for respondents with high and low trait levels tended to be compromised as a result of fitting the wrong model. For example, the 3PM produced too narrow intervals and 95% credible intervals covered true θ less than 95% of the time for high θ in most conditions of test length, while wrongly modeling with the

2PM led to poor coverage for both ends of the latent continuum (Rulison & Loken,

2010). The impaired precision for person scoring can have dire consequences, for clinical practice as a case in point, if the true data generating mechanism is closer to the 4PM than the 2PM and 3PM, because respondents with high θ levels are of interest to psychopathologists and are often those in need of interventions the most (Waller & Reise,

2010).

The differences in item parameter estimates by the 4PM versus the 2PM and 3PM translate into the gaps in the total information, which is the sum of information from all items in a scale. Past studies found that compared to the 3PM, modeling data with the

4PM provided more information in the middle portion and less information for the upper part of the θ scale (Culpepper, 2017; Waller & Reise, 2010). Figure 5 illustrates the difference in item information of a hypothetical data set generated under the 4PM and modeled with the 3PM and 4PM. The most notable differences between the two graphs are (1) the more peaked information for the middle range of the latent scale provided by 89 the 4PM, and (2) the amount of information for test-takers at the upper end of the latent scale, with the 3PM results claiming to provide more information (and hence a lower standard error) for this group of examinees. Again, if the 4PM is the more fitting model than its predecessor 3PM, modeling the slip parameter brings new understanding of the precision and utility of measurement scales for respondents at different levels of latent trait, particularly a less rosy picture of what scale items allow us to know about high scoring examinees.

Figure 5

Test Information (Blue Line) and Standard Error (Purple Line) of the 3PM (Left) and the

4PM (Right)

Note. The hypothetical data set consists of 20 items and 2,500 cases.

Interestingly, however, when the true model is the 2PM and 3PM, using the 4PM does not provide a better fitting model, since this more highly parameterized model only performs better when the actual upper asymptote is smaller than one. In Sheng’s (2015) simulation, the 2PM and 3PM performed better in both parameter recovery accuracy and 90 model fit than their protégé 4PM when the data generating mechanisms were in fact the less parameterized models 2PM and 3PM. Similar findings were reported in Zhang et al.

(2020). When item responses were simulated under the 4PM, modeling data with simpler models (the 2PM and 3PM) led to underfitting, whereas modeling data with the 4PM brought about overfitting when item responses were generated under the 2PM and 3PM.

Therefore, modeling an additional parameter does not necessarily always bring additional value if the underlying structure of the data does not necessitate an upper asymptote.

Factors Manipulated in IRT Parameter Recovery Studies

A multitude of factors might affect the quality of parameter estimates for an IRT model, such as how many participants respond to a scale, how many items a scale has, whether parameters are fixed or freely estimated, whether the respondent sample is heterogeneous or homogenous on the latent scale, and whether our assumption about the latent distribution matches its actual characteristics. Below is an overview of two factors subject to manipulations in this simulation study.

Sample Size

Sample size is probably the most heavily studied design feature in modeling with item response theory. Because IRT is known as a large data technique, the question of exactly how large data should be concerns methodologists, psychometricians, and applied researchers alike. Apart from its important role in accurate assessment of latent characteristics, sample size availability also affects other modeling issues like the choice of IRT models, estimation methods, algorithm convergence, and power to detect model- data misfit. Sample size guidelines, thus, are not intended to be followed to the letter 91 because varying authentic calibration objectives, such as desired accuracy level and power, might call for different numbers of cases (de Ayala, 2009). Other things being equal, generally more complex, highly parameterized models would require larger samples for similar estimation quality than less complex models do. For unidimensional dichotomous data, past studies suggested a sample of 50 cases might suffice for Rasch modeling, but more recent research pointed to the parameter estimation instability in small sample calibrations and suggested a sample size of 400 at the minimum for acceptable precision (Chen et al., 2014; Custer, 2015). For the 2PM, de Ayala’s (2009) review suggested a sample of 500 respondents and scale of 20 items should be sufficient for reasonable item parameter estimation accuracy when MML is used for calibration.

For the 3PM, the recommended sample size appears far larger than that for the 1PM and

2PM. Thissen and Wainer’s (1982) study found that maximum likelihood estimation in the 3PM with a sample of 10,000 still yielded an unacceptably large standard errors for item parameter estimates, particularly for easy items. De Ayala (2009) reviewed past studies and found MML offers reasonable accuracy in recovering item parameters in

3PM with 1,000 cases and a 20-item assessment tool, yet encouraged attainment of samples larger than 1,000 to accommodate possible convergence issues with this model.

Similar to investigations into the 3PM estimation, most 4PM estimation studies thus far also suggested large samples for successful recovery of person and item parameters. As discussed above, samples ranging from several thousand to 10,000 cases for accurate item calibration, depending on data characteristics, and a sample of 1,000 cases at the minimum for stable person parameter estimates were recommended for the 4PM 92

(Culpepper, 2016; Feuerstahler & Waller, 2014; Waller & Feuerstahler, 2017). In contrast, Loken and Rulison (2010) demonstrated good estimates of item parameters under the four-parameter IRF with Gibbs sampler for a sample of 600 cases, yet the relatively low correlations between the true and estimated values (ranging from .37 to

.69) for the slope and two asymptote parameters might represent a worrying result for researchers.

Latent Trait Distribution

Latent trait distribution is also frequently varied in psychometric research designs.

For unidimensional binary models, latent trait scores are typically generated from the unit normal density N(0,1), but the effect of other latent distributions such as uniform and skewed distributions has also been increasingly investigated (e.g., Kieftenbeld &

Natesan, 2012; Luo, 2018; Svetina et al., 2017). Regarding the normal distribution, its mean and variance can also be manipulated so that the θ generating distribution differs from the θ prior density and cluster of item difficulty parameters (e.g., Baker, 1998). For multidimensional models, another related factor often subject to manipulation is inter- trait correlation (e.g., Chang, 2017; Kuo & Sheng, 2016). Simulation studies of characterizing data with the 4PM thus far have employed the standard normal

(Culpepper, 2017; Loken & Rulison, 2010; Rulison & Loken, 2009; Sheng, 2015; Waller

& Feuerstahler, 2017) and uniform distribution (Liao et al., 2012; Yen, Ho, Liao, &

Chen, 2012) to create sample person parameters, while skewed latent distributions have been absent from the data generation schemes. The impact of latent trait distribution 93 types has not been evaluated for the 4PM estimation and will be an area of investigation in the present study.

Issues for Consideration with MML

When fitting an IRT model with MML, researchers have to make several decisions on the configurations of the calibration process. The key decisions usually concern the number of quadrature points, the choice of θ prior density, and the method to estimate person parameters subsequent to item calibration.

Prior Distribution for Abilities

The correspondence between the θ prior distribution and the true underlying θ distribution brings about item calibration consistency because additional test-takers do not change the θ distribution used in the likelihood function integration and require recalibration (Harwell et al., 1988). Seong (1990) examined the 2PM estimation with

MML and found the agreement between the actual and the assumed θ distribution also has a particularly pronounced effect of improving person parameter estimation accuracy, even in small data sets. Typically, prior density for the latent trait variable is set to be the standard unit normal N(0,1) in IRT modeling, and this approach is recommended in the case of the lack of knowledge about θ probability density (Seong, 1990). Studies of the

4PM estimation have employed N(0,1) and a more diffuse distribution N(0,2) as θ priors

(Liao et al., 2012; Loken & Rulison, 2010; Rulison & Loken, 2009; Waller &

Feuerstahler, 2017). 94

Quadrature Points

Overall, when the match between the specified θ prior and the actual underlying θ distribution is present and sample size is large, increasing the number of quadrature points enhances item calibration accuracy. However, person scoring is affected by the amount of quadrature points, irrespective of sample size and the agreement between the actual and the assumed θ distribution (de Ayala, 2009; Kim & Lee, 2017; Seong, 1990).

The recent study by Kim and Lee (2017) on the 3PL model estimation with MML reveals that all else being equal, (estimated versus fixed θ, numerical analysis approaches, and reference frame for item calibration), at least 21 quadrature points is recommended to obtain low estimation bias for item parameters when the latent trait distribution is normal.

In the 4PM literature, Waller and Feuerstahler (2017) employed 33 quadrature points to estimate latent trait parameters with EAP following BME item calibration.

Person Parameter Estimation Method

In addition to MLE, researchers also have Bayesian alternatives including EAP

(which uses the posterior distribution mean to represent the point estimate of the person parameter) and MAP (which, unlike EAP, employs the posterior distribution mode).

Among the person scoring methods following MML item calibration, EAP is probably most commonplace. Both EAP and MAP have an advantage over MLE with their capacity for estimation of extreme score patterns (i.e., all right or all wrong answers).

EAP computation is mathematically simpler than its Bayesian counterpart MAP, and past studies found EAP estimates to be more accurate and less affected by regression to the mean of θ prior distribution than MAP estimates (de Ayala, 2009). 95

Issues for Consideration with MCMC

Fitting a psychometric model in a fully Bayesian framework via MCMC requires the researcher’s attention to a number of issues, two of which are probably most important. One is the setting of prior distributions for both item and person parameters, which is the unique feature to distinguish the Bayesian from the frequentist approach.

The other is methods to monitor Markov chain convergence in the target distribution, which sets Bayesian calculation via MCMC apart from its analytic counterpart.

Priors for Person and Item Parameters

Since the posterior distribution of a parameter represents the compromise between its prior distribution and the likelihood function, the influence of the prior on Bayesian inference is proportionally related to how much information it carries and is subdued by the likelihood in analyses conducted on large data (Gelman et al., 2013). Depending on the readily accessible information about parameter values, priors can be specified as noninformative (diffuse), weakly informative, and informative, or can be empirically determined from the sample (Matteucci et al., 2012). In the item response modeling framework, Lord (1986) conjectured that incorporating Bayesian priors, even noninformative, prevents item discrimination and pseudo-chance parameters, which tend to be difficult to estimate, from taking on implausible values, and allows placement of latent trait estimates within sound limits. However, despite the common expectations of the helpful role of the information contained in prior densities in improving calibration accuracy, especially with studies based on small samples, recent research casts doubt on this belief because severe estimation bias is observed in the 2PM calibration with both 96 informative and diffuse priors (Marcoulides, 2018). Thus, it has been recommended that priors be selected in light of the sample size and objective of an investigation (Culpepper,

2016; Marcoulides, 2018; Matteucci et al., 2012). In the Bayesian IRT literature, certain distributions have become the norm for person and item parameter priors: standard normal density for ability, normal density for difficulty parameter, log normal or truncated normal for discrimination parameter to ensure non-negativity expected in well- behaved scale items, and beta distribution for the lower asymptote (Baker & Kim, 2004;

Matteucci et al., 2012; Sahu, 2002). Regarding the upper asymptote, beta distribution, uniform distribution, or logit reparameterization of the normal distribution have been used as its priors (Culpepper, 2016; Loken & Rulison, 2010; Sheng, 2015; Waller &

Feuerstahler, 2017).

Convergence Assessment

One prerequisite to the use of draws from a Markov chain to make inferences about the parameters of interest is to determine whether the chain has converged. It is important to emphasize that MCMC convergence means the Markov chain has reached a stationary distribution which is the (joint) posterior distribution of the parameter(s) being estimated, whereas MML with EM converges to a single point (Sinharay, 2003, 2004).

MCMC convergence evaluation is challenging because it is almost impossible to theoretically determine when a chain has converged to the posterior distribution in general (Brooks & Roberts, 1998; Cowles & Carlin, 1996), and the rate of convergence appears peculiar to a model and its specifications (Sinharay, 2004). Gelman et al. (2013) summarized the difficulty monitoring MCMC convergence in three questions: (a) how to 97 know when the Markov chain is long enough to terminate the run, (b) how to handle the initial part of the chain which might remain heavily influenced by its starting value and unrepresentative of the target posterior distribution, and (c) how to accommodate the low information due to the correlated nature of draws in a chain. Multiple approaches have been proposed to meet this demand to evaluate convergence by MCMC users. Accessible presentations of convergence diagnostics and graphical examination methods for psychometricians can be found in Junker et al. (2016) and Sinharay (2003; 2004). Cowles and Carlin (1996) provided a comprehensive treatment of MCMC algorithm convergence assessment techniques with discussions from both theoretical and applied perspectives, and a more technical exposition was written by Brooks and Roberts (1998). A review of

MCMC convergence diagnostic tools with a discussion of more recent developments can be found in Roy (2020). Below is a description of the common practices to monitor

MCMC algorithm convergence.

Suppose one runs a pilot Markov chain with 2n iterations and divides the chain into two equal halves. Convergence diagnostics can then be performed on the second half of the chain with n iterations. If convergence checks suggest the chain has reached a steady state, one can discard the first half of the chain (the burn-in segment) and utilize the output from the latter half for probabilistic inferences. However, if convergence diagnostic operations suggest the MCMC sampler still fluctuates considerably on its trajectory, more iterations are needed for the chain to reach convergence. This trial-and- error process is repeated until convergence diagnostics, preferably a multitude of them, suggest the chain has converged into an equilibrium distribution. Another oft-used 98 approach is to run in several parallel chains with well-separated starting points, and approximate convergence is obtained only when these sequences overlap thoroughly and converge to the same region of parameter space, which again can be examined with diagnostic tools. However, this multi-start approach with several shorter chains is disputable, and running a single but very long chain to ensure the MCMC algorithm in use has carefully traversed the parameter space and escaped from a pseudo-convergence state is recommended by Geyer (2011). Sinharay (2004) and Patz and Junker (1999a) tried both methods for the 3PM and 2PM estimations, respectively, and found one single very long chain and several independent chains with various starting points yielded essentially the same estimates. The key issue here is to ensure each chain in a group of parallel chains, however shorter, has sufficient iterations to explore the parameter space and produce a representative sample from the posterior. If the multi-chain approach is employed, one must ensure diagnostics for each chain suggest convergence before one can safely use this group of chains for inferences. For investigations of the 4PM estimation and its applications with Gibbs sampler, chain lengths across studies vary substantially, from a couple of dozen thousand to over two hundred thousand iterations, and both single long runs and multiple runs have been resorted to (Culpepper, 2016,

2017; Loken & Rulison, 2010; Sheng, 2015; Waller & Reise, 2010).

Cowles and Carlin (1996) included a useful summary of convergence diagnostics based on their theoretical underpinnings, applicability, and other criteria. Some diagnostic tools are numerical, such as the potential scale reduction factor (PSRF;

Gelman & Rubin, 1992) or its alternative version (Brooks & Gelman, 1998), while others 99 are graphical (trace plot and density plots), or both. Some diagnostics by Raftery and

Lewis (1992) and Geweke (1992) can be performed on a single chain, whereas others are used in multi-chain contexts. According to Cowles and Carlin (1996), the most popular convergence checking techniques probably are those initiated by Raftery and Lewis

(1992) and Gelman and Rubin (1992). The method by Raftery and Lewis (1992) allows a rough calculation of the necessary chain length to obtain predetermined level of estimation accuracy for chain quantiles. Raftery and Lewis’s (1992) procedure might prove particularly useful to aid researchers to navigate the trial-and error phase to determine the needed number of iterations and burn-in portion, yet it is univariate in nature and can give differing results for different initial chains for the same model

(Cowles & Carlin, 1996).The PSRF, denoted as R̂ , based on variance ratio (Gelman &

Rubin, 1992) is criticized for its dependence on normal approximation and use of over dispersed starting points which ultimately requires knowledge of the posterior distribution to assess (Cowles & Carlin, 1996). Brooks and Gelman’s (1998) alternative R̂ based on interval length is simpler to calculate and does not require normality assumption. According to Brooks and Gelman (1998), R̂ < 1.1 or 1.2 suggests that the

Markov chains have converged to the posterior distribution. The multivariate potential scale reduction factor (MPSRF), the generalized version of the original method was also proposed to diagnose convergence in a multi-parameterized modeling context (Brooks &

Gelman, 1998; Brooks & Roberts, 1998), and was found to be more powerful than PSRF in detecting convergence failure in the 3PM estimation via Metropolis-within-Gibbs algorithm (Sinharay, 2004). Waller and Reise (2010) also found that the use of PRSF at 100 the recommended threshold near 1.1 or 1.2 alone was insufficient in assessing Gibbs sampler convergence in estimating the 4PM fitted to psychopathology data.

A note on managing autocorrelations in MCMC sampling results is desirable. As a Markov chain is designed so that a generated value is directly related to the draw immediately before it, MCMC results are characterized by their within-sequence correlations. Because correlated values provide less accurate inferences about the parameters under examination than independently simulated draws, researchers can consider keeping only sampled values after every successive group of draws (retaining every kth simulated value), a technique referred to as thinning the chain. An autocorrelation function graph can be examined to reveal how autocorrelation gradually reduces and at which point draws can be considered independent of one another (i.e., correlation approaching zero). Moreover, MCMC users can also consider effective effect size (EEF) calculation to gauge the independent information amount provided by dependent draws in a Markov chain.

There is a general consensus that no single diagnostic technique guarantees successful MCMC sampling convergence detection. Because each procedure represents a different check on convergence, it is advisable to employ multiple diagnostics simultaneously (Brooks & Gelman, 1998; Cowles & Carlin, 1996; Junker et al., 2016;

Sinharay, 2003, 2004). This recommendation is particularly relevant to psychometric modeling, since the use of a single convergence diagnostic PSRF was found to be an unreliable convergence evaluation method for 3PM and 4PM estimations with MCMC

(Sinharay, 2004; Waller & Reise, 2010). Sinharay (2004) noted that the high item 101 parameter cross-correlations typical of many IRT models often necessitate a large number of Markov chain iterations, and that convergence might not need monitoring for person parameters because they tend to be accurately estimated.

Chapter Summary

This chapter began with a brief history of the 4PM, including the model initiation, its long period of neglect, and the recently rekindled interest in its potential usefulness, followed by its current applications across disciplines and continuing debates on the interpretation and utility of the upper asymptote parameter. The essential features of three estimators (MML, Gibbs and HMC) were then described and a review of the relevant

IRT parameter recovery literature with various types of model was provided. An important part of this review concerned the state-of-the-art knowledge about parameter recovery under the IRF for the 4PM with MML, Gibbs and BME. The chapter ended with a discussion of the factors often maneuvered in IRT parameter recovery studies and model configuration issues in relation to the three estimation methods under investigation.

102

Chapter 3: Methodology

The overarching purpose of this dissertation is to compare the quality of person and item parameter recovery under the four-parameter item response function (Barton &

Lord, 1981) by Marginal Maximum Likelihood with Expectation Maximization algorithm (Bock & Aitkin, 1981) and Bayesian statistical inference via two MCMC methods, Gibbs sampling (Gelfand & Smith, 1990; Geman & Geman, 1980) and

Hamiltonian/hybrid Monte Carlo (Neal, 2011). Specifically, this research study seeks answers to two questions, each of which deals with a factor commonly encountered in educational measurement practices. The first question addresses the performances of three estimation algorithms across three sample size levels (1,000, 2,500 and 5,000 cases). The second question evaluates the aforementioned estimation approaches under two distributional characteristics of the latent trait (normality and negative skewness).

Monte Carlo Research

While Monte Carlo simulation tends to be treated as a single concept, the term actually consists of two distinct notions, namely simulation and Monte Carlo. Simulation is the creation of a pseudo-reality which mimics a system, an environment or a process with controlled characteristics to understand the impact of modifications and interventions, while Monte Carlo refers to the use of repeated random process

(Sawilowski, 2003). In modern statistics, simulation studies are typically conducted with computers and the properties of, say, a statistical estimator, unfold after many repetitions, which explains why simulation and Monte Carlo methods are used interchangeably in the literature. Taken together, Monte Carlo simulation procedure is a statistical sampling 103 experiment, and when viewed from the frequentist practice, the purpose of a Monte Carlo experiment is to approximate the sampling distribution of a statistic and assess its behavior via repeatedly drawn random samples from pre-defined population distributions

(Fan et al., 2002; Mooney, 1997). When applied to Bayesian statistics, the purpose of a

Monte Carlo experiment is to construct/approximate the (joint) posterior distribution and evaluate the long run frequency-based properties of the Bayesian probabilistic statements about parameters (Morris et al., 2019) (more details about the contradiction in conceptions of frequentist and Bayesian probability are presented in chapter 5). Although empirical research also employs random sampling, Monte Carlo simulation research distinguishes itself from empirical research in its use of data generated with computer software so that the characteristics of the underlying data generating mechanism are controlled for and known in advance (Harwell et al., 1996). In light of the purpose of this study, Monte Carlo simulation was the best fitting methodology because answers to the research questions described above could not be theoretically derived, and because the true, but unknown parameters in empirical data were unable to support the nature of this investigation (Fan et al., 2002; Feinberg & Rubright, 2016; Harwell et al., 1996; Mooney,

1997). According to Mooney (1997), comparison of estimators’ properties, and other research situations such as examination of assumption violations or inference with weak statistical theory, call for the use of simulation studies.

In the psychometric literature, Monte Carlo simulation studies are popular.

Indeed, a small survey by Feinberg and Rubright (2018) on one-fifth of the articles published between 2008 and 2012 in Applied Psychological Measurement and Journal of 104

Educational Measurement found that 27 out of 45 articles were simulation-based. The benefits of Monte Carlo methods in measurement underlie their popularity. In addition to serving as an alternative to intractable analytic solutions, Monte Carlo research is advantageous over studies with empirical data in that it allows the collection of data under manipulated conditions at a very low cost and is not affected by real life data issues such as nonrandom missingness and other possible confounders like test-taker cultural and linguistic backgrounds (Bulut & Sünbül, 2017). However, Harwell et al. (1996) offered a fair warning that the popularity of simulation studies should not be equated with a false belief that MC methods offer a one-size-fits-all solution to measurement research questions. Quite the contrary, as several authors have pointed out, the validity of Monte

Carlo research is tied to the plausibility of its assumptions, and its utility depends on how relevant and authentic the manipulated conditions are. In statistical simulation research, assumptions are important and must be reasonably met in practice, so that the findings from simulations are applicable and generalizable to real world data analyses. If the assumptions of a Monte Carlo research are usually unmet in practice, and the conditions under investigation do not align with reality, interpretations are hard to draw and the external validity of the research can become seriously questionable (Feinberg &

Rubright, 2018; Harwell et al., 1996; Luecht & Ackerman, 2018; Mooney, 1997). The success of a Monte Carlo study, like research with other methodologies, thus, is largely dependent on the thoughtfulness and planning on the part of the researcher.

Harwell et al. (1996) delineated a general eight-step procedure for a Monte Carlo study in IRT: (a) formulate research questions, (b) describe the conditions under 105 investigation, (c) select an experimental design, (d) generate data, (e) estimate model parameters, (f) compare the “true” and estimated parameters, (g) replicate the process in each cell, and (h) analyze criterion variables with both descriptive and inferential methods. The following section offers a detailed description of the research design, data generation under the four-parameter IRF, number of replications, analytic planning and outcome variables, and parameter estimation with a focus on the choices associated with each estimation approach and statistical software. Due to the scarcity of studies into the

4PM estimation, this research prioritized the common decisions regarding issues such as prior densities for parameters estimated in the Bayesian framework and person scoring method following MML item calibration. It is hoped that better understanding of the popular scenarios when modeling with the 4PM will help lay the foundation for further research to explore other unexamined possibilities with 4PM estimation, such as modeling data with dimensionality contamination or more complex characteristics

(Luecht & Ackerman, 2018).

Research Design

This simulation was designed as a mixed factorial research study with two between-subjects factors and one within-subjects factor. Two between-subjects factors were sample size (with three levels: 1,000, 2,500, and 5,000 respondents) and latent distribution (with two conditions: normality and negative skewness). The only within- subjects factor was estimation method with three types: MML, Gibbs, and HMC. This

3x2x3 mixed factorial design with two fully crossed factors and one repeated measures factor resulted in 18 unique combinations of estimation features (18 cells). The treatment 106 of estimation method as a within-subjects factor in this simulation produced more accurate results for comparison of estimators since all three estimation procedures were subject to identical random sampling variations in the data generation process.

Replications

As the role of replications in a simulation is similar to that of sample size in an empirical study, sufficient replications are necessary to ensure the validity and reliability of research findings (Bulut & Sünbül, 2017; Feinberg & Rubright, 2016; Harwell et al.,

1996). A larger number of replications bring more stable simulation results via the reduction of sampling variation effects, which is especially germane to the purpose of this investigation to compare estimation properties of different algorithms. However, more replications come at the cost of the increasing time and computer resources to repeat the data modeling process. A minimum of 25 replications was recommended for IRT simulation studies (Harwell et al., 1996), but advances in computer power now allow us to perform the same statistical task more efficiently than what was possible two decades ago. In some Monte Carlo research into parameter recovery for the 4PM, a range of numbers of replications have been employed, from 25 to 500 (Culpepper, 2016;

Feuerstahler & Waller, 2014; Sheng, 2015; Waller & Feuerstahler, 2017), whereas in other studies, the authors neglected to report how many times model calibrations were repeated (Loken & Rulison, 2010). In light of the previous research into parameter recovery under the four-parameter IRF, the fairly large number of conditions tested in the current study, and the time-consuming and computer resource intensive nature of the

MCMC runs, 50 replications were set for each cell in this simulation. Regarding data 107 generation, Waller and Feuerstahler’s (2017) strategy was adopted, that is, data were simulated for the estimation process until 50 successful calibrations in each cell were reached. This approach aimed to accommodate possible unidentified models and non- convergent model calibrations based on the past findings on convergence failure in the

4PM parameter estimation (Waller & Feuerstahler, 2017). Specifically, in this study, only data sets which resulted in identified models and successful estimation convergence in

MML were fed to Gibbs and HMC procedures. It should be noted that both technical and practical convergence in MML were required. Pilot studies indicated that MML frequently failed to practically converge (i.e., produce sound IRT estimates) despite technical convergence, especially for small samples. Thus, additional criteria to ensure reasonable parameter estimates (a ≤ 3, -6 ≤b≤6, c < d) were imposed on selected data.

Only calibration results from data sets that enabled successful model identification and estimation convergence (both technical and practical) in mirt (Chalmers, 2012) were legitimate for calibrations with two MCMC methods.

Analytical Outcomes

Two common measures of the difference between a true parameter and its estimated value were used as outcome variables: bias, which represents the systematic discrepancy between the true and estimated parameter, and root mean square error

(RMSE), which denotes total estimation error. Average estimation bias and RMSE were calculated using the following equations:

108

For item parameters For person parameters

bias = ∑ ∑( − ) bias = ∑ ∑( − ) (11)

∑( ) ∑( ) RMSE = ∑ RMSE = ∑ in which x is the parameter of interest (either person or item parameter), xT is its true value, x̂ is its estimated value, R is the number of replications, L is scale length (i.e., 20 items), and N is the number of simulated examinees in each data set. The average bias will approach zero for an unbiased estimator because both positive and negative deviations from the true parameter values occur and cancel each other out. The magnitude of RMSE indicates how spread out estimates are around the true parameter values. To compare the total estimation errors produced by different methods, relative efficiency, or the ratio between RMSE by MCMC and RMSE by MML, was calculated and converted to a percentage (Mooney, 1997). In each cell, relative efficiency smaller than 100 indicates that MCMC estimates parameters with higher accuracy, and relative efficiency larger than 100 means MML offers more accurate estimates.

Moreover, to capture the stability of estimates by three approaches, the proportion of true parameter coverage by the 95% confidence interval (CI) in MML and 95% posterior highest density interval (HDI) in Gibbs and HMC were computed. For MCMC methods, HDI represents the range of values with the highest probability density (i.e., values outside this range are less credible candidates of parameter values). HDI and the equal-tailed credible interval are identical if the posterior distribution is unimodal and symmetrical, but they are different if the posterior distribution lacks either or both 109 characteristics. HDIs were used in lieu of equal-tailed credible intervals to account for possible asymmetrical and multi-modal posterior distributions. In order to understand how well the three estimators preserved the rankings of simulated examinees, Spearman correlations between true and estimated θs were also reported.

RMSE, bias, true-estimated parameter value correlation, and coverage proportion were also separately gauged for the extreme item and person parameter values. A better understanding of MML, Gibbs and HMC estimation performances at very high and very low latent trait of respondents might be of strong interest for many measurement professionals (Loken & Rulison, 2010; Reise & Waller, 2003; Waller & Feuerstahler,

2017; Waller & Reise, 2010). Another measure which deserved attention was estimation time incurring in each procedure to indicate how efficient the three methods under investigation were relative to one other. The idea of efficiency as assessed by the amount of time input should not be confused with the estimator efficiency ratio mentioned earlier.

Data Generation

The data generation procedure and modeling with 4PM were informed by the dual intention to remain consistent with past 4PM research configurations and to address typical measurement conditions in educational measurement practice. In the following section, details of data generation relevant to each research question are provided, with a particular focus on the motivation for choices of item characteristics, sample size and latent distribution conditions selected for this simulation study.

Item characteristics. Measurement scale length was constrained at 20 items in this study. Under the four-parameter IRT model, item properties are represented by four 110 parameter types, namely discrimination, difficulty, and lower and upper asymptote. In this study, item parameters were generated under the assumption that the tests are generally well designed, and few parameters are beyond their reasonable range in reality.

While theoretically the discrimination parameter can assume both infinitely negative and positive values, items with negative values are discarded in measurement practices, and discrimination values between 1.35 and 1.69 are high, and values larger than 1.70 are considered very high under the logistic model (Baker, 2001). Indeed, Masters (1988) cautioned against the facile interpretation of larger discrimination parameters as better items because high sensitivity to individual differences can actually be a form of item bias, and Hudson (1991) pointed out that items which distinguish examinees on a very narrow band and consequentially cause the item response curve to be flat in most other areas of the latent scale are of little value. Therefore, in this study, item discrimination parameters were generated from the normal distribution N(1.2, .352) truncated at the lower limit of zero, so that approximately 80% of values randomly drawn were within the reasonable range [0.75, 1.65]. Item difficulty was randomly drawn from the standard normal distribution N(0,1) to represent a typical scale which provides the highest amount of information about respondents in the middle range of the latent continuum (Sass et al.,

2008). For pseudo-chance parameters, a normal distribution N(.15, .042) with the lower limit of zero was used to reflect the representative values of this parameter in actual testing conditions, which tend to be lower than the probability of a correct answer resulting from a merely random guessing behavior in, say, a common multiple choice question with four alternatives (Hambleton & Swaminathan, 1985). The upper asymptote 111 parameter followed the truncated normal distribution N(.88, .042) with an upper limit of one so that the generated d parameters were consistent with the value range frequently explored in 4PM estimation studies. The constraint 0 ≤ c < d ≤ 1 was placed on generated asymptote parameters to further ensure item quality and appropriate representation of the 4PM item response function.

Sample Size. Three calibration sizes N = {1,000, 2,500 and 5,000} simulated examinees were tried in this Monte Carlo study. The choice of calibration size was motivated by the previous investigations into parameter recovery under the four- parameter IRF in educational and psychological measurement, which documented the necessary sample size range for accurate item and person parameter recovery (Culpepper,

2016; Feuerstahler & Waller, 2014; Loken & Rulison, 2010; Waller & Feuerstahler,

2017). More specifically, Loken and Rulison (2010) demonstrated satisfactory parameter recovery for the 4PM with Gibb sampling using a sample of only 600 cases, and while

Culpepper (2016) tried N = {2,500, 5,000, 10,000} and reported the need for large samples of 2,500 and 10,000 cases to obtain accurate item parameter estimates for educational and psychopathological assessment scenarios, respectively. A large sample of

10,000 cases was also found to bring reasonable item parameter recovery accuracy for data mimicking psychopathological measures with MML (Feuerstahler & Waller, 2014).

However, Waller and Feuerstahler (2017) reported that a sample size of 5,000 cases was sufficient for accurate item parameter recovery by BME with data echoing the characteristics of psychopathological measurement, while only 1,000 cases were necessary for accurate person parameter accuracy using BME-EAP. This study 112 purposefully incorporated the rather small sample of 1,000 cases because of its relevance in applied research as samples of several thousand cases are not always attainable.

Samples less than 1,000 cases were not considered due to the low convergence rate (<

0.1%) with MML in pilot simulations.

Latent Trait Distribution. The person parameter was simulated from two types of distribution. Firstly, θ was drawn from the standard normal density N(0,1) under the typical assumption that the test-taker sample comes from a normally distributed population. Secondly, a negatively skewed θ distribution was constructed to represent latent trait scores when a relatively easy test is administered (Sass et al., 2008). In this study, the choice of the skewed θ distribution was based on Ho and Yu’s (2015) survey of state-level educational measurement data. A beta distribution with two shape parameters

(8.2, 2.8) was employed to construct a negatively skewed distribution with a moderate skewness level of approximately -0.60. This skewness level was within the range of median skewness values for IRT scale scores across state-level tests reported by Ho and

Yu (2015). The excess kurtosis was 0.07, which indicated that the skewed θ distribution had a slightly heavier tail than a normal distribution (value of zero indicates a mesokurtic distribution). This positive kurtosis index also fell within the range of median kurtosis values across IRT scale scores examined by Ho and Yu (2015). Random θ draws from beta(8.2, 2.8) were transformed to have mean of zero and variance of one before being inputted in the item response matrix generation process.

The procedure to construct a negatively skewed latent distribution in our study warranted further comments. In the psychometric literature, the traditional power method 113 to generate random non-normal data (Fleishman, 1978; Vale & Maurelli, 1983) has been criticized for its lack of theoretical support for the type of distribution generated and considerable bias in the actual level of kurtosis, even for samples as large as 1,000 cases

(Luo, 2011; Tadikamalla, 1980). The beta distribution served our purpose well for two reasons. First, with the smallest sample of size 1,000, pilot results showed that the shape of the beta distribution was well preserved, and sample skewness did not deviate substantially from the population skewness (maximum skewness difference at 0.11 in absolute value). Secondly, the beta distribution is bounded by the interval [0, 1], hence our convenient control of the range for θ draws. Theoretically, the latent ability parameter can extend to infinity, but our hypothesized θ scale resembles the z score scale, making θ values outside the [-4, 4] very rare. Indeed, the default latent scale incorporated in the package mirt (Chalmers, 2012) covers the [-6, 6] range. When draws from beta(8.2, 2.8) are transformed to obtain mean of zero and variance of one, the minimum and maximum possible values of this transformed distribution are -5.93 and 2.02, respectively. In a number of pilot runs (described below), none of the drawn θ values went below -4.6, even when the largest sample size of 5,000 cases was tried. The bounded nature of the beta distribution, which offers plausible sampled values of θ, and its conservation of the desired level of skewness in random sampling, made the beta distribution a reasonable candidate for the purpose of constructing the skewed latent distribution in this study.

Once person and item parameter values were generated with the distributions described above, they were used to simulate response data under the 4PM IRF with the simdata function in the mirt package (Chalmers, 2012). Six combinations of between- 114 subjects factor levels, or six groups, were present with 50 replications for each group, yielding a total of 300 simulated response data sets. To reduce sampling errors and ensure diverse response patterns simultaneously, this study adopted Feinberg and Rubright’s

(2016) strategy: in each group, only one θ sample was drawn and fixed for all replications, but item parameters differed across replications. A random seed was set for each group so that the generated person and item parameter values differed across groups.

Data Calibration

Subsequent to the data generation phase, item response matrices were calibrated with three different programs, of which one is a package in R (R Core Team, 2019), and the remaining two are interface R packages to connect R with pre-existing Bayesian packages/software programs. In the following section, description of the R programming environment and the software packages utilized for calibrations of the 4PM is provided.

The R Programming Environment. The entire simulation process in this study was conducted in the R computing environment. The past decade has witnessed the increasing popularity of statistical analyses in R across social science disciplines, and several virtues of R explain its widespread embrace. One reason is that R offers an open source and free tool for statistical programming and comes with an abundance of assistance and tutorial resources from the large R user community on the Internet, including respected statisticians and measurement experts. Moreover, R offers copious functions with a diverse range of utilities in its library and allows users to combine and extend these functions at their will. Also, users of R can create their own functions and write specializing packages for their respective field. Regarding psychometric modeling, 115

R is not only faster but also less demanding than other commercial packages with respect to advanced programming skills (Bulut & Sünbül, 2017). As such, R users are permitted a great degree of flexibility to meet their diverse quantitative analytical needs at various levels of technical skills (Bulut & Sünbül, 2017; Carsey & Harden, 2014; Kabakoff,

2011; Matloff, 2011).

MML with the mirt Package. MML item calibration under the four-parameter

IRF was performed with the mirt package (Chalmers, 2012) in R. mirt has been used in multiple recent IRT simulation studies (see, e.g., Martin-Fernandez & Revuelta, 2017;

Svetina et al., 2017), including investigations into parameter recovery for the 4PM

(Feuerstahler & Waller, 2014; Waller & Feuerstahler, 2017). In light of research into

MML estimation, 41 quadrature points were deemed sufficient to ensure accurate θ recovery and little item parameter estimation bias (de Ayala, 2009; Kim & Lee, 2017;

Seong, 1990; van Rijn, 2014). In fact, 41 quadrature points are routinely used by the

NAEP in its large-scale assessment data calibrations (Sinharay & von Davier, 2005), and are popular in ETS research reports (e.g., Cao et al., 2014; Kim & Moses, 2016).

Subsequent to item parameter estimation with MML, the person scoring phase was executed with EAP. The prior N(0, 22) was used and recommended for θ density in several studies (Rulison & Loken, 2009; Waller & Feuerstahler, 2017). However, in this study, estimation results with the 2PM and 3PM are assumed to be accessible to measurement practitioners, and because θ estimates under less parameterized models produce little bias overall (Loken & Rulison, 2010), a less diffuse prior N(0, 1.22) was employed for the person parameter. MML item parameter estimates are generally not 116 sensitive to starting values (Nader et al., 2015), hence the default initial values in mirt were used in this investigation. The convergence threshold of .001 was set for all replications. It should be noted again that because convergence with MML is not always guaranteed for IRT models more highly parameterized than the 1PL/Rasch models

(Harwell et al., 1988), only calibration results from data sets that enabled both successful technical and practical convergence in mirt will be legitimate for comparison with the outcomes from two MCMC methods.

Gibbs with JAGS and HMC with Stan via R Interface Packages. Bayesian estimation of the 4PM with Gibbs sampling was performed with JAGS (Plummer, 2003) via the R interface package rjags version 4.8. JAGS stands for “just another Gibbs sampler” (Plummer, 2003, p. 2), and can be conveniently run from within R when both

JAGS and rjags are installed. Similarly, for ease of programming and output monitoring,

Bayesian computations with HMC was conducted using Stan (Carpenter et al., 2017; Stan

Development Team, 2019) via its R interface package rstan version 2.18.2. The built-in algorithm in Stan is No-U-Turn sampler (NUTS; Hoffman & Gelman, 2014), an adaptive form of HMC sampler which exempts users from the formidable task of hand-tuning the number of leapfrog steps and step size, thereby simplifying the use of HMC and making this MCMC method more accessible to applied researchers. In addition to the benefits of increased HMC efficiency brought about by NUTS (Hoffman & Gelman, 2014), sample codes for several IRT model types in Stan have been introduced to facilitate the learning process for IRT practitioners (Ames & Au, 2018; Luo & Jiao, 2018). 117

Bayesian Priors for the 4PM Parameters. The defining feature of fully

Bayesian computations is the assignment of priors for all parameters estimated. The impact of different prior densities on estimation accuracy was not the focus of this investigation, hence the purposeful effort to stay consistent with past investigations into the 4PM estimation within the Bayesian framework. Choices of priors for item and person parameters were also motivated by the assumption that researchers had copious information about possible values of the latent trait, item difficulty, discrimination and lower asymptote parameters based on previous modeling results with less parameterized models, the 2PM and 3PM. In this study, the b parameter was considered as coming from a normal distribution N(0, 1.32) which is only slightly more diffuse than the prior θ distribution N(0, 1.22). Because 4PM calibrations can reveal substantially higher values of item discrimination (Culpepper, 2016; Loken & Rulison, 2010; Waller & Reise, 2010), less information about the a parameter should be assumed. Therefore, the log normal distribution with mean of zero and standard deviation of 0.2, lognormal(0, 0.2), was imposed on the item slope as its prior. Note that in the previous studies with Gibbs sampling, Loken and Rulison (2010) and Waller and Reise (2010) employed smaller variances for item slope priors, lognormal(0, 0.125) and lognormal(0, 1/11), respectively

(i.e., more informative priors for the a parameter). Regarding the lower asymptote parameter, this study adopted Waller and Reise’s (2010) selection of beta(2, 10) as the prior distribution to convey a strong judgment of the probable lower asymptote value range by the measurement practitioner. With beta(2, 10), the mean and mode of the prior for the c parameter are .167 and .10, respectively. It is reasonable to assume the d 118 parameter to be functionally similar to the c parameter (Loken & Rulison, 2010), hence the adoption of beta(10,2) as prior for the upper asymptote. Note also that Loken and

Rulison (2010) tried U(0.7, 1) as prior for the d parameter and reported little difference on model estimation overall compared to the use of a more informative prior beta(17, 5).

Markov Chain Configurations. One predicament facing researchers employing

MCMC simulations is that the number of iterations and burn-in segment (i.e., the initial part of the Markov chain to be discarded) not only depends on specific model conditions but also varies across data sets, yet tailoring the chain length and burn-in portion for each model calibration in a simulation study proves impractical. Common chain length and burn-in are desirable and remain the popular practice in MCMC simulation research. This study adopted the common strategies of IRT-based simulation studies (e.g., Culpepper,

2016; Wollack et al., 2012) and considered results from some preliminary MCMC runs to inform chain length and burn-in configurations. Figure 6 shows the PSRF as a function of the number of iterations in the Markov chain for sample model parameters calibrated with Gibbs, and the corresponding relationships in HMC are presented in figure 7. Both pilots were conducted with 1,000 cases and normal θ density. It was evident that after the initial 15,000 to 20,000 iterations, the PSRF (R̂ ) fell below 1.1 for all item and person parameters with both Gibbs and HMC. The fluctuations of R̂ after 20,000 iterations were present but very small, and R̂ appeared to approach 1.0 generally. The MPSRF for all item parameters was below 1.1 when MCMC chains were terminated at 50,000 iterations with 20,000 burn-in iterations in both rjags and rstan. Similar results for PSRF and

MPSRF were observed for other combinations of sample size and latent trait distribution. 119

Figure 6

Sample PSRF as a Function of Chain Length and Burn-in with Gibbs

Note. From left to right, top to bottom: item discrimination, item difficulty, item lower asymptote, item upper asymptote, and person ability parameter.

Figure 7

Sample PSRF as a Function of Chain Length and Burn-in with HMC

Note. From left to right, top to bottom: item discrimination, item difficulty, item lower asymptote, item upper asymptote, and person ability parameter. 120

Trace plots, which show the sampled parameter values a function of steps in the

Markov chain, are displayed for models calibrated with Gibbs in figure 8 and with HMC in figure 9. It appeared there was little variation in the θ draws (i.e., small posterior variance), while larger fluctuations were observed for item parameter samples (i.e., large posterior variances). However, despite the differences in posterior variance, the four

Markov chains seemed to stabilize and mix well in the same region of the parameter, and no chains exhibited obvious systematic upward or downward drifts after the first 15,000 to 20,000 iterations for all parameters. Numerical checks of the shrink factors, both univariate and multivariate, and visual examinations of the sampling trajectories indicated that chains of length 50,000 iterations and burn-in segment of 20,000 steps were sufficient for both MCMC methods to estimate person and item parameters under the four-parameter IRF.

121

Figure 8

Sample Trace Plots of Item and Person Parameters with Gibbs

Note. From left to right, top to bottom: item discrimination, item difficulty, item lower asymptote, item upper asymptote, and person ability parameter.

Figure 9

Sample Trace Plots of Item and Person Parameters with HMC

Note. From left to right, top to bottom: item discrimination, item difficulty, item lower asymptote, item upper asymptote, and person ability parameter. 122

In simulation research as well as empirical studies which employ MCMC estimation, the thinning technique (i.e., keeping only every kth sampled parameter value) can be used to reduce autocorrelations between MCMC draws. However, this practice has invited criticism because it is often unnecessary and because unthinned Markov chains produce more precise results (Link & Eaton, 2012). In the IRT literature, thinning technique is rarely, if ever, used. To remain consistent with the literature and minimize the unintended effects on estimation accuracy, no chain thinning was performed in this study, and all results by Gibbs and HMC were based on summary statistics of posterior distributions approximated by four unthinned Markov chains.

Verification of Algorithm

Verification, or ensuring algorithms in the employed statistical package software are bug-free and execute the models in the intended manner, is a challenging but necessary task to evidence model credibility. Bratley et al. (1987) recommended five methods to verify a statistical program for simulation, including: (1) manual verification of logic, or comparison of a model output after a short run with manual calculations, (2) modular testing, or checking if each part of the codes, or subroutine, functions well for all possible inputs, (3) checking against known solutions, or comparing model output with known solutions from a particular input, (4) sensitivity testing, or changing one parameter while holding all others constant to examine if the model demonstrates sensible behavior

, and (5) stress testing, or trying implausible inputs (i.e., illogical parameter values) and checking if models offer warnings/error messages accordingly. In the following section, descriptions of procedures to verify the simulation code are provided. 123

Manual Verification

Manual verification of MCMC and MML algorithms was not practical because the multivariate nature of the 4PM entails complex and laborious computations typically not performed by hand. Therefore, only the operation of the R package SimDesign (Sigal

& Chalmers, 2016) was manually verified. This package was used to handily compute bias and RMSE when the simulation runs had been completed. Using results from three pilot runs with 1,000 cases and normal θ distribution, it was verified that the average bias value computed by SimDesign was identical to hand calculation results, but RMSE was calculated with N rather than N-1. However, after 3 pilot simulations, the difference between N and N-1 was found to be very low at 0.15 at the maximum for all parameter types. With 50 replications, the number of estimates was 1,000 for each type of item parameters, and at least 50,000 for person parameters in each cell. Therefore, this N vs.

N-1 difference would presumably become negligible.

Modular Testing

For this method, two pilot simulations were run to ensure that simulated item and person parameters displayed the characteristics of their true underlying distributions, even with the rather short test of 20 items and the smallest sample of 1,000 examinees. In the first pilot run, model parameters were simulated for the 4PM with 1,000 cases and normal θ distribution N(0,1). In the second pilot simulation, sample size and number of items were identical to those in the first pilot run, but person parameters were drawn from a negatively skewed distribution beta(8.2, 2.8) and transformed to have mean of zero and variance of one. To align with the actual procedure in this study, person parameter 124 sample was drawn only one time for each pilot run but there were 50 replications for item parameter generation. A random seed was set for each pilot run. The histograms below show the distributional shapes of θ when sampled from a normal and negatively skewed distribution (figure 10).

Figure 10

Normal (Left) and Negatively Skewed (Right) Latent Distributions with 1,000 Cases

Descriptive statistics of item and person parameters are presented in table 1 below. Results indicated that with a sample of size 1,000, the distributions of θ draws corresponded to their parent distributions. In both pilot runs, sample skewness deviation from population skewness was 0.11 at the maximum. The mean and standard deviation of all parameter types were very close to the corresponding measures of their generating distributions, especially for the upper and lower asymptote parameters. Overall, sample deviations from population distributions were random (i.e., did not form a distinct pattern) and not substantially large to be meaningful. 125

Table 1

Sample Descriptive Statistics of Item and Person Parameter Distributions with N = 1,000

Pilot Parameter Population Sample M (SD) Skewness Kurtosis M (SD) Skewness Kurtosis Pilot 1 a 1.2 (0.35) 0.00 0.00 1.20 (0.35) -0.07 -0.08 b 0.00 (1.00) 0.00 0.00 -0.05 (1.00) 0.04 0.12 c .15 (.04) 0.00 0.00 .15 (.04) -0.03 -0.02 d .88 (.04) 0.00 0.00 .88 (.04) 0.02 -0.12 θ 0.00 (1.00) 0.00 0.00 -0.03 (1.04) 0.10 0.00 Pilot 2 a 1.2 (0.35) 0.00 0.00 1.21 (0.35) 0.11 -0.01 b 0.00 (1.00) 0.00 0.00 -0.01 (1.00) 0.11 -0.13 c .15 (.04) 0.00 0.00 .15 (.04) -0.01 0.07 d .88 (.04) 0.00 0.00 .88 (.04) -0.01 0.11 θ 0.00 (1.00) -0.60 0.00 0.00 (1.00) -0.66 0.08 Note. M=Mean; SD=Standard Deviation

Checking against Known Solutions

In this part, two pilot simulations were compared. Both pilot simulations had 5 replications with a sample of 1,000 examinees, but latent trait scores were drawn from

N(0,1) in the first pilot simulation and from beta(8.2, 2.8) after standardization in the second pilot simulation. Response data were then generated for both assessment conditions and calibrated with three methods under investigation. Bias and RMSE are displayed in table 2. In general, both MML and MCMC had their estimation errors for all parameters increase as assumption of normal θ was violated, and this deleterious effect of the negatively skewed θ was most obvious for item difficulty estimation bias. This outcome disparity suggested that the R codes executed the model calibrations in the expected manner.

126

Table 2

Comparison of Bias and RMSE under Normal and Negatively Skewed θ with N = 1,000

Pilot Estimator Outcome Parameter a b c d θ Pilot 1 MML Bias 0.22 -0.03 -0.02 0.00 0.01 RMSE 0.67 0.53 0.13 0.11 0.57 Gibbs Bias -0.20 -0.09 0.00 -0.02 0.00 RMSE 0.36 0.30 0.05 0.06 0.57 HMC Bias -0.20 -0.09 0.00 -0.02 0.00 RMSE 0.36 0.30 0.05 0.06 0.57 Pilot 2 MML Bias 0.26 -0.18 -0.04 -0.01 0.01 RMSE 0.72 0.54 0.12 0.12 0.59 Gibbs Bias -0.18 -0.14 0.00 -0.03 0.00 RMSE 0.35 0.30 0.06 0.07 0.59 HMC Bias -0.18 -0.14 0.00 -0.03 0.00 RMSE 0.35 0.30 0.06 0.07 0.59 Note. In pilot 1, simulated data with 1,000 test-takers under normal latent trait were calibrated; pilot 2 involved 1,000 examinees under negatively skewed latent trait.

Sensitivity Testing

In sensitivity testing, the upper asymptote parameter was fixed at one while other parameters were simulated from the same distributions described above. Item and person parameters were estimated from a simulated data set with 1,000 respondents. For MML,

16 out of 20 upper asymptote parameters were estimated to be .99, and mean estimate for this item parameter was .97, although in two cases, d parameters were estimated to be .81 and .83. The correlation between the true and estimated d parameters was not calculated because the generated parameters were constant. HMC produced highly accurate estimates of the d parameter: all the upper asymptote estimates were within the range

[.99, 1], and an error warning was given for the correlation between true and estimated d parameters because the standard deviation of the true d parameters was zero. Finally, 127

Gibbs sampling seemed to be the most biased and inconsistent estimator in this pilot: the mean of the estimated d parameters was .91, with only seven estimates > .95 and six estimates < .90, including one value < .80. Apparently, when the d parameter equaled one, all three estimators were differentially sensitive to this special feature of the model.

Stress Testing

In stress testing, a random data set generated with 1,000 cases under normal latent trait which was known to have caused estimation failure with MML was fed to Gibbs and

HMC to examine how these MCMC methods would behave when dealing with problematic data. When this data set was estimated with MML, mirt offered a warning about the lack of model identification and did not provide standard error of estimates due to its inability to inverse the information matrix. Similarly, as calibration was performed on this data set with Gibbs and HMC, samples from the MCMC posteriors suggested that both MCMC methods did not reach convergence for many parameters. Figure 11 and 12 show, as an example, the trace plot and PSRF versus iterations plot for an item discrimination parameter in Gibbs and HMC, respectively.

128

Figure 11

Trace Plot (Left) and Plot of R̂ against Iterations (Right) of an Item Discrimination

Parameter in an Unidentified 4M Estimated with Gibbs Sampling

Figure 12

Trace Plot (Left) and Plot of R̂ against Iterations (Right) of an Item Discrimination

Parameter in an Unidentified 4PM Estimated with HMC

Trace plots showed that for many variables, the four chains in neither MCMC procedure have stabilized and mixed well, and the PSRFs fluctuated but only went down below 1.1 for a few parameters. In the example shown in figures 11 and 12, one chain in

Gibbs sampling appeared to stray away from the cluster of three other chains, while for

HMC, two chains seemed to be on a different trajectory than the other two chains. For 129 both MCMC methods, the chains still explored the parameter space in their own direction even after 25,000 iterations, and R̂ generally reduced with more iterations but was much larger than the conventional threshold of 1.1. When dealing with the same data set which appeared inappropriate for modeling with 4PM, the consistent behavior of three estimation methods when parameters could not be uniquely estimated suggested that the

R codes were set up correctly.

Chapter Summary

Chapter 3 delineated the methodology adopted in this dissertation. Specifically, it explained why Monte Carlo simulation was the best fitting methodology, given the purpose of this study. Details of the data generation and calibration process were then provided, with a particular focus on choices of quadrature points and person scoring method in MML, and prior densities and convergence diagnostics in Gibbs and HMC, followed by description of the R computing environment and the software packages to accommodate MML, Gibbs and HMC estimations of the person and item parameters under the four-parameter IRF. The chapter ended with results of five procedures (manual verification of logic, modular testing, checking against known solutions, sensitivity testing, and stress testing) to verify that the R codes were compiled correctly and executed the 4PM calibrations in the expected manner.

130

Chapter 4: Results

Model Calibration and Convergence

Data filtering and model estimations with MML were performed in five office computers with the Windows 10 operating system. The vast majority of model calibrations with MCMC for the two smaller sample levels (N = 1,000 and 2,500) were directly executed in R in the same computers, while several calibrations with N = 2,500 cases and all models with the largest sample size of 5,000 cases were conducted in the

Ohio Linux-based supercomputer system via the batch mode. In order to address MCMC convergence issues in particular data sets, the “divide and conquer” strategy was employed: instead of running all 50 replications sequentially with one set of commands,

R code was divided into smaller groups of 2-3 replications so that timely adjustments could be made to the Markov chain executions. This breaking down of the R codes also allowed more efficient use of computer resources, especially regarding the supercomputer system of many separate computers working in parallel. All model estimations with Gibbs sampling performed well and no modifications were necessary, whereas several Markov chains in HMC did not progress well with random initial values generated in the first place and, therefore, required a change of the random seed to produce more appropriate starting values.

As aforementioned, simulated data had to satisfy both technical convergence criteria and additional practical convergence criteria in MML with mirt to be selected for calibrations. In general, practical convergence rate was poor for MML, and examinations of the models which converged only technically pointed out that the tendency to inflate 131 the item discrimination parameter estimates to IRT-unwise level (e.g., a > 10) was largely responsible for this low practical convergence rate. Table 3 below displays percentage of simulated data sets which were calibrated by MML without unreasonable parameter values, given the successful technical convergence. An obvious trend from table 3 is that plausible estimation by MML improved with increasing sample size and deteriorated with the violation of the θ distribution normality assumption. The identical pattern was documented for technical convergence and practical convergence separately (i.e., both technical and practical convergence worsened with deviations from normal θ distribution and improved with the increase in N).

Table 3

Percentage of Practical Convergence Given Technical Convergence in MML

Theta distribution Sample size 1,000 2,500 5,000 Normal 0.44% 11.09% 72.52% Negatively skewed 0.19% 3.67% 38.93%

Subsequent to data filtering with MML, selected item response matrices were subjected to calibrations with Gibbs and HMC. For both MCMC methods, all PSRF and

MPSRF estimates were far below the conservative cut point of 1.05, which suggested

Markov chains reached a stationary state and mixed thoroughly. For instance, for N =

1,000 cases and normal θ distribution, the maximum PSRF for item parameters was

1.000307 and for θ estimates was 1.000121, while the largest estimate of MPSRF for the joint distribution of item parameters across 50 replications was 1.006914. Even with the 132 scenario deemed most challenging for estimation with smallest sample size N = 1,000 and negatively skewed theta, no PSRF exceeded 1.000308 for any model parameters, and all MPSRFs were below 1.006456.

Data Analytic Results

Mean estimation bias, the correlation between generated and estimated parameters, the 95% CI/HDI coverage of “true” parameter values, RMSE, and standard error (SE) for each combination of sample size and latent trait distribution are displayed in tables 4–9 below. For MML, SE was the standard error of item parameter estimates, though in MML-EAP estimation of latent trait and in MCMC, SE represented the posterior standard deviation.

133

Table 4

Comparison of MML, Gibbs and HMC under Normal θ and N = 1,000

Estimator Parameter Outcome Bias RMSE Correlation Coverage SE MML a 0.2047 0.6323 .2330 .9510 1.0452 b -0.0573 0.5833 .8232 .8680 0.8023 c -0.0219 0.1204 .1740 .7810 0.1925 d -0.0022 0.1179 .1849 .7930 0.1801 θ 0.0060 0.5784 .8232 .9425 0.5690 Gibbs a -0.1886 0.3389 .6952 .7000 0.1769 b -0.0757 0.2985 .9543 .9860 0.3888 c 0.0007 0.0537 .3468 .9690 0.0646 d -0.0235 0.0642 .3433 .9510 0.0610 θ 0.0025 0.5782 .8319 .9770 0.6746 HMC a -0.1887 0.3389 .6955 .7010 0.1769 b -0.0756 0.2983 .9544 .9880 0.3887 c 0.0007 0.0537 .3465 .9700 0.0646 d -0.0235 0.0642 .3430 .9490 0.0610 θ 0.0024 0.5782 .8319 .9771 0.6746 Note: Bias: average bias; RMSE: root mean square error; Correlation: average correlation between the generated and estimated parameter values; Coverage: average proportion coverage of the true parameter value by 95% Confidence Interval (MML) and 95%

Highest Density Interval (MCMC); SE: average standard error/posterior standard deviation.

134

Table 5

Comparison of MML, Gibbs and HMC under Normal θ and N = 2,500

Estimator Parameter Outcome Bias RMSE Correlation Coverage SE MML a 0.1971 0.5671 .4036 .9650 0.8610 b 0.0110 0.4244 .8966 .9160 0.6501 c -0.0004 0.1013 .3072 .8810 0.1768 d -0.0006 0.1003 .2395 .8410 0.1583 θ 0.0293 0.5770 .8276 .9521 0.5844 Gibbs a -0.1846 0.3150 .7674 .6770 0.1663 b -0.0116 0.2764 .9623 .9790 0.3134 c 0.0034 0.0515 .4632 .9590 0.0558 d -0.0165 0.0589 .3782 .9430 0.0527 θ 0.0323 0.5834 .8318 .9752 0.6672 HMC a -0.1846 0.3151 .7672 .6790 0.1663 b -0.0112 0.2765 .9623 .9820 0.3137 c 0.0035 0.0515 .4626 .9580 0.0558 d -0.0164 0.0588 .3785 .9460 0.0528 θ 0.0324 0.5834 .8318 .9752 0.6672 Note: Bias: average bias; RMSE: root mean square error; Correlation: average correlation between the generated and estimated parameter values; Coverage: average proportion coverage of the true parameter value by 95% Confidence Interval (MML) and 95%

Highest Density Interval (MCMC); SE: average standard error/posterior standard deviation.

135

Table 6

Comparison of MML, Gibbs and HMC under Normal θ and N = 5,000

Estimator Parameter Outcome Bias RMSE Correlation Coverage SE MML a 0.1282 0.4707 .4765 .9690 0.6000 b -0.0026 0.3597 .9288 .9060 0.5776 c -0.0033 0.0927 .3265 .8950 0.1368 d 0.0012 0.0871 .2849 .8930 0.1496 θ -0.0192 0.5820 .8233 .9566 0.5974 Gibbs a -0.1837 0.2998 .7904 .6640 0.1554 b -0.0228 0.2704 .9678 .9340 0.2662 c 0.0047 0.0545 .4115 .9370 0.0507 d -0.0116 0.0541 .4195 .9290 0.0470 θ -0.0185 0.5903 .8259 .9751 0.6729 HMC a -0.1836 0.2999 .7902 .6630 0.1554 b -0.0225 0.2691 .9681 .9310 0.2664 c 0.0048 0.0544 .4120 .9390 0.0507 d -0.0116 0.0540 .4200 .9300 0.0470 θ -0.0185 0.5903 .8259 .9751 0.6730 Note: Bias: average bias; RMSE: root mean square error; Correlation: average correlation between the generated and estimated parameter values; Coverage: average proportion coverage of the true parameter value by 95% Confidence Interval (MML) and 95%

Highest Density Interval (MCMC); SE: average standard error/posterior standard deviation.

136

Table 7

Comparison of MML, Gibbs and HMC under Negatively Skewed θ and N = 1,000

Estimator Parameter Outcome Bias RMSE Correlation Coverage SE MML a 0.2503 0.6733 .2560 .9640 1.2684 b -0.2486 0.6288 .8109 .7920 0.8269 c -0.0502 0.1204 .2026 .7510 0.1780 d -0.0346 0.1358 .2050 .8030 0.2133 θ 0.0175 0.5896 .8173 .9416 0.5796 Gibbs a -0.1783 0.3289 .6178 .7170 0.1777 b -0.1827 0.3421 .9524 .9740 0.3953 c -0.0105 0.0527 .3620 .9440 0.0625 d -0.0381 0.0728 .3337 .9470 0.0641 θ 0.0034 0.5872 .8271 .9750 0.6788 HMC a -0.1782 0.3289 .6185 .7210 0.1777 b -0.1828 0.3423 .9523 .9720 0.3955 c -0.0105 0.0527 .3625 .9410 0.0625 d -0.0382 0.0729 .3329 .9470 0.0641 θ 0.0033 0.5872 .8271 .9752 0.6789 Note: Bias: average bias; RMSE: root mean square error; Correlation: average correlation between the generated and estimated parameter values; Coverage: average proportion coverage of the true parameter value by 95% Confidence Interval (MML) and 95%

Highest Density Interval (MCMC); SE: average standard error/posterior standard deviation.

137

Table 8

Comparison of MML, Gibbs and HMC under Negatively Skewed θ and N = 2,500

Estimator Parameter Outcome Bias RMSE Correlation Coverage SE MML a 0.3172 0.6689 .3943 .9650 0.9167 b -0.2358 0.5614 .8350 .8020 0.5882 c -0.0250 0.1046 .3154 .8380 0.1417 d -0.0561 0.1267 .2922 .8060 0.1566 θ 0.0229 0.5708 .8289 .9517 0.5999 Gibbs a -0.1688 0.3157 .6910 .7200 0.1695 b -0.2254 0.3747 .9502 .8860 0.3189 c -0.0173 0.0546 .4096 .9090 0.0535 d -0.0420 0.0775 .3693 .9060 0.0552 θ 0.0020 0.5759 .8338 .9751 0.6734 HMC a -0.1687 0.3156 .6914 .7170 0.1696 b -0.2251 0.3746 .9500 .8780 0.3189 c -0.0172 0.0547 .4081 .9080 0.0535 d -0.0420 0.0775 .3692 .9070 0.0552 θ 0.0019 0.5759 .8338 .9751 0.6734 Note: Bias: average bias; RMSE: root mean square error; Correlation: average correlation between the generated and estimated parameter values; Coverage: average proportion coverage of the true parameter value by 95% Confidence Interval (MML) and 95%

Highest Density Interval (MCMC); SE: average standard error/posterior standard deviation.

138

Table 9

Comparison of MML, Gibbs and HMC under Negatively Skewed θ and N = 5,000

Estimator Parameter Outcome Bias RMSE Correlation Coverage SE MML a 0.2844 0.5959 .4234 .9770 0.7056 b -0.2558 0.4954 .8909 .7720 0.4687 c -0.0282 0.0871 .3156 .9170 0.1391 d -0.0648 0.1190 .3201 .7610 0.1193 θ 0.0255 0.5758 .8252 .9552 0.6162 Gibbs a -0.1516 0.2918 .6632 .7330 0.1623 b -0.2756 0.3963 .9560 .7760 0.2679 c -0.0225 0.0496 .4760 .9110 0.0492 d -0.0518 0.0843 .3783 .8320 0.0480 θ 0.0013 0.5832 .8281 .9738 0.6816 HMC a -0.1516 0.2918 .6627 .7340 0.1623 b -0.2756 0.3964 .9559 .7720 0.2680 c -0.0225 0.0497 .4747 .9080 0.0492 d -0.0518 0.0844 .3764 .8310 0.0480 θ 0.0014 0.5832 .8281 .9737 0.6815 Note: Bias: average bias; RMSE: root mean square error; Correlation: average correlation between the generated and estimated parameter values; Coverage: average proportion coverage of the true parameter value by 95% Confidence Interval (MML) and 95%

Highest Density Interval (MCMC); SE: average standard error/posterior standard deviation.

Some overall patterns emerged from the results in tables 4–9. In general, no method was consistently superior to the remaining methods as judged by the all the above evaluative outcomes, although with the levels of sample size and latent distribution types under examination, MCMC methods appeared to be have an advantage over MML more frequently with smaller total estimation errors, while MML proved to be an unbiased 139 estimator as sample size increased and the θ normality assumption held. Second, the changes in estimation performances were more sharply visible for MML than for Gibbs and HMC. When person parameters were normally distributed, MML continuously improved its performance with the increase in sample size, whereas MCMC methods exhibited signs of diminished returns with the jump from N = 2,500 to N = 5,000 cases.

MML displayed a sharp decline in parameter recovery accuracy when latent trait was negatively skewed, but variations in MCMC performances were less dramatic in general.

Third, both MML and MCMC estimates of the two asymptote parameters did not correlate strongly with their true values (all correlation coefficients < .50) regardless of latent trait and sample size conditions. Finally, it was obvious from the result tables that

Gibbs and HMC provided almost identical quality of parameter recovery as evaluated by all outcome measures. Across all measurement conditions, no systematic discrepancy between Gibbs and HMC was detected, and two MCMC methods differed by a practically trivial amount likely due to random fluctuations in the Markov chain sampling process.

Relative efficiency (i.e., the ratio of RMSEs, expressed in percentage) was calculated to compare the total estimation errors produced by different estimation approaches. Because Gibbs and HMC were almost identical in every measure of performance (tables 4–9), RMSE produced by Gibbs sampling was used as a representative of both MCMC methods. The relative efficiency ratios between MCMC and MML across conditions are highlighted in table 10. Values smaller than 100% 140 indicated that RMSE by Gibbs/HMC were smaller than that by MML, whereas measures of relative efficiency larger than 100% pointed to the advantage of MML.

Table 10

Relative Efficiency of MCMC versus MML across Measurement Conditions

θ distribution N Parameter a b c d θ Normal 1,000 0.3389 0.2985 0.0537 0.0642 0.5782 0.6323 0.5833 0.1204 0.1179 0.5784 53.60 51.17 44.60 54.45 99.97 2,500 0.3150 0.2764 0.0515 0.0589 0.5834 0.5671 0.4244 0.1013 0.1003 0.5770 55.55 65.13 50.84 58.72 101.11 5,000 0.2998 0.2704 0.0545 0.0541 0.5903 0.4707 0.3597 0.0927 0.0871 0.5820 63.69 75.17 58.79 62.11 101.43 Negatively 1,000 0.3289 0.3421 0.0527 0.0728 0.5872 skewed 0.6733 0.6288 0.1204 0.1358 0.5896 48.85 54.41 43.77 53.61 99.59 2,500 0.3157 0.3747 0.0546 0.0775 0.5759 0.6689 0.5614 0.1046 0.1267 0.5708 47.20 66.74 52.20 61.17 100.89 5,000 0.2918 0.3963 0.0496 0.0843 0.5832 0.5959 0.4954 0.0871 0.1190 0.5758 48.97 80.00 56.95 70.84 101.29 Note. In each cell, the first line represents RMSE by two MCMC methods, the second line is RMSE by MML, and the third line is relative efficiency ratio highlighted.

With respect to item parameter recovery, Gibbs and HMC were clearly more advantageous than MML in terms of total estimation errors, evidenced by all the relative efficiency ratios less than 100% for all item parameters. However, the gap between two

MCMC approaches and MML narrowed with the increase in the number of test-takers, 141 and this trend was observed for both normal and negatively skewed θ distributions. When latent trait distribution assumption was violated, estimations of item slopes by MML were hit harder than MCMC, as the relative efficiency ratios considerably reduced when

θ transitioned from normality to negative skewness. In contrast, Gibbs and HMC were more heavily affected in estimations of the b and d parameters, as the relative efficiency ratios went up and pointed to improvement of MML over MCMC in tandem with θ distributional shape change from normal to negatively skewed, controlling for the examinee number. In fact, when θ was negatively skewed, a slight increase in RMSE was observed in estimations of the b and d parameters by MCMC in conjunction with sample size increase, whereas MML consistently demonstrated increasing accuracy as sample size increased despite the deviation from the normal θ assumption. No clear pattern emerged for the c parameter, except the general trend that compared to MCMC, MML became increasingly more accurate as number of examinees went up. Regarding person parameters, MML and two MCMC methods exhibited little change in total errors regardless of variations in sample size and latent trait distribution shape. At N = 1,000,

Gibbs and HMC slightly outperformed MML, while at N = 2,500 and above, MML gained an advantage over two MCMC methods. This RMSE pattern was observed under both normal and skewed θ, although in all fairness, the fluctuations in RMSE of θ estimates by all methods across measurement conditions proved to be practically negligible.

More detailed examinations of each parameter recovery are offered below.

Because mean bias does not reflect the true extent of estimation errors when positive and 142 negative values cancel each other out, and RMSE penalizes outliers (i.e., large bias in absolute value), it was useful to have an additional means to understand both the magnitude and the direction of the deviations from the true parameters. The boxplots in figures 13–22 served this purpose graphically.

Item Discrimination Parameters

Figure 13

Item Discrimination Estimation Bias by MML, Gibbs and HMC under Normal θ

Note. Boxplot whiskers display 2.5 to 97.5 quantiles.

143

Figure 14

Item Discrimination Estimation Bias by MML, Gibbs and HMC under Skewed θ

Note. Boxplot whiskers display 2.5 to 97.5 quantiles.

Figures 13 and 14 showed Gibbs and HMC had little observable difference in the discrimination parameter estimation bias, and both MCMC methods produced bias primarily within [-0.5, 0.5] across all conditions. MML, by the contrary, had a much wider bias range, and positive bias larger than 1.0 was not uncommon at N = 1,000 under normal θ and across all sample sizes under skewed θ. For MCMC, bias was more likely to be negative irrespective of latent trait distributional shape, whereas MML tended to produce more positive than negative bias, and this pattern in MML was further consolidated under skewed θ. An interesting and unexpected finding was that latent trait skewness appeared to render Gibbs and HMC slightly less biased but did not seem to significantly affect their RMSE, while MML had both bias and RMSE worsened under 144 skewed θ (tables 4–9). Whether the person parameters were normal or negatively skewed, sample size increase consistently shrank the a parameter estimation bias range for both

MML and MCMC, and the positive influence of sample size increase was stronger on

MML than on MCMC. Tables 4–9 indicated that larger samples under normal θ not only engendered an increasingly unbiased MML but also reduced its RMSE, but as more examinees were simulated under skewed θ, the drop in RMSE was accompanied by an increase in proportion of positive bias produced by MML. Specifically, the percentage of positive bias by MML under skewed θ followed an upward trend, jumping from 57% at

N = 1,000 to 67% at N = 5,000, respectively, whereas it dropped from 56% to 51% at the corresponding sample sizes under normal θ. Figures 13 and 14 appeared to corroborate this narrative about slope estimation properties in MML under skewed latent trait: more positive bias accompanied by lower RMSE at every step of sample size increase.

145

Item Difficulty Parameters

Figure 15

Item Difficulty Estimation Bias by MML, Gibbs and HMC under Normal θ

Note. Boxplot whiskers display 2.5 to 97.5 quantiles.

146

Figure 16

Item Difficulty Estimation Bias by MML, Gibbs and HMC under Skewed θ

Note. Boxplot whiskers display 2.5 to 97.5 quantiles.

Like item discrimination, item difficulty was estimated with less errors by two

MCMC methods than by MML across sample size and latent trait conditions overall, but the gap again quickly reduced in conjunction with sample size increase. Under the most favorable circumstance with largest N = 5,000 cases and normal θ, the gap between

MML and MCMC was the smallest (table 6), but MML still had more extreme biases

(bias > 0.5 in absolute value) than Gibbs and HMC (figure 15). The departure from θ normality sent both MML and MCMC down the spiral of negative bias direction, and the percentage of negatively biased estimates went up together with sample size increase.

Specifically, when the person parameters were drawn from a negatively skewed distribution, the percentage of negative bias of estimating the b parameters climbed from

68% at N = 1,000 to 79% at N = 5,000 for MML, and the corresponding figures for 147

MCMC were from 76% to 87%, respectively. In contrast, under normal θ, the percentage of positive and negative bias appeared to fluctuate around 50%, as evidenced by the fact that the median bias value was very for close to zero for both MML and MCMC (figure

15). Interestingly, although both MML and MCMC saw improvement in item difficulty parameter estimation errors as sample size increased under normal latent trait, this beneficial effect of sample size increase was available only for MML under the latent trait normality assumption violation. Gibbs and HMC, in contrast, had larger RMSE of estimating item difficulty when θs were simulated from a skewed distribution (tables 7–

9). Mean absolute bias also followed the same pattern as RMSE, and increased from 0.27 to 0.32 for MCMC but decreased from 0.46 to 0.36 for MML when sample size jumped from 1,000 to 5,000 examinees under skewed θ. Careful examinations of figure 16 revealed that when the deviation from normal latent trait was present, sample size increase led to the bulk of MCMC estimation errors to drift further from zero. It should be noted, however, that despite these adverse trends for two MCMC methods, RMSE of item difficulty parameter recovery was still larger for MML across all conditions under investigation.

148

Item Lower Asymptote Parameters

Figure 17

Item Lower Asymptote Estimation Bias by MML, Gibbs and HMC under Normal θ

Note. Boxplot whiskers display 2.5 to 97.5 quantiles.

149

Figure 18

Item Lower Asymptote Estimation Bias by MML, Gibbs and HMC under Skewed θ

Note. Boxplot whiskers display 2.5 to 97.5 quantiles.

Again, MML clearly had a wider range of bias values than two MCMC methods in estimation of the c parameter across all conditions. The majority of bias values were roughly within the [-0.1, 0.1] for MCMC regardless of sample size and latent trait conditions, while most MML bias values spanned a wider range [-0.2, 0.2]. When interpreted against the c parameter values generated from N(.15, .042) with the lower end truncated at zero, these biases are large, even those produced by MCMC. When controlling for the shape of the latent trait distribution, sample size increase was associated with smaller total estimation errors by MML, while MCMC did not exhibit a clear RMSE change pattern. Interestingly, tables 4–9 indicated that when the trait scores were drawn from a normal distribution, MML underestimated the lower asymptote 150 parameters while MCMC overestimated them on average. Nevertheless, as θ turned negatively skewed, all estimators exhibited a negative mean bias. Surprisingly, θ skewness appeared to reduce RMSE by both MML and MCMC. At N = 5,000, both

MML and MCMC were more negatively biased on average yet had lower RMSE under skewed θ than under normal θ (tables 4-9). As can also be seen in figures 17 and 18, when N was fixed at the largest level, all three estimators had a smaller range of bias for the majority of c parameters when the person parameters were drawn from a negatively skewed distribution.

Item Upper Asymptote Parameters

Figure 19

Item Upper Asymptote Estimation Bias by MML, Gibbs and HMC under Normal θ

Note. Boxplot whiskers display 2.5 to 97.5 quantiles.

151

Figure 20

Item Upper Asymptote Estimation Bias by MML, Gibbs and HMC under Skewed θ

Note. Boxplot whiskers display 2.5 to 97.5 quantiles.

Both familiar and unusual patterns were observed for MML and MCMC estimations of the d parameters in tables 4–9. Mean bias by all estimators was generally negative across all conditions, except at N = 5,000 under normal θ when mean bias by

MML became slightly positive in direction. MML became increasingly negatively biased together with sample size increase when θ normality assumption was violated, although

RMSE still consistently improved as it did when θ normality assumption was tenable.

Figures 19 and 20 also showed that the range of bias generally reduced as sample size increased for MML, regardless of the latent distributional shape, but as larger samples were drawn from a skewed θ distribution, the d parameters became more likely to be underestimated. Like other item parameters, the d parameters were estimated with lower

RMSE by Gibbs and HMC, but MCMC recovery of the upper asymptote experienced a 152 lower improvement rate together with sample size increase than their MML counterpart.

When the latent trait was normal, RMSE reduced from 0.06 at N = 1,000 to 0.05 at N =

5,000 for MCMC, as opposed to the deeper fall from 0.12 to 0.09 at corresponding sample sizes for MML. Under skewed θ, MCMC had increasingly negatively biased estimates like MML. However, unlike MML, MCMC also had its RMSE exacerbated in tandem with sample size increase when θ was skewed. Specifically, with the negatively skewed latent trait, RMSE was 0.073 at N = 1,000, and jumped up to 0.078 and 0.084 at

N = 2,500 and 5,000, respectively. As can be seen in figures 19 and 20, in general, the

MCMC bias magnitude range was aggravated by latent trait skewness compared to normal θ, and larger samples appeared to breed larger negative bias for the bulk of the d parameters under skewed θ. MCMC behavior in the estimation of the upper asymptote parameters was very similar to its estimation of the difficulty parameters.

153

Person Parameters

Figure 21

Latent Trait Estimation Bias by MML, Gibbs and HMC under Normal θ

Note. Boxplot whiskers display 2.5 to 97.5 quantiles.

154

Figure 22

Latent Trait Estimation Bias by MML, Gibbs and HMC under Skewed θ

Note. Boxplot whiskers display 2.5 to 97.5 quantiles.

MML-EAP appeared to have no significant difference from MCMC in person parameter recovery accuracy across sample sizes and θ distributional shapes in tables 4–

9, and figures 23–24 corroborated these findings. Mean bias was very close to zero and slightly positive in direction for all estimators, except at N = 5,000 cases under normal θ when mean bias was slightly negative. RMSE by MML-EAP and MCMC appeared to be rather stable across all sample size and latent trait conditions. MML-EAP estimation held the lead over Gibbs and HMC in total errors at N = 2,500 or larger under both types of latent trait distributions, although this advantage appeared microscopic. Proportion coverage of the true person parameter by the 95%HDI in Gibbs and HMC always reached

.97 to .98, while the 95%CI in MML-EAP failed slightly short of its promise with .94 at

N = 1,000 cases and delivered its promise at N = 2,500 and larger. The conservative (i.e., 155 too wide) coverage of the 95%HDI is due to the large posterior standard deviation for θ estimates by MCMC compared to the MML-EAP posterior standard deviation. For both

MML-EAP and MCMC, the true and estimated θs had a high correlation at .82 to .83 across sample size and latent trait conditions. Similarly, the rankings of examinees were equally well preserved, with Spearman’s correlation fluctuating from .82 to .84 for all estimators.

Supplemental Analyses

Reports on parameter recovery for the 4PM have uncovered the groups of parameter values which represent challenges for estimation. Sheng (2015) found that larger item discrimination tends to induce larger RMSE, while more extreme item difficulty (in both directions) gives rise to increased bias and RMSE. Several studies have also drawn attention to the relationship between item difficulty and estimation of asymptote parameters. In the same way that the lower item difficulty level rendered estimation of the lower asymptote trickier as reported in the 3PM literature, higher item difficulty level made estimation of the upper asymptote less accurate for the 4PM (Waller

& Feuerstahler, 2017). Similarly, Culpepper (2016) noted that the slipping parameter

(i.e., 1 - d) was likely to be overestimated when items were very difficult. Loken and

Rulison (2010) found that the 95%HDI proportion coverage of person parameter values outside the middle range [-1,1] on the latent scale might be compromised in the 4PM estimation, especially with shorter tests, and Waller and Feuerstahler (2017) uncovered the effect of regression to the prior’s mean on person parameter estimates by BME-EAP in simulated data that mimicked the MMPI-A Depression Scale responses. 156

Based on the previous findings on the connection between parameter values and estimation errors, the following supplemental analysis aimed to examine how MML,

Gibbs and HMC item parameter recovery accuracy compared in extreme scenarios, and the extent to which MML-EAP and MCMC were affected by regression to the prior’s mean (Lord, 1986). In table 12 below, outcomes indicative of estimator performances are reported for the top 10% highest item slopes (i.e., a >= 1.64), item locations outside the [-

1.5, 1.5] range, item upper asymptotes when item difficulty is higher than 1.5, and item lower asymptotes when item difficulty is lower than -1.5.

Table 11

Comparison of MML, Gibbs and HMC Item Parameter Recovery in Extreme Situations

Estimator Parameter Outcome Bias RMSE Correlation Coverage SE MML a 0.0273 0.5265 .0779 .9183 0.8890 b -0.0455 0.7237 .9269 .9079 1.3491 c -0.0185 0.1578 .1131 .8758 0.3825 d -0.0586 0.2046 .0822 .8586 0.5533 Gibbs a -0.6062 0.6294 .1941 .0666 0.1725 b -0.0409 0.3684 .9837 .9758 0.4097 c 0.0408 0.0699 .2448 .9938 0.0971 d -0.1068 0.1258 .1572 .9562 0.1133 HMC a -0.6063 0.6294 .1949 .0683 0.1725 b -0.0413 0.3676 .9837 .9774 0.4104 c 0.0410 0.0700 .2459 .9969 0.0972 d -0.1070 0.1262 .1490 .9596 0.1133 Note: Bias: average bias; RMSE: root mean square error; Correlation: average correlation between the generated and estimated parameter values; Coverage: average proportion coverage of the true parameter value by 95% Confidence Interval (MML) and 95%

Highest Density Interval (MCMC); SE: average standard error/posterior standard deviation. 157

Overall, extreme values of item parameters presented a challenge in accurate recovery, as many outcome measures, especially RMSE, worsened for all three estimators compared to the overall results reported above. Similar to what was known from tables 4–9, two MCMC methods were almost identical in every measure, and both continued to sustain their advantage with lower RMSE than MML, except in the recovery of large item slope parameters where MML had lower RMSE than MCMC. Regarding biasedness, MML was less biased than Gibbs and HMC in the recovery of the item discrimination and two asymptote parameters, while two MCMC estimators were slightly better with item difficulty. Prominent deterioration in item parameter recovery performances was observed for MML in terms of the very poor correlation between the true and estimated item discriminations (Pearson’s r < .10), and the increasing instability of item location estimation with notably high SE. The 95%HDI by Gibbs and HMC became too liberal (i.e., too narrow) and captured the true item discrimination parameter only less than 7% of the time, far below the nominal level of 95%.

To examine the effect of regression to prior’s mean on person scoring, the generated ability parameters were recoded into equally spaced intervals, with the exception of the two most extreme intervals. The most extreme θ interval at the lower end of the latent trait continuum included all θs smaller than -3.25, and all θs larger than 3.25 were grouped in the opposite interval at the upper end of the latent trait scale. The middle point of each θ interval, together with bias and RMSE, are reported in table 12 below.

The range of bias by MML-EAP and MCMC across generated θ intervals is portrayed in figure 23. 158

Table 12

MML-EAP, Gibbs and HMC Person Parameter Recovery across True θ Ranges

True θ Estimator MML Gibbs HMC Bias RMSE Bias RMSE Bias RMSE -3.5 1.7083 1.7919 1.6034 1.6908 1.6034 1.6908 -3.0 1.1531 1.2487 1.0407 1.1447 1.0408 1.1448 -2.5 0.8199 0.9559 0.7108 0.8655 0.7109 0.8656 -2.0 0.5282 0.7321 0.4262 0.6678 0.4262 0.6678 -1.5 0.3047 0.6030 0.2148 0.5787 0.2149 0.5787 -1.0 0.1520 0.5479 0.0828 0.5589 0.0829 0.5588 -0.5 0.0616 0.5368 0.0217 0.5652 0.0217 0.5652 0 0.0024 0.5350 -0.0066 0.5653 -0.0066 0.5653 0.5 -0.0577 0.5374 -0.0377 0.5615 -0.0377 0.5615 1.0 -0.1533 0.5458 -0.1108 0.5549 -0.1108 0.5549 1.5 -0.3015 0.5873 -0.2415 0.5691 -0.2415 0.5691 2.0 -0.5145 0.7135 -0.4384 0.6601 -0.4384 0.6601 2.5 -0.8127 0.9384 -0.7335 0.8681 -0.7336 0.8682 3.0 -1.1659 1.2502 -1.0898 1.1749 -1.0898 1.1749 3.5 -1.6010 1.6692 -1.5279 1.5961 -1.5280 1.5962

159

Figure 23

Latent Trait Estimation Bias by MML-EAP, Gibbs and HMC across θ Intervals

Note. Boxplot whiskers display 2.5 to 97.5 quantiles.

Several patterns were conspicuous from the inspection of mean bias and RMSE across generated θ intervals. First, ability estimates by MML-EAP and MCMC were all regressed toward the prior’s mean, in the sense that high-ability respondents were underestimated and low-ability examinees were overestimated. As true ability parameters were located further toward either end of the latent continuum, an increase in bias magnitude ensued. Second, as is obvious from table 12 and figure 23, MML-EAP yielded less total estimation errors in the middle range of the latent trait scale, whereas Gibbs and

HMC had more accurate recovery of latent trait scores outside the [-1, 1] range. Again,

Gibbs and HMC offered almost identical performances; differences, if there were, were in the fourth decimal place and were essentially ignorable for practical purposes. In general, estimations of the person parameters were unbiased in the central interval [-1.5, 160

1.5] by three methods. Spearman’s correlation between the true and estimated θs outside the [-1.5, 1.5] was approximately .81 for all three estimators, which indicated that MML-

EAP and two MCMC methods performed equally well in preserving the ranking order of very poor and very proficient test-takers. The above patterns were observed for MML-

EAP and MCMC when estimation errors were examined under normal and skewed θ separately.

The section below provides answers to two research questions in the present study based on the analyses conducted above.

Answer to Research Question 1

How accurate are MML, Gibbs and HMC parameter estimations for the unidimensional four-parameter binary IRT model across latent trait distributions?

When latent trait scores were generated from a normal distribution, MML, Gibbs and HMC estimates of item difficulty, lower asymptote and upper asymptote parameters were essentially unbiased, while MML estimates of item discrimination parameters were positively biased but MCMC estimates were negatively biased on average. When latent trait scores were skewed to the left, there was a concomitant deterioration in the quality of item parameter recovery by both MML and MCMC generally, with MML taking a more substantial impact. With departure from the normal θ, MCMC estimates of all item parameters were negatively biased, and the same pattern was observed for MML, except that MML still consistently overestimated item discrimination parameters. Under skewed

θ, larger sample size benefited MML recovery of item parameters by lowering RMSE but had an adverse effect of actually increasing total errors of estimating the item difficulty 161 and upper asymptote parameters by MCMC. However, regardless of θ distributional characteristics, Gibbs and HMC still had lower RMSE than MML in item parameter recovery. MML-EAP and MCMC had little difference in person parameter recovery, and their mean bias was minor even when the θs were drawn from a negatively skewed distribution. Spearman’s correlation between the true and estimated θs remained consistent at slightly above .80 for all three methods irrespective of θ generating distributions, even for respondents with extreme latent trait, which suggested that MML-

EAP and MCMC had comparably good performance in retaining the ranking order of examinees. Overall, the shift from normal to negatively skewed latent trait brought about little change in person parameter recovery by all three methods. Gibbs and HMC offered very similar item and person parameter recovery quality across conditions, and neither systematic nor meaningful difference between two MCMC approaches was found.

Answer to Research Question 2

How accurate are MML, Gibbs and HMC parameter estimations for the unidimensional four-parameter binary IRT model across sample size levels?

In general, sample size increase benefited the four-parameter IRT model estimation as a whole, but specific effects of sample size on the recovery of each type of parameters appeared to depend on the estimation method and the underlying latent trait distribution. Overall, MML absorbed the momentum from sample size increase to improve its performance more strongly than Gibbs and HMC. When θs were generated from a normal distribution, MML and MCMC estimated item difficulty, lower asymptote and upper asymptote parameters with little bias, even at N = 1,000 cases. Estimates of 162 item slopes were positively biased for MML and negatively biased for MCMC, and mean bias by all methods was considerably large in absolute value (> 0.10) even at N = 5,000.

Under normal θ, all methods consistently improved RMSE of item parameter recovery in conjunction with sample size increase, and MML item parameter recovery became less biased than Gibbs and HMC at N = 5,000 cases. When θs were generated from a negatively skewed distribution, sample size increase affected the performances of the three methods differentially. Under skewed θ, MML had total errors of item parameter recovery diminished as more examinees took a test, yet sample size increase did not appear to benefit mean bias. Indeed, MML became increasingly negatively biased in its estimation of the upper asymptote parameters as sample size increased, and biases of estimating other item parameters remained considerably large even at N = 5,000 when the latent trait normality assumption was violated. For Gibbs and HMC, sample size increase under skewed latent trait benefited only mean bias of item discrimination recovery while rendering their estimations of other item parameters more negatively biased. In addition, unlike MML, two MCMC methods also produced larger RMSE in item difficulty and upper asymptote parameter estimation as more cases were drawn from a skewed latent trait distribution. Overall, sample size increase brought a correspondingly narrower gap between MML and MCMC estimation errors, although MCMC remained their advantage with lower RMSE across three sample size levels under investigation. Sample size had little observable effect on person parameter recovery on average. Both MML-EPA and

MCMC were essentially unbiased and had similar RMSE in trait score estimation across 163 sample size levels. Again, Gibbs and HMC were almost identical in every performance outcome, and no considerable difference between two MCMC methods was detected.

Chapter Summary

Chapter 4 offered results from model calibrations to build an overall picture of the performances of each estimation method and provided detailed answers to each research question. First, modeling results with MML and two MCMC methods were presented for each unique combination of sample size level and latent trait distribution, followed by the relative efficiency table to compare the total estimation errors by MML and MCMC.

Subsequent to the presentation of summary statistics, close inspections of each model parameter recovery were conducted, with boxplot graphs to further aid understanding of the direction and magnitude of estimation errors. Supplemental analyses were also performed to provide additional information on how MML and MCMC recovered item and person parameters with extreme characteristics. Chapter 4 ended with answers to two research questions based on the comprehensive analyses undertaken earlier. For question

1, results indicated that θ normality benefited model estimation accuracy, whereas negatively skewed θ brought increased RMSE and rendered recovery of item parameters negatively biased for MML and MCMC, respectively, except MML estimation of item discrimination which remained positively biased. For question 2, analyses revealed sample size increase improved item parameter recovery under normal θ, but only represented a partial solution to model estimation when θ was skewed. MML was more heavily affected by the departure from θ normality but was also more responsive to sample size increase to boost its item parameter estimation accuracy than Gibbs and 164

HMC. Sample size and latent trait distribution were found to have little effect on person parameter recovery, and all estimators had comparable performance and minor mean bias of latent trait estimation overall. Across all conditions, Gibbs and HMC had almost identical outcome measures.

165

Chapter 5: Discussions

Summary of Findings

This study aimed to assess the parameter recovery accuracy of three estimation methods, including MML, considered the gold standard in item response modeling, and two methods in the MCMC class, Gibbs and HMC, under the four-parameter unidimensional binary item response function. Data were simulated under the fully crossed design with three levels of sample size (1,000, 2,500 and 5,000 respondents) and two types of latent trait distribution (normal and negatively skewed). Only data which satisfied both technical and practical convergence criteria with MML were inputted into

Gibbs and HMC estimation procedures. Results indicated that Gibbs and HMC offered nearly identical outcomes and produced less total item parameter estimation errors than

MML across all conditions. MCMC and MML did not substantially differ in terms of mean bias for all parameters, with the exception of item slopes. Both MML and MCMC estimated item discrimination with fairly large mean bias but in opposite directions:

MML overestimated item slopes whereas MCMC underestimated them regardless of sample size and latent trait distributional shape. The violation of the normal latent trait distribution assumption was found to have a wide-ranging impact on the accuracy of item parameter recovery, especially item difficulty. With departure from normally distributed person parameters, MCMC consistently produced estimates of all item parameters with negative mean bias, and the same trend was observed in MML except for item slopes which MML overestimated on average. Sample size increase improved both item parameter estimation bias and total estimation errors for all methods when latent trait 166 parameters were drawn from a normal distribution, but increasing the number of examinees appeared to remedy the harmful effect of skewed θ on model estimation only to some extent. Overall, MML was more extensively affected by latent trait skewness than MCMC, but MML also had more considerable accuracy improvement in connection with sample size increase. The gap in total estimation errors between MML and MCMC continuously narrowed as number of respondents went up, irrespective of latent trait distributional characteristics. At 5,000 respondents and normal θ, MML was less biased than MCMC in the estimation of every item parameter. Person parameters were estimated with minor mean bias even with smallest sample size and latent trait skewness, and

MCMC and MML yielded comparable outcomes in person parameter recovery across all conditions overall. In scenarios with extreme item characteristics, MML was generally less biased while MCMC continued to have the advantage with lower total errors, except for item slopes which MML recovered with both smaller bias and lower total errors than

MCMC. Analysis of person parameter estimation bias as a function of θ magnitude and estimator revealed that MML was more accurate in the middle portion of the latent continuum [-1, 1], whereas Gibbs and HMC recovered trait scores toward both ends of the latent scale with slightly higher accuracy.

Discussions

Careful inspections of the estimation results by MML and two MCMC methods across sample size levels and latent trait distributions gave rise to certain remarks.

First, Lord (1986) noted, with mathematical proof, that given a useful prior density and the use of posterior mean as the point estimate of a parameter, Bayesian 167 estimation would outperform MML in minimizing mean square error. This fact explains the consistent advantage of two MCMC methods with lower RMSE for item parameter recovery across all conditions. However, Bayesian estimation has an inherent trade-off between bias and RMSE in that securing minimal mean square error equals inflating bias

(Lord, 1986). Our findings indicated that when θ normal assumption held, MML only lagged behind Gibbs and HMC in terms of mean bias at the lowest sample of 1,000 cases.

At N = 2,500, MML was less biased than MCMC in estimations of all item parameters except item slopes, and at N = 5,000, MCMC was trailing MML in unbiasedness of recovering all item parameters. It was also observed that as more respondents were available, MML made stronger progress than MCMC and narrowed the RMSE gap accordingly, regardless of the shape of the parent θ distribution. These findings could be explained by the different consistency properties possessed by MML and MCMC estimates (Patz & Junker, 1999a). It is well known that MML is based on the asymptotic theory, and MML item parameter estimates are consistent (i.e., approaching the true population parameters) as sample size increases when the number of items is constant

(Ogasawara, 2012). MML enjoys the advantage of the separation between item calibration and person scoring, as more test-takers do not lead to an increase in the number of parameters to estimate. The consistency properties in MCMC, interestingly, depend on the way the Markov chain draws are used (Patz & Junker, 1999a). MCMC estimates of item parameters are expected to be consistent if the inferences are made on item parameters by integrating over θ estimates (i.e., discarding θ draws in the Markov chain) as sample size increases, and MCMC estimates of person parameters are expected 168 to be consistent if inferences on person parameters are made by integrating over item parameter estimates as test length increases. When MCMC output is used to estimate both item (structural) and person (incidental) parameters, the item parameter estimates do not have to be consistent when sample size increases, because a larger number of test- takers equals a higher number of parameters to estimate in the joint posterior. In this manner, the consistency properties of MCMC estimates resemble the Joint Maximum

Likelihood (JML) consistency, which might be expected only when the numbers of item and person parameters are large and carefully tended (Ogasawara, 2012). As the number of test items was fixed while only the number of examinees was designed to increase, and both person and item parameters were estimated with the MCMC output in the present study, Gibbs and HMC estimators were not necessarily consistent despite the large jump in sample size, as opposed to MML.

It has been acknowledged that in general, item discrimination estimation is challenging (Baker & Kim, 2004). Previous simulation studies also found that MML item slope estimation errors tended to be larger than item location estimation errors for the

2PM (Sass et al., 2008), and item slope estimates correlated rather poorly with their true values for the 4PM (Loken & Rulison, 2010). The parameter recovery findings with the

4PM in this study lend support to Baker and Kim’s (2004) observation and resonated with Sass et al.’s (2008) and Loken & Rulison’s (2010) reports. On average, both MML and MCMC recovered item slopes with rather large mean bias in absolute value within the confines of this study’s design. Two MCMC methods demonstrated a superior performance with lower mean bias in most scenarios and smaller RMSE across all 169 conditions, yet the approximated posteriors had misleadingly small standard deviations and 95% HDI covered the true item discrimination only about 70% of the time. While the

95% CI by MML captured the true item slopes well, mean standard errors were unacceptably large, especially at smaller sample size levels. This is not to mention simulated data were excessively filtered, primarily due to the fact that MML frequently produced inadmissible estimates for item discrimination parameters, despite its technically successful model convergence. Consistent with Feuerstahler and Waller

(2014), the results of this study also pointed to a positive bias in item slope recovery by

MML across sample sizes and added that MML tended to overestimate item discrimination regardless of whether latent trait score distribution was normal or negatively skewed. In addition, even with the most favorable conditions under examination (N = 5,000 and satisfaction of the latent trait normality assumption), MML still produced much larger errors in item slope recovery than Gibbs and HMC, and bias magnitude larger than 0.5 by MML was not uncommon. Because at N = 5,000, MML offered unsatisfactory accuracy but showed signs of continuous improvement as more examinees participated in a test, a follow-up simulation was undertaken with N = 10,000 under identical model configurations described in chapter 3. Outcomes were averaged across 50 replications and reported in Appendix A. With 10,000 examinees and normal latent trait, MML estimation accuracy became acceptable. Mean bias went down below

0.10 for item slopes, and notable reduction in total estimation errors and standard errors were observed for all item parameters, especially item discrimination and item difficulty. 170

Equally important, absurd item discrimination parameter estimation happened only 8.4% among generated data at N = 10,000, as opposed to 27% at N = 5,000.

When the underlying θ distribution matched the normal θ prior, MCMC obtained reasonable accuracy in item parameter recovery even at N = 1,000. MML always had larger total errors of item parameter estimation than MCMC, but the gap decreased rapidly at every step of sample size increase, and MML surpassed MCMC in terms of unbiasedness at N = 5,000. Overall, item parameter estimation quality of MML and

MCMC was consistently improving as expected when larger sample became available.

However, ability samples from the skewed latent distribution shaped a different picture about item parameter estimation for both MML and MCMC, with the former bearing a larger impact than the latter in general, as seen in chapter 4. When sample size was controlled for, almost all evaluative outcome measures by three estimators deteriorated under non-normal θ when juxtaposed with their counterparts under normal θ, a finding which resonated reports on the consequences of violating the normal latent trait assumption in IRT model estimation (e.g., Sass et al., 2008; Stone, 1992; Svetina et al.,

2017), although the severity of the increase in estimation errors varied considerably due to differences in study design features. Under the conditions investigated in the present study, the estimation of item difficulty came with non-ignorable negative bias by both

MCMC and MML when θ was skewed, which might lead to serious consequences if estimated values serve to inform our decision making, such as item selection for computerized adaptive testing (Wainer & Mislevy, 2000). Unfortunately, sample size increase did not appear to bring a comprehensive remedy for item parameter recovery 171 accuracy when the assumption of latent trait distribution was violated. Results from additional calibrations under skewed θ (Appendix A) showed that the improvement in item parameter recovery accuracy at N = 10,000 under negatively skewed latent trait was minor generally, and MML appeared to have reached a point of diminished returns when a very large sample alone no longer remedied the adverse impact of the non-normal latent trait on item parameter recovery to a desirable degree.

MML and MCMC methods behaved quite differently in several interesting manners as larger samples were drawn from a skewed θ distribution. Regarding item slopes, sample size increase under skewed θ appeared to benefit both mean bias and

RMSE by Gibbs and HMC. On the contrary, MML had its RMSE improved but estimation simultaneously became more likely to be positively biased, hence no appreciable gains in mean item slope bias as larger calibration samples were available under skewed θ. Some counterintuitive results were found for Gibbs and HMC recovery of item difficulty and upper asymptote parameters. While MML estimations of the b and d parameters became more negatively biased on average, RMSE improved with larger samples, which means the magnitude of bias diminished systematically with sample size increase even when θ assumption did not hold. Conversely, two MCMC methods not only had their estimations pulled further in the negative bias direction but also witnessed their RMSE deteriorate as sample size went up under skewed θ. In other words, larger sample sizes actually harmed MCMC estimation of item difficulty and upper asymptote parameters when the normal prior of latent trait did not match its underlying skewed distribution. It is important to note that compared to normal θ samples, skewed θ draws 172 actually brought about two disadvantages to data calibrations. One was the violation of the underlying latent distribution assumption for MML, and for MCMC was the mismatch between the normal θ prior and its skewed parent distribution. The other detriment was the lack of high-ability respondents at the upper end of the latent scale when θ samples were drawn from a skewed distribution. Specifically, it was clear that the maximum possible θ value generated from the skewed distribution beta(8.2, 2.8) after z transformation was around 2.02. In fact, examinations of the simulated data indicated that the maximum θ drawn under the negative skewed latent trait equaled 1.97 in the present study. The worsened performances in overall item calibrations by MML and MCMC under skewed theta could be attributed to the differences in these two factors. Within

MCMC under skewed theta, however, the counterintuitive trend of increased RMSE of item difficulty and upper asymptote parameter estimations as sample size increased occurred when only sample size was designed to vary at three levels, and the mismatched prior was held constant (tables 7–⁠9). A putative explanation for this phenomenon has to do with the interplay between the prior and data in Bayesian posterior reconstruction.

Due to the nature of the skewed θ distribution described above, sample size increase did not bring more respondents with latent trait scores larger than the threshold of 2.02

(practically no larger than 1.97 due to the random sample process in this study). The absence of larger θ values toward the right end of the latent continuum provided little information for accurate estimation of highly difficult items and particularly for the recovery of upper asymptote parameters, which was shown to depend on the number of respondents with high latent trait scores (Culpepper, 2016). In the meantime, sample size 173 increase led to the diminishing influence of priors overall. Thus, even as larger samples brought more information for item parameter recovery in general, the useful information associated with sample size increase under skewed θ was not as strong as its counterpart under normal θ to counter the declining effect of priors in the estimations of the b and d parameters. Note that the rate of RMSE improvement slowed down for the b and d parameter estimations by MML under skewed θ as well, possibly because the lack of individuals at the upper tail of the latent trait continuum offset the overall value of sample size increase to some extent. The significant role of adequate information from data can also explain why little adverse impact of the negatively skewed θ distribution on the estimation of the lower asymptote parameter was observed, presumably because the presence of more individuals in the lower tail of the latent scale (i.e., more respondents with low θ scores) considerably assisted accurate recovery of the c parameter.

Examinations of the generated data revealed that there were 128 individuals with θ scores below -2.0 at N = 5,000 under normal θ, whereas the corresponding number under negatively skewed θ was 190 examinees. At this largest sample size N = 5,000, both

MCMC and MML yielded lower RMSE in estimating the c parameter under skewed θ than under normal θ.

To test the credibility of this explanation, additional simulations were performed to examine whether adding more high-ability examinees to the data would terminate the upward trajectory of RMSE in item difficulty and item upper asymptote parameter recovery by MCMC. A random sample of latent trait scores were drawn from beta(8.4,

2.0) and fused with a random sample of θs larger than 1.8 from N(0,1) to form a 174 negatively skewed distribution with skewness = -0.60. In this manner, the skewness parameter was held constant, and the resulting skewed distribution noticeably differed from a random sample from beta(8.2, 2.8) in its more peaked center and longer upper tail

(Appendix B). Note that when more high-achieving respondents were incorporated, the distribution became more leptokurtic (i.e., kurtosis increased). Therefore, the constraint kurtosis < 0.50 was imposed on selected samples in the follow-up simulation to avoid excessive deviations from the kurtosis level of the skewed samples in the main simulations. Draws from this hybrid distribution was then standardized to have mean of zero and variance of one before being fed to the data simulation procedure in mirt.

Features of model estimation by MML and MCMC remained identical to the those described in chapter 3. Results are presented in Appendices C and D. As predicted,

RMSE of MCMC estimation of the b parameters dropped sharply as sample size increased, from 0.95 at N = 1,000 to 0.43 at N = 2,500, but RMSE of MCMC estimation of the d parameters slightly increased by a small amount from 0.080 to 0.085. A closer inspection revealed the increase in RMSE of MCMC estimation of the d parameters was driven by the presence of the extreme negative deviation values rather than the overall increase in bias, because mean absolute error did not go up with sample size increase

(indeed, mean absolute error had a minor decrease from 0.061 at N = 1,000 to 0.060 at N

= 2,500). The main simulation and follow-up outcomes taken together pointed to the challenges of improving the upper asymptote parameter recovery by MCMC when the disagreement between normal θ prior and the underlying skewed latent distribution is present, because the increased number of candidates in the upper tail of the latent 175 distribution, at least as designed in the follow-up, appeared to compensate for the incorrectly specified θ prior to some extent but did not bring appreciable changes. Given that the 4PM literature to date has primarily addressed Bayesian parameter estimation under the normal latent trait, the empirical question of the necessary number of examinees with high latent trait scores to offset the negative impact of the mismatch between the normal prior and the actual negative skewness of θ and obtain the desired precision in recovering the d parameter remains to be explored.

The ability parameters were recovered with little mean bias in general by all estimators, even in the worst-case scenario with the smallest sample size and skewed θ distribution. Little observed effect of sample size on MML-EAP estimation of ability parameters was an expected finding because the underlying latent score of each respondent is calibrated independently of other respondents’ scores (de Ayala, 2009).

MML possessed a slight advantage over two MCMC methods in the middle range of the latent continuum (i.e., average ability examinees), whereas Gibbs and HMC were more accurate in the recovery of latent scores for high and low ability respondents. While this difference appeared to place MML-EAP and MCMC on an equal footing, it implied rather different merits depending on the measurement context and purpose. Lord (1986) reminded us that from a practical educational measurement perspective, greater importance should be attached to the identification of small trait score differences in the middle of the latent scale than large differences in the extreme ability scores. If extreme latent scores are of greater concern (e.g., to pick the top 10% of test-takers with the highest scores for scholarship award or placement in gifted programs, and bottom 10% 176 with the lowest scores for individualized instructions), adding very easy or very difficult items would provide more relevant data for accurate recovery of these θ values. However, from a psychological measurement standpoint, accurate estimates of large θ values might be more important in clinical settings and represent cases of greater interest to psychopathologists (Reise & Waller, 2003; Waller & Reise, 2010; Waller & Feuerstahler,

2017), while adding items to existing scales to measure psychological constructs would not always be feasible. While this study targeted the typical educational measurement characteristics in the generation of person and item parameters, the similarities between educational and psychological measurement data modeling mean practitioners from both disciplines can find these results helpful.

Several features of this study’s design warrant further discussions. In this inquiry, the logistic function rather than normal ogive model to link the probability of response to the underlying trait of respondents and items was employed. While other item parameters

(difficulty, pseudo-chance and slip) have the same interpretation regardless of the link function, item discrimination parameters must be interpreted against the benchmark for the logistic function. The generated a parameters subjected to calibrations in this study consisted of 4.18% low, 61.25% moderate, and 33.47% high and very high values per

Baker’s (2001) benchmark. To put these values into perspective, it means for a scale with

20 items, one item poorly discriminates test-takers, 12 items are moderately discriminating, and the remaining seven items are highly and very highly discriminating items. The small amount of low discrimination parameters (a < .65) was not discarded to reflect the fact that at times items are kept in a scale for content validity purpose despite 177 their poor discriminating capacity. To my view, the item statistics in the present study represent quite reasonable test configurations which practitioners usually encounter in educational measurement.

The low practical convergence rate in MML was another issue in this study. As can be seen in chapter 4, approximately one in every four samples failed to practically converge with reasonable estimates in MML under the most favorable conditions with N

= 5,000 and normal latent trait distribution. At N = 1,000, less than 1% of data sets generated were successfully calibrated with plausible MML estimates. This aggressive data selection process appeared like a tireless search for ideal data and gave MML an unearned advantage, yet it was justifiable because any IRT practitioners who come across implausible estimates such as a = 10 would at best give model calibration results the benefit of the doubt. Examinations of MML estimates revealed that out-of-bounds discrimination parameter estimates were the major culprit behind practical convergence failure in MML. Infinite estimates of item difficulty also appeared, albeit very rarely. A natural question which likely arises among the reader at this point is how Gibbs and

HMC perform when MML offers severely inflated/deflated item parameter estimates. To answer this question, additional simulations for N = 1,000 under both normal and negatively skewed latent trait distributions were conducted. Simulated data sets were selected if they allowed technical convergence in mirt, and no additional constraints were imposed on MML estimates. Model configuration details for MML and MCMC were identical to those described in chapter 3. Results from 50 replications for each condition

(Appendices E and F) indicated that when MML technically converged but failed at 178 practically sound estimates, both Gibbs and HMC continued to perform stably. No appreciable difference in MCMC estimations was found when compared to the results in table 4 and table 7 (chapter 4). On the contrary, in addition to the implausible item discrimination parameter estimates (14% of the a parameter estimates were larger than

5.0), there was once case of anomalous item difficulty parameter estimate of -84.00 by

MML when the generated b value was -3.25, which severely worsened mean bias and

RMSE for its item difficulty estimation. Due to the infinite parameter estimation by

MMl, suggestions to incorporate item priors (i.e., turn MML into BME) have been made and adopted in popular IRT software programs like BILOG and MULTILOG (Rupp,

2003). However, Waller and Feuerstahler’s (2017) investigation of the 4PM estimation with BME indicated that the lack of practical model convergence was still present with

BME, especially at N = 1,000. Our supplemental calibrations with Gibbs and HMC indicated that Bayesian estimation via MCMC simulation can serve as a more viable alternative for IRT practitioners when MML and Bayesian analytic algorithm do not work well for item parameter estimation.

A characteristic of the Markov chain simulation in this study was the employment of informative priors for all item parameters, which was found to be vital for Markov chain convergence. In our pilot simulations, rstan gave warnings about a large number of divergent transitions, which at times went up to 10,000, even when only the d parameter had the uniform prior U(0.60, 1) and informative priors were used for other item parameters. Essentially, divergent transitions mean that rstan is having difficulty sampling from the posterior region thoroughly (Stan Development Team, 2018b). When 179 divergence happens in rstan, model calibration results from other MCMC mechanisms should be questioned as well because the Markov chains are unlikely to converge to the posterior distribution (B. Goodrich, personal communication, November 19, 2019). When informative priors were imposed on all item parameters, the vast majority of calibrations were completed well and rstan gave warnings to only a small number of replications. The number of divergent transitions in those calibrations with warnings ranged from several to a couple of dozens, which is small compared to 120,000 iterations after burn-in (or warm-up, as it is referred to in rstan). Given the significant role of informative priors for item parameters characterized with the 4PM, a relevant question to ask is how we can obtain them ahead of actual model estimation. Lord (1986) commented that repeated administrations of parallel test forms to similar test-taker groups allow us to infer appropriate item and person parameter prior distributions. While parallel test forms and frequent administrations are not feasible in many situations, reasonable expectations of an

IRT parameter value range and types of distributions which effectively capture it are possible (Baker & Kim, 2004). Note also that when the d parameter was very close to 1, little change in estimates of the a, b, and c parameters were observed in the 4PM compared to the 3PM (see, e.g., Swist, 2015). Therefore, modeling results with traditional

IRT models can serve as useful starting points for the 4PM calibrations. Of course, the challenge remains for the d parameter prior distribution specification because the 4PM has found widespread psychometric applications very recently and few preexisting research results are available to inform our choice of its prior. However, recent 4PM applications in educational measurement (e.g., Culpepper, 2016; Sideridis et al., 2016; 180

Walstad & Rebeck, 2017) appeared to support a common-sense approach to specifying an upper asymptote prior which allows most values to cluster around .80 to 1.00 and lower values to be less and less probable. For psychopathology data modeling as reported in Waller and Feuerstahler (2017) and Waller and Reise (2010), the d parameters in the

Depression and Cynicism data fitted this general anticipation well, whereas the Low Self-

Esteem data had many d parameters below .70 and might require a close collaboration between methodologists, psychometricians and content experts, as suggested by König and van de Schoot (2018), to determine the upper asymptote prior. Substantive issues aside, the use of uniform prior U(0,1) for the d parameter is actually unreasonable as it endorses an implicit assumption that upper asymptote parameter values of .10 and .90 are equally probable. While the uniform prior appeared to impose no constraint on the d parameter assuming that no prior information is present, our study showed that it is unnecessary and indeed is a statistical luxury researchers cannot afford due to the Markov chain convergence concern.

The use of the multivariate potential scale reduction factor, MPSRF, was found be highly useful in the evaluation of the Markov chain convergence state in this study. In the pilot studies with the uninformative prior U(0, 1) for the d parameter, the PSRF quickly went down below 1.1 for all parameters in both HMC and Gibbs with about 25,000 iterations, whereas its multivariate counterpart MPSRF remained well above the chosen threshold even when chain length was increased up to 100,000 iterations. The message conveyed by MPSRF was consistent with warnings of divergent transitions in rstan that

MCMC algorithm was struggling to sample from the joint posterior distribution and the 181

Markov chains did not stabilize or mix well in a stationary state. Only when a more informative prior for the d parameter was employed did MPSRF reduce to close to 1.0, which again resonated with the fact that no divergent transition warning was issued by rstan. These findings about the power of MPSRF corroborated Sinharay’s (2004) report on its superiority to PSRF in monitoring Markov chain convergence to the posterior.

Given that MCMC simulates random samples from the multivariate (joint) posterior distribution of all IRT parameters, the necessity to diagnose Markov chain convergence to the multivariate posterior distribution is self-evident. As there have been complaints about the performance of PSRF in detecting the lack of MCMC convergence with the

4PM (Waller & Reise, 2010), researchers wishing to explore modeling data with the 4PM via MCMC might want to add MPSRF to their frequently used convergence diagnostic toolbox.

The present study offers several contributions to the IRT modeling literature.

Theoretically, the results of this study enrich our understanding of the 4PM, the model on which the literature is still in its infant stage, and bring to light the relative merits of three estimation methods, including one from the frequentist approach (MML) and two from the fully Bayesian framework (Gibbs and HMC), in characterizing data with the 4PM under several typical assessment conditions. Practically, the findings of the study can be taken together with results of other research to serve as estimator selection and sample size requirement guidelines for measurement practitioners as they employ the four- parameter modeling in their unique contexts. In particular, this study incorporates the investigation of HMC sampling, an estimation method which possesses a high potential 182 utility but remains understudied in the IRT literature. Because the use of HMC for the

4PM estimation has not been explored in previous research, new understanding of the properties of this estimator will open up a new research trajectory for researchers interested in the 4PM and a new estimation possibility for IRT users simultaneously. In addition, it is well-known in the IRT literature that when MML produces inadmissible estimates, Bayesian methods could become alternative modeling techniques, yet no empirical investigations have looked into the quality of parameter recovery by Bayesian methods when date pose a challenge to MML. The follow-up findings in this research revealed Bayesian computations via Gibbs and HMC continued to perform well and stably when MML failed, thereby providing evidence for the appropriate use of two

MCMC methods in difficult modeling situations. These findings were particularly important and highly applicable to studies such as Barnard-Brak et al.’s (2018) analysis of the PISA mathematics test items under the 4PM with the U.S student sample, which reported 24 out of 76 items with the discrimination parameters larger than 5.0 for two student groups based on their opportunities to learn (though not always the same item for two groups). Some of the item discrimination estimates in Barnard-Brak et al.’s (2018) study even exceeded 20, an unreasonable value for the a parameters. The same data could be reexamined with Gibbs and HMC to fix these implausible estimates and verify the substantive conclusions. Finally, the curious RMSE increase together with larger sample size by MCMC under the skewed latent distribution was further explored and results revealed that sampling adequacy in the upper tail of the latent scale to guide the 4PM parameter estimations is just as important as the amount of data for calibrations. These 183 findings suggest useful strategies to combat the harmful effects of the mismatch between prior distributions set by IRT modelers and the actual underlying data generating distributions.

Calibration Time

Needless to say, MML was the clear winner over two MCMC methods regarding efficiency. Even with a sample of up to 10,000 cases, MML completed the estimation task within matters of seconds, and did not require a large computer memory for its execution. However, because Gibbs and HMC both offered smaller total errors in item parameter estimations than MML across all conditions under investigation and continued to work well when MML produced out-of-bounds estimates, these two MCMC mechanisms can serve as go-to methods for IRT users when MML is unfit. As Gibbs and

HMC yielded similarly accurate parameter recovery for the 4PM, what matters to practitioners in making a sound choice of MCMC methods is how efficient these two methods are comparatively. In this study, the MCMC simulations with N = 1,000 and

2,500 were conducted in four office computers with differing computing powers, whereas all model estimations with N = 5,000 and some calibrations with N = 2,500 were performed in a supercomputer system using the batch mode, hence the difficulty in making a generalizable estimate of calibration time even within the model features investigated here. Nevertheless, it was easy to recognize that under the optimal condition,

HMC would execute the 4PM estimation much faster than Gibbs. For example, at N =

1,000 and normal θ, a core i7 desktop computer with 16 GB of memory handled a data set within three and a half hours with HMC but had to spend up to seven hours and 25 184 minutes with Gibbs sampling on average. The superior computational speed of HMC was not always available, unfortunately, because the compiled Stan program would not be able to execute at the maximum speed and offered an error message instead when sample size increased to a certain level. At N = 2,500, the optimal function in rstan had to be turned off in some office Windows desktop computers so HMC could progress, albeit more slowly. In such situations, rstan still maintained its gain over runjags, although its advantage appeared to shrink: rstan calibrated one data set of 2,500 cases under normal θ in 13.5 hours, whereas runjags handled the same data set in 18.8 hours, given the same workstation. Before the reader is eager to embrace HMC in anticipation of future applications of the 4PM without question, it is worth pointing out that rstan installation was more complicated than runjags, and rstan execution was also more complex: at times, rstan unexpectedly terminated mid-way due to computer memory limits, and at other times, it required random seed adjustments to improve the poor progress of some chains.

The use of the supercomputer resources proved to be a brilliant solution to the computing power riddle in this study when sample size increased to 5,000 cases. At this largest sample size under investigation, none of the available office computers managed to complete one pilot calibration within a week with either HMC or Gibbs. In the supercomputer system with a Dell Intel Xeon E5-2680 v4 machine, a node with 28 cores and 128 GB of memory was requested to execute the 4PM estimation, which took approximately 15 hours on average in both Gibbs and HMC to calibrate one data set with

5,000 cases. Unfortunately, the HMC optimal function was unable to be activated in the 185

Linux operating system of the supercomputer due to certain technical constraints beyond users’ control.

Practical Implications

The analytical results in the present inquiry prompted several recommendations for applied researchers utilizing the four-parameter IRT model. Like Waller and

Feuerstahler’s (2017) report on BME-EAP, this study suggested that if the overarching purpose of model calibration is recovery of trait scores, both MCMC and MML-EAP could be employed for similarly accurate θ estimates with samples of as few as 1,000 respondents. If accurate recovery of item parameters is also targeted, which many studies undoubtedly aim for, issues of latent trait distribution and sample size must be taken into account to inform choices of estimation methods. Based on the follow-up simulation results when the θ normal assumption was reasonably satisfied, it is recommended that

MML should be employed when sample size reaches 10,000 cases to ensure acceptable accuracy in item parameter recovery. While obvious advantages of MML include high speed calibrations and the lack of the need to specify item priors, which is convenient due to the rather novel applications of the 4PM in many settings, our simulations showed that even at N = 10,000 cases, practical model convergence with MML still failed about 8% of the time. In such circumstances, MCMC appears to be a viable alternative for the 4PM estimation. Either Gibbs or HMC can be selected to estimate the 4PM item parameters with as few as 1,000 examinees, providing that useful information is available in the form of informative priors. To reiterate the discussions above, even when the 4PM applications are relatively new, repeated administrations of parallel test forms to similar respondent 186 populations, estimation results from traditional IRT models (Rasch/1PM, 2PM and 3PM) which the 4PM builds upon, expert judgement and reasonable expectations of a parameter value range can offer valuable prior information to aid MCMC approximation of the posterior. HMC is preferred among two MCMC approaches for small to moderate samples due to its similarly accurate model parameter recovery but higher estimation speed under optimal conditions and superior built-in mechanism in rstan to detect non- convergence.

As already explained in chapter 3, common Markov chain length and burn in are necessary in simulation studies because the total number of data sets and parameters to handle is large and it is impractical to have the number of iterations adapted to each peculiar estimation scenario. In reality, applied researchers typically examine fewer data sets and parameters; therefore, it is recommended that the Markov chain length and burn in segment be tailored to individual response data sets and measurement conditions.

Moreover, results from this study suggested that the MPSRF for all item parameters should be used in tandem with the PSRF for each model parameter in monitoring MCMC convergence to the joint posterior distribution. In addition to numerical means such as

MPSRF and PSRF, visual means like trace plots to investigate the stability and mixing of parallel chains are more convenient in actual measurement practice and are highly recommended. After all, MCMC convergence to the posterior region forever remains a black box and one can never be too certain about it. Therefore, regarding techniques to evaluate MCMC convergence, one should be content to have more and willing to use more, not less. 187

When there are reasons to believe departure from θ normality is present, such as test scores obtained from academically at-risk or gifted students, or non-normal latent trait population distribution, neither MML nor MCMC offered an optimal solution. In this study, only negatively skewed latent trait was explored, and MCMC still remained the better option with less total estimation errors of item parameter recovery than MML for small to moderate samples. For MML, the increase from 5,000 to 10,000 cases under negatively skewed θ brought about little noticeable improvement in model estimation accuracy, which suggested that sample size increase alone is not bound to be the target solution to estimation errors when the normal latent trait assumption is violated. In a like fashion, the worrying trends of increased RMSE in item difficulty and upper asymptote parameter estimations as a larger amount of data failed to compensate for the declining effect of prior distributions imply that Gibbs and HMC should not be construed as our panacea for estimation errors under the skewed latent trait, and that simply adding more data of the same type does not represent the silver bullet for improved accuracy. Further analyses revealed that a negatively skewed θ distribution with the lack of very high ability respondents exerted a harmful influence on not only MCMC recovery of very large b parameters but also MCMC estimation of b parameters in the middle range of the latent scale, which would likely lead to severe consequences in measurement practice such as adaptive item delivery in computerized adaptive testing. When data provide little information to accurately estimate the upper asymptote and difficulty parameters of the

4PM, one strategy to consider is to collect more data to fill the empty locations in the upper tail of the latent trait spectrum to aid accurate recovery of these parameters. 188

Because accurate item parameter recovery in IRT depends on a sample which is both large and heterogenous (Hambleton & Jones, 1993), both quantity and quality of data matter. After all, our model is only as good as the data we feed into it. Unfortunately, even when more response data toward right end of the latent trait scale are available, the follow-up simulation revealed that a negatively skewed θ distribution still introduced considerably more item parameter estimation errors than a normal θ distribution, given the same normal prior specification. A sample larger than what is investigated in this study might be needed for accurate recovery of the d parameter by MCMC under negatively skewed latent trait.

Recommendations for Future Research

Some promising venues for future explorations suggested themselves from the design and results of this study. Scale length, an important factor for accurate IRT model parameter recovery, was held fixed at 20 items while other conditions of sample size and latent trait distribution were subjected to variations. In addition, the parameter values selected for data generation reflect a typical educational cognitive assessment scenario more than psychological measurement with higher item difficulty parameters like in psychopathology (Feuerstahler & Waller, 2014; Waller & Feuerstahler, 2017; Waller &

Reise, 2010). Another possibility of deviation from latent trait normality, θ positive skewness, is entirely possible in educational and psychological measurement but was not investigated in this study. Therefore, the findings of this study might hold true for the measurement conditions under examination and should not be overgeneralized to other scenarios outside the confines of the study design such as longer tests or administrations 189 of clinical psychology scales. Future inquiries into how test length variations, different underlying item parameters and latent distributional characteristics influence the accuracy of parameter recovery are warranted.

Second, it is important to understand how the specifications of MCMC priors affect the 4PM parameter estimation. In the present study, Gibbs and HMC simulations were configured with informative and useful priors. Even when the true underlying θ prior was skewed, the normal θ prior was still helpful because it was set to center at the same value as the parent θ distribution. Better estimates by MCMC, therefore, were expected. Of course, prior informativeness and usefulness admit of degree and can be adjusted by changing the parameters of the prior distributions. A sensitivity analysis to look at whether different levels of useful knowledge incorporated in the priors for the

4PM parameters have a considerable impact on the substantive conclusions regarding the merit of these two MCMC methods would be an interesting research idea to investigate.

Third, the unsatisfactory performance of both MML and MCMC when the θ normality assumption was violated/the mismatch between θ prior and its true distribution was present necessitates further research into more robust methods to handle the 4PM estimation under such a circumstance. The non-parametric Bayesian estimation approach represents another approach to dealing with parameter recovery in the context of non- normal θ. The literature on the use of non-parametric Bayesian method is still in its infant stage, but pioneering research has demonstrated that it brought more accurate item and person parameter recovery for Rasch IRT binary model than MML and MCMC, among other methods, in the presence of skewed latent trait (Finch & Edwards, 2016). The non- 190 parametric Bayesian approach is an intriguing Bayesian inference development and a promising method for the 4PM estimation upon the departure from normal latent trait distribution. Future research efforts could examine this newer member in the Bayesian framework to model response data with the 4PM when latent trait normality cannot be reasonably assumed.

Finally, it is imperative to provide formal proof of identifiability for the 4PM, as

Culpepper (2016) also pointed out. IRT model identifiability is necessary for meaningful interpretations of its parameters, and future interests and useful applications of the 4PM might be hampered while waiting for such a significant piece of research to arrive.

Frequentist versus Bayesian Philosophies and Applications: A Commentary

This section consists of two parts. Part one summarizes the philosophical and practical differences of the frequentist versus Bayesian approaches to data analysis, presents justifications for the comparison between Bayesian and frequentist estimates, and discusses the usefulness of both Bayesian and frequentist conceptions of probability despite their differences. In part two, applications of Bayesian estimations in relation to the test fairness issue are discussed, and recommendations on critical considerations of modeling purpose and ethics when employing Bayesian methodology are offered.

First, it is well-known, even among novice learners of Bayesian methods, that

Bayesian and frequentist approaches differ philosophically, and the core difference concerns with how the concept of probability is defined (Gelman et al., 2013; O’hagan,

2008; VanderPlas, 2014). In the frequentist view, probability denotes how often a certain value appears in repeated sampling. For example, on dealing with question of the 191 probability of heads when a coin is tossed, the frequentist solves the problem by running a long sequence of tosses, and measures ratio between the number of times heads appear versus the number of times a coin is tossed. Provided that the coin is fair, and each toss is performed identically and independently of each other, for 10,000 times the coin is tossed, it is likely that heads will appear about 5,000 times, hence the .50 probability of heads. In the frequentist sense, probability is tied to frequency of the outcome of interest in repeated measurements. On the contrary, in Bayesian analysis, probability expresses the degree of uncertainty about an event. For questions such as the number of coronavirus infections in the U.S. by the end of 2020, Bayesian analysts can make probability statements such as “there is a 50% probability that the number of confirmed coronavirus cases in the U.S. by the end of 2020 is between fourteen million to fifteen million”. In this manner, the probability is a statement of Bayesian analysts’ knowledge of the event and how certain/uncertain the analysts are about the true value, and this 50% value is not necessarily tied to the frequency-based construction of probability.

This philosophical difference between frequentist and Bayesian treatment of probability translates into a fundamental divergence in how the two approaches handle parameter estimations. In the frequentist framework, data are random, but parameters are constant. It is impossible in the frequentist sense to make a probability statement about true parameter values because they are the fixed attributes which govern the data generation mechanism and are not repeatable outcomes in a series of experiments.

Regarding parameter estimations, frequentists using the maximum likelihood (ML) methods ask: what are the parameter values that maximize the probability of obtaining 192 the observed data? Frequentist analysts solve this problem by constructing the likelihood function, P(data|parameters), which is the conditional probability of the data given the parameters, and seek the parameter values which maximize the likelihood of having the data which have actually been observed. In the frequentist sense, the best parameter estimates are the point estimates which give the highest probability of obtaining the data.

In the Bayesian framework, however, both data and parameters are treated as random (i.e., they come from a probability distribution), which allows analysts to derive inferences about their values with some degree of uncertainty expressed in terms of probability. Bayesians ask a different question in parameter estimation: what is the (joint) probability of the parameter values, given the data? Bayesian analysts solve this puzzle by constructing the probability distribution P(parameters|data), known as the (joint) posterior distribution, by combining information from both collected data (in the form of likelihood function) and prior substantive knowledge (in the form of prior distributions).

The ultimate goal in Bayesian estimation is to build the posterior distribution and examine the probabilistic relationship between data and parameters, rather than arrive at the “best” point estimates of parameter values. Levy and Mislevy (2016) cited an interesting analogy that while frequentist analysts aim to find the highest peak in a mountain range with ML, Bayesians aim to develop a panorama of the entire mountain range via the posterior distribution.

Point and interval estimates of parameter values are usually required in practice, and they are typically extracted from the Bayesian posterior distribution summaries.

However, it should be made clear that point estimates from Bayesian and frequentist 193 analyses give rise to different interpretations, even though they appear similar. For example, the Bayesian posterior mean is interpreted as the posterior expectation of the parameter, and the posterior mode represents the single most probable value given the data (Gelman et al., 2013), whereas an ML point estimate is the value which maximizes the likelihood of observing the data, as explained earlier. The interval estimates from frequentist and Bayesian approaches are also fundamentally different. The frequentist

95% confidence interval (FCI) should be interpreted as “if the data collection procedure is conducted repeatedly many times, the calculated confidence interval will capture the true parameter 95% of the time”. However, the 95% Bayesian posterior credible interval

(BCI) means that given the data which have been collected, there is a 95% probability that the true parameter value lies within the range provided by BCI. The 95% in the FCI refers to the property of the rule to construct the interval, while BCI allows us to make direct probabilistic statement about the parameter values. Thus, even when FCI and BCI are both probability statements, the probability refers to different aspects of the interval

(VandePlas, 2014).

Given that frequentist and Bayesian analyses seek to answer different questions and allow different interpretations, a question suggests itself: are we comparing apples and oranges when juxtaposing ML estimations and Bayesian estimations? In other words, is it legitimate to compare parameter estimates provided by these two approaches, with the field of educational measurement as a case in point? The answer is negative theoretically, but positive practically. Theoretically, it is not possible to compare a single value (frequentist) to a distribution (Bayesian), and even when a single value is picked 194 from the Bayesian posterior distribution, it does not have the same meaning as the frequentist point estimate. Similarly, FCI and BCI do not have the same theoretical interpretation and should not be placed on the same metric for comparison.

However, in measurement practice, frequentist and Bayesian estimates are treated as if they were comparable in meaning and use. ML estimates support deductive inferences via the likelihood P(data|parameters), which means analysts go from the general (e.g., examinees’ latent trait) to the particular (e.g., test scores). In contrast,

Bayesian mechanism allows analysts to reverse this direction by constructing the posterior distribution P(parameters|data) and deriving inductive inferences from the particular test scores to the general trait. In assessment, psychometricians are interested not in particular test scores of individuals per se, but in the attribute underlying their performance more broadly conceived (Levy & Mislevy, 2016), and only Bayesian estimation allows this inference. However, ML point estimates of latent trait tend to be interpreted as representations of examinees’ underlying abilities conditioning on their test item responses, even though these estimates are derived from the frequentist framework.

Similarly, the FCIs tends to be misinterpreted in the Bayesian manner (e.g., the 95% FCI means the 95% probability that the true parameter lies within this range), and this tendency implies that the Bayesian inference is what researchers would like to be able to derive (Wagenmakers et al., 2008).

Bayesian analytical results, in turn, tend to be treated as frequentist point estimates while the range of information they offer might be ignored in many cases. For example, item parameters estimated with Bayesian methods tend to be treated as 195 quantities known with certainty in the person scoring phase. Before we feel tempted to dispute the impoverishment of Bayesian inference by picking only single point estimates, it is important to note that point estimates are not only necessary for concise communication of results, but they also facilitate the common psychometric work such as item/task selection, construction of parallel test forms, and test score equating (Levy &

Mislevy, 2016). The practice of, for example, treating item parameters as fixed in person scoring, is not without consequences (such as underestimation of posterior uncertainty in estimating person parameters), yet it typically is considered acceptable within the constraints of resources and demands of operational assessment work (Levy & Mislevy,

2016). Some scholars (e.g., Kim, 2007) support the examination of item calibration uncertainty to avoid overconfidence in person scoring. Fortunately, the fully Bayesian framework allows us to compare the results of including and ignoring item parameter estimation uncertainty to avoid the dire consequences in the person scoring phase (Levy

& Mislevy, 2016).

While Bayesian and frequentist expressions of probability are fundamentally different, which lead to differences in parameter estimations and interpretations, many scholars agree that both approaches offer useful properties in statistical modeling. Rubin

(1983) argued that strong data analyses must have both Bayesian and frequentist characteristics: they are based on sound logic to bring fair summaries of data under explicitly formulated assumptions (Bayesian) and the inferences must stand the test of long run frequency under similar measurement conditions (frequentist). Lukan (2020) made a similar point that both Bayesian and frequentist interpretations of probability have 196 value because they cover different facets of the same epistemological process. The present examination of the quality of Bayesian estimates across replications, like many studies in the Bayesian literature, follows this line of reasoning: we seek to understand whether Bayesian probabilistic statements about parameters have the desirable frequentist properties in the long run. In this sense, our research navigates the opposing conceptual frameworks which inform frequentist and Bayesian estimations, and embraces their contradictions to illuminate the dual features of Bayesian statistical inferences. Take the statement “the 95% HDI captures the true θ parameter 95% of the time” from this study for example. This statement employs two different conceptions of probability. The 95% in “95% HDI” refers to the view of probability as degree of belief about the true parameter given the data, whereas the 95% in “95% of the time” denotes probability as the relative frequency of the phenomenon of interest in a number of repeated trials. The juxtaposition of two opposing views of probability makes the nature of my simulation research clear: this study tests the accuracy of subjective probability with objectively verifiable probability. The idea here is not to blur the conceptual difference between two views of probability, but to tease out a useful aspect of each view.

Because Bayesian methods offer psychometricians the inductive inference they seek, and provide many advantages over the frequentist approach such as improved accuracy in small sample calibrations and use of knowledge from previous test administrations to inform current calibrations (see review in chapter 3), should educational measurement professionals employ Bayesian methods in all aspects of measurement work? The answer is that the use of Bayesian approach in operational 197 psychometric work can be beneficial but is not problem free. The following part reviews the issues associated with Bayesian analysis, especially regarding test fairness, and offers recommendations for critical applications of Bayesian methods in educational assessment.

First, the application of Bayesian methods has helped analysts tackle issues that have traditionally eluded effective quantitative analysis due to the availability of small sample sizes. For example, Rubin (1983) demonstrated the utility of Bayesian methods to provide evidence for different equations to predict first year average in business schools for black and white students and avoid overprediction for black students due to black student data sparsity. In psychometrics, differential item functioning (DIF) is usually examined to identify test items which do not work in the same manner for comparable groups of test-takers and require revision/removal. Detection of items which exhibit DIF is a necessary condition for test validity and fairness evaluation (AERA, APA, & NCME,

2014; Santelices & Wilson, 2010). Research has found Bayesian methods flag DIF items less frequently, have higher percentage of correct identification and smaller RMSE than traditional DIF detection methods, especially in small samples (Zwick et al., 1999;

Sinharay et al., 2009). Because social justice is primarily framed in terms of test fairness in educational measurement (Mislevy, 2018), modelers contribute to the social justice project through their analysis of statistical bias against examinees with socio-cultural and personal characteristics unrelated to the construct measured.

Interestingly, the very use of Bayesian inference in detection of DIF has raised concerns among researchers (Camilli, 2006; Camilli et al., 2013). Given that shrinkage 198 estimation, with Bayesian estimation as an example, reduces the absolute level of DIF in an item and leads to few DIF cases to be identified, Camilli et al. (2013) questioned if the use of Bayesian method aligns with the purpose of DIF as the first step to detect potential item bias. Here, Bayesian inference inflates type II errors (false negative, or failing to detect DIF when DIF is indeed present) by minimizing type I errors (false positive, or flagging an item as having DIF when DIF is not present). Researchers would usually attach more importance to type I errors avoidance, but the situation is more complex in

DIF research. When placed in the broader context of test fairness, the cost of type II errors would be high for examines when bias is present but not identified. However, type

I errors would likely place a burden on test developers and psychometricians as test development and scoring would be affected when items are removed (Camilli, 2006).

Thus, the trade-off between decreasing type I errors and increasing type II errors in DIF brought about by Bayesian analysis might benefit test development work yet does not appear to establish the fairness argument very well.

Another hypothetical example, the nature of which was recognized in the earliest days of Bayesian inference in psychometrics (Novick, 1970), makes the issue of fairness even more conspicuous. Assume it is possible to divide the examinees who take the Test of English as a Foreign Language (TOEFL) into two groups based on their targeted educational level: examinees who transfer from a high school in a non-English speaking country to a high school in the U.S., and examinees who use the English proficiency certificate to apply for an undergraduate or graduate program. Previous TOEFL administrations reveal that the first group tended to have lower mean latent score than the 199 second group; therefore, psychometricians model test scores of two groups with different prior distributions. In this manner, for two candidates with the same TOEFL raw score but different group memberships, a Bayesian analysis might offer them two different posterior distributions of the latent trait score, because the college/graduate school applicant is associated with the population which has a higher mean latent score and might have their ability estimate pulled closer to their group mean. The same scenario is possible if other background variables of test-takers carry information about their latent trait and are taken into consideration in person scoring. This modeling practice adheres to the Bayesian principle of incorporating prior knowledge in parameter estimation, yet the results do not appear to uphold the educational and psychological testing fairness principles (AERA, APA, & NCME, 2014), and can raise ethical and legal questions

(Phillips & Camara, 2006). Interestingly, the same shrinkage property of Bayesian estimation can be seen to benefit measurement accuracy in other contexts. For example, few people would dispute the finding that a woman in her twenties would have a lower risk of breast cancer than a woman in her fifties, even though they have the same mammogram test result.

How should we read the situations in which measurement modelers employ

Bayesian methods yet ignore the range of information that Bayesian analysis can offer by using only point estimates, or neglect the useful prior information in estimations?

Bayesian approach offers us a powerful analytical instrument and a way to organize our thinking about the world. As such, like any other analytical and epistemological tools, it should be placed in the broader context of the measurement field and employed with 200 practical and ethical considerations. Camilli et al. (2013) rightly commented that the statistical properties of an estimation method alone do not suffice to warrant its application. Levy and Mislevy (2016) pointed out that it is necessary for our model to accommodate our purpose in addition to reflecting our belief, because tests can serve differenct purposes in different contexts. The purpose of medical testing tends to be accuracy rather than a contest. In this sense, it is necessary to consider patient-specific variables such as gender and family history to arrive at an accurate diagnosis. In contrast, when testing is viewed in the sense of a contest, which is typical in education, one examinee’s results has consequences for other examinees because high-stakes decision- making, such as award offer or program admission, tends to be based on the ordering of examinees. As the purpose of educational testing is contest among test-takers, the need to observe the test fairness principle overrides the need to include the psychometrician’s prior knowledge in modeling the test data. With the increasing popularity of Bayesian approach in psychometrics, the complex dynamics between what Bayesian analysis can offer and how analysts make use of the Bayesian advantage for their modeling purpose within the contextual characteristics of their professional operation will continue to unfold and need to be assessed critically.

Ethical Reflections

It was not necessary to obtain research ethical approval for the current study, because its data were simulated from a computer software program rather than be collected from human participants. However, the absence of the need for a formal ethical review by no means equals the lack of ethical considerations from conception to 201 completion of this study. Ören (2000) argued that ethical questions are relevant in simulation studies because simulation research has consequences and implications for human judgement and decision making. This argument is particularly apt for IRT as a dominant analytical framework in educational and psychological assessment, since its modeling results will likely form the basis of decision making with respect to human health and well-being, interventions and treatment, opportunities and access to higher education, and educational policies, to name just a few possible high stakes uses of IRT analytical outcomes. Drawing on the Ethical Guidelines for Statistical Practice by the

American Statistical Association (2018), Code of Professional Ethics for Simulationists

(Ören et al., 2002), and Gelman’s (2018) recommendations for ethical practice and communication in statistics, I present my ethical reflection on my dissertation research process in the following section, with a focus on explicit assumptions, transparency of the research design and reporting, reproducibility of results, and handling of the reader’s feedback.

As transparent assumptions constitute an important basis of sound statistical practice (American Statistical Association, 2018), the need to be clear about what information is used in statistical analyses is even more pronounced in Bayesian computations because background knowledge in the form of prior distributions is directly involved in deriving the posterior inferences (Gelman, 2018). In this study, the reasons for the use of informative prior densities for item parameters were made explicit, especially for two asymptote parameters, including (1) the soundness of the assumptions about the range of possible values of these parameters, and (2) their important roles in 202

Markov chain convergence. While the substantive assumptions are included in the model estimation, it must be acknowledged that determining the extent to which previous research can inform our current investigation and choosing the statistical form to quantify prior knowledge are challenging and to some extent subjective (König & van de Schoot,

2018). The strong performance of two MCMC methods in this study can be in part attributed to the useful, informative priors. Therefore, I discourage the overgeneralizations of the findings of this study to other circumstances when little useful information is available to the modelers. I am aware that sensitivity studies to examine the effect of less informative priors or other ways to express our prior knowledge are clearly needed.

The research design and reporting were also made to be as transparent as possible.

Clear descriptions of the mixed factorial design with two between-subjects factors

(sample size and latent distribution) and one within-subjects factor (estimation method) were provided, together with details of data generation and parameter estimation. One modification in the study design warrants clarification. When the study was initiated, the test length was incorporated as the third between-subjects factor with 3 levels (20, 40 and

60 items). However, it was later realized that the computer power available was insufficient to examine the effect of test length. For example, an increase in test length from 20 items to 40 items would consume three times more computational resources than an increase in sample size from 1000 cases to 2500 cases using HMC. In order for the study findings to serve as guidelines for applied researchers, prioritization was reserved for investigations of sample size and latent trait distribution, because (1) measurement 203 professionals typically have more control over sample size than test length, (2) previous studies have pointed out the important effects of latent trait normality on estimation accuracy (e.g., Sass et al., 2008), and (3) sample size requirement can vary depending on whether normal latent trait assumption is violated. I acknowledge the delimitations of my study and support the ethical practice of limiting the generalizations of the study findings to rather short tests with 20 items and avoiding overclaims of inferences outside the conditions under examination. In addition, results were presented in both numerical and graphical forms with clear labels denoting the simulation conditions under which results were summarized. While mean bias and RMSE were the main outcomes of interest, other analytical results, including correlation between generated and estimated parameters, standard errors, and coverage proportion of the confidence intervals/highest density intervals were also provided to aid readers’ judgement of the validity of inferences.

As the author of this study, I made every attempt to ensure reproducibility of the results. The R codes used in this study, including the main simulation and the follow-up examination, are available in Appendices G–M. The necessary information about R software version, R packages and random seeds to generate data is provided so interested readers can reproduce the findings at their convenience. Examinations of the replicability of the study are also highly encouraged. The reader can try different random seeds to investigate any changes in substantive conclusions due to the randomness in the data generation and Markov chain sampling process. Due to the nature of statistical simulations, a small programming mistake can lead to serious consequences for results and interpretations. Therefore, the reader’s close scrutiny to identify any possible 204 simulation code problems is appreciated and questions should be directly addressed to the researcher.

Finally, ethical practice is more than adherence to guidelines. Because ethical considerations are constantly evolving as the field of statistical simulation develops, conversations surrounding simulation research are necessary to push the frontiers of ethical practice forward. While conducting this study, I had to juggle between the desirable best practices and the resource constraints, such as in determination of the number of replications and sample size levels to investigate. Therefore, other criticisms, especially those concerning the rigor, transparency and methodological choices made in this study which likely have ethical implications, are also highly welcome and will be dealt with in a timely manner.

205

References

Albert, J. H. (1992). Bayesian estimation of normal ogive item response curves using

Gibbs sampling. Journal of Educational Statistics, 17(3), 251–269.

https://doi.org/10.3102/10769986017003251

Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response

data. Journal of the American Statistical Association, 88, 669–679.

https://doi.org/10.2307/2290350

American Educational Research Association, American Psychological Association,

National Council on Measurement in Education, Joint Committee on Standards

for Educational and Psychological Testing (U.S.). (2014). Standards for

educational and psychological testing. AERA.

American Statistical Association. (2018). Ethical guidelines for statistical practice.

https://www.amstat.org/ASA/Your-Career/Ethical-Guidelines-for-Statistical-

Practice.aspx

Ames, A., & Au, C. H. (2018). Using Stan for item response theory models.

Measurement: Interdisciplinary Research and Perspectives, 16(2), 129–134.

https://doi.org/10.1080/15366367.2018.1437304

Baker, F. B. (1998). An investigation of the item parameter recovery characteristics of a

Gibbs sampling procedure. Applied Psychological Measurement, 22(2), 153–169.

https://doi.org/10.1177/01466216980222005

Baker, F. B. (2001). The basics of item response theory (2nd ed.). ERIC Clearinghouse

on Assessment and Evaluation. 206

Baker, F. B., & Kim, S-H. (2004). Item response theory: Parameter estimation

techniques (2nd ed.). Dekker.

Barnard-Brak, L., Lan, W. Y., & Yang, Z. (2018). Differences in mathematics

achievement according to opportunity to learn: A 4PL item response theory

examination. Studies in Educational Evaluation, 56, 1–7.

https://doi.org/10.1016/j.stueduc.2017.11.002

Barton, M., & Lord, F. (1981). An upper asymptote for the three-parameter logistic

model. (Research Report No. 81-20). Educational Testing Service.

Béguin, A. A., & Glas, C. A. W. (2001). MCMC estimation and some model-fit analysis

of multidimensional IRT models. Psychometrika, 66(4), 541–561.

https://doi.org/10.1007/BF02296195

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s

ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test

scores (pp. 395–479). Addison-Wesley.

Bock, R. D. (1972). Estimating item parameters and latent ability when responses are

scored in two or more nominal categories. Psychometrika, 37(1), 29–51.

https://doi.org/10.1007/BF02291411

Bock, R. D. (1997). The nominal categories model. In W. J. van der Linden & R. K.

Hambleton (Eds.), Handbook of modern item response theory (pp. 33–49).

Springer. 207

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item

parameters: An application of an EM algorithm. Psychometrika, 46(4), 443–459.

https://doi.org/10.1007/BF02293801

Bock, R. D., & Lieberman, M. (1970). Fitting a response model for n dichotomously

scored items. Psychometrika, 35(2), 179–198.

https://doi.org/10.1007/BF02291262

Bratley, P., Fox, B. L., & Schrage, L. E. (1987). A guide to simulation (2nd ed.).

Springer-Verlag.

Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of

iterative simulations. Journal of Computational and Graphical Statistics, 7(4),

434–455.

Brooks, S. P., & Roberts. G. O. (1998). Convergence assessment techniques for Markov

chain Monte Carlo. Statistics and Computing, 8, 319–335.

https://doi.org/10.1023/A:1008820505350

Bulut, O., & Sünbül, Ö. (2017). Monte Carlo simulation studies in item response theory

with the R programming language. Journal of Measurement and Evaluation in

Education and Psychology, 8(3), 266–287. https://doi.org/10.21031/epod.305821

Cai, L. (2010a). High-dimensional exploratory item factor analysis by a Metropolis-

Hastings Robbins-Monro algorithm. Psychometrika, 75(1), 33–57.

https://doi.org/10.1007/s11336-009-9136-x 208

Cai, L. (2010b). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item

factor analysis. Journal of Educational and Behavioral Statistics, 35(3), 307–335.

https://doi.org/10.3102/1076998609353115

Camilli, G. (2006). Test fairness. In R. Brennan (Ed.), Educational measurement (4th ed.,

pp. 220–256). Praeger.

Camilli, G., Briggs, D. C., Sloane, F. C., & Chiu, T.-W. (2013). Psychometric

perspectives on test fairness: Shrinkage estimation. In K. F. Geisinger, B. A.

Bracken, J. F. Carlson, J.-I. C. Hansen, N. R. Kuncel, S. P. Reise, & M. C.

Rodriguez (Eds.), APA handbooks in psychology®. APA handbook of testing and

assessment in psychology: Vol. 3. Testing and assessment in school psychology

and education (pp. 571–589). American Psychological Association.

https://doi.org/10.1037/14049-027

Cao, Y., Lu, R., & Tao, W. (2014). Effect of item response theory (IRT) model selection

on testlet-based test equating. (Research Report No. 14–19). Educational Testing

Service.

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M.,

Brubaker, M. A., Guo, J., Li, P., & Riddell, A. (2017). Stan: A probabilistic

programming language. Journal of Statistical Software, 76(1), 1–32.

http://dx.doi.org/10.18637/jss.v076.i01

Carsey, T. M., & Harden, J. J. (2014). Monte Carlo simulation and resampling methods

for social science. Sage. 209

Chalmers, R., P. (2012). mirt: A multidimensional item response theory package for the

R environment. Journal of Statistical Software, 48(6), 1–29.

http://dx.doi.org/10.18637/jss.v048.i06

Chang, M-I. (2017). A comparison of two MCMC algorithms for estimating the 2PL IRT

models [Doctoral dissertation, Southern Illinois University Carbondale].

OpenSIUC.https://opensiuc.lib.siu.edu/cgi/viewcontent.cgi?article=2450&context

=dissertations

Chen, W-H., Lenderking, W., Jin, Y., Wyrwich, K. W., Gelhorn, H., & Revicki, D. A.

(2014). Is Rasch model analysis applicable in small sample size pilot studies for

assessing item characteristics? An example using PROMIS pain behavior item

bank data. Quality of Life Research, 23(2), 485–493.

https://doi.org/10.1007/s11136-013-0487-5

Cheng, Y., & Liu, C. (2015). The effect of upper and lower asymptotes of IRT models on

computerized adaptive testing. Applied Psychological Measurement, 39(7), 551–

565. https://doi.org/10.1177/0146621615585850

Cole, K. M., & Paek, I. (2017). PROC IRT: A SAS procedure for item response theory.

Applied Psychological Measurement, 41(4), 311–320.

https://doi.org/10.1177/0146621616685062

Cowles, M. K., & Carlin, B. P. (1996). Markov chain Monte Carlo convergence

diagnostics: A comparative review. Journal of The American Statistical

Association, 91, 883–904. 210

Culpepper, S. A. (2016). Revisiting the 4-parameter item response model: Bayesian

estimation and application. Psychometrika, 81(4), 1142–1163.

https://doi.org/10.1007/s11336-015-9477-6

Culpepper, S. A. (2017). The prevalence and implications of slipping on low-stakes,

large-scale assessments. Journal of Educational and Behavioral Statistics, 42(6),

706–725. https://doi.org/10.3102/1076998617705653

Custer, M. (2015, October 21–24). Sample size and item parameter estimation precision

when utilizing the one-parameter “Rasch” model [Paper presentation]. The 2015

Mid-Western Educational Research Association Annual Meeting, Evanston,

Illinois, United States. de Ayala, R. J. (2009). The theory and practice of item response theory. Guilford. de Ayala, R. J., & Sava-Bolesta, M. (1999). Item parameter recovery for the nominal

response model. Applied Psychological Measurement, 23(1), 3–19.

https://doi.org/10.1177/01466219922031130

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood for

incomplete data via the EM algorithm (with discussion). Journal of the Royal

Statistical Society, Series B, 39(1), 1–38. https://doi.org/10.1111/j.2517-

6161.1977.tb01600.x

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists.

Lawrence Erlbaum Associates.

Fan, X., Felsovályi, Á., Sivo, S. A., & Keenan, S. C. (2002). SAS® for Monte Carlo

studies: A guide for quantitative researchers. SAS Institute Inc. 211

Feinberg, R. A., & Rubright, J. D. (2016). Conducting simulation studies in

psychometrics. Educational Measurement: Issues and Practice, 35(2), 36–49.

https://doi.org/10.1111/emip.12111

Feuerstahler, L. M., & Waller, N. G. (2014). Abstract: Estimation of the 4-parameter

model with marginal maximum likelihood. Multivariate Behavioral Research,

49(3), 285–285. https://doi.org/10.1080/00273171.2014.912889

Finch, H., & Edwards, J. M. (2016). Rasch model parameter estimation in the presence of

a nonnormal latent trait using a nonparametric Bayesian approach. Educational

and Psychological Measurement, 76(4), 662–684.

https://doi.org/10.1177/0013164415608418

Finch, H., & French, B. F. (2019). A comparison of estimation techniques for IRT

models with small samples. Applied Measurement in Education, 32(2), 77–96.

https://doi.org/10.1080/08957347.2019.1577243

Fleishman, A. L. (1978). A method for simulating non-normal distributions.

Psychometrika, 43(4), 521–532. https://doi.org/10.1007/BF02293811

Gelfand, E. A., & Smith, A. F. M. (1990). Sampling-based approaches to calculating

marginal densities. Journal of the American Statistical Association, 85, 398–409.

Gelman, A. (2018). Ethics in statistical practice and communication: Five

recommendations. Significance, 15(5), 40–43. https://doi.org/10.1111/j.1740-

9713.2018.01193.x

Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013).

Bayesian data analysis (3rd ed.). Chapman and Hall/CRC. 212

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple

sequences (with discussion). Statistical Science, 7(4), 457–472.

Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the

Bayesian restoration of images. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 6, 721–741. https://doi.org/10.1109/TPAMI.1984.4767596

Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches in calculating

posterior moments. In J. M. Bernardo, A. F. M. Smith, A. P. Dawid, & J. O.

Berger (Eds.), Bayesian Statistics 4 (pp. 169–193). Oxford University Press.

Geyer, C. J. (2011). Introduction to Markov chain Monte Carlo. In S. Brooks, A. Gelman,

G. L. Jones, & X-L. Meng (Eds.), Handbook of Markov chain Monte Carlo, (pp.

3–48). CRC Press.

Green, B. F. (2011). A comment on early student blunders on computerized-adaptive

tests. Applied Psychological Measurement, 35(2), 165–174.

https://doi.org/10.1177/0146621610377080

Gregory, C. A. (2019). Are we underestimating food insecurity? Partial identification

with a Bayesian 4-parameter IRT model. Journal of Classification.

https://doi.org/10.1007/s00357-019-09344-2

Hambleton, R. K., & Cook, L. L. (1977). Latent trait models and their use in the analysis

of educational test data. Journal of Educational Measurement, 14(2), 75–96.

https://doi.org/10.1111/j.1745-3984.1977.tb00030.x

Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item

response theory and their applications to test development. Educational 213

Measurement: Issues and Practice, 12(3), 38–47. https://doi.org/10.1111/j.1745-

3992.1993.tb00543.x

Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and

applications. Kluwer Academic.

Harris, D. (1989). Comparison of 1-, 2-, and 3-parameter IRT models. Educational

Measurement: Issues and Practice, 8(1), 35–41. https://doi.org/10.1111/j.1745-

3992.1989.tb00313.x

Harwell, M., Baker, F. B., & Zwarts, M. (1988). Item parameter estimation via marginal

maximum likelihood and an EM algorithm: A didactic. Journal of Educational

Statistics, 13(3), 243-271. https://doi.org/10.3102/10769986013003243

Harwell, M., Stone, C. A., Hsu, T-C., & Kirisci, L. (1996). Monte Carlo studies in item

response theory. Applied Psychological Measurement, 20(2), 101–125.

https://doi.org/10.1177/014662169602000201

Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their

applications. Biometrika, 57(1), 97–109.

Hessen, D.J. (2004). A new class of parametric IRT models for dichotomous item scores.

Journal of Applied Measurement, 5(4), 385–397.

Hessen, D. J. (2005). Constant latent odds-ratios models and the Mantel-Haenszel

null hypothesis. Psychometrika, 70(3), 497–516. https://doi.org/10.1007/s11336-

002-1040-6

Ho, A. D., & Yu, C. C. (2015). Descriptive statistics for modern test score distributions:

Skewness, kurtosis, discreteness, and ceiling effects. Educational and 214

Psychological Measurement, 75(3), 365–388.

https://doi.org/10.1177/0013164414548576

Hoffman, M. D., & Gelman, A. (2014). The No-U-Turn sampler: Adaptively setting path

lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research,

15(1), 1593–1623.

Hudson, T. (1991). Relationships among IRT item discrimination and item fit indices in

criterion-referenced language testing. Language Testing, 8(2), 160–181.

https://doi.org/10.1177/026553229100800205

Huggins-Manley, A. C., & Algina, J. (2015). The partial credit model and generalized

partial credit model as constrained nominal response models, with applications in

Mplus. Structural Equation Modeling: A Multidisciplinary Journal, 22(2), 308–

318. https://doi.org/10.1080/10705511.2014.937374

Hughes, H. H., & Trimble, W. E. (1965). The use of complex alternatives in multiple

choice items. Educational and Psychological Measurement, 25(1), 117–126.

https://doi.org/10.1177/001316446502500112

Jiao, H., Wang, S., & He, W. (2013). Estimation methods for one-parameter testlet

models. Journal of Educational Measurement, 50(2), 186–203.

https://doi.org/10.1111/jedm.12010

Junker, B. W., Patz, R. J., & VanHoudnos, N. M. (2016). Markov chain Monte Carlo for

item response models. In W. J. van der Linden (Ed.), Handbook of item response

theory (Vol. 2, pp. 271–312). CRC Press.

Kabacoff, R. (2011). R in action: Data analysis and graphics with R. Manning. 215

Kern, J. L., & Culpepper, S. A. (2020). A restricted four-parameter IRT model: The dyad

four-parameter normal ogive (DYAD-4PNO) model. Psychometrika, 85(3), 575–

599. https://doi.org/10.1007/s11336-020-09716-3

Kieftenbeld, V., & Natesan, P. (2012). Recovery of graded response model parameters: A

comparison of marginal maximum likelihood and Markov chain Monte Carlo.

Applied Psychological Measurement, 36(5), 399–419.

https://doi.org/10.1177/0146621612446170

Kim, S-H. (2001). An evaluation of a Markov chain Monte Carlo method for the Rasch

model. Applied Psychological Measurement, 25(2), 163–176.

https://doi.org/10.1177/01466210122031984

Kim, S-H. (2007). Some posterior standard deviations in item response theory.

Educational and Psychological Measurement, 67(2), 258–279.

https://doi.org/10.1177/00131644070670020501

Kim, J-S., & Bolt, D. M. (2007). Estimating item response theory models using Markov

chain Monte Carlo methods. Educational Measurement: Issues and Practice,

26(4), 38–51. https://doi.org/10.1111/j.1745-3992.2007.00107.x

Kim, K. Y., & Lee, W. C. (2017). The impact of three factors on the recovery of item

parameters for the three-parameter logistic model. Applied Measurement in

Education, 30(3), 228–242. https://doi.org/10.1080/08957347.2017.1316274

Kim, S., & Moses, T. (2016). Investigating robustness of item response theory

proficiency estimators to atypical response behaviors under two-stage multistage

testing. (Research Report No. 16–22). Educational Testing Service. 216

König, C., & van de Schoot, R. (2018). Bayesian statistics in educational research: A

look at the current state of affairs. Educational Review, 70(4), 486–509.

https://doi.org/10.1080/00131911.2017.1350636

Kuo, T.-C., & Sheng, Y. (2016). A comparison of estimation methods for a

multimimensional graded response IRT model. Frontiers in Psychology, 7,

Article 880. https://doi.org/10.3389/fpsyg.2016.00880

Kruschke, J. K. (2015). Doing Bayesian data analysis: A tutorial with R, Jags, and Stan

(2nd ed.). Academic Press.

Levy, R. (2009). The rise of Markov chain Monte Carlo estimation for psychometric

modeling. Journal of Probability and Statistics, 2009, 1–18.

https://doi.org/10.1155/2009/537139

Levy, R., & Mislevy, R. J. (2016). Bayesian psychometric modeling. CRC Press.

Levy, R., Mislevy, R. J., & Behrens, J. T. (2011). MCMC in educational research. In S.

Brooks, A. Gelman, G. L. Jones, & X-L. Meng (Eds), Handbook of Markov chain

Monte Carlo (pp. 531–546). CRC Press.

Liao, W.-W, Ho, R.-G., & Yen, Y.-C. (2012). The four-parameter logistic item response

theory model as a robust method of estimating ability despite aberrant responses.

Social Behavior and Personality, 40(10), 1679–1694.

https://doi.org/10.2224/sbp.2012.40.10.1679

Linacre, J. M. (2004). Discrimination, guessing and carelessness: Estimating IRT

parameters with Rasch. Rasch Measurement Transactions, 18, 959–960.

https://www.rasch.org/rmt/rmt181b.htm 217

Little, E. B. (1962). Overcorrection for guessing in multiple-choice test scoring. The

Journal of Educational Research, 55(6), 245–252.

https://doi.org/10.1080/00220671.1962.10882801

Loken, E., & Rulison, K. L. (2010). Estimation of a four-parameter item response theory

model. British Journal of Mathematical and Statistical Psychology, 63, 509–525.

https://doi.org/10.1348/000711009X474502

Lord, F. M. (1953). The relation of test score to the trait underlying the test. Educational

and Psychological Measurement, 13, 517–548.

https://doi.org/10.1177/001316445301300401

Lord, F. M. (1955). A survey of observed test-score distributions with respect to

skewness and kurtosis. Educational and Psychological Measurement, 15(4), 383–

389. https://doi.org/10.1177/001316445501500406

Lord, F. M. (1974). Estimation of latent ability and item parameters when there are

omitted responses. Psychometrika, 39, 247–264.

https://doi.org/10.1007/BF02291471

Lord, F. M. (1986). Maximum likelihood and Bayesian parameter estimation in item

response theory. Journal of Educational Measurement, 23, 157–162.

Luecht, R., & Ackerman, T. A. (2018). A technical note on IRT simulation studies:

Dealing with truth, estimates, observed data, and residuals. Educational

Measurement: Issues and Practice, 37(3), 65–76.

https://doi.org/10.1111/emip.12185 218

Lukan, P. (2020). Interpretations of probability and Bayesian inference—An overview.

Acta Analytica, 35, 129–146. https://doi.org/10.1007/s12136-019-00390-4

Luo, H. (2011). Some aspects on confirmatory factor analysis of ordinal variables and

generating non-normal data [Doctoral dissertation, Uppsala University, Uppsala,

Sweden]. Digital Comprehensive Summaries of Uppsala Dissertations from the

Faculty of Social Sciences.

http://www.diva-portal.org/smash/get/diva2:405108/FULLTEXT01.pdf

Luo, Y. (2018). Parameter recovery with marginal maximum likelihood and Markov

chain Monte Carlo estimation for the generalized partial credit model

[Manuscript submitted for publication]. National Center for Assessment in Saudi

Arabia. https://arxiv.org/ftp/arxiv/papers/1809/1809.07359.pdf

Luo, Y., & Jiao, H. (2018). Using the Stan program for Bayesian item response theory.

Educational and Psychological Measurement, 78(3), 384–408.

https://doi.org/10.1177/0013164417693666

Luo, Y., & Wolf, M. G. (2019). Item parameter recovery for the two-parameter testlet

model with different estimation methods. Psychological Test and Assessment

Modeling, 61(1), 65–89.

Magis, D. (2013). A note on the item information function of the four-parameter logistic

model. Applied Psychological Measurement, 37(4), 304–315.

https://doi.org/10.1177/0146621613475471

Marcoulides, K. M. (2018). Careful with those priors: A note on Bayesian estimation in

two-parameter logistic item response theory models. Measurement: 219

Interdisciplinary Research and Perspectives, 16(2), 92–99.

https://doi.org/10.1080/15366367.2018.1437305

Maris, G. (2008). A note on ‘‘Constant latent odds-ratios models and the Mantel-

Haenszel null hypothesis’’ Hessen, 2005. Psychometrika, 73(1), 153–157.

https://doi.org/10.1007/s11336-007-9033-0

Maris, G., & Bechger, T. (2009). On interpreting the model parameters for the three

parameter logistic model. Measurement: Interdisciplinary Research and

Perspectives, 7(2), 75–88. https://doi.org/10.1080/15366360903070385

Martin-Fernandez, M., & Revuelta, J. (2017). Bayesian estimation of multidimensional

item response models: A comparison of analytic and simulation algorithms.

Psicológica, 38(1), 25–55.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–

174. https://doi.org/10.1007/BF02296272

Masters, G. N. (1988). Item discrimination: When more is worse. Journal of Educational

Measurement, 25(1), 15–29. https://doi.org/10.1111/j.1745-3984.1988.tb00288.x

Matloff, N. (2011). The art of R programming: A tour of statistical software design.

William Pollock.

Matteucci, M., Mignani, S., & Veldkamp, B. P. (2012). Prior distributions for item

parameters in IRT models. Communications in Statistics - Theory and Methods,

41(16-17), 2944–2958. https://doi.org/10.1080/03610926.2011.639973

McDonald, R. P. (1967). Nonlinear factor analysis. Psychometric Monographs, No. 15. 220

McElreath, R. (2016). Statistical rethinking: A Bayesian course with examples in R and

Stan. CRC Press.

McMorris, B. J. (1997). Using latent structural analysis to examine the distribution of

problem behavior: Discrete types vs. propensity explanations of criminality

(Publication No. 9819701) [Doctoral dissertation, University of Nebraska-

Lincoln]. ProQuest Dissertations & Theses.

Meng, X., Xu, G., Zhang, J., & Tao, J. (2019). Marginalized maximum a posteriori

estimation for the four-parameter logistic model under a mixture modelling

framework. British Journal of Mathematical and Statistical Psychology.

http://dx.doi.org.proxy.library.ohio.edu/10.1111/bmsp.12185

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953).

Equation of state calculations of fast computing machines. The Journal of

Chemical Physics, 21(6), 1087–1092. https://doi.org/10.1063/1.1699114

Meyer, J. P. (2014). Applied measurement with jMetrik. Routledge.

Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika,

51(2), 177–195. https://doi.org/10.1007/BF02293979

Mislevy, R. J. (2018). Sociocognitive foundations of educational measurement.

Routledge.

Mooney, C. Z. (1997). Monte Carlo simulation. Sage.

Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to

evaluate statistical methods. Statistics in Medicine, 38(11), 2074–2102.

https://doi.org/10.1002/sim.8086 221

Muraki, E. (1990). Fitting a polytomous item response model to Likert-type data. Applied

Psychological Measurement, 14(1), 59–71.

https://doi.org/10.1177/014662169001400106

Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm.

Applied Psychological Measurement, 16(2), 159–176.

https://doi.org/10.1177/014662169201600206

Muthén, L. K., & Muthén, B. O. (1998–2017). Mplus user’s guide (8th ed.). Muthén &

Muthén.

Myszkowski, N., & Storme, M. (2017). Measuring “good taste” with the Visual Aesthetic

Sensitivity Test-Revised (VAST-R). Personality and Individual Differences, 117,

91–100. http://dx.doi.org/10.1016/j.paid.2017.05.041

Nader, I. W., Tran, U. S., & Voracek, M. (2015). Effects of initial values and

convergence criterion in the two-parameter logistic model when estimating the

latent distribution in BILOG-MG 3. PLoS ONE, 10(10), Article e0140163.

https://doi.org/10.1371/journal.pone.0140163

Neal, R. M. (2011). MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G. L.

Jones, & X-L. Meng (Eds), Handbook of Markov chain Monte Carlo (pp. 113–

162). CRC Press.

Novick, M. R. (1970). Bayesian considerations in educational information systems. ACT

Research Reports, 38. 222

O’hagan, A. (2008). The Bayesian approach to statistics. In T. Rudas (Ed.), Handbook of

probability: Theory and applications (pp. 85–100). Sage.

http://dx.doi.org/10.4135/9781452226620

Ören, T. I. (2000). Responsibility, ethics, and simulation. Transactions of the Society for

Computer Simulation International, 17(4), 165–170.

Ören, T. I., Elzas, M. S., Smit, I., & Birta, L. G. (2002). A code of professional ethics for

simulationists. In J. Wallace & J. Celano (Eds.), Proceedings of the 2002 summer

computer simulation conference (pp. 434–435). Society for Computer Simulation

International.

Ogasawara, H. (2012). Asymptotic expansions for the ability estimator in item response

theory. Computational Statistics, 27, 661–683. https://doi.org/10.1007/s00180-

011-0282-0

Ogasawara, H. (2017). Identified and unidentified cases of the fixed-effects 3- and 4-

parameter models in item response theory. Behaviormetrika, 44, 405–423.

https://doi.org/10.1007/s41237-017-0032-x

Osgood, D. W., McMorris, B. J., & Potenza, M. T. (2002). Analyzing multiple-item

measures of crime and deviance I: Item response theory scaling. Journal of

Quantitative Criminology, 18, 267–296.

https://doi.org/10.1023/A:1016008004010

Osterlind, S. J. (2010). Modern measurement: Theory, principles, and applications of

mental appraisal (2nd ed.). Pearson. 223

Patz, R. J., & Junker, B. W. (1999a). A straightforward approach to Markov chain Monte

Carlo methods for item response models. Journal of Educational and Behavioral

Statistics, 24(2), 146–178. https://doi.org/10.3102/10769986024002146

Patz, R. J., & Junker, B. W. (1999b). Applications and extensions of MCMC in IRT:

Multiple item types, missing data, and rated responses. Journal of Educational

and Behavioral Statistics, 24(4), 342–366.

https://doi.org/10.3102/10769986024004342

Phillips, S. E., & Camara, W. J. (2006). Legal and ethical issues. In R. Brennan (Ed.),

Educational measurement (4th ed., pp. 734–755). Praeger.

Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using

Gibbs sampling. In K. Hornik, F. Leisch, & A. Zeileis (Eds.), Proceedings of the

3rd international workshop on distributed statistical computing (DSC 2003).

Vienna, Austria. https://www.r-project.org/conferences/DSC-

2003/Proceedings/Plummer.pdf

Primi, R., Nakano, T. D. C., & Wechsler, S. M. (2018). Using four-parameter item

response theory to model human figure drawings. Avaliação Psicológica, 17(4),

473–483. http://dx.doi.org/10.15689/ap.2018.1704.7.07

R Core Team. (2019). R: A language and environment for statistical computing

[Computer software]. Vienna, Austria. http://www.R-project.org/

Raftery, A. E., & Lewis, S. M. (1992). How many iterations in the Gibbs sampler? In J.

M. Bernardo, A. F. M. Smith, A. P. Dawid, & J. O. Berger (Eds.), Bayesian

Statistics 4 (pp. 763–773). Oxford University Press. 224

Raiche, G., Magis, D., Blais, J.-G., & Brochu, P. (2013). Taking atypical response

patterns into account: A multidimensional measurement model from item

response theory. In M. Simon, K. Ercikan, & M. Rousseau (Eds.), Improving

large-scale assessment in education: Theory, issues and practice (pp. 238–259).

Routledge.

Reckcase, M.D. (2009). Multidimensional item response theory. Springer.

Reise, S., & Waller, N. (2003). How many IRT parameters does it take to model

psychopathology items? Psychological Methods, 8(2), 164–184.

https://doi.org/10.1037/1082-989X.8.2.164

Robert, C., & Casella, G. (2011). A short history of MCMC: Subjective recollections

from incomplete data. In S. Brooks, A. Gelman, G. L. Jones, & X-L. Meng (Eds.),

Handbook of Markov chain Monte Carlo (pp. 49–66). CRC Press.

Roy, V. (2020). Convergence diagnostics for Markov chain Monte Carlo. Annual Review

of Statistics and Its Application, 7, 387–412. https://doi.org/10.1146/annurev-

statistics-031219-041300

Rubin, D. B. (1983). Some applications of Bayesian statistics to educational data. The

Statistician, 32, 55–68. https://doi.org/10.2307/2987592

Rulison, K. L., & Loken, E. (2009). I’ve fallen and I can’t get up: Can high-ability

students recover from early mistakes in CAT? Applied Psychological

Measurement, 33(2), 83–101. https://doi.org/10.1177/0146621608324023 225

Rupp, A. A. (2003). Item response modeling with BILOG-MG and MULTILOG for

Windows. International Journal of Testing, 3(4), 365–384.

https://doi.org/10.1207/S15327574IJT0304_5

Sahu, S. K. (2002). Bayesian estimation and model choice in item response models.

Journal of Statistical Computation and Simulation, 72(3), 217–232.

https://doi.org/10.1080/00949650212387

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded

scores. Psychometrika Monograph Supplement, No. 17.

Samejima, F. (1997). Graded response model. In W. J. van der Linden & Hambleton, R.

K. (Eds.), Handbook of modern item response theory (pp. 85–100). Springer.

San Martín, E., González, J., & Tuerlinckx, F. (2015). On the unidentifiability of the

fixed effect 3PL model. Psychometrika, 80(2), 450–467.

https://doi.org/10.1007/s11336-014-9404-2

Santelices, M. V., & Wilson, M. (2010). Unfair treatment? The case of Freedle, the SAT,

and the standardization approach to differential item functioning. Harvard

Educational Review, 80(1), 106–133.

https://doi.org/10.17763/haer.80.1.j94675w001329270

Sass, D. A., Schmitt, T. A., & Walker, C. M. (2008). Estimating non-normal latent trait

distributions within item response theory using true and estimated item

parameters. Applied Measurement in Education, 21(1), 65–88.

https://doi.org/10.1080/08957340701796415 226

Sawilowski, S. S. (2003). You think you’ve got trivials? Journal of Modern Applied

Statistical Methods, 2(1), 218–225.

Schilling, S., & Bock, R. D. (2005). High-dimensional maximum marginal likelihood

item factor analysis by adaptive quadrature. Psychometrika, 70(3), 533–555.

https://doi.org/10.1007/s11336-003-1141-x

Seong, T-J. (1990). Sensitivity of marginal maximum likelihood estimation of item and

ability parameters to the characteristics of the prior ability distributions. Applied

Psychological Measurement, 14(3), 299–311.

https://doi.org/10.1177/014662169001400307

Sheng, Y. (2015). Bayesian estimation of the four-parameter IRT model using Gibbs

sampling. International Journal of Quantitative Research in Education, 2(3/4),

194–212. https://doi.org/10.1504/IJQRE.2015.071736

Shults, F. R., Wildman, W. J., & Dignum, V. (2018). The ethics of computer modeling

and simulation. In M. Rabe, A.A. Juan, N. Mustafee, A. Skoogh, S. Jain, & B.

Johansson (Eds.), Proceedings of the 2018 winter simulation conference (pp.

4069–4083). IEEE.

Sideridis, G. D., Tsaousis, I., & Al Harbi, K. (2016). The impact of non-attempted and

dually-attempted items on person abilities using item response theory. Frontiers

in Psychology, 7, Article 1572. https://doi.org/10.3389/fpsyg.2016.01572

Sigal, M. J., & Chalmers, R. P. (2016). Play it again: Teaching statistics with Monte

Carlo simulation. Journal of Statistics Education, 24(3), 136–156.

https://doi.org/10.1080/10691898.2016.1246953 227

Sijtsma, K., & Hemker, B. T. (2000). A taxonomy of IRT models for ordering persons

and items using simple sum scores. Journal of Educational and Behavioral

Statistics, 25(4), 391–415. https://doi.org/10.3102/10769986025004391

Sinharay, S. (2003). Assessing convergence of the Markov chain Monte Carlo

algorithms: A review. (Research Report No. 03-07). Educational Testing Service.

Sinharay, S. (2004). Experiences with Markov chain Monte Carlo convergence

assessment in two psychometric examples. Journal of Educational and

Behavioral Statistics, 29(4), 461–488.

https://doi.org/10.3102/10769986029004461

Sinharay, S., Dorans, N. J., Grant, M. C., & Blew, E. O. (2009). Using past data to

enhance small sample DIF estimation: A Bayesian approach. Journal of

Educational and Behavioral Statistics, 34(1), 74–96.

https://doi.org/10.3102/1076998607309021

Sinharay, S., & von Davier, M. (2005). Extension of the NAEP BGGROUP program to

higher dimensions. (Research Report No. 05-27). Educational Testing Service.

Stan Development Team. (2018a). Stan user’s guide (Version 2.18). http://mc-

stan.org/documentation/

Stan Development Team. (2018b). Stan reference manual (Version 2.18). http://mc-

stan.org/documentation/

Stone, C. A. (1992). Recovery of marginal maximum likelihood estimates in the two-

parameter logistic response model: An evaluation of MULTILOG. Applied 228

Psychological Measurement, 16(1), 1–16.

https://doi.org/10.1177/014662169201600101

Storme, M., Myszkowski, N., Baron, S., & Bernard, D. (2019). Same test, better scores:

Boosting the reliability of short online intelligence recruitment tests with nested

logit item response theory models. Journal of Intelligence, 7(3), Article 17.

https://doi.org/10.3390/jintelligence7030017

Suen, H. (1990). Principles of test theories. Lawrence Erlbaum.

Svetina, D., Valdivia, A., Underhill, S., Dai, S., & Wang, X. (2017). Parameter recovery

in multidimensional item response theory models under complexity and

nonnormality. Applied Psychological Measurement, 41(7), 530–544.

https://doi.org/10.1177/0146621617707507

Swaminathan, H., & Gifford, J. (1982). Bayesian estimation in the Rasch model. Journal

of Educational Statistics, 7(3), 175–191.

https://doi.org/10.3102/10769986007003175

Swist, K. (2015). Item analysis and evaluation using a four-parameter logistic model.

Edukacja, 3, 77–97.

Tadikamalla, P. R. (1980). On simulating non-normal distributions. Psychometrika,

45(2), 273–279. https://doi.org/10.1007/BF02294081

Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data

augmentation (with discussion). Journal of the American Statistical Association,

82, 528–540. 229

Tavares, H. R., de Andrade, D. F., & Pereira, C. A. (2004). Detection of determinant

genes and diagnostic via item response theory. Genetics and Molecular Biology,

27(4), 679–685. http://dx.doi.org/10.1590/S1415-47572004000400033

Tendeiro, J. N., & Meijer, R. R. (2012). A CUSUM to detect person misfit: A discussion

and some alternatives for existing procedures. Applied Psychological

Measurement, 36(5), 420–442. https://doi.org/10.1177/0146621612446305

Thissen, D. (1982). Marginal maximum likelihood estimation for the one-parameter

logistic model. Psychometrika, 47(2), 175–186.

https://doi.org/10.1007/BF02296273

Thissen, D., & Wainer, H. (1982). Some standard errors in item response theory.

Psychometrika, 47(4), 397–412. https://doi.org/10.1007/BF02293705

Vale, C. D., & Maurelli, V. A. (1983). Simulating multivariate nonnormal distributions.

Psychometrika, 48(3), 465–471. https://doi.org/10.1007/BF02293687 van de Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item

response theory. Springer.

VanderPlas, J. (2014). Frequentism and Bayesianism: A Python-driven primer. In S. van

der Walt & J. Bergstra (Eds.), Proceedings of the 13th Python in Science

conference (pp. 85–93). https://doi.org/10.25080/Majora-14bd3278-011 van de Schoot, R., Kaplan, D., Denissen, J., Asendorpf, J. B., Neyer, F. J., & van Aken,

M. A. G. (2014). A gentle introduction to Bayesian analysis: Applications to

developmental research. Child Development, 85(3), 842–860.

https://doi.org/10.1111/cdev.12169 230 van Ravenzwaaij, D., Cassey, P., & Brown, S. D. (2018). A simple introduction to

Markov chain Monte Carlo sampling. Psychonomic Bulletin & Review, 25, 143–

154. https://doi.org/10.3758/s13423-016-1015-8 van Rijn, P. W. (2014). Reliability of multistage tests using item response theory. In D.

Yan, A. A. von Davier, & C. Lewis (Eds.), Computerized multistage testing:

Theory and applications (pp. 251–264). CRC Press.

Wagenmakers, E-J., Lee, M., Lodewyckx, T., & Iverson, G. J. (2008). Bayesian versus

frequentist inference. In H. Hoijtink, I. Klugkist, & P. A. Boelen (Eds.), Bayesian

evaluation of informative hypotheses (pp. 181–207). Springer.

Wainer, H. (1983). Are we correcting for guessing in the wrong direction? In D. J. Weiss

(Ed.), New horizons in testing: Latent trait test theory and computerized adaptive

testing (pp. 63–80). Lawrence Erlbaum.

Wainer, H., & Mislevy, R. J. (2000). Item response theory, item calibration, and

proficiency estimation. In H. Wainer (Ed.), Computerized adaptive testing: A

primer (2nd ed.) (pp. 61–100). Lawrence Erlbaum.

Waller, N. G., & Feuerstahler, L. (2017). Bayesian modal estimation of the four-

parameter item response model in real, realistic, and idealized data sets.

Multivariate Behavioral Research, 52(3), 350–370.

https://doi.org/10.1080/00273171.2017.1292893

Waller, N. G., & Reise, S. P. (2010). Measuring psychopathology with non-standard IRT

models: Fitting the four-parameter model to the MMPI. In S. Embretson & J. S. 231

Roberts (Eds.), Measuring psychological constructs: Advances in model-based

approaches (pp. 147–173). American Psychological Association.

Walstad, W. B., & Rebeck, K. (2017). The Test of Financial Literacy: Development and

measurement characteristics. The Journal of Economic Education, 48(2), 113–

122. https://doi.org/10.1080/00220485.2017.1285739

Wang, W-C., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological

Measurement, 29(2), 126–149. https://doi.org/10.1177/0146621604271053

Wollack, J. A., Bolt, D. M., Cohen, A. S., & Lee, Y.-S. (2002). Recovery of item

parameters in the nominal response model: A comparison of marginal maximum

likelihood estimation and Markov Chain Monte Carlo estimation. Applied

Psychological Measurement, 26(3), 339–352.

https://doi.org/10.1177/0146621602026003007

Yalçın, S. (2018). Data fit comparison of mixture item response theory models and

traditional models. International Journal of Assessment Tools in Education, 5(2),

301–313. https://doi.org/10.21449/ijate.402806

Yen, W. M. (1987). A comparison of the efficiency and accuracy of BILOG and

LOGIST. Psychometrika, 52(2), 275–291. https://doi.org/10.1007/BF02294241

Yen, Y.-C., Ho, R.-G., Liao, W.-W., & Chen, L.-J. (2012). Reducing the impact of

inappropriate items on renewable computerized adaptive testing. Educational

Technology and Society, 15(2), 231–243.

Yen, Y.-C., Ho, R.-G., Laio, W.-W., Chen, L.-J., & Kuo, C.-C. (2012). An empirical

evaluation of the slip correction in the four parameter logistic models with 232

computerized adaptive testing. Applied Psychological Measurement, 36(2), 75–

87. https://doi.org/10.1177/0146621611432862

Zhang, J., Lu, J., Du, H., & Zhang, Z. (2020). Gibbs-slice sampling algorithm for

estimating the four-parameter logistic model. Frontiers in Psychology, 11, Article

2121. https://doi.org/10.3389/fpsyg.2020.0212

Zwick, R., Thayer, D. T., & Lewis, C. (1999). An empirical Bayes approach to Mantel-

Haenszel DIF analysis. Journal of Educational Measurement, 36(1), 1–28.

https://doi.org/10.1111/j.1745-3984.1999.tb00543.x

233

Appendix A: Calibrations by MML with N = 10,000

θ Parameter Outcome Bias1 RMSE2 Correlation3 Coverage4 SE5 Normal a 0.0956 0.3531 .6264 .9750 0.4318 b 0.0388 0.3082 .9509 .9260 0.3887 c 0.0113 0.0833 .3892 .9080 0.1061 d -0.0046 0.0742 .3896 .9280 0.0998 θ 0.0087 0.5786 .8230 .9602 0.6076 Negatively a 0.3173 0.5511 .4863 .9840 0.5581 skewed b -0.2632 0.4718 .9087 .6880 0.3769 c -0.0076 0.0752 .4080 .9270 0.1013 d -0.0842 0.1287 .3126 .6730 0.0966 θ 0.0259 0.5726 .8272 .9580 0.6252 Note: Bias: average bias; RMSE: root mean square error; Correlation: average correlation between the generated and estimated parameter values; Coverage: average proportion coverage of the true parameter value by 95% Confidence Interval (MML) and 95%

Highest Density Interval (MCMC); SE: average standard error/posterior standard deviation.

234

Appendix B: Latent Trait Distributions in the Main Simulation versus the Follow-

up Simulation (N = 2,500)

235

Appendix C: Calibrations by MML and MCMC under Negatively Skewed Latent

Trait with More High-Ability Examinees (N = 1,000)

Estimator Parameter Outcome Bias1 RMSE2 Correlation3 Coverage4 SE5 MML a 0.2036 0.6291 0.2981 0.9460 1.0507 b -0.2497 0.6920 0.7829 0.8230 1.0300 c -0.0532 0.1217 0.2306 0.7280 0.1694 d -0.0333 0.1336 0.1961 0.7980 0.1925 θ 0.0150 0.5956 0.8144 0.9377 0.5802 HMC a -0.1901 0.3738 0.3060 0.6570 0.1784 b -0.2052 0.9499 0.5318 0.7030 0.3993 c -0.0125 0.0583 0.2092 0.9090 0.0623 d -0.0398 0.0804 0.1570 0.9120 0.0647 θ 0.0008 1.0759 0.4219 0.8185 0.6805 Note: Bias: average bias; RMSE: root mean square error; Correlation: average correlation between the generated and estimated parameter values; Coverage: average proportion coverage of the true parameter value by 95% Confidence Interval (MML) and 95%

Highest Density Interval (MCMC); SE: average standard error/posterior standard deviation.

236

Appendix D: Calibrations by MML and MCMC under Negatively Skewed Latent

Trait with More High-Ability Examinees (N = 2,500)

Estimator Parameter Outcome Bias1 RMSE2 Correlation3 Coverage4 SE5 MML a 0.2269 0.6235 0.3175 0.9550 0.8712 b -0.2614 0.5900 0.8353 0.7990 0.6180 c -0.0378 0.1031 0.2452 0.8400 0.1458 d -0.0518 0.1265 0.2743 0.8060 0.1608 θ 0.0192 0.5799 0.8218 0.9511 0.5998 HMC a -0.1789 0.3388 0.6555 0.6730 0.1693 b -0.2486 0.4269 0.9351 0.8580 0.3274 c -0.0182 0.0553 0.3731 0.9170 0.0529 d -0.0490 0.0848 0.3462 0.8770 0.0560 θ -0.0001 0.5857 0.8265 0.9745 0.6794 Note: Bias: average bias; RMSE: root mean square error; Correlation: average correlation between the generated and estimated parameter values; Coverage: average proportion coverage of the true parameter value by 95% Confidence Interval (MML) and 95%

Highest Density Interval (MCMC); SE: average standard error/posterior standard deviation.

237

Appendix E: Calibrations by MML and MCMC at N = 1,000 under Normal Latent

Trait when MML Converged Technically

Estimator Parameter Outcome Bias1 RMSE2 Correlation3 Coverage4 SE5 MML a 1.3126 2.8162 .0693 .9530 2.6248 b -0.0757 2.6412 .3679 .7620 1.5584 c 0.0176 0.1476 .1831 .6310 0.1488 d -0.0266 0.1413 .1729 .6520 0.1368 θ 0.0203 0.6136 .8053 .9288 0.5735 Gibbs a -0.1747 0.3341 .6964 .7080 0.1805 b -0.0458 0.3064 .9562 .9950 0.4069 c 0.0066 0.0575 .2960 .9670 0.0661 d -0.0288 0.0659 .3119 .9670 0.0628 θ 0.0215 0.5989 .8212 .9768 0.6879 HMC a -0.1748 0.3341 .6969 .7070 0.1805 b -0.0460 0.3064 .9563 .9940 0.4070 c 0.0066 0.0574 .2968 .9690 0.0661 d -0.0288 0.0659 .3118 .9660 0.0628 θ 0.0214 0.5989 .8212 .9769 0.6879 Note: Bias: average bias; RMSE: root mean square error; Correlation: average correlation between the generated and estimated parameter values; Coverage: average proportion coverage of the true parameter value by 95% Confidence Interval (MML) and 95%

Highest Density Interval (MCMC); SE: average standard error/posterior standard deviation.

238

Appendix F: Calibrations by MML and MCMC at N = 1,000 under Negatively

Skewed Latent Trait when MML Converged Technically

Estimator Parameter Outcome Bias1 RMSE2 Correlation3 Coverage4 SE5 MML a 1.9651 3.9274 .0035 .9650 3.9149 b -0.2419 0.7321 .6988 .7060 0.6177 c 0.0058 0.1402 .1968 .6290 0.1413 d -0.0772 0.1728 .1734 .6030 0.1480 θ 0.0198 0.6021 .8102 .9343 0.5844 Gibbs a -0.1790 0.3352 .6348 .7120 0.1812 b -0.1900 0.3564 .9493 .9750 0.4069 c -0.0043 0.0528 .3785 .9680 0.0644 d -0.0381 0.0775 .2870 .9480 0.0634 θ 0.0039 0.5875 .8268 .9753 0.6829 HMC a -0.1789 0.3352 .6348 .7120 0.1813 b -0.1902 0.3561 .9494 .9750 0.4072 c -0.0043 0.0529 .3785 .9680 0.0644 d -0.0382 0.0776 .2866 .9510 0.0635 θ 0.0038 0.5875 .8269 .9753 0.6829 Note: Bias: average bias; RMSE: root mean square error; Correlation: average correlation between the generated and estimated parameter values; Coverage: average proportion coverage of the true parameter value by 95% Confidence Interval (MML) and 95%

Highest Density Interval (MCMC); SE: average standard error/posterior standard deviation.

239

Appendix G: R Script for the Main Simulation Data Generation rm(list = ls()) #Install and load necessary packages packages.needed <- c("mirt", "truncnorm") new.packages <- packages.needed[!(packages.needed %in% installed.packages()[,"Package"])] if(length(new.packages)) install.packages(new.packages, dependencies=TRUE) library(mirt); library(truncnorm);

#Set the simulation conditions. reps <- 50 #actual number of replications for converged models Ntheta <- 1000 Nitem <- 20 quad.points <- 41 tolerance <- .001

#Generate seeds for the cell. Don't change this 1804 seed. totalconditions <- 18 k <- 1 set.seed(1804) seed1 <- trunc(runif(totalconditions, 100000, 500000)) #Generate theta and inital values set.seed(seed1[k]) theta.true <- matrix(rnorm(Ntheta, 0, 1), nrow = Ntheta, ncol=1)

#Preprare empty objects to store data and model results data.temp <- list() #temp = temporary FourPL.temp <- list() a.hat.temp <- vector(length=Nitem) b.hat.temp <- vector(length=Nitem) c.hat.temp <- vector(length=Nitem) d.hat.temp <- vector(length=Nitem) data <- list() #store data good for convergence FourPL <- list() #store converged model result #Matrices to store true item parameters a.true.temp <- vector(length=Nitem) b.true.temp <- vector(length=Nitem) c.true.temp <- vector(length=Nitem) d.true.temp <- vector(length=Nitem) a.true <- matrix(nrow=Nitem, ncol=reps) #discrimination parameter b.true <- matrix(nrow=Nitem, ncol=reps) #difficulty parameter c.true <- matrix(nrow=Nitem, ncol=reps) #lower asymptote d.true <- matrix(nrow=Nitem, ncol=reps) #upper asymptote

240

#Vector to store information about convergence. tech.convergence <- vector() #TRUE=converged, FALSE=not converged technically in mirt identification <- vector() drift <- vector() drift.vector <- vector() convergence <- vector() #practically converged

#Now collect results for 50 successfully calibrated models count <- 0 #count how many models were identified and converged r <- 1 set.seed(seed1[k]) while (count < reps) { a.true.temp <- rtruncnorm(Nitem, a=0, b=Inf, mean = 1.2, sd = .35)#discrimination parameter b.true.temp <- rnorm(Nitem, 0, 1) # this is the difficulty parameter c.true.temp <- rtruncnorm(Nitem, a=0, b=Inf, mean = .15, sd = .04)#pseudo-guessing parameter (lower asymptote) d.true.temp <- rtruncnorm(Nitem, a=0, b=1, mean = .88, sd = .04) #slipping parameter (upper asymptote)

data.temp <- simdata(a=a.true.temp, d=(-a.true.temp*b.true.temp), #d is the intercept parameter, and is equal to -a*b itemtype='4PL', guess = c.true.temp, upper= d.true.temp, Theta = theta.true) IRTmodel <- 'F1 = Item_1-Item_20' model <- mirt.model(IRTmodel, itemnames = colnames(data.temp)) FourPL.temp <- mirt(data=data.temp, model= model, dentype = 'Gaussian', method="EM", itemtype='4PL', SE=TRUE, technical = list(NCYCLES = 10000), TOL = tolerance, quadpts = quad.points)

tech.convergence[r] <- extract.mirt(FourPL.temp, "converged") identification[r] <- length(coef(FourPL.temp, IRTpars = TRUE, rawug = FALSE, simplify = FALSE, printSE = TRUE)[[1]][,1]) for (j in 1:Nitem){ a.hat.temp[j] <- coef(FourPL.temp, IRTpars = TRUE, rawug = FALSE, simplify = FALSE, printSE = TRUE)[[j]][1,1] b.hat.temp[j] <- coef(FourPL.temp, IRTpars = TRUE, rawug = FALSE, simplify = FALSE, printSE = TRUE)[[j]][1,2] c.hat.temp[j] <- coef(FourPL.temp, IRTpars = TRUE, rawug = FALSE, simplify = FALSE, printSE = TRUE)[[j]][1,3] d.hat.temp[j] <- coef(FourPL.temp, IRTpars = TRUE, rawug = FALSE, simplify = FALSE, printSE = TRUE)[[j]][1,4] ifelse (a.hat.temp[j] <= 3 & abs(b.hat.temp[j]) <= 6 & c.hat.temp[j] < d.hat.temp[j], 241

drift[j] <- 0, drift[j] <- 1)} drift.vector[r] <- mean(drift) ifelse (tech.convergence[r] == "TRUE" & identification[r] ==2 & drift.vector[r] ==0, convergence[r] <- 1, convergence[r] <- 0) if (convergence[r] == 1) {count <- count +1 data[[count]] <- data.temp FourPL[[count]] <- FourPL.temp a.true[,count] <- a.true.temp b.true[,count] <- b.true.temp c.true[,count] <- c.true.temp d.true[,count] <- d.true.temp } r <- r + 1 } #end of while loop

#Check convergence, identification and drift rate (tech.convergence.rate <- length(tech.convergence[tech.convergence=="TRUE"])/length(tech.convergence)) (identification.rate <- length(identification[identification == 2])/length(identification)) (drift.rate <- length(drift.vector[drift.vector>0])/length(drift.vector)) (convergence.rate <- length(convergence[convergence==1])/length(convergence))

#Save data and true parameters for later use #Selectively save objects save(list=c("FourPL", "a.true", "b.true", "c.true", "d.true", "theta.true", "data","tech.convergence.rate", "convergence.rate","identification.rate", "drift.rate" ), file="Cell1.RData")

############################################################################ #Cell 2: 2500 cases and normal theta #Note: Copy the code for cell 1, and make the following changes to the shaded lines: #Ntheta <- 2500 #k <- 2 #file="Cell2.RData" ############################################################################ #Cell 3: 5000 cases and normal theta #Note: Copy the code for cell 1, and make the following changes to the shaded lines: #Ntheta <- 5000 #k <- 3 #file="Cell3.RData" ############################################################################ #Cell 10: 1000 cases and negatively skewed theta #Note: Copy the code for cell 1, and make the following changes to the shaded lines: #k <- 10 #theta.true <- matrix(scale(rbeta(Ntheta, 8.2, 2.8)), nrow = Ntheta, ncol=1) 242

#file="Cell10.RData" ############################################################################ #Cell 11: 2500 cases and negatively skewed theta #Note: Copy the code for cell 1, and make the following changes to the shaded lines: #Ntheta <- 2500 #k <- 11 #theta.true <- matrix(scale(rbeta(Ntheta, 8.2, 2.8)), nrow = Ntheta, ncol=1 #file="Cell11.RData" ############################################################################ #Cell 12: 5000 cases and negatively skewed theta #Note: Copy the code for cell 1, and make the following changes to the shaded lines: #Ntheta <- 5000 #k <- 12 #theta.true <- matrix(scale(rbeta(Ntheta, 8.2, 2.8)), nrow = Ntheta, ncol=1) #file="Cell10.RData" ############################################################################

243

Appendix H: R Script for MML Estimations rm(list = ls()) ##setwd("D:/Dissertation/Chapter 3/Official calibration") #change this to the folder to save results memory.limit(1000000) #increase memory limit if needed start.time <- Sys.time() #Install and load necessary packages packages.needed <- c("mirt", "SimDesign","truncnorm") new.packages <- packages.needed[!(packages.needed %in% installed.packages()[,"Package"])] if(length(new.packages)) install.packages(new.packages, dependencies=TRUE) library(mirt); library(truncnorm) ; library(SimDesign)

#Set the simulation conditions. reps <- 50 #actual number of replications for converged models Ntheta <- 1000 Nitem <- 20 quad.points <- 41 tolerance <- .001 theta.prior.mean <- 0 theta.prior.var <- 1.44

#Load true parameters and data matrices load("Cell1.RData")

#Preprare empty objects to store model results #Array to store parameter estimates and SE. Matrix 1 = estimates. Matrix 2 = SE. Matrix 3 = coverage. item.out <- list() person.out <- list() item.CI <- list() a.out <- array(dim=c(3,Nitem,reps)) b.out <- array(dim=c(3,Nitem,reps)) c.out <- array(dim=c(3,Nitem,reps)) d.out <- array(dim=c(3,Nitem,reps)) theta.out <- array(dim=c(3,Ntheta,reps))

#Generate seeds for the loop. Don't change this "1804" seed. totalconditions <- 18 k <- 1 set.seed(1804) seed1 <- trunc(runif(totalconditions, 100000, 500000)) set.seed(seed1[k]) for (r in 1:reps){ 244

item.out[[r]] <- coef(FourPL[[r]], IRTpars = TRUE, rawug = FALSE, simplify = FALSE, printSE = TRUE) item.CI[[r]] <- coef(FourPL[[r]], IRTpars = TRUE, rawug = FALSE, simplify = FALSE, CI=.95) person.out[[r]] <- fscores(FourPL[[r]], method = "EAP", quadpts = quad.points, mean = theta.prior.mean, cov = theta.prior.var, #prior theta = N(0, 1.2) full.scores = TRUE, full.scores.SE=TRUE)

#Allocate results to respective vectors for each parameter type #Matrix 1 = estimates, Matrix 2 = SE for (j in 1:Nitem){ a.out[1:2,j,r] <- item.out[[r]][[j]][1:2,1] b.out[1:2,j,r] <- item.out[[r]][[j]][1:2,2] c.out[1:2,j,r] <- item.out[[r]][[j]][1:2,3] d.out[1:2,j,r] <- item.out[[r]][[j]][1:2,4]

#Compute item parameter coverage in matrix 3. Cover = 1, otherwise=0. ifelse(a.true[j,r] >= item.CI[[r]][[j]][2,1] & a.true[j,r] <= item.CI[[r]][[j]][3,1], a.out[3,j,r] <- 1, a.out[3,j,r] <- 0) ifelse(b.true[j,r] >= item.CI[[r]][[j]][2,2] & b.true[j,r] <= item.CI[[r]][[j]][3,2], b.out[3,j,r] <- 1, b.out[3,j,r] <- 0) ifelse(c.true[j,r] >= item.CI[[r]][[j]][2,3] & c.true[j,r] <= item.CI[[r]][[j]][3,3], c.out[3,j,r] <- 1, c.out[3,j,r] <- 0) ifelse(d.true[j,r] >= item.CI[[r]][[j]][2,4] & d.true[j,r] <= item.CI[[r]][[j]][3,4], d.out[3,j,r] <- 1, d.out[3,j,r] <- 0)}

#Matrix 3 of theta.outcome: Check if CIs contain true theta. Cover = 1, otherwise = 0. for (i in 1:Ntheta){ theta.out[1,i,r] <- person.out[[r]][i,1] #estimates theta.out[2,i,r] <- person.out[[r]][i,2] #SE

ifelse(theta.true[i,1] >= (theta.out[1,i,r]-1.96*theta.out[2,i,r]) & theta.true[i,1] <= (theta.out[1,i,r]+1.96*theta.out[2,i,r]), theta.out[3,i,r] <- 1, theta.out[3,i,r] <- 0) } } # end of for loop

#Check model run time end.time <- Sys.time() (model.run.time <- end.time - start.time) ############################################################################ #For other cells, change the shaded lines with corresponding sample sizes and seeds #See Appendix G ############################################################################

245

Appendix I: R Script for HMC Estimations rm(list = ls()) ##setwd("D:/Dissertation/Chapter 3/Official calibration") #change this to the folder to save results start.time <- Sys.time() memory.limit(1000000) #Install and load necessary packages #rstan 2.18.2 is complicated and has to be installed separately. packages.needed <- c("SimDesign","truncnorm","HDInterval", "coda", "tidyr") new.packages <- packages.needed[!(packages.needed %in% installed.packages()[,"Package"])] if(length(new.packages)) install.packages(new.packages, dependencies=TRUE) library(rstan) rstan_options(auto_write=TRUE) options(mc.cores = parallel::detectCores()) #Sys.setenv(LOCAL_CPPFLAGS = '-march=native') #this line has to be commented out when sample size is large so that R does not crash, but commenting it out will make HMC slower library(truncnorm) ; library(SimDesign); library(coda); library(HDInterval); library(tidyr);

#Set the simulation conditions. reps <- 50 Ntheta <- 1000 Nitem <- 20 n.chains <- 4 # Number of chains thin <- 1 #thinning option in HMC burn.in <- 20000 chain.length <- 50000 #HMC chain length includes warm-up

#Prepare empty objects to store HMC draws and model outcomes StanFit <- list() model.out <- list() model.hdi <- list()

#Vectors to store multivariate Rhat. item.mRhat <- vector(length=reps) theta.mRhat <- vector(length=reps) model.mRhat <- vector(length=reps)

#Load true parameters and data matrices load("Cell1.RData")

#Generate seeds for the loop. Don't change this "1804" seed. totalconditions <- 18 k <- 1 246 set.seed(1804) seed1 <- trunc(runif(totalconditions, 100000, 500000)) #The loop would begin here .RNG.name="base::Mersenne-Twister" set.seed(seed1[k]) #Initial values for the chains StanInit <- list(list(theta=rnorm(Ntheta, 0, 1.2), b=rnorm(Nitem, 0, 1.3), a=rlnorm(Nitem, 0, .2), c=rbeta(Nitem, 2, 10), d=rbeta(Nitem, 10, 2)), list(theta=rnorm(Ntheta, 0, 1.2), b=rnorm(Nitem, 0, 1.3), a=rlnorm(Nitem, 0, .2), c=rbeta(Nitem, 2, 10), d=rbeta(Nitem, 10, 2)), list(theta=rnorm(Ntheta, 0, 1.2), b=rnorm(Nitem, 0, 1.3), a=rlnorm(Nitem, 0, .2), c=rbeta(Nitem, 2, 10), d=rbeta(Nitem, 10, 2)), list(theta=rnorm(Ntheta, 0, 1.2), b=rnorm(Nitem, 0, 1.3), a=rlnorm(Nitem, 0, .2), c=rbeta(Nitem, 2, 10), d=rbeta(Nitem, 10, 2))) set.seed(seed1[k]) for (r in 1:reps){ data[[r]] <- list( y = data[[r]] ) #Specify model string stanString=" data{ intNtheta; intNitem; inty[Ntheta, Nitem]; //item response } parameters{ vector[Ntheta]theta; vector[Nitem]a; //discrimination is non-negative vector[Nitem]b; vector[Nitem]c; //pseudo-guessing parameter/lower asymptote vector[Nitem]d; //slipping parameter/upper asymptote } model{ theta~normal(0,1.2); //prior theta a~lognormal(0,.2); //prior discrimination parameter b~normal(0,1.3); //prior difficulty parameter c~beta(2,10); //prior lower asymptote d~beta(10,2); //prior upper asymptote for (i in 1:Ntheta){ for (j in 1:Nitem){ y[i,j] ~ bernoulli(c[j] + ((d[j] - c[j])*inv_logit(a[j]*(theta[i] - b[j])))); //likelihood }}} " 247

#Compile model StanModel <- stan_model(model_code=stanString)

#Sample from the posterior StanFit <- sampling(object=StanModel, #compiled model data=data[[r]], #Data init= StanInit, #Initial parameter values chains=n.chains, #number of chains warmup=burn.in, #number of burn-in draws iter=chain.length, #number of draws per chain thin=thin) #keep every thin-th draw

#Store estimates model.out[[r]] <- summary(StanFit, pars = c("a", "b", "c", "d", "theta"))$summary[ ,c(1,2,3,9,10)]

#Multivariate Rhat with coda package. item.mRhat[r] <- gelman.diag(As.mcmc.list(StanFit, pars=c("a", "b", "c", "d")), confidence = 0.95, transform=FALSE, autoburnin=FALSE, multivariate=TRUE)$mpsrf theta.mRhat[r] <- gelman.diag(As.mcmc.list(StanFit, pars="theta"), confidence = 0.95, transform=FALSE, autoburnin=FALSE, multivariate=TRUE)$mpsrf model.mRhat[r] <- gelman.diag(As.mcmc.list(StanFit, pars=c("a", "b", "c", "d", "theta")), confidence = 0.95, transform=FALSE, autoburnin=FALSE, multivariate=TRUE)$mpsrf #Save HDIs. model.hdi[[r]] <- hdi(rstan::extract(StanFit, pars=c("a", "b", "c", "d", "theta")))

} #end of for loop end.time <- Sys.time() (model.run.time <- end.time - start.time) ############################################################################ #For other cells, change the shaded lines with corresponding sample sizes and seeds #See Appendix G ############################################################################

248

Appendix J: Gibbs.txt model { for (i in 1:Ntotal) { y[i] ~ dbern(c[itemID[i]]+((d[itemID[i]]-c[itemID[i]])/(1+exp(-(a[itemID[i]]*(theta[personID[i]] - b[itemID[i]]))))))} for (itemID in 1:Nitem) { a[itemID] ~ dlnorm(0, 25)} #1/.04= 25 is precision for (itemID in 1:Nitem) { b[itemID] ~ dnorm(0, 1/1.69)} for (itemID in 1:Nitem) { c[itemID] ~ dbeta(2, 10)} for (itemID in 1:Nitem) { d[itemID] ~ dbeta(10, 2)} for (personID in 1:Ntheta) { theta[personID] ~ dnorm(0, 1/1.44)} }

249

Appendix K: R Script for Gibbs Estimations rm(list = ls()) #setwd("D:/Dissertation/Chapter 3/Official calibration") #change this to the folder to save results memory.limit(1000000) #increase memory limit if needed start.time <- Sys.time() #Install and load necessary packages packages.needed <- c("SimDesign","truncnorm","HDInterval","coda", "tidyr") new.packages <- packages.needed[!(packages.needed %in% installed.packages()[,"Package"])] if(length(new.packages)) install.packages(new.packages, dependencies=TRUE) library(SimDesign); library(coda); library(HDInterval); library(rjags) library(tidyr); library(runjags)

#Set the simulation conditions. reps <- 50 Ntheta <- 1000 Nitem <- 20 n.chains <- 4 # Number of chains n.adapt <- 1000 #tuning iterations thin <- 1 #thinning option in MCMC burn.in <- 20000 #burn-in segment chain.length <- 30000/thin #number of iterations for each chain after the burn-in ID <- seq(from=1, to=Ntheta, by=1) #each respondent has an ID

#Prepare empty objects to store data and model results data <- list() #response data in long format dataList <- list() #data list for JAGS JFit <- list() ##JAGS posterior after burn-in model.out <- list()

#Vectors to store multivariate Rhat. item.mRhat <- vector(length=reps) theta.mRhat <- vector(length=reps) model.mRhat <- vector(length=reps)

#Load true parameters and data matrices load("Cell1.RData")

#Generate seeds for the loop. Don't change this "1804" seed. totalconditions <- 18 k <- 1 set.seed(1804) seed1 <- trunc(runif(totalconditions, 100000, 500000))

250

#Generate inital values # There are 4 chains, so there are 4 lists of initial values (n.chains=4) set.seed(seed1[k]) jInit <- list(list(theta=rnorm(Ntheta, 0, 1.2), b=rnorm(Nitem, 0, 1.3), a=rlnorm(Nitem, 0, .2), c=rbeta(Nitem, 2, 10), d=rbeta(Nitem, 10, 2), .RNG.name="base::Mersenne-Twister"), list(theta=rnorm(Ntheta, 0, 1.2), b=rnorm(Nitem, 0, 1.3), a=rlnorm(Nitem, 0, .2), c=rbeta(Nitem, 2, 10), d=rbeta(Nitem, 10, 2), .RNG.name="base::Mersenne-Twister"), list(theta=rnorm(Ntheta, 0, 1.2), b=rnorm(Nitem, 0, 1.3), a=rlnorm(Nitem, 0, .2), c=rbeta(Nitem, 2, 10), d=rbeta(Nitem, 10, 2), .RNG.name="base::Mersenne-Twister"), list(theta=rnorm(Ntheta, 0, 1.2), b=rnorm(Nitem, 0, 1.3), a=rlnorm(Nitem, 0, .2), c=rbeta(Nitem, 2, 10), d=rbeta(Nitem, 10, 2), .RNG.name="base::Mersenne-Twister")) #The outer loop would begin here set.seed(seed1[k]) for (r in 1:reps){ data[[r]] <- tidyr::gather(as.data.frame(cbind(ID, data[[r]])), key=Item, value=Response, Item_1:Item_20, factor_key=TRUE) data[[r]] <- data[[r]][order(data[[r]]$ID),]

#Assemble data into list for JAGS personID <- as.numeric(factor(data[[r]]$ID)) itemID <- as.numeric(factor(data[[r]]$Item)) Ntotal <- nrow(data[[r]]) dataList[[r]] <- list(y=data[[r]][,3], itemID = itemID, personID=personID, Nitem = Nitem, Ntheta = Ntheta, Ntotal = Ntotal)

#Compile model JFit <- run.jags(method = "parallel", model = "Gibbs.txt", monitor = c("a", "b", "c", "d", "theta"), data = dataList[[r]], inits = jInit, n.chains = n.chains, adapt = n.adapt, burnin = burn.in, sample= chain.length, thin = thin, summarise = FALSE, plots = FALSE)

# Save results from the posterior probability distribution. 251

model.out[[r]] <- summary(JFit)[ ,c(1,3,4,5,7,9,11)] item.mRhat[r] <- gelman.diag(as.mcmc.list(JFit)[,1:(4*Nitem)], transform=FALSE, autoburnin=FALSE, multivariate = TRUE)$mpsrf theta.mRhat[r] <- gelman.diag(as.mcmc.list(JFit)[,(4*Nitem+1):(4*Nitem+Ntheta)], transform=FALSE, autoburnin=FALSE, multivariate = TRUE)$mpsrf model.mRhat[r] <- gelman.diag(as.mcmc.list(JFit), transform=FALSE, autoburnin=FALSE, multivariate = TRUE)$mpsrf

} #end of for loop end.time <- Sys.time() (model.run.time <- end.time - start.time) ############################################################################ #For other cells, change the shaded lines with corresponding sample sizes and seeds #See Appendix G ############################################################################

252

Appendix L: R Script for Follow-up (MML Converged Only Technically) rm(list = ls()) #setwd("D:/Dissertation/Chapter 3/Official calibration") #change this to the folder to save results memory.limit(1000000) #increase memory limit if needed

#Install and load necessary packages packages.needed <- c("mirt", "truncnorm") new.packages <- packages.needed[!(packages.needed %in% installed.packages()[,"Package"])] if(length(new.packages)) install.packages(new.packages, dependencies=TRUE) library(mirt); library(truncnorm);

#Set the simulation conditions. reps <- 50 #actual number of replications for converged models Ntheta <- 1000 Nitem <- 20 quad.points <- 41 tolerance <- .001

#Generate seeds for the cell. Don't change this 1804 seed. totalconditions <- 18 k <- 1 set.seed(1804) seed1 <- trunc(runif(totalconditions, 500000, 600000)) #Generate theta and inital values set.seed(seed1[k]) theta.true <- matrix(rnorm(Ntheta, 0, 1), nrow = Ntheta, ncol=1)

#Preprare empty objects to store data and model results data.temp <- list() #temp = temporary FourPL.temp <- list() a.hat.temp <- vector(length=Nitem) b.hat.temp <- vector(length=Nitem) c.hat.temp <- vector(length=Nitem) d.hat.temp <- vector(length=Nitem) data <- list() #store data good for convergence FourPL <- list() #store converged model result #Matrices to store true item parameters a.true.temp <- vector(length=Nitem) b.true.temp <- vector(length=Nitem) c.true.temp <- vector(length=Nitem) d.true.temp <- vector(length=Nitem) a.true <- matrix(nrow=Nitem, ncol=reps) #discrimination parameter 253 b.true <- matrix(nrow=Nitem, ncol=reps) #difficulty parameter c.true <- matrix(nrow=Nitem, ncol=reps) #lower asymptote d.true <- matrix(nrow=Nitem, ncol=reps) #upper asymptote

#Vector to store information about convergence. tech.convergence <- vector() #TRUE=converged, FALSE=not converged technically in mirt identification <- vector() #drift <- vector() #drift.vector <- vector() convergence <- vector() #practically converged

#Now collect results for 50 successfully calibrated models count <- 0 #count how many models were identified and converged r <- 1 set.seed(seed1[k]) while (count < reps) { a.true.temp <- rtruncnorm(Nitem, a=0, b=Inf, mean = 1.2, sd = .35)#discrimination parameter b.true.temp <- rnorm(Nitem, 0, 1) # this is the difficulty parameter c.true.temp <- rtruncnorm(Nitem, a=0, b=Inf, mean = .15, sd = .04)#pseudo-guessing parameter (lower asymptote) d.true.temp <- rtruncnorm(Nitem, a=0, b=1, mean = .88, sd = .04) #slipping parameter (upper asymptote)

data.temp <- simdata(a=a.true.temp, d=(-a.true.temp*b.true.temp), #d is the intercept parameter, and is equal to -a*b itemtype='4PL', guess = c.true.temp, upper= d.true.temp, Theta = theta.true) IRTmodel <- 'F1 = Item_1-Item_20' model <- mirt.model(IRTmodel, itemnames = colnames(data.temp)) FourPL.temp <- mirt(data=data.temp, model= model, dentype = 'Gaussian', method="EM", itemtype='4PL', SE=TRUE, technical = list(NCYCLES = 10000), TOL = tolerance, quadpts = quad.points) #Store model convergence and identification information #The first condition tells us if mirt model converges technically. #The second condition makes sure model is identified. #The third condition removes implausible estimates tech.convergence[r] <- extract.mirt(FourPL.temp, "converged") identification[r] <- length(coef(FourPL.temp, IRTpars = TRUE, rawug = FALSE, simplify = FALSE, printSE = TRUE)[[1]][,1]) #for (j in 1:Nitem){ # a.hat.temp[j] <- coef(FourPL.temp, IRTpars = TRUE, rawug = FALSE, simplify = FALSE, printSE = TRUE)[[j]][1,1] 254

# b.hat.temp[j] <- coef(FourPL.temp, IRTpars = TRUE, rawug = FALSE, simplify = FALSE, printSE = TRUE)[[j]][1,2] # c.hat.temp[j] <- coef(FourPL.temp, IRTpars = TRUE, rawug = FALSE, simplify = FALSE, printSE = TRUE)[[j]][1,3] # d.hat.temp[j] <- coef(FourPL.temp, IRTpars = TRUE, rawug = FALSE, simplify = FALSE, printSE = TRUE)[[j]][1,4] #ifelse (a.hat.temp[j] <= 3 & abs(b.hat.temp[j]) <= 6 & c.hat.temp[j] < d.hat.temp[j], # drift[j] <- 0, drift[j] <- 1)} #drift.vector[r] <- mean(drift) ifelse (tech.convergence[r] == "TRUE" & identification[r] ==2, convergence[r] <- 1, convergence[r] <- 0) if (convergence[r] == 1) {count <- count +1 data[[count]] <- data.temp FourPL[[count]] <- FourPL.temp a.true[,count] <- a.true.temp b.true[,count] <- b.true.temp c.true[,count] <- c.true.temp d.true[,count] <- d.true.temp } r <- r + 1 } #end of while loop ############################################################################ #For other cells, change the shaded lines with corresponding sample sizes, theta distribution #and random seeds. See Appendix G ############################################################################

255

Appendix M: R Script for Follow-up (Skewed θ Distribution with More High-

Ability Respondents) rm(list = ls()) #setwd("D:/Dissertation/Chapter 3/Official calibration") #change this to the folder to save results memory.limit(1000000) #increase memory limit if needed

#Install and load necessary packages packages.needed <- c("mirt", "truncnorm") new.packages <- packages.needed[!(packages.needed %in% installed.packages()[,"Package"])] if(length(new.packages)) install.packages(new.packages, dependencies=TRUE) library(mirt); library(truncnorm); library(psych);

#Set the simulation conditions. reps <- 50 #actual number of replications for converged models Ntheta <- 1000 Nitem <- 20 quad.points <- 41 tolerance <- .001

#Generate seeds for the cell. Don't change this 1804 seed. totalconditions <- 18 k <- 10 set.seed(1804) seed1 <- trunc(runif(totalconditions, 100000, 500000)) #Generate theta and inital values count <- 0 #count how many models were identified and converged set.seed(seed1[k]) while (count < 1) { theta.skewed <- matrix(scale(rbeta(100000, 8.4, 2)), #negative skewness nrow = Ntheta, ncol=1) theta.norm <- matrix(rnorm(100000, 0, 1), nrow = Ntheta, ncol=1) upper.tail <- theta.norm[which(theta.norm[,1] > 1.5), ] theta.true <- scale(c(theta.skewed, upper.tail)) shape <- describe(theta.true)

if (shape$skew < -.595 & shape$skew > -.605 & shape$kurtosis <.5) {count <- count +1} } #end of while loop describe(theta.true) plot(density(theta.true)) theta.old <- scale(rbeta(100000, 8.2, 2.8)) lines(density(theta.old)) theta.true <- as.matrix(sample(theta.true, size=Ntheta, replace=F), 256

nrow=Ntheta, ncol=1) #this is the matrix to feed data generation theta.true[which(theta.true[,1] > 2), ]

#Preprare empty objects to store data and model results data.temp <- list() #temp = temporary FourPL.temp <- list() a.hat.temp <- vector(length=Nitem) b.hat.temp <- vector(length=Nitem) c.hat.temp <- vector(length=Nitem) d.hat.temp <- vector(length=Nitem) data <- list() #store data good for convergence FourPL <- list() #store converged model result #Matrices to store true item parameters a.true.temp <- vector(length=Nitem) b.true.temp <- vector(length=Nitem) c.true.temp <- vector(length=Nitem) d.true.temp <- vector(length=Nitem) a.true <- matrix(nrow=Nitem, ncol=reps) #discrimination parameter b.true <- matrix(nrow=Nitem, ncol=reps) #difficulty parameter c.true <- matrix(nrow=Nitem, ncol=reps) #lower asymptote d.true <- matrix(nrow=Nitem, ncol=reps) #upper asymptote

#Vector to store information about convergence. tech.convergence <- vector() #TRUE=converged, FALSE=not converged technically in mirt identification <- vector() drift <- vector() drift.vector <- vector() convergence <- vector() #practically converged

#Now collect results for 50 successfully calibrated models count <- 0 #count how many models were identified and converged r <- 1 set.seed(seed1[k]) while (count < reps) { a.true.temp <- rtruncnorm(Nitem, a=0, b=Inf, mean = 1.2, sd = .35)#discrimination parameter b.true.temp <- rnorm(Nitem, 0, 1) # this is the difficulty parameter c.true.temp <- rtruncnorm(Nitem, a=0, b=Inf, mean = .15, sd = .04)#pseudo-guessing parameter (lower asymptote) d.true.temp <- rtruncnorm(Nitem, a=0, b=1, mean = .88, sd = .04) #slipping parameter (upper asymptote)

data.temp <- simdata(a=a.true.temp, d=(-a.true.temp*b.true.temp), #d is the intercept parameter, and is equal to -a*b itemtype='4PL', 257

guess = c.true.temp, upper= d.true.temp, Theta = theta.true) IRTmodel <- 'F1 = Item_1-Item_20' model <- mirt.model(IRTmodel, itemnames = colnames(data.temp)) FourPL.temp <- mirt(data=data.temp, model= model, dentype = 'Gaussian', method="EM", itemtype='4PL', SE=TRUE, technical = list(NCYCLES = 10000), TOL = tolerance, quadpts = quad.points) #Store model convergence and identification information #The first condition tells us if mirt model converges technically. #The second condition makes sure model is identified. #The third condition removes implausible estimates tech.convergence[r] <- extract.mirt(FourPL.temp, "converged") identification[r] <- length(coef(FourPL.temp, IRTpars = TRUE, rawug = FALSE, simplify = FALSE, printSE = TRUE)[[1]][,1]) for (j in 1:Nitem){ a.hat.temp[j] <- coef(FourPL.temp, IRTpars = TRUE, rawug = FALSE, simplify = FALSE, printSE = TRUE)[[j]][1,1] b.hat.temp[j] <- coef(FourPL.temp, IRTpars = TRUE, rawug = FALSE, simplify = FALSE, printSE = TRUE)[[j]][1,2] c.hat.temp[j] <- coef(FourPL.temp, IRTpars = TRUE, rawug = FALSE, simplify = FALSE, printSE = TRUE)[[j]][1,3] d.hat.temp[j] <- coef(FourPL.temp, IRTpars = TRUE, rawug = FALSE, simplify = FALSE, printSE = TRUE)[[j]][1,4] ifelse (a.hat.temp[j] <= 3 & abs(b.hat.temp[j]) <= 6 & c.hat.temp[j] < d.hat.temp[j], drift[j] <- 0, drift[j] <- 1)} drift.vector[r] <- mean(drift) ifelse (tech.convergence[r] == "TRUE" & identification[r] ==2 & drift.vector[r] ==0, convergence[r] <- 1, convergence[r] <- 0) if (convergence[r] == 1) {count <- count +1 data[[count]] <- data.temp FourPL[[count]] <- FourPL.temp a.true[,count] <- a.true.temp b.true[,count] <- b.true.temp c.true[,count] <- c.true.temp d.true[,count] <- d.true.temp } r <- r + 1 } #end of while loop ############################################################################ #For other cells, change the shaded lines with corresponding sample sizes, and random seeds. #See Appendix G ############################################################################ ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Thesis and Dissertation Services ! !