BAYESIAN MODEL CHECKING STRATEGIES FOR

DICHOTOMOUS ITEM RESPONSE THEORY MODELS

Sherwin G. Toribio

A Dissertation

Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

August 2006

Committee:

James H. Albert, Advisor

William H. Redmond Graduate Faculty Representative

John T. Chen

Craig L. Zirbel ii

ABSTRACT

James H Albert, Advisor

Item Response Theory (IRT) models are commonly used in educational and psycholog- ical testing. These models are mainly used to assess the latent abilities of examinees and the effectiveness of the test items in measuring this underlying trait. However, model check- ing in Item Response Theory is still an underdeveloped area. In this dissertation, various model checking strategies from a Bayesian perspective for different Item Response models are presented. In particular, three methods are employed to assess the goodness-of-fit of different IRT models. First, Bayesian residuals and different residual plots are introduced to serve as graphical procedures to check for model fit and to detect outlying items and exami- nees. Second, the idea of predictive distributions is used to construct reference distributions for different test quantities and discrepancy measures, including the standard deviation of point bi-serial correlations, Bock’s Pearson-type chi-square index, Yen’s Q1 index, Hosmer-

Lemeshow , Mckinley and Mill’s G2 index, Orlando and Thissen’s S −G2 and S −X2 indices, Wright and Stone’s W -statistic, and the Log-likelihood statistic. The prior, poste- rior, and partial posterior predictive distributions are discussed and employed. Finally, Bayes factor are used to compare different IRT models in model selection and detection of outly- ing discrimination parameters. In this topic, different numerical procedures to estimate the

Bayes factors for these models are discussed. All of these proposed methods are illustrated using simulated data and Mathematics placement exam data from BGSU.

ii iii

ACKNOWLEDGMENTS

First of all, I would like to thank Dr. Jim Albert, my advisor, for his constant support and many suggestions throughout this research. I also wish to thank him for the friendship and all the advice that he shared about life in general. I also want to extend my gratitude to the other members of my committee, Dr. John Chen, Dr. Craig Zirbel, and Dr. William

Redmond, for their time and advice.

I am grateful to the department of Mathematics and for all the support and for providing a wonderful research environment. I especially wish to thank Marcia Seubert,

Cyndi Patterson, and Mary Busdeker for all their help. The dissertation fellowship for the period 2005-2006 was crucial to the completion of this work.

I wish to thank my colleagues and friends from BG, Joel, Vhie, Merly, Florence,

Dhanuja, Kevin, Mike, Khairul and Shapla, and all the other Pinoys for all the fun and interesting discussions.

Finally, I thank my beloved wife, Alie, for all her support, love, and patience, and

Simone for bringing all the joy and happiness in our lives during our stay in Bowling Green.

Without them this work could never have come to existence.

Sherwin G. Toribio

Bowling Green, Ohio August, 2006

iii iv

TABLE OF CONTENTS

CHAPTER 1: ITEM RESPONSE THEORY MODELS 1

1.1 Introduction ...... 1

1.2 Item Response Curve ...... 2

1.3 Common IRT Models ...... 4

1.3.1 One-Parameter Model ...... 4

1.3.2 Two-Parameter Model ...... 6

1.3.3 Three-Parameter Model ...... 8

1.3.4 Exchangeable IRT Model ...... 8

1.4 Parameter Estimation ...... 9

1.4.1 Likelihood Function ...... 9

1.4.2 Joint Maximum Likelihood Estimation ...... 10

1.4.3 Bayesian Estimation ...... 14

1.4.4 Albert’s Gibbs Sampler ...... 17

1.5 An Example - BGSU Mathematics Placement Exam ...... 20

1.6 Advantages of the Bayesian Approach ...... 24

CHAPTER 2: MODEL CHECKING METHODS FOR BINARY AND IRT

MODELS 28

2.1 Introduction ...... 28

2.2 Residuals ...... 29

iv v

2.2.1 Classical Residuals ...... 31

2.2.2 Bayesian Residuals ...... 34

2.3 Chi-squared tests for Goodness-of-fit of IRT Models ...... 35

2.3.1 Wright and Pachapakesan Index (WP) ...... 35

2.3.2 Bock’s Index (B) ...... 36

2.3.3 Yen’s Index (Q1) ...... 37

2.3.4 Hosmer and Lemeshow Index (HL) ...... 37

2.3.5 Mckingley and Mills Index (G2)...... 38

2.3.6 Orlando and Thissen Indices (S − χ2 and S − G2) ...... 39

2.4 Discrepancy Measures and Test quantities ...... 40

2.5 Predictive Distributions ...... 40

2.5.1 Prior Predictive Distribution ...... 40

2.5.2 Posterior Predictive Distribution ...... 42

2.5.3 Conditional Predictive Distribution ...... 43

2.5.4 Partial Posterior Predictive Distribution ...... 44

2.6 Bayes Factor ...... 44

CHAPTER 3: OUTLIER DETECTION IN IRT MODELS USING BAYESIAN

RESIDUALS 46

3.1 Introduction ...... 46

3.2 Detecting Misfitted Items Using IRC Interval Band ...... 47

3.3 Detecting Guessers ...... 51

3.3.1 Examinee Bayesian Residual Plots ...... 51 vi

3.3.2 Examinee Bayesian Latent Residual Plots ...... 55

3.4 Detecting Misfitted Examinees ...... 59

3.5 Application To Real Data Set ...... 62

CHAPTER 4: ASSESSING THE GOODNESS-OF-FIT OF IRT MODELS

USING PREDICTIVE DISTRIBUTIONS 67

4.1 Introduction ...... 67

4.2 Checking the Appropriateness of the One-parameter Probit IRT Model . . . 68

4.2.1 Point Biserial Correlation ...... 68

4.2.2 Using Prior Predictive ...... 70

4.2.3 Using Posterior Predictive ...... 74

4.3 Item Fit Analysis ...... 78

4.3.1 Using Prior Predictive ...... 79

4.3.2 Using Posterior Predictive ...... 80

4.3.3 Using Partial Posterior Predictive ...... 86

4.4 Examinee Fit Analysis ...... 88

4.4.1 Discrepancy Measures for Person Fit ...... 89

4.4.2 Detecting Guessers using Posterior Predictive ...... 90

4.5 Application To Real Data Set ...... 95

CHAPTER 5: BAYESIAN METHODS FOR IRT MODEL SELECTION 101

5.1 Introduction ...... 101

5.2 Checking the Beta-Binomial Model using Bayes Factors ...... 102

5.2.1 Beta Binomial Model ...... 103 vii

5.2.2 Bayes Factor ...... 105

5.2.3 Laplace Method for Integration ...... 106

5.2.4 Estimating the Bayes Factor ...... 108

5.2.5 Application to Real Data ...... 112

5.2.6 Approximating the Denominator of the Bayes Factor ...... 112

5.2.7 Using Importance Sampling ...... 114

5.3 Exchangeable IRT Model ...... 117

5.3.1 Approximating the One-parameter model ...... 120

5.3.2 Approximating the Two-parameter model ...... 122

5.4 IRT Model Comparisons and Model Selection ...... 124

5.4.1 Computing the Bayes Factor for IRT models ...... 125

5.4.2 IRT Model Comparison ...... 126

5.5 Finding Outlying Discrimination Parameters ...... 130

5.5.1 Using Bayes Factor ...... 131

5.5.2 Using Mixture Prior Density ...... 132

5.6 Application To Real Data Set ...... 138

CHAPTER 6: SUMMARY AND CONCLUSIONS 142

Appendix A: NUMERICAL METHODS 145

A.1 Newton Raphson for IRT Models ...... 145

A.2 Markov Chain Monte Carlo (MCMC) ...... 147

A.2.1 Metropolis-Hasting ...... 147

A.2.2 Gibbs Sampling ...... 149 viii

A.2.3 Importance Sampling ...... 150

Appendix B: MATLAB PROGRAMS 151

B.1 Chapter 1 codes ...... 151

B.2 Chapter 3 codes ...... 154

B.3 Chapter 4 codes ...... 158

B.4 Chapter 5 codes ...... 164

REFERENCES 174 ix

LIST OF FIGURES

1.1 A typical item response curve...... 4

1.2 Item response curves for 3 different difficulty values...... 5

1.3 Item response curves for 3 different discrimination values...... 7

1.4 Items with high discrimination power have higher chances of distinguishing

two examinees with different ability scores than items with low discrimination

power...... 7

1.5 Scatterplots of 35 actual item parameter versus their corresponding estimates 13

1.6 Scatterplots of 1000 actual ability scores versus their corresponding estimates. 13

1.7 Scatterplots of 35 actual item parameter versus their corresponding Bayesian

estimates ...... 19

1.8 Scatterplot of 1000 actual ability scores versus their corresponding Bayesian

estimates ...... 19

1.9 Summary plot of the JML estimates of the parameters of the 35 items in

BGSU Math placement exam...... 21

1.10 Scatterplot of the JML estimates of the ability scores versus their correspond-

ing exam raw score...... 22

1.11 Summary plot of the Bayesian estimates of the parameters of the 35 items in

BGSU Math placement exam...... 23

1.12 Scatterplot of the Bayesian estimates of the ability scores versus their corre-

sponding exam raw score...... 23

ix x

1.13 Scatterplots that compare the Bayesian estimates with the JMLE estimates

of the item parameters...... 25

1.14 A scatterplot that depicts a strong correlation between the Bayesian and

JMLE estimates of the ability scores...... 25

2.1 Classical Residual Plot ...... 33

3.1 A 90% interval band for the fitted item response curves of items 15 and 30

using the Two-parameter IRT model...... 48

3.2 A 90% interval band for the item response curves of items 10 (above) and

26 (below) fitted with the (left) One-parameter IRT model and (right) Two-

parameter IRT model...... 49

3.3 Posterior residual plots of items 10 (above) and 26 (below) fitted with the

(left) One-parameter IRT model and (right) Two-parameter IRT model. . . . 50

3.4 Examinee residual plot of someone with ability score θ = −0.09...... 53

3.5 Examinee residual plots of examinees with ability scores of θ = −1.15 (left)

and θ = −2.19 (right)...... 53

3.6 Examinee residual plots of examinees with ability scores of θ = 1.22 (left) and

θ = 2.2 (right)...... 54

3.7 Examinee residual plots of two guessers...... 54

3.8 Examinee latent residual plot of an examinee with ability score θ = −0.09. . 57

3.9 Examinee latent residual plots of examinees with ability scores of θ = −1.15(left)

and θ = −2.19(right)...... 58 xi

3.10 Examinee latent residual plots of examinees with ability scores of θ = 1.22(left)

and θ = 2.2(right)...... 58

3.11 Examinee latent residual plots of two guessers...... 59

3.12 Histograms of the number of examinees (out of 1000) who scored (left) much

too high and (right) much too low...... 62

3.13 Examinee residual and latent residual plots of examinee no. 185...... 63

3.14 Residual and latent residual plots of examinee no. 802 (above) and 854 (below). 64

3.15 Examinee residual and latent residual plots of examinee no. 500...... 64

3.16 IRC band and Posterior Residual Plot of Item 15...... 65

3.17 Item response curves of item 21 (above) and item 30 (below)...... 66

4.1 Histogram of 500 simulated values of std(r-pbis) using prior predictive distri-

bution...... 71

4.2 This histogram of the 100 simulated prior predictive p-values illustrates that

the distribution of the prior p-value of std(rpbis) is close to uniform[0,1]. . . . 73

4.3 Histograms of 100 observed std(rpbis) when data sets were generated using

(left) two-parameter and (right) one-parameter model...... 74

4.4 Histogram of 500 simulated values of std(r-pbis)...... 75

4.5 Histogram of 100 posterior predictive p-values...... 77

4.6 Residual plots of the two guessers, examinee 236 and 777...... 92

4.7 Residual plots of examinee 529...... 93

4.8 Histogram of the 995 non-guessers...... 95

4.9 Histogram of 500 simulated values of std(r-pbis) ...... 96 xii

4.10 The 90% interval bands for the item response curves of items 11 (upper left),

30 (upper right), 33 (lower left), and 34 (lower right) fitted with the one-

parameter IRT model...... 97

4.11 The 90% interval bands for the item response curves of items 14 (left) and 15

(right) fitted with the one-parameter IRT model...... 98

4.12 Latent residual plots of six students marked as potential guessers by the W

and L statistics using the posterior predictive distribution...... 100

5.1 Scatterplots of the exact values versus the approximate values of the log-

denominator of the Bayes factor...... 114

5.2 Parameter estimates obtained using the exchangeable model is compared with

the actual values: (left) Item difficulty, and (right) Ability scores...... 118

5.3 Item Parameter and Ability scores estimates obtained using the exchangeable

model is compared with the observed data: (left) Item difficulty vs. No. of

correct students, and (right) Ability scores vs. Students raw scores...... 119

5.4 Scatterplot of the discrimination estimates obtained using the Exchangeable

model and the Two-parameter model...... 120

5.5 Estimates obtained using the two exchangeable models (one with random sa

and one with fixed sa = 0.25) are compared: (left) Item difficulty; (right)

Ability scores...... 121

5.6 Estimates obtained using the One-parameter model and the exchangeable

model with fixed sa = 0.01 are compared: (left) Item difficulty; (right) Item

discrimination...... 122 xiii

5.7 Estimates obtained using the One-parameter model and the exchangeable

model with fixed sa = 10 are compared: (left) Item difficulty; (right) Ability

scores...... 123

5.8 Scatterplot of estimates of ability scores obtained using the One-parameter

model and the exchangeable model with fixed sa = 10...... 123

5.9 Histogram of 100 log10BF of Exchangeable model (sa = 0.25) vs. (left) Two-

parameter model and (right) One-parameter model...... 127

5.10 Values of log10BF of exchangeable models with varying standard deviations

compared to the approximate Two-parameter model. The right plot is just a

close up look at the peak of the graph...... 128

5.11 Values of log10BF of exchangeable models with varying standard deviations

compared to the approximate One-parameter model. The right plot is a closer

look at the peak of the graph...... 130

5.12 (left) Scatterplot of the actual vs. estimated item discrimination parameters.

(right) Estimated probability of each item having an outlying discrimination

parameter. Note that items 10, 20, and 30 have much bigger probabilities

than the rest...... 136

5.13 Values of log10BF of exchangeable models with varying standard deviations

compared to the two-parameter model using the BGSU Math placement data

set. The right plot is just a close up look at the peak of the graph...... 138

5.14 Histogram of the 1000 posterior sample values of µa for the BGSU Math

placement data using the exchangeable model with sa = 0.148...... 139 xiv

5.15 Histogram of the 1000 posterior sample values of µa for the BGSU Math

placement data using the exchangeable model with sa = 0.148...... 140 xv

LIST OF TABLES

1.1 First and Second Derivatives of Item and Ability Parameters for the Two-

Parameter Logistic Model...... 11

1.2 Two extreme questions in the exam...... 24

2.1 levels of evidence by log10 BF ...... 45

4.1 Orlando and Thissen (2000) simulation results: Proportion of significant p-

values (< 0.05)...... 78

4.2 Percentage of p-values < 0.05 out of 100...... 82

4.3 Percentage of p-values < 0.05 out of 100...... 83

4.4 Percentage of significant p-values when the one-parameter probit model is

used on items with no guessing parameter(c = 0)...... 84

4.5 Percentage of significant p-values when the one-parameter probit model is

used on items with guessing parameter value of c = 0.25 ...... 84

4.6 Percentage of significant p-values when the two-parameter probit model is

used on items with no guessing parameter (c = 0)...... 85

4.7 Percentage of significant p-values when the two-parameter probit model is

used on items with guessing parameter value of c = 0.25...... 85

4.8 Percentage of p-values < 0.05 out of 100 using G2 (pp1 and pp2 represent the

one-parameter and two-parameter probit model)...... 88

4.9 The 17 misfitted examinees with PW < 0.05 (* signifies a guesser)...... 92

xv xvi

4.10 The 16 misfitted examinees with PL < 0.05 (* signifies a guesser)...... 94

4.11 The percentage of PL and PW < 0.05 (* signifies a guesser)...... 94

5.1 Twenty simulated observations from Beta-binomial model...... 109

5.2 Twenty generated binomial observations...... 110

5.3 of values of log10 BF ...... 111

5.4 Levels of evidence by log10 BF ...... 111

5.5 Barry Bonds’ hitting data from 1986 to 2005...... 112

5.6 The log10BF(Ml−out/M) for each item in the artificial data. Note that the

value of log10BF(Ml−out/M) for items 10, 20, and 30 are all bigger than 3,

marking them as items with outlying discrimination parameter...... 133

5.7 Theγ ˆ for each item represents the likelihood that its discrimination parameter

is outlying. Note that the value ofγ ˆ for items 10, 20, and 30 are all much

bigger than the rest, marking them as items with outlying discrimination

parameter...... 137

5.8 Bayesian estimates ofa ˆj, log10BF, and γ for the BGSU Math placement exam.141 xvii

OVERVIEW

The focus of this dissertation is to discuss the available model diagnostic procedures for Item Response Theory (IRT) models and to propose new methodologies to assess the goodness-of-fit of these models. The first two chapters cover the material needed to un- derstand the different IRT models and some Bayesian ideas which will be utilized later.

Chapters 3, 4, and 5 cover the proposed Bayesian methodologies to assess the goodness-of-fit of the IRT models.

In Chapter 1, the different IRT models used in this work are introduced. Classical and

Bayesian methods to estimate the parameters in the IRT models are also discussed. This includes discussions of some numerical methods like Newton Raphson and Gibbs Sampling.

These methods are illustrated using a Mathematics placement data set from BGSU. The chapter ends with a discussion on the advantages of the Bayesian estimation method over the classical estimation method.

Chapter 2 covers the ideas of classical and Bayesian residuals. The concept of residuals is used to construct different chi-squared indices which are currently being used to check model

fit of IRT models. These different indices are used later within a Bayesian framework as discrepancy measures. The idea of predictive distributions and measures of surprise are also discussed in this chapter. These standard Bayesian ideas are useful to construct reference distributions for different test quantities and discrepancy measures. Another important

Bayesian concept that will be employed later is the Bayes factor, which is introduced in the last section of this chapter.

xvii xviii

Chapter 3 deals mostly about graphical procedures that can be used to assess the fit of the IRT models. These visual diagnostic plots are constructed based on Bayesian residuals.

The item response curve probability interval band proposed by Albert (1999) is a simple but very useful plot to check for item fit. This plot is described in the first section. Two other diagnostic plots, the examinee Bayesian residual and latent residual plots, are proposed in the second section. These two plots will be utilized to check how a particular examinee performed in the test. They may also help detect examinees who were simply guessing their responses. In the third section, a Bayesian procedure to detect examinees who scored much too low or much too high in the exam is proposed based on another Bayesian residual. These

Bayesian methods and plots were applied to a real data set in the last section.

In Chapter 4, new quantitative methods are proposed to give objective assessments to the fit of IRT models. In particular, the prior, posterior, and partial posterior predictive dis- tributions are used to construct reference distributions for the standard deviation of the item point-biserial correlations, and for eight different discrepancy measures − the six χ2-indices described in Chapter 2 and two more discrepancy measures for person fit. A simulation study is performed to illustrate and compare the effectiveness of these different discrepancy measures and different predictive distributions in detecting misfitted items and examinees, as well as the overall model misfit. The chapter ends with the application of these predictive methods to a real data set.

In Chapter 5, the Bayes factor is used to illustrate a quantitative method for comparing goodness-of-fit of different IRT models and for model selection. The first section of this chapter covers different numerical methods that could be used to calculate the Bayes factor.

These methods are then modified and applied to estimate the Bayes factor for IRT models xix in later sections. This Bayes factor is used to choose between competing IRT models and for the detection of outlying discrimination parameters. The effectiveness of this method will be illustrated using simulated data. Again, these methods were applied to a real data set in the last section.

Finally, the last chapter gives a summary of all the proposed Bayesian methods along with discussions regarding their performances in the assessment of goodness-of-fit of different

IRT models. 1

CHAPTER 1 ITEM RESPONSE THEORY MODELS

1.1 Introduction

Item Response Theory (IRT) models are commonly used in Educational and Psycho- logical testing. In these fields of study, researchers are usually interested in measuring the underlying ability of examinees such as intelligence, mathematical abilities, or scholastic abilities. However, these kinds of quantities cannot be measured directly as one measures physical attributes like weight or height. In this sense, these underlying abilities are latent traits. One of the main objectives of IRT is to measure the amount of (latent) ability that an examinee possesses. This is usually done using a questionnaire or an examination. It is important that the items used in the questionnaire or test are appropriate to accurately and effectively measure the underlying trait. Consequently, the second main objective of IRT is to study the effectiveness of different test items in measuring a particular underlying trait.

Although the idea of IRT has been around for almost a century now, it only became popular in the last two decades. This is mainly due to the extensive computational require- ments of the IRT methods. Up until the 1980’s, the Classical Test Theory (CTT) had been the mainstay of psychological and educational test development and test score analysis. The classic book of Gulliksen (1950) is often cited as the defining volume for CTT. Today there are countless numbers of achievement, aptitude, and personality tests that were constructed using CTT models and procedures.

However, there are many well-documented shortcomings of the ways in which educa-

1 2 tional and psychological tests are usually constructed, evaluated, and used within the CTT

(Hambleton & van der Linden, 1982). For one, the values of commonly used item statistics in test development, such as item difficulty, depend on the particular sample of examinees from which they were obtained. That is, one particular item can be labeled as easy when given to a group of well prepared students and as difficult when given to a group of unprepared students. For more information about the shortcomings of CTT, see the book by Hambleton and Swaminathan (1985).

By the late 1980’s, the power of computers had developed to a point where it allowed people working in measurement theory to employ the more computationally intensive meth- ods of IRT. This “new” theory is conceptually more powerful than CTT [11]. Based upon items rather than test scores, IRT addresses most of the shortcomings of the CTT. In other words, IRT can do all the things that CTT can do and more. An extensive comparison between these two theories are discussed in the book by Embretson and Reise (2000).

1.2 Item Response Curve

In this dissertation, the latent ability of examinees (usually denoted by θ) is assumed to be continuous and one-dimensional. That means that the performance of an examinee on a particular item of an exam depends only on this one characteristic. Theoretically, the range of this latent variable is from negative infinity to positive infinity. But for most practical purposes, it is sufficient to limit this range between −3 and 3. An examinee with higher ability score is expected to perform better in answering a particular item in the test compared to an examinee with a lower ability score.

In the case where items in the test can only be answered either correctly or incorrectly, 3 let y denote the examinee’s response to a particular item, and take y = 1 if the response is correct and y = 0 if incorrect. This is a Bernoulli random variable with success probability p that depends on the latent ability of the examinee. That is,

p = Pr(y = 1) = F (θ), where F represents a known function, called the link function.

Because p should be an increasing function of θ and are supposed to take on values between 0 and 1, a natural class for the function F is provided by the class of cumulative distribution functions, or cdf’s. The two most commonly used link functions in IRT models are:

1. Probit link (standard normal cdf). Z x 1 − 1 t2 F (x) = √ e 2 dt, x ∈ R −∞ 2π

2. Logistic link (standard logistic distribution function). ex F (x) = , x ∈ R 1 + ex

Inferences obtained using either link functions are essentially the same. Previously, people working with IRT models prefer the logistic link because of its nice properties that simplifies the mathematical calculations in parameter estimation. However, with the ad- vancement of computing power and the introduction of Bayesian methods in parameter estimation, the probit link has gained more popularity as it is more natural and easier to implement numerically.

In the IRT model, an examinee with a certain ability level will have a certain probability of answering a particular item correctly. Plotting these probabilities and their corresponding 4

ability scores will yield a plot like the one shown Figure 1.1. This curve is called an Item

Response Curve (IRC).

Typical Item Response Curve 1

0.9

0.8

0.7

0.6

0.5

0.4

0.3 PROBABILITY OF CORRECT RESPONSE 0.2

0.1

0 −3 −2 −1 0 1 2 3 LATENT ABILITY Figure 1.1: A typical item response curve.

1.3 Common IRT Models

1.3.1 One-Parameter Model

The probability that an examinee will answer a particular item in a test correctly should

also depend on the characteristics of the item. For example, if item 2 is more difficult than

item 1, then the probability that a particular examinee will get item 2 correctly should be

lower than the probability that he/she gets item 1 correctly. Under the assumption that

each item in the test can be described using this single difficulty parameter, one could model

the probability of correctly answering a particular item in the test by

Pr(y = 1) = F (θ − b) (1.3.1) 5

where b represents the difficulty parameter of the item. To see the effect of b in the Item

Response Curve (IRC), consider the three different plots given in Figure 1.2 with varying

difficulty values.

Item Response Curves for 3 Different Item Difficulty Values 1 a=1,b=0 a=1,b=−1 0.9 a=1,b=1

0.8

0.7 Easier (b=−1) 0.6

(b=0) 0.5 Harder (b=1) 0.4

0.3 PROBABILITY OF CORRECT RESPONSE 0.2

0.1

0 −3 −2 −1 0 1 2 3

LATENT ABILITY

Figure 1.2: Item response curves for 3 different difficulty values.

Note that b serves as a . When b takes negative values, the IRC is shifted to the left and the probability that a particular examinee, with a certain ability score

θ, correctly answers the item increases. Hence, lower b values correspond to easier items and higher b values correspond to more difficult items.

When the link function F is taken to be the cumulative distribution function of the standard (denoted by Φ), this model is known as the One-parameter

probit model. But when this link function F is taken as the logistic cumulative distribution 6

function, this model becomes the famous Rasch Model (Rasch, 1966).

e(θ−b) Pr(Y = 1) = . (1.3.2) 1 + e(θ−b)

1.3.2 Two-Parameter Model

Suppose that each item in the exam can be described by two parameters − a discrim-

ination parameter aj and a difficulty parameter bj. Then the probability that a particular

examinee with latent ability score of θi correctly answers item j is modeled as

Pr(Yij = 1|θi) = F (ajθi − bj). (1.3.3)

Again, when the link function F is taken to be the cumulative distribution function of

the standard normal distribution, this model is known as the Two-parameter probit model.

But when this link function F is taken as the logistic cumulative distribution function, this

model is called the Two-parameter logit model.

To see the effect of the discrimination parameter a in the item response curve, consider

the three different plots shown in Figure 1.3 with varying discrimination values. Note that a serves as a scale parameter that represents the slope of the item response curve. It indicates how well a particular item discriminates between students with different abilities.

Take for example two examinees − one with ability score 0 and another with ability

score 1. If an item has a discrimination parameter value of 0.5, then the difference in the

probabilities of getting the correct answer to this item by these 2 examinees will be about

0.19 (see Figure 1.4). On the other hand, if an item has a discrimination parameter value of

2, then this difference in probabilities will be about 0.48. 7

Item Response Curves for 3 Different Item Discrimination Values 1 a=1,b=0 a=0.5,b=0 0.9 a=2,b=0 High

0.8

0.7 Low

0.6

0.5

0.4

0.3 PROBABILITY OF CORRECT RESPONSE 0.2

0.1

0 −3 −2 −1 0 1 2 3 LATENT ABILITY

Figure 1.3: Item response curves for 3 different discrimination values.

High Discrimination vs. Low Discrimination 1 a=1,b=0 a=0.5,b=0 0.9 a=2,b=0

P1 − P0 = .48 0.8

0.7

0.6 P1 − P0 = .19

0.5

0.4

0.3 PROBABILITY OF CORRECT RESPONSE 0.2

0.1

0 −3 −2 −1 0 1 2 3 LATENT ABILITY

Figure 1.4: Items with high discrimination power have higher chances of distinguishing two examinees with different ability scores than items with low discrimination power. 8

Hence, the item with higher discrimination parameter value has a better chance of

finding the examinee with higher ability score.

1.3.3 Three-Parameter Model

Sometimes, especially on multiple choice items, examinees can get the correct answer purely by guessing. To include this guessing parameter in the model, one could model the success probability as

Pr(yij = 1|θi) = cj + (1 − cj)F (ajθi − bj). (1.3.4)

where cj represents the probability that any examinee will get item j correct by pure guessing.

This model is known as the Three-parameter probit model when the standard normal cumulative distribution is used as the link function. But when the logistic link function is used, this model is called the Three-parameter logit model. The latter model was introduced by Birnbaum in 1968.

1.3.4 Exchangeable IRT Model

The one-parameter IRT model assumes that all items in the exam have the same discrim- ination parameters (usually all equal to one), while the two-parameter IRT model assumes that each item can have a different discrimination parameter value. Some people think that the one-parameter model is too restrictive, while others think that the two-parameter model is already over-parameterized. In the Bayesian framework, there is a way to get a compromise between these two models. This is achieved by considering an exchangeable

IRT model in which the item discrimination parameter values are shrunk toward a common 9 value. More details about this model will be discussed in Chapter 5, where this model will be used extensively.

1.4 Parameter Estimation

There are two main methods of obtaining estimates for the parameters in the above mentioned models: the classical Joint Maximum Likelihood Estimation (JMLE) and by

Bayesian Estimation. In either case, one has to work with the likelihood function. To facilitate the discussion of these two estimation methods, they will be discussed using only the two-parameter IRT models. These two estimation procedures can be easily modified to work for the other IRT models.

1.4.1 Likelihood Function

Let yi1, yi2, . . . , yik denote the binary responses of the ith individual to k test items, a = (a1, . . . , ak) and b = (b1, . . . , bk) be the vectors of item discrimination and difficulty parameters, respectively. Assuming that an individual taking the test answers each item independently (local independence assumption), then the probability of observing the entire sequence of responses of the ith individual is given by

Yk Pr(Yi1 = yi1,...,Yik = yik|θi, a, b) = Pr(Yij = yij|θi, a, b). j=1 Yk yij (1−yij ) = F (ajθi − bj) [1 − F (ajθi − bj)] . j=1 10

Finally, if the responses of each of the n individuals to the test items are assumed to be independent, then the likelihood function for all responses of all individuals will be

Yn Yk yij (1−yij ) L(θ, a, b) = F (ajθi − bj) [1 − F (ajθi − bj)] . (1.4.1) i=1 j=1

This function represents the likelihood of obtaining the observed data as a function of

the model parameters. Therefore, it is logical to estimate these model parameters using those

values that maximize this likelihood function. This is what Maximum Likelihood Estimation

(MLE) or Joint Maximum Likelihood Estimation (JMLE) is all about.

1.4.2 Joint Maximum Likelihood Estimation

One of the most common ways of maximizing a likelihood function is to take its partial

derivatives with respect to each parameter in the model and set them to zero. Actually,

because likelihood functions are most often expressed as the product of several density func-

tions, it is often more convenient to maximize the natural logarithm of the likelihood (ln(L)).

Since logarithmic functions are increasing in R, then the maximum of the likelihood func-

tion will occur at the same point as the maximum of the log-likelihood. In the case of the

two-parameter IRT model, the log-likelihood is

Xn Xk ln L = {yij ln(pij) + (1 − yij) ln(1 − pij)} (1.4.2) i=1 j=1

where pij = F (ajθi − bj).

Taking its partial derivatives with respect to each parameter and setting them to zero

will yield a system of n + 2k equations with the same number of unknowns. The solutions

of this system of equations are the potential maximum likelihood estimates of the model 11 parameters. For this reason, people working with the IRT models preferred to use the logistic link because it simplifies the derivative expressions nicely and greatly facilitated the required calculations.

e(aj θi−bj ) For the two-parameter logistic IRT model, where pij = , the first partial 1 + e(aj θi−bj ) derivatives are given by

∂pij ∂pij ∂pij = pijqijθi, = −pijqij, and = pijqijaj, ∂aj ∂bj ∂θi

where qij = 1−pij. Using these partial derivatives, the first and second partial derivatives of the log-likelihood (1.4.2) using the logistic link, can be obtained easily and are summarized in Table 1.1, shown below.

Derivative Expression ∂ ln(L) Xn θ (y − p ) ∂a i ij ij j i=1 ∂ ln(L) Xn −(y − p ) ∂b ij ij j i=1 ∂ ln(L) Xk a (y − p ) ∂θ j ij ij i j=1 ∂2 ln(L) Xn −p q θ2 ∂a2 ij ij i j i=1 ∂2 ln(L) Xn p q θ ∂b a ij ij i j j i=1 ∂2 ln(L) Xn −p q ∂b2 ij ij j i=1 Table 1.1: First and Second Derivatives of Item and Ability Parameters for the Two- Parameter Logistic Model.

However, even with the logistic link, the resulting equations are not linear. Thus, to get the maximum likelihood estimates, one needs to solve these equations numerically. Two 12

popular numerical methods used for this purpose are the Newton-Raphson and Fisher’s

Method of Scoring (see Appendix).

Using the mathematical software Matlab, the author has written programs to implement

the Newton-Raphson algorithm to estimate the parameters of the two-parameter logistic

model. This program is described in full in the Appendix under the name pl2 mle.

To show how close the JMLE estimates are to the actual parameters, a simple simulation

was performed where a data set of 0’s and 1’s were generated using 1000 simulated ability

scores and 35 test items each with 2 parameters. The 1000 simulated ability scores and the 35

item difficulty parameter values were generated from N(0,1), while the 35 item discrimination

parameters were randomly selected from the possible values of {0.2, 0.4, 0.6, 0.8, 1.0, 1.2,

1.4, 1.6}. Once the parameter values were specified, the probability of answering a particular

item correctly by a certain simulated student was computed using the logistic link to obtain

a 1000 × 35 matrix of probabilities. Finally, this matrix was converted into a matrix of 0’s

and 1’s to simulate a particular exam result. Using the 1000 × 35 data matrix of simulated

responses, the JMLE estimates were obtained using the program pl2 mle.

The two scatterplots shown in Figure 1.5 display the relationship between the actual

item parameters and their JMLE estimates. The left plot of Figure 1.5 shows a linear

pattern of dots that were very close to the line y = x, which illustrates the accuracy of the estimates of the 35 difficulty parameters obtained. The right plot of Figure 1.5 also shows a linear trend, but the dots in this plot are more scattered revealing a lower precision for the estimates of the discrimination parameters of the 35 items. Also, notice that the linear pattern is slightly above the line y = x, suggesting a positive bias in the estimation of the discrimination parameters by the JMLE. This positive bias was previously noted by Lord 13

Classical Approach : (r = 0.9965) Classical Approach : (r = 0.9801) 2 1.8

1.5 1.6

1 1.4

0.5 1.2

0 1

−0.5 0.8 Estimated Item Difficulty Estimated Item Discrimination

−1 0.6

−1.5 0.4

−2 0.2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Actual Item Difficulty Actual Item Discrimination

Figure 1.5: Scatterplots of 35 actual item parameter versus their corresponding estimates

(1983).

Classical Approach : (r = 0.8964) 4

3

2

1

0

−1 Estimated Ability Score

−2

−3

−4 −3 −2 −1 0 1 2 3 Actual Ability Score

Figure 1.6: Scatterplots of 1000 actual ability scores versus their corresponding estimates.

The scatterplot of the actual ability scores versus their estimated values is given in

Figure 1.6. Again, notice the linear pattern of the dots that cluster around the line y = x.

The variability of the estimates around this line depends on the number of items in the exam.

That is, if there were more items in the exam, these dots would be closer to the line y = x. 14

This plot also revealed the increased variability of the estimates for extreme ability scores.

1.4.3 Bayesian Estimation

In the classical (or frequentist) framework, parameters in a model are considered as

fixed quantities. On the other hand, in the Bayesian framework, these parameters, ξ =

(ξ1, . . . , ξN ), are considered as random variables that follow a certain distribution, π(ξ).

Bayesian methodology requires the specification of a prior distribution, π0(ξ), for each pa- rameter ξ in the model. This will represent the prior belief regarding the parameters in the model. After observing the data through the likelihood function, L(data; ξ), the belief about the parameters is modified (or updated) by computing their posterior distributions,

π(ξ|data). This is done with the use of the Bayes Rule formula:

P (A ∩ B) P (B|A)P (A) P (A|B) = = ∝ P (B|A)P (A). P (B) P (B)

Or in our terms,

π(ξ|data) ∝ L(data; ξ)π0(ξ).

Once the posterior distributions of the parameters are obtained, all inferences pertaining to these parameters are based on their respective posterior distributions.

For the Bayesian method of estimation, using the probit link is more natural and easier to implement as will be seen later. For this reason, the Bayesian estimation method was discussed using the two-parameter probit model. Using Bayes’ rule, the joint posterior distri- bution will be proportional to the product of the likelihood function obtained in Section 1.4.1 and the joint prior density of the parameters. That is,

Yn Yk yij (1−yij ) π(θ, a, b|data) ∝ Φ(ajθi − bj) [1 − Φ(ajθi − bj)] π0(θ, a, b). (1.4.3) i=1 j=1 15

where Φ is the standard normal cumulative distribution and π0(θ, a, b) is the joint prior density of the parameters in the model.

It is a standard practice to use values for θi mostly between −3 and 3. For this reason, a

N(0, 1) prior is assigned for θi , i = 1, . . . , n. This also solves the problem of nonidentifiability of the parameters in the model. To avoid the problem of getting unbounded estimates for the item difficulty parameters, bj is assigned a N(0, sb) prior, j = 1, . . . , k, where sb < 5.

Finally, a N(0, sa) prior is also assigned for aj, j = 1, . . . , k, where sa is fixed. For simplicity,

sb and sa were both set to 1 in the actual computations.

Combining these prior densities with the likelihood function, the posterior density of

the two-parameter IRT model is proportional to

Yn Yk π(θ, a, b|data) ∝ L(θ, a, b) φ(θi; 0, 1) φ(bj; 0, sb)φ(aj; 0, sa). (1.4.4) i=1 j=1

As mentioned before, all Bayesian inferences about a parameter will be based on its

posterior distribution. Consequently, Bayesian analysis will require the study of the impor-

tant parameters based on this joint posterior distribution or their corresponding marginal

posterior distributions. However, it is quite difficult to study this complicated posterior

distribution or to derive the marginal posterior distributions analytically.

An alternative method is to simulate values of these parameters from the joint posterior

distribution. Inferences about a parameter can then be made using this sample. For example,

one could take the average of the sample to serve as an estimate of the mean of the parameter,

or construct an approximate 95% probability interval for the parameter. However, drawing

a sample from a high-dimensional posterior distribution is not an easy task. Fortunately,

there is Gibbs Sampling (Geman and Geman, 1984). 16

Gibbs sampling, as discussed in Gelfand and Smith (1990), is a special type of Markov

Chain Monte Carlo (MCMC) that makes use of the full conditional distribution of a set

of parameters. The idea is, to simulate from f(x, y, z) (the joint distribution of X,Y , and

Z), one iteratively draws from the full conditional distributions. That is, from initial values

x0, y0, and z0, one draws x1 from g(x|y0, z0), then y1 from g(y|x1, z0), and then z1 from g(z|x1, y1). This will constitute a single iteration of the Gibbs Sampling. To simulate m

points from the f(x, y, z), simply repeat this cycle m + l times, where l is the number of

cycles it takes to converge to the desired distribution (also called the burn-in period). The

points from the last m cycles can be considered as a (dependent) sample drawn from the joint distribution f(x, y, z). Some other people would repeat the cycle km + l times and select every other k points among the last km points in order to reduce the dependency of the sample points.

However, Gibbs sampling assumes that it is possible to simulate from the full condi-

tional distributions. If each of the full conditional distributions turned out to be a common

standard distribution that can be directly or easily simulated from, then there would be no

problem. But if some of these distributions are nonstandard density functions, then one

may need to employ a more general MCMC algorithm, like the Metropolis-Hasting (MH)

algorithm, to obtain a sample from them (see the appendix for details on MH algorithm).

Sometimes, finding the full conditional distribution from a joint distribution may be the

problem. Especially in complicated distributions, like our joint posterior distribution given

in equation 1.4.4, it can be very challenging. 17

1.4.4 Albert’s Gibbs Sampler

To facilitate the computation of these full conditional distributions, Albert (1992) intro-

duced a latent variable Zij that has a normal distribution with mean mij = ajθi − bj and

1. This continuous variable serves as the underlying mechanism that generates the

responses. We say that the response is positive (yij = 1) when zij > 0 and negative (yij = 0)

when zij < 0. This ingenious idea greatly simplified the simulation of samples from the

conditional posterior distributions, as they turned out to be just variations of the normal

distribution.

With the introduction of these continuous latent data Z = (Z11,...,Znk), the joint

posterior density of all model parameters is given by

Yn Yk π(θ, Z, a, b|data) ∝ [φ(Zij; mij, 1)I(Zij, yij)] i=1 j=1 Yn Yk × φ(θi; 0, 1) φ(bj; 0, sb)φ(aj; 0, sa). (1.4.5) i=1 j=1

where I(z, y) is equal to 1 when {z > 0, y = 1} or {z < 0, y = 0}, and equal to 0 otherwise.

To simulate from the joint posterior (1.4.5), the Gibbs sampling procedure can iter-

atively draw from three sets of conditional probability distributions: g(Z|θ, (a, b), data), g(θ|Z, a, b, data), and g((a, b)|Z, θ, data). The conditional posterior distribution of Zij given (θi, aj, bj, data) is simply a truncated normal distribution with mean mij = ajθi − bj and variance 1. The truncation is from the left of 0 if the corresponding response is correct

(yij = 1), and from the right of 0 if it is incorrect (yij = 0). The conditional posterior distribution of θi given (Zij, aj, bj, data) is a normal distribution with mean and variance

Pk j=1 aj(Zij + bj) 1 mθ = P and νθ = P . i k 2 i k 2 j=1 aj + 1 j=1 aj + 1 18

Finally, the conditional posterior distribution of (aj, bj) given (Zij, θi, data) is the mul- tivariate normal distribution with mean

0 −1 −1 0 0 −1 Mj = [X X + Σ0 ] [X Z + [µa 0] Σ0 ]

and covariance matrix

0 −1 −1 νj = (X X + Σ0 ) ,   2 where, Σ−1= Sa 0  and X is the known covariate vector (θ , −1). 0 2 i 0 Sb For more details on these conditional posterior distributions, see Albert and Johnson

(1999).

To implement Albert’s Gibbs Sampler on the two-parameter probit IRT model with

burn-in period, the author modified Albert’s Matlab program to get the program pp2 bay

(see Appendix). To see how close the estimates are to the actual parameters, the same

simulated parameter values that were used in the previous section were used to generate a

data set of 0’s and 1’s using the probit link. Using the generated 1000 × 35 data matrix of

responses, the Bayesian estimates were obtained using the program pp2 bay.

The two scatterplots shown in Figure 1.7 display the relationship between the actual

item parameters and their Bayesian estimates. The left plot in Figure 1.7 shows a linear

pattern of dots which resembles very closely to the corresponding plot obtained earlier using

the classical approach and is shown in Figure 1.5. The correlation coefficient between the

actual difficulty values and their Bayesian estimates was 0.9948. This indicates the accuracy

of the Bayesian estimates. The plot on the right of Figure 1.7 also shows a linear pattern

of dots that looks similar to the one obtained using the JMLE method, except that the

Bayesian item discrimination estimates are better since they centered around values close to 19

Bayesian Approach (r = 0.9948) Bayesian Approach : (r = 0.9794) 2.5 1.8

2 1.6

1.5 1.4

1 1.2

0.5 1

0 0.8 Estimated Item Difficulty −0.5 0.6 Estimated Item Discrimination

−1 0.4

−1.5 0.2

−2 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Actual Item Difficulty Actual Item Discrimination

Figure 1.7: Scatterplots of 35 actual item parameter versus their corresponding Bayesian estimates

Bayesian Approach (r = 0.9557) 3

2

1

0 Estimated Ability Scores −1

−2

−3 −3 −2 −1 0 1 2 3 Actual Ability Scores

Figure 1.8: Scatterplot of 1000 actual ability scores versus their corresponding Bayesian estimates 20 the actual discrimination values.

Figure 1.8 shows a very nice linear pattern around the line y = x, which illustrates the accuracy of the Bayesian estimates of the ability score. In addition, the problem of higher variability at extreme values that were observed earlier, when the JMLE method was used, no longer exist.

1.5 An Example - BGSU Mathematics Placement Exam

Every year, the Mathematics and Statistics Department of BGSU administers a place- ment exam to determine the proficiency of the incoming freshmen students. In 2004, there were three different exams (A, B, and C) given to a total of 5570 students. Exam A was composed of 35 questions and were given to a total of 1286 students. Data set A contains the results of these 1286 students to each of the 35 items. It is a 1286 × 35 table of 0’s and

1’s with aij = 1 when the ith student answered the jth item correctly and 0 otherwise.

To illustrate the kind of results that one gets from the estimation procedures discussed in the previous two sections, those methods were applied to the responses of the 1286 students who took exam A.

A. Using Joint Maximum Likelihood Estimation

Before the JMLE procedure can be applied to the data set for exam A, students who got either a score of zero or a perfect score, as well as items that were answered correctly or incorrectly by all students had to be removed to avoid getting unreasonable results (This issue will be discussed later in the last section of this chapter). After checking, 11 students

(2 zero scores and 9 perfect scores) were removed from the data set. The JMLE procedure was then applied to this slightly smaller data set and the resulting item parameter estimates 21 are summarized in the left scatterplot shown in Figure 1.9. The estimates were obtained using the Matlab program pl2 mle.

Based on Figure 1.9, one observes that item 30 was tagged as a difficult question by this method. This result agrees with the data as only 175 out of the 1286 students got this item correctly. This item was also tagged as an item with very low discrimination power.

On the other hand, item 21 was tagged as an easy question with high discrimination power which, again, agrees with the data as 1196 out of 1286 students answered it correctly.

Classical Approach Classical Approach 1.8

31 1200 21 21 2 3 1.6 14 9 3120 12 8 20 1000 2418 2922 26 13 1.4 5 10 6 13 5 1 11 19 727 9 32 800 6 2922 1.2 24 19 12 1 4 18 3217 15 351510 1623 14 25 27 17 600 1 3 7 33 DISCRIMINATION 3516 4 25 2 23 26 3428 8 0.8 28 400 11 No. of Examinees With Correct Responses 34 33 0.6 200 30 30

−4 −3 −2 −1 0 1 2 −4 −3 −2 −1 0 1 2 DIFFICULTY JML Estimates of Item Difficulty

Figure 1.9: Summary plot of the JML estimates of the parameters of the 35 items in BGSU Math placement exam.

Finally, in Figure 1.10, one could see that the ability estimates of students are very con- sistent with the observed data. The total number of correct responses of students increases as the estimated ability scores increases. 22

Correlation = 0.9673 35

30

25

20

15 Students Total Scores

10

5

0 −5 −4 −3 −2 −1 0 1 2 3 4 JMLE of Ability Scores

Figure 1.10: Scatterplot of the JML estimates of the ability scores versus their corresponding exam raw score.

B. Using Bayesian Estimation

When the Bayesian estimation method was applied to the original (complete) BGSU

Math placement exam A result, the Bayes estimates for the discrimination and difficulty

parameters of the 35 items were obtained using the Matlab program pp2 bay. A summary of the Bayesian estimates is shown in the left scatterplot of Figure 1.11. Note how close this plot resembles the left plot of Figure 1.9 which was obtained using JMLE.

Based on Figure 1.11, it was observed that item 30 was again tagged as a difficult

question with low discrimination power and item 21 was also tagged as an easy question

with high discrimination power. As pointed out earlier, these results are consistent to the

observed results. Figure 1.12 also exhibits very similar relationship between the raw scores

of students and their corresponding ability score (Bayes) estimates with what was obtained

using the JMLE procedure (shown in Figure 1.10). All these show that the classical and 23

Bayesian Approach Bayesian Approach

21 31 1200 21 2 0.8 3 26 14 9 2229 3120 20 10 12 8 1000 2418 0.7 13 5 19 136 51 32 11 24 6 1 727 0.6 9 12 800 2229 15 19 25 4 18 3217 351510 27 17 1623 0.5 14 7 4 3516 3 23 600 33

DISCRIMINATION 2 2526 28 34 0.4 8 28

34 400

11 No. of Examinees With Correct Responses 33 0.3

200 30 30 0.2

−2 −1.5 −1 −0.5 0 0.5 1 −2 −1.5 −1 −0.5 0 0.5 1 DIFFICULTY Bayesian Estimates of Item Difficulty

Figure 1.11: Summary plot of the Bayesian estimates of the parameters of the 35 items in BGSU Math placement exam.

Correlation = 0.9882 35

30

25

20

15 Number of correct items

10

5

0 −4 −3 −2 −1 0 1 2 3 Bayesian Ability Score Estimates

Figure 1.12: Scatterplot of the Bayesian estimates of the ability scores versus their corre- sponding exam raw score. 24

Bayesian approaches will yield very similar results and conclusions. However, the Bayesian approach have some advantages over the classical approach as will be discussed later in the next section.

As a final check on these results, one could go back and check the actual questions in the placement exam. The two extreme questions (Q21 and Q30) are given in Table 1.2 shown below. Looking at these questions, most teachers would agree that the results are consistent to expectation. Incoming freshmen college students usually have the most difficult time simplifying composite rational expressions and that was exactly what question 30 was about.

Q21. If 2x − 5 = 7, then x =

(A) −6 (B) −1 (C) 1 (D) 6 (E) 12 1 Q30. 6 = 1 + x x x x + 6 1 x (A) (B) (C) (D) (E) 1 + 6 x + 6 x x + 6 6

Table 1.2: Two extreme questions in the exam.

1.6 Advantages of the Bayesian Approach

Although the estimates that was obtained from the two estimation methods were quite similar (see the 3 plots below), the Bayesian method does have some advantages over the classical joint maximum likelihood estimation method.

First of all, as was seen earlier, the JMLE method cannot handle extreme results. An examinee who answered all items in the exam correctly would get a positive infinity ability 25

Correlation = 0.9903 Correlation = 0.9674 2.5 1.8

2 1.6

1.5 1.4

1 1.2

y=1.0256x − 0.1437 0.5 1

0 0.8

−0.5 0.6 Bayesian Difficulty Estimates Bayesian Discrimination Estimates

−1 0.4

−1.5 0.2

−2 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 JMLE Difficulty Estimates JMLE Discrimination Estimates

Figure 1.13: Scatterplots that compare the Bayesian estimates with the JMLE estimates of the item parameters.

Correlation = 0.8608 3 y = x

2

1

0 y=0.8139x+0.0113

−1 Bayesian Ability Score Estimates

−2

−3 −4 −3 −2 −1 0 1 2 3 4 JMLE Ability Score Estimates

Figure 1.14: A scatterplot that depicts a strong correlation between the Bayesian and JMLE estimates of the ability scores. 26 estimate when the JMLE method is used. Also, an examinee who did not get any correct item would get a negative infinity ability estimate. Similarly, an item that was correctly answered by all the students would get a difficulty estimate of negative infinity, while an item that was not answered correctly by any examinee would get a difficulty estimate of positive infinity. These problems do not arise with the Bayesian estimation method with the use of a proper prior.

Secondly, the estimates from the JMLE method may not be the absolute maximum estimates but rather, only some local maximum. This is due to the fact that the likelihood function may have multiple modes.

Thirdly, there is a problem of identifiability of the parameters in the IRT models. Take for example the two-parameter IRT model, where the probability of correctly answering item j by examinee i is modeled by pij = F (ajθi − bj). If one considers another set of

∗ ∗ 1 ∗ ∗ parameters like, aj = 2aj and θi = 2 θi, then pij = F (ajθi − bj) = F (aj θi − bj). This implies that the parameter values are not unique and cannot be identified without putting some restrictions on their possible values. Some people address this problem by working with the marginal likelihood function. This marginal likelihood is obtained by integrating out the joint likelihood function with respect to all the ability parameters. However, this marginal maximum likelihood method (MMLE) is not useful when one is also interested with the ability score estimates. In Bayesian, this problem of identifiability is conveniently solved by the introduction of the prior distributions.

Finally, the Bayesian estimation method provides additional benefits: (1) This method allows the incorporation of prior beliefs about item parameters easily; (2) The fitting proce- dure is simple; (3) Once simulated values from the posterior distributions are obtained, it is 27 very easy to do inference about any function of the parameters that may be of interest; and most importantly, (4) it offers a great deal of potential for model checking. 28

CHAPTER 2 MODEL CHECKING METHODS FOR BINARY AND IRT MODELS

2.1 Introduction

Model checking, or assessing the fit of a model, is an important part of any data modeling process. Before using the model to make inferences regarding the data, it is crucial to estab- lish that the model fits the data well enough according to some criteria. In particular, the model should explain important aspects of the data that are of practical interest. Otherwise, the conclusions obtained using the model might not be relevant.

In IRT, model checking is still an underdeveloped area. Well-established statistical tech- niques for model checking do not exist except for the simplest models (Van der Linden and

Hambleton, 1997). Most of the classical methods involve plotting residuals and computing some χ2 indices. However, the exact or asymptotic distributions of all of these χ2 statistics are not known or are questionable.

Recently, as a consequence of the groundbreaking work of Albert (1992), there were some papers by C. Glas (2003, 2005) and S. Sahu (2002) that came out which employed

Bayesian methodologies to assess the goodness-of-fit of the IRT model. Glas employed the idea of posterior predictive to assess the person fit of IRT models, while Sahu proposed the idea of pseudo Bayes factor for IRT model selection. However, Sahu’s discussion is more about the general Bayesian ideas and without including any specific methodology to do the specific task, while Glas et. al.’s discussion concentrated only on the overall fit of the Three- parameter model using discrepancy measures only for person fit. Nothing was mentioned

28 29 regarding item fit or any other method aside from the posterior predictive distribution.

Model checking in IRT can be addressed in three general ways: (1) Check if the data satisfy the dimensionality and local independence assumptions; (2) Check the expected ad- vantages derived from the use of the item response model; (3) Check the closeness of the

fit between the predicted and observed outcomes. In this dissertation, most of the author’s efforts was concentrated on the last case.

In this chapter, the idea of residuals and their uses will be discussed in Section 2. Section

3 will cover the idea of discrepancy measures and test quantities, which will be followed by a discussion of the different predictive distributions in Section 4. Finally, Section 5 will cover the idea of Bayes factor.

2.2 Residuals

A residual is a measurement of agreement between an observed response and its corre- sponding fitted value under a certain model. If the model is appropriate, then one expects the fitted values to be “close” to the observed responses. These residuals are a good source of information regarding the goodness-of-fit of the model to the data (or more appropriately, about the lack-of-fit of the model). They can be utilized in two different ways. First, resid- uals can be used to determine which observation is unusual under the model assumptions.

Identifying these outlying values (or outliers) is important as they may signal deficiencies in the model. Second, summary statistics can be derived using these residuals which can provide much information about the overall adequacy of the fitted model.

In the first case, where residuals are used to identify outliers, residuals are most of- ten defined as the difference between the observed responses and their corresponding fitted 30

values.

Take for example a data set composed of n binomial observations. That is,

Yi ∼ bin(ni, pi), i = 1, . . . , n.

A commonly used model for this kind of data is the linear logistic model, where the proba-

bility of success pi is modeled as µ ¶ pi logit(pi) = log = αxi + β. 1 − pi

Note that if we let pij = Pr(Yij = 1), α = aj, xi = θi, and β = bj, then this is exactly the two-parameter IRT model.

Using standard classical maximum likelihood iterative procedure, similar to the JMLE

described in the previous chapter, or some Bayesian methodologies like the one described in

Chapter 1, one could get estimates for α and β. With these estimates,α ˆ and βˆ, the predicted success probability is computed as

ˆ eαxˆ i+β pˆi = ˆ . (2.2.1) 1 + eαxˆ i+β

The raw residual of the ith data point is given by

ri = yi − nipˆi = yi − yˆi. (2.2.2)

Large values of ri will suggest a potential outlier. However, what constitutes “large”? To

be able to answer this simple question, one has to know the distribution of the residual ri.

However, the distribution of Yi depends on different values of ni and pi, which makes these

raw residuals difficult to interpret. One way to address this problem is to standardize these

raw residuals by dividing it by the of Yi. 31

2.2.1 Classical Residuals

1. Pearson Residuals

Because the variance of Yi is given by nipi(1 − pi), the Pearson residuals are defined as,

yi − yˆi ri,p = p , (2.2.3) nipˆi(1 − pˆi)

0 ˆ wherey ˆi = nipˆi andp ˆi = F (xiβ).

These residuals are approximately standard normal when nipi and ni(1 − pi) are both greater than or equal to 3 (Albert and Johnson, 1999). In such cases, residuals with absolute values of at least 2 or 3 can be considered as potential outliers.

2. Deviance Residuals

Another type of residual can be constructed using the deviance function. In the linear logistic model for binomial data, the deviance function is defined as twice the difference of the log-likelihood function evaluated at the observed values yi and at the model’s fitted valuesy ˆi; that is, ½ µ ¶ µ ¶¾ Xn y n − y D = 2 y log i + (n − y ) log i i . (2.2.4) i yˆ i i n − yˆ i=1 i i i

The signed square root of the ith observation’s contribution to this overall deviance is called the Deviance residual; that is,

½ · µ ¶ µ ¶¸¾1/2 yi ni − yi ri,D = sign(yi − yˆi) 2 yi log + (ni − yi) log (2.2.5) yˆi ni − yˆi

where, sign(yi −yˆi) is 1 when yi −yˆi > 0 and −1 when yi −yˆi < 0. Like the Pearson residual, this residual is approximately standard normal when both nipi and ni(1 − pi) are greater than or equal to 3. 32

3. Adjusted Deviance Residuals

Adjusted deviance residuals (Pierce and Schaffer, 1986) are obtained from deviance

residuals by making the first-order correction for the mean bias. For the linear logistic

models applied to binomial data, this leads to adjusted residuals of the form

1 − 2ˆp ri,AD = ri,D + p . (2.2.6) 6 nipˆi(1 − pˆi)

Of the three residuals described, the distribution of the adjusted deviance residuals is

often most nearly normal and yields surprisingly accurate approximations for even small ni,

provided that the fitted probabilities are not too close to 0 to 1 (Albert and Johnson, 1999).

4. Likelihood Residuals

Yet another way to construct a residual is to compare the deviance obtained on fitting

a linear logistic model to the complete set of n binomial observations, with the deviance obtained when the same model is fitted to the same data set with the ith observation

excluded. This quantity will measure the change in the deviance when each observation

is omitted from the data set. Computing for the exact values of these residuals can be

computationally intensive. However, these residuals can be approximated by using the result

that the change in deviance in omitting the ith observation from the fit is well approximated

by

2 2 rL = hi(ri,P ) + (1 − hi)(ri,D) , (2.2.7)

1/2 0 −1 0 1/2 where hi is the ith diagonal element of the n×n matrix H = W X(X WX) X W . In

this expression, X is the n × p design matrix, where p is the number of unknown parameters

in the model and W is the n×n diagonal matrix with nipˆi(1−pˆi) as the ith diagonal element

under the linear logistic model (Collett, 2003). 33

The signed square root of these values, given by

q 2 2 ri,L = sign(yi − yˆi) hi(ri,P ) + (1 − hi)(ri,D) (2.2.8) are known as the likelihood residuals.

However, the exact distributions of all these four residuals, under the assumption that the fitted model is correct, are intractable. The asymptotic distributions are not accurate when the binomial sample sizes (ni’s) are small. In particular, in the case of the binary response when all ni’s are equal to one, the distribution of residuals obtained from modeling binary data will not be approximately normal, even when the model is correct. This situation arises where a continuous covariate is included in the model. In IRT models, the latent variable is usually considered a continuous variable. Hence, it will be difficult to evaluate the size of the residuals obtained from these IRT models.

CLASSICAL RESIDUAL PLOT 2

1.5

1

0.5

0

−0.5 DEVIANCE RESIDUAL −1

−1.5

−2

−2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FITTED PROBABILITIES

Figure 2.1: Classical Residual Plot

Even residual plots for binary response are not very informative. Because of the nature 34

of the binary response, their residual plots will always exhibit some pattern similar to that

shown above in Figure 2.1.

This residual plot seems to indicate three (3) possible outliers - two at the lower right

section of the plot and one at the upper left section. However, because the distribution of

these residuals are unknown, it is difficult to say whether these three values are actually

outliers.

2.2.2 Bayesian Residuals

In the binomial linear model, Bayesian residuals can be defined in a similar manner as

yi 0 ri,B = − F (xiβ). ni

However, unlike the classical residual where there was only one value at a particular

point xi, these Bayesian residuals have a distribution of values. Since the parameter β is the only random quantity in this expression, its posterior distribution will determine the posterior distribution of the residuals. With these posterior distributions, it is now possible to evaluate whether the corresponding observed residual values are outlying or not. In the case of binary regression model, Bayesian residuals appear to present the only viable method for identifying outliers (Albert and Johnson 1999).

How these residuals can be used to check for outliers and to assess the goodness-of-fit

of the model will be discussed in greater details in Chapter 3. 35

2.3 Chi-squared tests for Goodness-of-fit of IRT Models

2.3.1 Wright and Pachapakesan Index (WP)

To measure the Goodness-of-fit of the Rasch Model, Wright and Panchapakesan (1969)

used the standard score

fjs − E(fjs) zjs = p Var(fjs)

where fjs is the observed number of examinees with raw score s that correctly answered item

j. In their paper, they assumed that zjs is asymptotically normally distributed with mean

zero and unit variance. This is a consequence of their assumption that fjs has a binomial

distribution with parameters pjs and fs, where fs denotes the number of examinees with e(θs−bj ) a raw score s and pjs = . The parameter θs is the ability score of examinees 1 + e(θs−bj )

with raw score s and bj is the difficulty parameter value of item j. Under the binomial

distribution, E(fjs) = fspjs and V ar(fjs) = fspjs(1 − pjs). Since the value of pjs depends

ˆ ˆ on the unknown quantity θs and bj, it will be calculated using their estimates, θs and bj.

Consequently, the estimated standard residual is given by

fjs − fspˆjs zˆjs = p . fspˆjs(1 − pˆjs)

Summing up the squares of these residuals over the S different raw score values for a

given item j, will yield the statistic

XS 2 WPj = (ˆzjs) s=1

where WPj is assumed to be distributed as chi-squared with (S − 2) degrees of freedom.

Wright and Panchapakesan used this as a goodness-of-fit measure for individual items. 36

To get an overall goodness-of-fit measure of the Rasch model to the data, they summed up the previous quantity over the K items to yield the statistic

XK XS 2 WP = (ˆzjs) j=1 s=1 where WP is assumed to be distributed as chi-squared with (K−1)(S−2) degrees of freedom.

Note that this chi-square measure can easily be adapted for the two-parameter and three-parameter logistic models.

However, as pointed out by Hambleton and Swaminathan (1985), there are several problems associated with these χ2 tests of fit. First, the asymptotic χ2 test has dubious validity when any one of the E(fjs) terms is less than one. Because in such cases, the deviatesz ˆjs are not normally distributed, and a chi-squared distribution is obtained only by summing the squares of normal deviates. Second, this χ2 test is sensitive to sample size. If enough observations are taken, the null hypothesis that the model fits the data can always be rejected using this χ2 test. Finally Divgi (1981) and Wollenberg (1980) have also demonstrated that the Wright-Panchapakesan goodness-of-fit statistic is not distributed as a chi-square variable, and the associated degrees of freedom have been assumed to be higher than they actually are.

2.3.2 Bock’s Index (B)

Another commonly used chi-square index was proposed by Bock (1972), which is very similar in spirit with that of Wright and Panchapakesan. The main difference is that Bock proposed to divide the examinees into equal-size subgroups according to their estimated 37

latent ability. Bock’s chi-square index is given by

XS (f − f pˆ )2 B = js s js j f pˆ (1 − pˆ ) s=1 s js js

where fs is the number of examinees in group s, fjs is the observed number of examinees

ˆ ˆ in group s that correctly answered item j, andp ˆjs = Pr(Yij = 1|θs), θs is the median of all ability estimates of examinees belonging to group s. In practice, this statistic is assumed to have an approximate chi-square distribution with degrees of freedom of S − m, where m is the number of item parameters in the assumed model.

2.3.3 Yen’s Index (Q1)

Similar to this, Yen (1981) proposed to group the examinees into 10 equal-size subgroups,

to come up with the following index (Q1)

X10 (f − f pˆ )2 Q = js s js 1 f pˆ (1 − pˆ ) s=1 s js js

where fs is the number of examinees in group s, fjs is the observed number of examinees

¯ ¯ in group s that correctly answered item j, andp ˆjs = P r(Yij = 1|θs), θs is the mean of all ability estimates of examinees belonging to group s. In practice, this statistic is assumed to have an approximate chi-square distribution with degrees of freedom of 10 − m, where m is the number of item parameters in the assumed model.

2.3.4 Hosmer and Lemeshow Index (HL)

Yen’s Q1 index is actually very similar to one version of the Hosmer-Lemeshow statistic

(1980). The Hosmer-Lemeshow statistic (HL) is a standard statistic used in measuring the goodness-of-fit of ordinary logistic regression model. In this particular version, Hosmer and 38

Lemeshow proposed to rank and partitioned the data into S equal-size intervals according

to the fitted probability of success, to form the statistic

XS (f − f pˆ )2 HL = js s js f pˆ (1 − pˆ ) s=1 s js js

where fs is the number of values in group s, fjs is the number of “successes” in group s,

andp ˆjs is the average estimated success probability of examinees in group s. Hosmer and

Lemeshow, using extensive simulation, claimed that the distribution of this HL statistic is very close to a chi-square distribution with (S − 2) degrees of freedom, and this is what is used in practice. The reason why this statistic is not directly applicable to the dichotomous

IRT models is because of the unknown abilities of the examinees. Unlike in ordinary logistic regression, where the covariate X is generally known, in IRT models, the ability scores which serve as the covariates are unknown and need to be estimated together with the item parameters.

2.3.5 Mckingley and Mills Index (G2)

Mckingley and Mills (1985) proposed a likelihood ratio G2 statistic based on Yen’s group-

ing style. Examinees were rank-ordered and partitioned into 10 equal-size intervals according

to their ability estimates. The G2 statistics is defined as

· µ ¶ µ ¶¸ X10 f f − f f G2 = 2 f ln js + (f − f pˆ ) ln s s js js f pˆ s s js f − f pˆ s=1 s js s s js

where fs, fjs, andp ˆjs are exactly the same quantities as in Yen’s index. Again, this statistic

is assumed to have an approximate chi-square distribution with degrees of freedom of 10−m,

where m is the number of item parameters in the assumed model. 39

2.3.6 Orlando and Thissen Indices (S − χ2 and S − G2)

Finally, Orlando and Thissen (2000) proposed two more measures similar to Yen’s and the previous G2 statistic. The only difference is in the grouping of the examinees. They adopted the way Wright and Panchapakesan grouped the examinees, according to the observed raw scores, and came up with the following measures

Xk−1 (f − f pˆ )2 S − χ2 = js s js j f pˆ (1 − pˆ ) s=1 s js js · µ ¶ µ ¶¸ Xk−1 f f − f f S − G2 = 2 f ln js + (f − f pˆ ) ln s s js j js f pˆ s s js f − f pˆ s=1 s js s s js where fs is the number of examinees in score group s, fjs is the observed number of exami- nees in score group s that correctly answered item j, andp ˆjs is the expected proportion of examinees in score group s who correctly answered item j. They computed these expected proportions using the method of predicting the likelihood distribution for each possible score group. This method was briefly described by Lord and Wingersky (1984) and further devel- oped by Thissen, Pommerich, Billeaud, and Williams (1995).

All of the above mentioned item goodness-of-fit tests suffer from two major problems.

First, their exact distributions are unknown and the asymptotic chi-square distribution may not always be true. Second, these tests are all sensitive to sample size. It has been said that failure to reject an IRT model is simply a sign that the sample size was too small (McDonald,

1989). But in spite of all these shortcomings, these tests are still being used regularly by people today.

Later in Chapter 4, all of these six indices were used within the Bayesian framework as discrepancy measures. Their respective distributions were obtained using the idea of 40

predictive distributions and were approximated by MCMC.

2.4 Discrepancy Measures and Test quantities

Discrepancy measures are functions of the data Y and the parameters ξ in the model. It is important that the discrepancy measure, T (Y, ξ), can effectively measure how much the data deviate from the model. Using the predictive distribution m(y), the distribution of T can be obtained which could be used to see how “unusual” the observed data is compared to similar replicated data.

Using the appropriate discrepancy measure for the right objective is very crucial. How-

ever, finding this appropriate discrepancy measure is not an easy task. In fact, this task is

equivalent to finding the desired test statistic in the classical framework.

Sometimes, it is useful to consider functions of just the data T (y) without worrying

about the parameters. These functions are called test quantities. Again, it is very crucial to

be able to find the appropriate test quantities that will give us the desired results about the

goodness-of-fit of the model.

2.5 Predictive Distributions

2.5.1 Prior Predictive Distribution

In Bayesian methodology, inferences about the parameters are all based on their posterior

distributions. That is, if a sample y1, y2, . . . , yn come from f(y|ξ) and ξ has a prior density

π0(ξ), then the posterior distribution of ξ is given by π(ξ|y) ∝ f(y|ξ)π0(ξ). To check how well the model fits the observed data, one can compute the predictive distribution m(y) by 41

integrating with respect to ξ. That is,

Z

m(y) = f(y|ξ)π0(ξ) dξ. (2.5.1)

This is called the prior predictive distribution since the integration was done with respect to

the prior density π0(ξ). This is also called the marginal likelihood of the observed data under

a certain model. In other words, this quantity measures the likelihood that the observed

data came from a certain model. This idea will be useful later in comparing different models.

The predictive distribution m(y) also serves as the marginal density of Y . Hence, for

any departure statistics T (Y ), its distribution h(t) can be obtained from the prior predictive distribution. Once the distribution h(t) is available, it would be possible to evaluate how

“unusual” the observed departure statistics value t(yobs) is by referring to this reference

distribution.

Box (1980) proposed to measure this “unusualness” by computing the prior-predictive

p-value defined as

h(t) pprior = Pr [t(Y ) ≥ t(yobs)]. (2.5.2)

Another popular measure of surprise, is the Relative Predictive Surprise proposed by Berger

(1985) and is defined as h(t(y )) RPS = obs . (2.5.3) sup h(t) t However, there are problems associated with using the prior predictive distribution. For

one, when a lack of fit is detected, it is difficult to say whether the lack of fit is due to the

wrong specification of the model or of the prior. Another problem arises when the prior

density is improper, since in this case the prior predictive m(y) will also be improper. 42

2.5.2 Posterior Predictive Distribution

To address the two problems of the prior predictive distribution, Rubin (1984) proposed

using the posterior distribution instead of the prior density in computing the predictive

distribution. That is, Z rep rep m(y ) = f(y |ξ)π(ξ|yobs)dξ, (2.5.4)

where π(ξ|yobs) is the posterior distribution of ξ. Using the posterior distribution allows the use of improper prior, as long as the posterior is proper. At the same time, this reduces the effect of the prior density on the predictive distribution.

Rubin proposed to compute the posterior predictive p-values using this posterior pre-

dictive distribution. Of course, we can also compute the RPS using this posterior predictive

distribution.

In practice, the posterior predictive p-value is difficult to compute analytically. Most

often, people use simulation techniques to estimate this p-value. This is done by simulating

rep ξ from its posterior distribution π(ξ|yobs), then replicated data y values are simulated from

rep rep rep f(y|ξ). This is repeated m times to get a sample y = (y1 , . . . , ym ) from the posterior

predictive distribution m(y). This sample can be used to compute the values of some test

quantities or discrepancy measures t(y) to form an approximate distribution of h(t). The

observed discrepancy value t(yobs) can then be computed and see how it fits within the distribution of t(y).

But this new predictive distribution also comes with some problems as pointed out by

Bayarri and Berger (1997). It appears that the observed data yobs is being used twice -

first training the improper prior π0(ξ) to the proper posterior distribution π(ξ|yobs), then for 43

measuring the surprise when locating the yobs in the predictive m(y|yobs), or t(yobs) in h(t).

This will yield a more conservative test. Another effect of this double use of the data is inadequate behavior of posterior p-values where it fails to converge to 0 as the observation become infinitely extreme.

In an effort to use the combined benefits of the prior predictive and posterior predictive

distributions, Bayarri and Berger (1997) proposed two new methods to obtain a predictive

distribution. These are the Conditional predictive distributions and the Partial predictive

distributions.

2.5.3 Conditional Predictive Distribution

The conditional predictive distribution is defined as

Z m(y) = f(y|u, ξ)π(ξ|u)dξ, (2.5.5)

where u(y) is an appropriate statistics that will give a proper π(ξ|u) which will result in a

proper m(y|u). Note that there is no double use of the data here, since only part of the data

(uobs) is used to produce the posterior to eliminate ξ, while another part (tobs) is used when

computing the measure of surprise.

The only problem with the conditional posterior predictive method is that it is very

complicated to work with. As a simpler alternative method, Bayarri and Berger proposed

to use the partial predictive distribution. 44

2.5.4 Partial Posterior Predictive Distribution

The partial predictive distribution is defined as

Z

m(y|yobs\tobs) = f(y|u, ξ)π(ξ|yobs\tobs)dξ (2.5.6)

The idea is to use the information in the data that is not in tobs to train the improper

prior into a proper one. One easy way to do this is to divide the data into two - one for

getting the posterior distribution of the parameters and the other for measuring the surprise.

2.6 Bayes Factor

In Section 2.5.1, it was pointed out that the predictive distribution m(y) can also be interpreted as the marginal likelihood of getting the observed data under a certain model. If there were two competing models, say model M1 and M2, then m(y|M1) and m(y|M2) would

represent the corresponding likelihood of observing what was observed under the respective

model. The Bayes factor is defined as the ratio of these two likelihoods. That is,

m(y|M1) BF12 = . (2.6.1) m(y|M2)

For example, when BF12 = 5, this implies that the observed data is five times more likely

to have come from model 1 than from model 2. In many instances, it is more convenient to

use the log base 10 unit of measurement. Kass and Raftery (1995) provided the following

categories, for values of Bayes factors or log base 10 of Bayes factors, that can serve as a

rough descriptive statement about the standards of evidence in scientific investigation as

shown in the table below. 45 log10 BF10 BF10 Evidence against H0 0 to 0.5 1 to 3.2 Not worth more than a bare mention 0.5 to 1 3.2 to 10 Substantial 1 to 2 10 to 100 Strong 2 100 Decisive

Table 2.1: levels of evidence by log10 BF . 46

CHAPTER 3 OUTLIER DETECTION IN IRT MODELS USING BAYESIAN RESIDUALS

3.1 Introduction

In IRT, there are three types of outliers that can be considered: outlying test items, examinees, and parameters. Because parameters are unknown, the discussion of outlying parameters will be postponed until Chapter 5, where an outlier exchangeable IRT model and the Bayes factor will be utilized. In this chapter, most of the discussion will be based on Bayesian residuals and residual plots for item and examinee fit.

Finding these outliers are important because the presence of a large number of them is an indication that the model being used is not really appropriate for the data. Or if there are only few present, then these few test items or few examinees do not conform with the assumptions of the model, unlike most of the population of items or examinees. In such cases, it may be more beneficial to the overall fit of the model if they are excluded from the model fitting process.

As was discussed in Section 2.2, classical residual plots for dichotomous IRT data sets are not very informative due to the binary nature of the responses. In Section 3.2, the construction of probability interval bands for item response curves is discussed. This simple but very informative plot is used to assess item fit. Two more posterior residual plots are proposed in Section 3.3 to help detect examinees who are purely guessing in the test. In

Section 3.4, another Bayesian residual method is proposed to find examinees who scored much higher or much lower than expected. Finally, these different plots and procedures are

46 47

applied to the BGSU Mathematics placement exam in the last Section.

3.2 Detecting Misfitted Items Using IRC Interval Band

One way to check how well a particular fitted item response curve, say of item j, fits the data is to construct a probability interval band, similar to a regression confidence band.

The idea is to create a region that contains the probability of getting this item correct for different ability scores, with a given probability of, say 90%. This would serve as a reference region for the item response curve under the assumed model and would visually indicate whether the observed item response curve is within this expected region.

Note that for a given ability score θi, the probability of correctly answering item j,

pij = F (ajθi − bj), depends only on its parameters (aj, bj). Therefore, using the posterior

sample of (aj, bj), a 90% probability interval for pij can be constructed at each value of ability

score. Plotting these probability intervals for a range of values of the ability score will yield

the desired probability interval band. Figure 3.1 displays the 90% probability interval bands

for items 15 and 30 of the BGSU Mathematics placement exam.

Plotting the observed item response curve is problematic due to the binary nature of

the responses in IRT. To address this problem, Albert (1999) proposed to group examinees

to make it less discrete. To group the examinees, the latent ability scale is divided into S

intervals and examinees are grouped according to their estimated ability score. For the sth

(c) group, denote θs to be the center of the sth interval, ns to be the number of examinees

belonging to this group,p ˆjs to be the proportion of examinees in this group who correctly

(c) answered item j, and pjs to be the probability of correct response for item j in this group.

(c) (c) That is, pjs = F (ajθs −bj), where aj and bj are the item parameters. Plotting the observed 48

Using Two−parameter model − Item # 15 Using Two−parameter model − Item # 30 1

0.9 0.6

0.8

0.5 0.7

0.6 0.4

0.5

PROBABILITY 0.3

PROBABILITY 0.4

0.3 0.2

0.2 0.1

0.1

0 0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 LATENT TRAIT LATENT TRAIT

Figure 3.1: A 90% interval band for the fitted item response curves of items 15 and 30 using the Two-parameter IRT model.

proportions of correct responsesp ˆjs with the 90% probability interval band estimate of pij will give a good visual assessment of how well the estimated IRC fit the observed data. The

Matlab program plot obspr fitpr will construct such a plot for a specified test item.

To illustrate the use of this plot, an artificial data set of 1000 × 35 dichotomous IRT data set was generated. The ability scores and item difficulty values were both simulated from the standard normal density, while the item discrimination values were simulated from

Uniform(0, 3) to simulate a data set from the two-parameter model. In the generated data set, the values of the discrimination parameters range from 0.33 to 2.99. Items 10 and 26 have discrimination parameter values of 2.99 and 0.39, respectively, and were used to illustrate the use of this interval band plot. The approximate 90% interval band plots for these items are given below in Figure 3.2. From the plots on the left side of Figure 3.2, it is clear that the one-parameter model is not appropriate for these items as many of the observed proportions are outside the interval bands. However, the plots on the right indicates that 49

the two-parameter model fit quite well.

Using One−parameter model − Item 10 Using Two−parameter model − Item 10 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5 PROBABILITY 0.4 PROBABILITY 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 −1.5 −1 −0.5 0 0.5 1 1.5 0 −1.5 −1 −0.5 0 0.5 1 1.5 LATENT TRAIT LATENT TRAIT Using One−parameter model − Item 26 Using Two−parameter model − Item 26 1 1

0.9 0.9 0.8

0.7 0.8

0.6 0.7 0.5 PROBABILITY PROBABILITY

0.4 0.6

0.3 0.5 0.2

0.1 0.4 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 LATENT TRAIT LATENT TRAIT

Figure 3.2: A 90% interval band for the item response curves of items 10 (above) and 26 (below) fitted with the (left) One-parameter IRT model and (right) Two-parameter IRT model.

(c) Alternatively, one could also plot the posterior residuals, rjs =p ˆjs − pjs . Again, since a

(c) posterior sample of pjs is available, a posterior sample of rjs is also available. This allows the construction of a probability interval for each residual. Plotting these probability intervals together will yield the plots shown in Figure 3.3. If many of the values of rjs are far from zero - that is, their probability interval do not include 0 - this would indicate that the model 50 does not fit the data, as what can be seen in the left plots of Figure 3.3. These are the posterior residual plots of items 10 and 26 when the one-parameter model was fitted. The posterior residual plots of these items when the two-parameter was fitted are shown on the right side of Figure 3.3. Notice how much closer the residuals are to zero and that almost all of their probability intervals include the value zero, indicating that the two-parameter model provides a good fit.

Using One−parameter model Using Two−parameter model 0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0

−0.05 −0.05 Bayesian Residuals Bayesian Residuals

−0.1 −0.1

−0.15 −0.15

−0.2 −0.2

−0.25 −0.25 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 Latent Trait Latent Trait Using the One−parameter model − Item 26 Using Two−parameter model − Item 26 0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1 Bayesian Residuals Bayesian Residuals

0 0

−0.1 −0.1

−0.2 −0.2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Latent Trait Latent Trait

Figure 3.3: Posterior residual plots of items 10 (above) and 26 (below) fitted with the (left) One-parameter IRT model and (right) Two-parameter IRT model.

This posterior residual plot essentially offers the same information as the probability 51 interval band plot. It is just an alternative way of looking at the deviations of the observed data from the assumed model.

3.3 Detecting Guessers

The ability to detect examinees who were simply guessing the answers to the items in the exam is important for two reasons. First, it allows those who are interested with the examinees to detect the students who were not ready for the exam and those who did not take the exam seriously. Second, it allows test developers to isolate and remove these examinees who do not follow the IRT model assumptions. This will result in a better fit of the model to the data and hence, will give more accurate estimates of the item parameters and ability scores of examinees.

To detect potential guessers, the simple idea − that guessers will tend to answer some difficult items correctly, at the same time, answer some very easy items incorrectly − will be utilized. This pattern of responses could be detected using a plot of the posterior distri- butions of all item (Bayesian) residuals for a specific individual. This plot will be called the

Examinee Bayesian Residual plot.

3.3.1 Examinee Bayesian Residual Plots

Recall that in the two-parameter IRT model, the Bayesian residual was defined in Section

2.2.2 as

rij = yij − Φ(ajθi − bj). (3.3.1)

As pointed out earlier, these Bayesian residuals have distributions that depend on the distributions of the parameters in the model. Although, in most cases, their posterior dis- 52

tributions are not available analytically, they can be estimated using samples drawn from

the posterior distributions of the parameters using MCMC as was done in the previous sec-

tions and in Chapter 1. Having a sample from the posterior distribution of these residuals,

one could construct the probability interval plots of these residuals for each item−examinee combination. The Examinee Bayesian residual plot is obtained by Plotting these probability interval plots for all the items in the test according to their estimated difficulty level for a specified examinee. The Matlab program plot resid post exa (see appendix) creates such

a plot for a specified examinee.

To illustrate how this plot can help spot guessers from a group of examinees, a 1000×35

artificial IRT data set of dichotomous responses with a specified number of guessers was gen-

erated. Again, both the latent ability scores and item difficulty values were generated from

the standard normal distribution. However, the item discrimination values were randomly

selected from the possible values of {.2,.4,.6,.8, 1, 1.2, 1.4, 1.6}. In addition, a few randomly selected examinees where made to purely guess their responses with success probability of

0.20. The Matlab program gen irt norm out2 generates this kind of artificial dichotomous

data set.

Using the program gen irt norm out2, a data set of 1000 examinees with 35 items were

generated with 5 guessers. Then the Examinee Bayesian residual plots were constructed

using the program plot resid post exa using 90% probability intervals of item residuals

for examinees with low, average, and high ability scores. These plots are shown in Figures

3.4 − 3.7.

Figure 3.4 depicts a typical examinee residual plot of someone with an average ability

score (θ close to zero). Note that for easy items, this average examinee is expected to answer 53

BAYESIAN RESIDUAL POSTERIOR DISTRIBUTIONS 1

0.8

0.6

0.4

0.2

0 Bayesian Residual

−0.2

−0.4

−0.6

−0.8 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Estimated Item Difficulty Figure 3.4: Examinee residual plot of someone with ability score θ = −0.09.

BAYESIAN RESIDUAL POSTERIOR DISTRIBUTIONS BAYESIAN RESIDUAL POSTERIOR DISTRIBUTIONS 1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 0

Bayesian Residual Bayesian Residual 0 −0.2

−0.2 −0.4

−0.6 −0.4

−0.8 −0.6 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Estimated Item Difficulty Estimated Item Difficulty Figure 3.5: Examinee residual plots of examinees with ability scores of θ = −1.15 (left) and θ = −2.19 (right). 54

BAYESIAN RESIDUAL POSTERIOR DISTRIBUTIONS BAYESIAN RESIDUAL POSTERIOR DISTRIBUTIONS 1 0.8

0.8 0.6

0.6 0.4 0.4

0.2 0.2

0 0

−0.2 Bayesian Residual Bayesian Residual −0.2

−0.4 −0.4 −0.6

−0.6 −0.8

−1 −0.8 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Estimated Item Difficulty Estimated Item Difficulty Figure 3.6: Examinee residual plots of examinees with ability scores of θ = 1.22 (left) and θ = 2.2 (right).

BAYESIAN RESIDUAL POSTERIOR DISTRIBUTIONS BAYESIAN RESIDUAL POSTERIOR DISTRIBUTIONS 1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2 Bayesian Residual Bayesian Residual

−0.4 −0.4

−0.6 −0.6

−0.8 −0.8

−1 −1 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Estimated Item Difficulty Estimated Item Difficulty Figure 3.7: Examinee residual plots of two guessers. 55 them correctly which resulted in positive residuals, shown in the left side of Figure 3.4.

However, as items get more difficult (that is, as one moves from left to right of the residual plot), this examinee is expected to get less of these harder items correctly. Thus, resulting the negative residuals towards the right side of Figure 3.4.

As expected, students with low ability scores have most of their correct responses for easy items (see Figure 3.5). There were very few positive residuals for difficult items. On the other hand, students with high ability scores, have more positive residuals (see Figure

3.6) and few negative residuals for easy items. As the ability score increases, more and more of the points are found above the zero line, implying that the examinees are getting more items correctly.

However, students who were just guessing showed large positive residuals for difficult items and also large negative residuals for very easy items (see Figure 3.7). In other words, these examinees are able to answer correctly some very difficult items but at the same time fail to get the correct responses of some very easy items.

3.3.2 Examinee Bayesian Latent Residual Plots

Although the previous plots provide useful information as just discussed, it is still difficult to decide when a residual can be considered extreme. Another difficulty with the previous residual plots is the inherent increasing pattern of the residuals as one moves from easy to difficult items. To address these kind of problems, Albert and Chib (1995) proposed to work with latent residual posterior densities when working with binary linear models.

Latent variables are used to model the underlying characteristic or mechanism that de- termines the corresponding probability of correct response. In the case of the two-parameter 56

probit IRT model, where the probability that examinee i answers item j correctly is modeled

as

pij = Pr(Yij = 1|θ) = Φ(ajθi − bj),

the latent variable Zij, that has a prior normal density with mean mij = ajθi−bj and variance

1, was introduced in Section 1.4.4. As was discussed in that section, this continuous latent

variable served as the underlying mechanism that generates the binary response yij. The

posterior distribution of this latent variable is a truncated normal distribution with mean

mij = ajθi − bj and variance 1. The truncation is done such that Zij > 0 when yij = 1 and

Zij ≤ 0 when yij = 0.

The latent residual is defined as

∗ rij = Zij − mij. (3.3.2)

Similar to the other Bayesian residuals, this latent residual also has a distribution. A priori,

∗ from the definition of Zij, the distribution of this latent residual rij is standard normal.

However, due to the truncated posterior distribution of Zij, its posterior distribution is a

truncated standard normal. If the observed response yij = 0, the truncation is done to the

right of mij when mij < 0, and to the right of −mij when mij > 0. On the other hand, if yij = 1, the truncation is done to the left of mij when mij < 0, and to the left of −mij when

∗ mij > 0. Therefore, the posterior distribution of rij will not include the value zero when the observation is of the opposite sign to mij. That is, when yij = 0 and mij > 0, or when yij = 1 and mij < 0. Thus, latent residual values that are far from 0 are considered outlying.

As before, the exact posterior distributions can be estimated using samples drawn from

the posterior distributions of the model parameters using MCMC. A simple modification of 57

the Matlab program pp2 bay would yield the desired simulated values of these latent vari-

ables and those of the other model parameters from their respective posterior distributions.

Using these posterior samples, the 90% probability interval plot for each item−examinee com- bination could be constructed. The Examinee Bayesian latent residual plot is obtained by plotting these 90% probability interval plots for all the items in the test according to their es- timated difficulty level for a specified examinee. The Matlab program plot resid post exa

(see appendix) creates such a plot for a specified examinee.

Using the program plot latresid post exa.m on the same set of examinees from the artificially generated data set used in the previous section, yields the latent residual plots shown in Figures 3.8 − 3.11.

LATENT RESIDUAL POSTERIOR DISTRIBUTIONS 2.5

2

1.5

1

0.5

0 Bayesian Residual

−0.5

−1

−1.5

−2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Estimated Item Difficulty Figure 3.8: Examinee latent residual plot of an examinee with ability score θ = −0.09.

The residual plot on the left side of Figure 3.7 indicated a total of six potentially large

residuals (two at the lower left and four at the upper right). But because there was really no

exact way to say whether a residual is extreme in that kind of plot, it was difficult to make 58

LATENT RESIDUAL POSTERIOR DISTRIBUTIONS LATENT RESIDUAL POSTERIOR DISTRIBUTIONS 2 2.5

1.5 2

1.5 1

1 0.5

0.5 0 0

Bayesian Residual −0.5 Bayesian Residual −0.5

−1 −1

−1.5 −1.5

−2 −2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Estimated Item Difficulty Estimated Item Difficulty Figure 3.9: Examinee latent residual plots of examinees with ability scores of θ = −1.15(left) and θ = −2.19(right).

LATENT RESIDUAL POSTERIOR DISTRIBUTIONS LATENT RESIDUAL POSTERIOR DISTRIBUTIONS 2 2

1.5 1.5

1 1 0.5

0.5 0

−0.5 0

−1 Bayesian Residual Bayesian Residual −0.5

−1.5 −1 −2

−1.5 −2.5

−3 −2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Estimated Item Difficulty Estimated Item Difficulty Figure 3.10: Examinee latent residual plots of examinees with ability scores of θ = 1.22(left) and θ = 2.2(right). 59

LATENT RESIDUAL POSTERIOR DISTRIBUTIONS LATENT RESIDUAL POSTERIOR DISTRIBUTIONS 4 3

3 2

2 1

1

0

0 Bayesian Residual Bayesian Residual

−1 −1

−2 −2

−3 −3 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Estimated Item Difficulty Estimated Item Difficulty Figure 3.11: Examinee latent residual plots of two guessers.

such definite conclusions. However, from its corresponding latent residual plot in Figure

3.11, it is easy to judge that all these six residuals are indeed extreme as all of their posterior

distributions do not include 0. The same thing can be said about the plot in the right side.

All the other latent residual plots, in Figures 3.8, 3.9, and 3.10, showed most of their

latent residual posterior distributions covering the value 0 with most of their centers very

close to 0. These plots indicated that these examinees were performing according to their

latent ability levels and according to the characteristics of the items in the exam, which is

true in those cases.

3.4 Detecting Misfitted Examinees

Another kind of residual that one could consider is the difference between the observed

raw score tˆi and the expected raw score τi of an examinee. For a specified examinee, with latent ability of θi, this new residual is defined as,

ri = tˆi − τi, (3.4.1) 60

Xk Xk where, tˆi = yij and τi = Φ(ajθi − bj). j=1 j=1

Because the observed raw score tˆi is assumed to be fixed for a given examinee, this

residual is just a function of the item parameters. Hence, its posterior distribution can

again be estimated using the posterior samples of these parameters. With the availability

of the posterior sample of this residual, the 95% probability interval can, once again, be

constructed. Examinees with probability intervals that do not include the value zero can be

considered as outlying. In such cases, when the probability interval is completely above (or

completely below) the value zero, then it means that the observed raw score of that student

is significantly higher (or lower) than what was expected based on the characteristics of the

items in the test and his/her estimated ability score. Having a lot of these kind of outlying

examinees in the data set may be an indication that the model is not appropriate for the

data or that examinees are able to guess the correct answers of items that they normally

would not be able to get based on their ability score.

The Matlab program out post examinee calculates the approximate 95% probability interval for each examinee’s expected raw score and also the approximate 95% probability interval of the residual. This program computes the estimated probabilities of correctly answering each item in the test using the 500 simulated values of the item parameters drawn from their posterior distributions. The sums of these probabilities over all items served as the fitted raw score of each student. These fitted raw scores represents 500 simulated values from its posterior distribution of each student. If the observed raw score of a student is found in the extreme left or extreme right of this distribution, then the student is considered an outlier. 61

To see the kind of results that one can expect from this procedure, a data set with 1000

examinees and 35 items were generated using ability scores and item difficulty values drawn

from the standard normal distribution and discrimination values drawn from the possible

values {.2,.4,.6,.8, 1, 1.2, 1.4, 1.6}. Then the correct two-parameter probit model was used

to fit the data using the Matlab program pp2 bay. Finally, out post examinee was used to

compute the number of outlying examinees in the group.

The result showed that 10 out of the 1000 examinees have scored significantly higher

than what is expected of them based on their estimated ability scores, while 9 out of 1000

examinees have scored significantly higher than expected. This gives a total of 19 misfitted

examinees. This is slightly lower than 50, the number of examinees that is expected to be

outside of the 95% probability interval.

To check if the 95% probability interval is really wider than expected, this process

was repeated 100 times and the numbers of outlying students were recorded. Summary

histograms of these numbers is given below (see Figure 3.12).

These histograms indicated that out of 1000 examinees, the average number of exam-

inees who scored much too high and much too low were both somewhere close to 15. This

result indicates that the probability interval is indeed wider than expected.

This is a little surprising because the residual defined in (3.4.1) does not even consider the

variability of tˆi (when this variability is included, the distribution of the residual becomes a predictive distribution, which is discussed in Chapter 4). Hence, one would expect its probability interval to be narrower. A possible explanation to the presence of this higher variability is the fact that the value of τi depends on three quantities, (θi, aj, bj), which all

contribute some sampling variability into the residual. 62

25 25

20 20

15 15

10 10

5 5

0 0 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Number of "Lucky" Examinees Number of "Unlucky" Examinees

Figure 3.12: Histograms of the number of examinees (out of 1000) who scored (left) much too high and (right) much too low.

3.5 Application To Real Data Set

1. Assessing examinee fit

When the previous method to detect examinees who scored much too high or much too low was applied to the BGSU Mathematics placement data set, out of 1286 students who took this exam, only six scored much lower than expected while 26 scored much higher than expected. This unbalance result may be an indication that students were able to improve their scores by guessing the correct answers to items that they normally would not have gotten the correct response. All of the 26 students who scored much too high got scores of at least 33 correct out of the 35 questions. One of them is student number 185 who got a perfect score. His/Her examinee residual plot is shown in Figure 3.13 below. Based on his/her estimated ability score of 2.32, he/she was not expected to correctly answer the most difficult item (Q30) by the model.

To check for students who were purely guessing their responses, the examinee residual 63

BAYESIAN RESIDUAL POSTERIOR DISTRIBUTIONS LATENT RESIDUAL POSTERIOR DISTRIBUTIONS 0.9 2

0.8 1.5

0.7 1

0.6 0.5

0.5 0 0.4 −0.5 0.3 Bayesian Residual Bayesian Residual

−1 0.2

−1.5 0.1

0 −2

−0.1 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 Estimated Item Difficulty Estimated Item Difficulty

Figure 3.13: Examinee residual and latent residual plots of examinee no. 185.

plots were constructed for students with low estimated abilities. Two such students were examinee number 802 and 854 who only got 2 and 5 correct answers, respectively. Their examinee residual plots are given in Figure 3.14.

The examinee residual plot of examinee 854 seems to indicate a potential guesser because of the negative residuals found on the lower left portion of the plot and some positive residuals found on the upper middle portion of the plot. Looking at the corresponding latent residual plot, the negative residuals on the lower left portion of the plot are well within the expected range of the 90% probability interval, but the four correct responses were not expected by the model. These may indeed be a result of guessing. The plots of examinee 802 clearly indicates that this student is simply unprepared for the exam, as he/she only got 2 correct items and these are two easy items of the test. The latent residual plot indicate that all responses of this student is well within the expected probability interval. 64

BAYESIAN RESIDUAL POSTERIOR DISTRIBUTIONS LATENT RESIDUAL POSTERIOR DISTRIBUTIONS 0.8 2

1.5 0.6

1 0.4

0.5 0.2

0

0

Bayesian Residual Bayesian Residual −0.5

−0.2 −1

−0.4 −1.5

−0.6 −2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 Estimated Item Difficulty Estimated Item Difficulty BAYESIAN RESIDUAL POSTERIOR DISTRIBUTIONS LATENT RESIDUAL POSTERIOR DISTRIBUTIONS 1 2

0.8 1.5

0.6 1

0.4 0.5

0.2 0 0

Bayesian Residual Bayesian Residual −0.5 −0.2

−1 −0.4

−0.6 −1.5

−0.8 −2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 Estimated Item Difficulty Estimated Item Difficulty

Figure 3.14: Residual and latent residual plots of examinee no. 802 (above) and 854 (below).

BAYESIAN RESIDUAL POSTERIOR DISTRIBUTIONS LATENT RESIDUAL POSTERIOR DISTRIBUTIONS 1 2

0.8 1.5

0.6 1

0.4 0.5

0.2 0

0 −0.5

−0.2 −1 Bayesian Residual Bayesian Residual

−0.4 −1.5

−0.6 −2

−0.8 −2.5

−1 −3 −2 −1.5 −1 −0.5 0 0.5 1 1.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 Estimated Item Difficulty Estimated Item Difficulty

Figure 3.15: Examinee residual and latent residual plots of examinee no. 500. 65

One interesting examinee is student number 500. His/Her residual plots in Figure 3.15 show that he/she (unexpectedly) got the most difficult item correctly but (unexpectedly) failed to get the right response for the easiest item.

2. Assessing item fit

To examine the fit of some of the items in the BGSU Mathematics placement exam when the two-parameter IRT model was fitted, the 90% probability interval band estimates of the IRC for items 15, 21, and 30 and their residual plots were constructed. These plots are given below in Figures 3.16 and 3.17. Both of the plots in Figure 3.16 indicate a good

fit for item 15.

Using Two−parameter model − Item # 15 Posterior Residual Plot − Item # 15 1 0.5

0.9 0.4

0.8 0.3

0.7 0.2

0.6 0.1

0.5 0

PROBABILITY 0.4 −0.1 Bayesian Residuals

0.3 −0.2

0.2 −0.3

0.1 −0.4

0 −0.5 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3 LATENT TRAIT Latent Trait

Figure 3.16: IRC band and Posterior Residual Plot of Item 15.

However, the 90% probability interval band IRC plots for the two extreme items (Q21 and Q30), shown in Figure 3.17, show some strong deviations of the observed data from the predicted item response curve. This is especially pronounced for item 30 where most of the posterior residual intervals do not include the value 0. 66

Using Two−parameter model − Item # 21 Using Two−parameter model − Item # 21 1 0.2

0.9 0.1

0.8 0

0.7 −0.1

0.6 −0.2

0.5 −0.3

PROBABILITY 0.4 −0.4 Bayesian Residuals

0.3 −0.5

0.2 −0.6

0.1 −0.7

0 −0.8 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3 LATENT TRAIT Latent Trait

Using Two−parameter model − Item # 30 Using Two−parameter model − Item # 30 1 0.8

0.9 0.7

0.8 0.6

0.7 0.5

0.6 0.4

0.5 0.3

PROBABILITY 0.4 0.2 Bayesian Residuals

0.3 0.1

0.2 0

0.1 −0.1

0 −0.2 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3 4 LATENT TRAIT Latent Trait

Figure 3.17: Item response curves of item 21 (above) and item 30 (below).

Although these residual plots provide useful information about the fit of the model, it involves some subjectivity in its interpretation. In the next chapter, numerical procedures will be discussed which would provide more objective assessment on the goodness of fit of

IRT models. 67

CHAPTER 4 ASSESSING THE GOODNESS-OF-FIT OF IRT MODELS USING

PREDICTIVE DISTRIBUTIONS

4.1 Introduction

In the previous chapter, all the Bayesian methodologies that were presented were visual in nature and involved subjectivity in their assessment. In this chapter, quantitative Bayesian methods are presented to give an objective way to assess the fit of the different IRT models.

It was discussed in Chapter 2 that the available classical χ2−type procedures currently being used for model checking are not really satisfactory because their exact sampling distri- butions are unknown. The methods discussed in this chapter are proposed to serve as better alternatives to the existing classical test procedures.

In Section 4.2, a method to check the overall fit of the one-parameter probit model is presented using the idea of predictive distributions. In particular, the reference distribution of the sample standard deviation of the item point biserial correlations is obtained using both the prior and the posterior predictive distributions. Simulation results have shown that this method has a low type I error rate and a much higher detection rate compared to the results reported by Orlando and Thissen (2000) using their proposed classical χ2 test.

In Section 4.3, the fit of individual items in the test will be assessed using the six χ2 indices described in Section 2.3 as different discrepancy measures. The main problem of people working with the classical approach of determining the distributions of these test statistics is conveniently solved by the availability of the posterior distributions for the

67 68 parameters in the model, which also determines the reference distribution of any discrepancy measure involving these parameters. In particular, the three different predictive distributions

- prior, posterior, and partial posterior predictive distributions - will be utilized and compared in this section.

In Section 4.4, two more discrepancy measures are introduced to measure the person

fit to the IRT models. The posterior predictive distribution is used to obtain their refer- ence distributions. These two discrepancy measures are used to detect misfitted examinees, specifically, those who were simply guessing in the test.

Finally, all the methods discussed in this chapter are applied to the BGSU Math place- ment data.

4.2 Checking the Appropriateness of the One-parameter Probit IRT Model

Checking whether the one-parameter model is appropriate for the data is equivalent to testing the null hypothesis, H0 : a1 = ··· = ak, versus the alternative hypothesis, H1 : not all aj are equal. To do this, the standard deviation of the point biserial correlations of the items is used as the test quantity.

4.2.1 Point Biserial Correlation

Lord and Novick (1968) showed that under the normal ability distribution, the dis-

√ρj crimination parameter can be represented as aj = 2 , where ρj is the point biserial 1−ρj item response-ability trait correlation. However, this ability trait is not observable. As a substitute, the raw score tˆi of examinees is used in the computation of the point biserial correlation. 69

The sample point biserial correlation (rpbis) is a measure of association between a con-

tinuous variable X and a binary variable Y . The formula for computing rpbis is given by p ¯ ¯ (X1 − X0) p(1 − p) rpbis = (4.2.1) SX

¯ ¯ where X0 is the mean of X when Y = 0, X1 is the mean of X when Y = 1, SX is the

standard deviation of X, and p is the proportion of values where Y = 1.

This quantity is mathematically equivalent to the traditional correlation formula and

the interpretation is similar. The point biserial correlation is positive when large values of

X are associated with Y = 1 and small values of X are associated with Y = 0.

In the IRT case, rpbis is computed for each item as the sample point biserial correlation

of the examinees’ raw scores and their binary responses for this item. This yields a total of

k (the number of test items) rpbis values per data set.

Note that the discrimination parameter is a monotone function of the biserial correlation.

Hence, a good way to check whether all the discrimination parameter values are equal is to

measure the standard deviation of the k sample point biserial correlations. A large standard

deviation would indicate major differences in the point biserial correlations, which in turn,

would indicate major differences in the discrimination parameter values. Therefore, a large

standard deviation of these point biserial correlations would support the two-parameter IRT

model over the one-parameter IRT model.

The Matlab program rpbis computes this point biserial correlation for each item and their standard deviation. Using the same artificially generated two-parameter data set used in Section 3.4, where the discrimination parameters aj were randomly selected from

{.2,.4,.6,.8, 1, 1.2, 1.4, 1.6}, the observed standard deviation of the k point biserial corre- 70

lations was 0.1512. To know whether this value is statistically large to conclude that the

discrimination parameters are not all equal, a reference distribution for rpbis is needed. As was discussed in Chapter 2, this reference distribution can be obtained using predictive distributions.

4.2.2 Using Prior Predictive

In Chapter 2, equation (2.5.1), the prior predictive distribution was defined as Z

m(y) = f(y|ξ)π0(ξ) dξ

where f(y|ξ) is the likelihood function and π0(ξ) is the prior density of the parameter ξ. In the IRT case, ξ = (a1, . . . , ak, b1, . . . , bk, θ1, . . . , θn), the large number of parameters in the model causes some problems in the actual integration with respect to these parameters.

As an alternative, one could approximate this distribution by simulating values of y

from m(y) using some MCMC methods. In particular, this can be done by simulating m

(l) (l) values of ξ from its prior distribution, π0(ξ), to get {ξ }, l = 1, . . . , m. Then for each ξ , simulate y(l) from f(y|ξ(l)). The set of values of T (y(l)), l = 1, . . . , m, will serve as a sample from the marginal distribution h(t) of the test quantity or discrepancy measure T .

In the previous section, the standard deviation of the k point biserial correlations was

used as the test quantity. That is,

T (y) = std(rpbis).

Hence, its distribution under the one-parameter IRT model can be approximated by draw-

ing {b(l), θ(l)} from their respective prior distributions, then simulating {y(l)} from the one-

parameter model, f(y|b(l), θ(l)), l = 1, . . . , m. For each replicated n×k response matrix y(l), 71 the standard deviation of the point biserial correlations were computed to form a sample from its marginal distribution. In particular, 500 replicated response data were simulated and the values of the standard deviations of the biserial correlations are summarized through a his- togram shown in Figure 4.1. This histogram is produced by the program prior pred rpbis.

Prior Predictive Distribution of std(r−pbis) 120

100

80

60

40

20

0 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12

Figure 4.1: Histogram of 500 simulated values of std(r-pbis) using prior predictive distribu- tion.

Recall that the standard deviation of the observed rpbis is 0.1512 which is located in the extreme right of the predictive distribution. This constitute a strong evidence that the data is not from the one-parameter model. To get a quantitative measure of the “unusualness” of this observed data, one can use the prior predictive p-value given in equation (2.5.2). In this particular case, since the test quantity is a standard deviation, large observed values of this test quantity would be considered extreme. Hence, the prior p-value for this test quantity is 72 estimated by 1 Xm P = I (l) . (4.2.2) prior m [T (y )>T (yobs)] l=1 The prior p-value of this artificial data set is 0. Therefore, there is enough evidence to reject the one-parameter IRT probit Model.

One desirable property of the classical p-value is that its distribution is uniform in the interval [0, 1] when the parameters are all specified in the null hypothesis or when T is an ancillary statistic (Bayarri and Morales, 2003). To see the kind of values that one gets for this particular Bayesian prior p-value and to have an idea of its distribution, a simple simulation was performed. To accomplish this, 100 data sets were generated from the one-parameter probit IRT model. Each data set represented the results of a test with 35 items given to 1000 examinees. Both the examinees’ latent abilities and item difficulty parameter values were randomly generated from the standard normal distribution. For each data set, the correct one-parameter probit model was fitted and then the prior p-value was computed using 500 item difficulty parameter values drawn from their prior distributions. A summary of the

100 prior p-values is given in the histogram below (see Figure 4.2). The values range from

0.004 to 0.994, with only 3 values less than 0.05. Hence, the approximate type 1 error rate is 0.03, when a threshold value of 0.05 is used. This histogram also portrays a distribution that resembles the desired uniform[0, 1] distribution.

Now to see the power of this procedure in detecting a wrong model, another 100 data sets were generated, but this time, from the two-parameter probit IRT model. Again, each data set had 1000 examinees and 35 items. The item difficulty parameter values were randomly selected from the standard normal distribution, while the item discrimination parameter 73

Prior Predictive P−values of One−Parameter IRT Model 14

12

10

8

6

4

2

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 4.2: This histogram of the 100 simulated prior predictive p-values illustrates that the distribution of the prior p-value of std(rpbis) is close to uniform[0,1].

values were randomly selected from the possible values of {0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, and

1.6}. For each data set, the (wrong) one-parameter probit model was fitted and then the prior p-value was computed using 500 item difficulty parameter values drawn from its prior distribution. A summary of the 100 replicated observed std(rpbis) is given in the histogram below in Figure 4.3. The values range from 0.1187 to 0.2203.

The results were amazingly accurate; all of the 100 observed std(rpbis) were larger than most of the replicated values of std(rpbis). This resulted in almost all of the 100 prior p-values to be very close to 0. At a 0.05 threshold level, all p-values were significant. In other words, the procedure was able to detect the wrong model 100% of the time. 74

Using Data from Two−Parameter IRT Model Using Data from One−Parameter IRT Model 25 25

20 20

15 15

10 10

5 5

0 0 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 Standard Deviations of R−pbis Standard Deviations of R−pbis

Figure 4.3: Histograms of 100 observed std(rpbis) when data sets were generated using (left) two-parameter and (right) one-parameter model.

4.2.3 Using Posterior Predictive

Another popular predictive distribution that can be used to get a reference distribution

for the test quantity std(rpbis) is the posterior predictive distribution. In equation (2.5.4), the posterior predictive distribution was defined as

Z rep rep ∗ m(y ) = f(y |ξ)π (ξ|Yobs) dξ

rep ∗ where f(y |ξ) is the likelihood function of future responses and π (ξ|Yobs) is the posterior density of the parameter ξ.

Again, in the IRT case, ξ = (a1, . . . , ak, b1, . . . , bk, θ1, . . . , θn). To get a sample from the posterior predictive distribution of a certain test quantity or discrepancy measure, draw

∗ (l) m values of ξ from its posterior distribution, π (ξ|Yobs), to get {ξ }, l = 1, . . . , m. Then for each ξ(l), simulate yrep(l) from f(yrep|ξ(l)). The set of values of T (yrep(l)) would serve as a sample from the posterior distribution h∗(t) of the test quantity T (y) or discrepancy 75

measure T (y, ξ).

For the test quantity T (y) = std(rpbis), the standard deviation of the k point biserial

correlations of the item responses and examinee raw scores, 500 future data were replicated

using 500 parameter values drawn from its posterior distribution. For each replicated data,

T (yrep) was computed to form a sample of 500 values from the posterior distribution of T (y).

A summary histogram of these 500 sample points is shown in Figure 4.4. This histogram is

produced by the program post pred rpbis.

Posterior Predictive Distribution of std(r−pbis) 140

120

100

80

60

40

20

0 0.055 0.06 0.065 0.07 0.075 0.08 0.085 0.09

Figure 4.4: Histogram of 500 simulated values of std(r-pbis).

Recall that the standard deviation of the observed rpbis was 0.1512 which is way to the right of the posterior predictive distribution. To get a quantitative measure of the

“unusualness” of this value, one can compute the posterior predictive p-value in a similar way as was computed for the prior predictive p-value using equation (4.2.2). That is,

1 Xm P = I rep(l) . (4.2.3) post m [T (y )>T (yobs)] l=1 76

This gave a posterior p-value of 0 since the observed value of the test quantity std(rpbis) was

bigger than all the replicated values of this test quantity. This implies that there is enough

variability in the discrimination parameter to warrant the use of the two-parameter IRT

model. Note that the posterior predictive distribution (see Figure 4.4) of std(rpbis) is narrower

compared to its prior predictive distribution (see Figure 4.1). This result seems to indicate

that the posterior predictive distribution of std(rpbis) has higher precision, and therefore, has a higher power in detecting a data set that doesn’t come from the one-parameter IRT model than the prior predictive distribution.

To have an idea of the distribution of this posterior predictive p-value, a similar sim-

ulation study was performed using 100 generated data sets from the one-parameter probit

IRT model. Again, each data set represented the results of a test with 35 items given to

1000 examinees. Both the examinees’ latent abilities and item difficulty parameter values

were randomly generated from the standard normal distribution. For each data set, the

correct one-parameter probit model was fitted and then the posterior predictive p-value was

computed using 500 item difficulty parameter values drawn from its posterior distribution.

A summary of the 100 posterior p-values is given in the histogram below (see Figure 4.5).

The values range from 0.016 to 0.924, with only 3 values less than 0.05. Hence, the approx-

imate type 1 error rate is 0.03, when a threshold value of 0.05 is used. However, Figure 4.5

indicates that the distribution of the posterior p-value of std(rpbis) is not uniform (for more

discussions on desirable properties of p-values both from a frequentist and a Bayesian point

of view see Bayarri and Berger, 2000).

Now to see the power of this procedure, another 100 data sets were generated from

the two-parameter probit IRT model. Again, each data set had 1000 examinees and 35 77

Posterior Predictive P−values of One−Parameter IRT model 14

12

10

8

6

4

2

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 4.5: Histogram of 100 posterior predictive p-values.

items. The item difficulty parameter values were randomly selected from the standard normal distribution, while the item discrimination parameter values were randomly selected from the possible values of {0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, and 1.6}. For each data set, the (wrong)

one-parameter probit model was fitted and then a posterior predictive p-value was computed

using 500 item difficulty parameter values drawn from its posterior distribution. Again, the

results obtained were amazingly accurate. Almost all of the 100 posterior p-values were 0.

At a 0.05 threshold level, all p-values were significant. In other words, the procedure was

able to detect the wrong model 100% of the time. However, it is important to remember

that this detecting rate is a function of how far apart the actual discrimination parameter

values are.

Orlando and Thissen (2000) proposed two new indices, S − χ2 and S − G2, to check

the fit of the one parameter model on data sets generated from the one-parameter and two-

parameter models. The simulation results for their two new indices, as well as that of two 78

2 2 2 other χ -type indices, Q1 − χ and Q1 − G , are given in Table 4.1.

Index Test length Type I rate Detection rate 10 0.04 0.57 S − χ2 40 0.06 0.52 80 0.05 0.46 10 0.05 0.59 S − G2 40 0.08 0.60 80 0.12 0.60 10 0.95 0.97 2 Q1 − χ 40 0.14 0.68 80 0.06 0.66 10 0.97 0.97 2 Q1 − G 40 0.21 0.69 80 0.10 0.67 Table 4.1: Orlando and Thissen (2000) simulation results: Proportion of significant p-values (< 0.05).

Note that their proposed indices performed better than the two other χ2 indices in terms of the Type I error rate. However, their power to detect the two-parameter data was not as high as that of the other two indices. But these results for all four indices pale in comparison to the kind of type I error rate and detection rate that were observed using the predictive distributions.

4.3 Item Fit Analysis

The procedures described in the previous section are useful in detecting the overall fit of the model. What if one is interested to detect misfit per item in the test? To accomplish this, different test quantities or discrepancy measures that measures the closeness of the observed outcomes to that predicted by the model per item is needed. A simple test quantity that might be useful is the proportion of students who correctly answered a particular item, which 79 will be denoted by PC.

One could also use more complicated measures like the six indices that were described earlier in Section 2.3. These six discrepancy measures are expected to be more effective than

PC since they measure the deviations of the observed proportion of students who correctly answered the item from the corresponding predicted success probability at multiple points along the ability trait scale. This might be crucial in the detection of any wrong specification of the discrimination parameter or the guessing parameter. The only difficulty with those indices is that they are not just functions of the observed data but of the parameters as well.

These measures will be used later in Section 4.3.2.

4.3.1 Using Prior Predictive

Using the idea of prior predictive discussed in Chapter 2 and used in the previous section, one could approximate the reference distribution of the test quantity PC. Once this reference distribution is obtained, the observed value, PCobs, could be located within this distribution and the prior predictive p-value could be computed.

To see how effective this test quantity and the prior predictive p-value in detecting item misfits, 100 artificially altered data sets were generated. Each data set contains 1000 exam- inees and 35 items. The examinees latent abilities and item difficulty parameter values were randomly drawn from the standard normal distribution. The item discrimination parameter values were all 1’s except for items 1, 5, and 10 which had values of 2, while items 15, 20, and 25 had values of 0.5. Also, items 1, 10, 15, 25, 30, and 35 had a guessing probability of

0.25. In other words, students who don’t know the correct answer for these questions can correctly guess the right answer with 0.25 probability. For each data set, the one-parameter 80 probit model was fitted and then the prior predictive p-value was computed using 500 item difficulty parameter values drawn from its prior distribution.

The results obtained were not encouraging. There were no difference in the number of significant p-values for misfitted items and well-fitted items. In other words, this particular procedure, using the test quantity PC, was not effective in detecting the misfitted items.

This is an example of an inappropriate test quantity.

4.3.2 Using Posterior Predictive

When the posterior predictive distribution was used to come up with a reference for the test quantity PC, the same discouraging results were obtained. This is an indication that a more complicated test quantity or discrepancy measure is needed for this task.

Gelman, Meng, and Stern (1996) generalized the posterior predictive idea by consid- ering realized discrepancy measures which are functions of both the observed data and the parameters, T (y, ξ). Using the simulated values of ξ = (θ, a, b) from its joint posterior distri- bution, a model-conform data yrep is generated and the posterior predictive p-value is defined as

rep Ppost = P r(T (y , ξ) ≥ T (yobs, ξ|yobs)), (4.3.1)

when larger values of T (yobs, ξ) represents more extreme values, and

rep Ppost = P r(T (y , ξ) ≤ T (yobs, ξ|yobs)), (4.3.2)

when smaller values of T (yobs, ξ) represents more extreme values.

With this generalization, it is now possible to use any one of those six indices described earlier in Section 2.3 as a discrepancy measure. Using the concept of posterior predictive 81 distribution, one can estimate the posterior distributions of these six discrepancy measures and check how they perform in detecting misfit for each individual item in the test.

To see how good they are in detecting misfitted items, another 100 artificially altered data sets were generated. These data sets were generated in exactly the same way as the previous 100 data sets that were generated and used in the previous section. For each data set, the one-parameter probit model was fitted and then a posterior predictive p-value was computed using 200 item difficulty parameter values drawn from its posterior distribution for each of the 6 indices described earlier. A summary of the results is given in Table 4.2.

These values were obtained using Matlab programs, gof item B, gof item hl, gof item Q1, gof item G, and gof item sq wp. These programs are very similar and so, only gof item G is included in the appendix (see Appendix B.3).

2 From this table one could see that discrepancy measures B,Q1,HL, and G show very low false positive results (type I error). This is very nice because it implies that this procedure is safe to use as it does not tag a good item as having a misfit. WP has a respectable type

I error rate, while S − G seems to have a higher type I error rate which makes it a little undesirable. As for the power to detect the misfitted items, all six seem to indicate very good detection rates. Among the six discrepancy measures, G2 seems to have the best combination of low type I error rate and high detecting rate.

Next, the performances of these six discrepancy measures were studied when the two- parameter probit IRT model is fitted to the 100 artificially generated data sets. Here, the main interest was to see whether these discrepancy measures were any good in detecting items with positive guessing parameter value. A summary of the results is given in Table

4.3. 82

2 Item BQ1 HLG S − GWP 1 45 82 81 91 92 77 (a=2,c=0.25) 2 1 0 0 0 12 6 4 1 0 0 1 11 8 5 95 100 100 100 100 100 (a=2,c=0) 6 1 0 0 0 14 2 9 2 0 0 0 8 5 10 51 88 86 96 92 86 (a=2,c=0.25) 11 1 0 0 1 8 4 14 1 0 0 0 12 5 15 93 99 99 99 94 98 (a=0.5,c=0.25) 16 1 0 0 0 11 5 19 1 0 2 2 12 4 20 80 98 99 98 76 97 (a=0.5,c=0) 21 1 0 0 0 8 0 24 1 0 0 0 8 8 25 94 98 98 98 93 97 (a=0.5,c=0.25) 26 1 1 1 3 16 1 29 1 0 0 0 9 5 30 33 57 56 55 33 40 (a=1,c=0.25) 31 1 0 0 0 12 9 34 1 0 0 0 10 5 35 36 49 49 48 33 44 (a=1,c=0.25) Table 4.2: Percentage of p-values < 0.05 out of 100.

Again, the results show very low type I error rates for all items that fit the two-parameter model, including items 5 and 20. However, the powers of these discrepancy measures to detect items 1, 10, 15, 25, 30, and 35 as misfitted items were not as high as before. In fact, only items 1 and 10 had a respectable detection rate of about 60% for the G2 discrepancy measure. It seems to indicate that this procedure is able to detect positive guessing when the discrimination parameter value was large.

To see how the value of the discrimination parameter affects the detection rates of these discrepancy measures, a second simulation study was performed where the data sets were now generated from fixed item parameter values. In particular, the item difficulty parameter 83

2 Item BQ1 HLG S − GWP 1 30 55 55 59 59 59 (a=2,c=0.25) 2 0 0 0 0 3 2 4 0 0 0 0 5 6 5 1 0 0 0 5 5 (a=2,c=0) 6 0 0 0 0 3 2 9 0 0 0 0 1 1 10 35 55 59 61 56 58 (a=2,c=0.25) 11 0 0 0 0 3 6 14 0 0 0 0 1 1 15 6 0 0 0 5 8 (a=0.5,c=0.25) 16 0 0 0 0 4 5 19 0 0 0 0 7 6 20 3 0 0 0 4 4 (a=0.5,c=0) 21 1 0 0 0 1 1 24 0 0 0 0 2 4 25 5 0 0 0 2 3 (a=0.5,c=0.25) 26 0 0 0 0 2 2 29 0 0 0 0 4 2 30 11 4 3 4 12 12 (a=1,c=0.25) 31 0 0 0 0 3 3 34 0 0 0 0 3 3 35 11 3 4 4 7 13 (a=1,c=0.25) Table 4.3: Percentage of p-values < 0.05 out of 100.

values, b, were set at -1.5, 0, or 1.5, while the item discrimination parameter values, a,

were set at 0.5, 0.8, 1, 1.2, and 1.5 for each possible value of b. In addition, two possible item guessing parameter values, 0 and 0.25, were used to get a total of 30 different possible combinations. As a result, 100 data sets were generated with 1000 examinees and 30 items

(one for each possible combination of item parameter values), where the latent traits were drawn from the standard normal distribution. For each data set, the one-parameter probit model and the two-parameter probit model were fitted to the data and the corresponding posterior predictive p-values where computed using 200 simulated values of the parameters from their posterior distributions. Summaries of the results are given in Tables 4.4 − 4.7. 84

2 b a Item BHLQ1 G S − GWP 0.5 1 6 41 43 28 2 20 0.8 7 0 0 0 0 6 6 -1.5 1 13 8 5 5 11 67 46 1.2 19 39 46 45 75 98 91 1.5 25 74 100 100 100 100 100 0.5 11 11 90 86 81 0 13 0.8 17 0 0 0 0 3 3 0 1 23 28 20 16 28 85 75 1.2 29 84 96 95 97 100 100 1.5 5 100 100 100 100 100 100 0.5 21 18 57 57 50 10 25 0.8 27 0 0 0 0 4 1 1.5 1 3 4 3 1 4 33 11 1.2 9 52 34 30 62 95 76 1.5 15 76 98 99 100 100 100 Table 4.4: Percentage of significant p-values when the one-parameter probit model is used on items with no guessing parameter(c = 0).

2 b a Item BHLQ1 G S − GWP 0.5 16 3 41 37 24 6 20 0.8 22 0 0 0 0 2 7 -1.5 1 28 0 0 0 0 12 3 1.2 4 6 0 1 7 50 28 1.5 10 35 37 42 74 97 84 0.5 26 76 100 100 100 26 74 0.8 2 1 37 33 29 0 5 0 1 8 1 2 2 2 1 6 1.2 14 4 0 1 3 21 25 1.5 20 24 15 15 31 75 71 0.5 6 100 100 100 100 100 100 0.8 12 100 100 100 100 99 100 1.5 1 18 98 100 100 100 93 100 1.2 24 97 100 100 100 69 98 1.5 30 84 100 99 98 68 90 Table 4.5: Percentage of significant p-values when the one-parameter probit model is used on items with guessing parameter value of c = 0.25 85

2 b a Item BHLQ1 G S − GWP 0.5 1 0 0 0 0 2 1 0.8 7 1 0 0 0 4 8 -1.5 1 13 0 0 0 0 3 5 1.2 19 0 0 0 0 0 8 1.5 25 0 0 0 0 2 3 0.5 11 0 0 0 0 2 2 0.8 17 0 0 0 0 0 1 0 1 23 0 0 0 0 0 2 1.2 29 0 0 0 0 1 3 1.5 5 0 0 0 0 0 2 0.5 21 3 0 0 0 4 2 0.8 27 0 0 0 0 3 2 1.5 1 3 0 0 0 0 4 1 1.2 9 0 0 0 0 2 2 1.5 15 0 0 0 0 2 3 Table 4.6: Percentage of significant p-values when the two-parameter probit model is used on items with no guessing parameter (c = 0).

2 b a Item BHLQ1 G S − GWP 0.5 16 0 0 0 0 2 2 0.8 22 0 0 0 0 3 6 -1.5 1 28 0 0 0 0 0 1 1.2 4 0 0 0 0 1 1 1.5 10 0 0 0 0 1 1 0.5 26 0 0 0 0 2 2 0.8 2 0 0 0 0 2 3 0 1 8 1 0 0 0 2 4 1.2 14 3 1 1 2 7 16 1.5 20 3 7 6 6 20 29 0.5 6 12 0 0 0 2 1 0.8 12 29 3 2 2 12 16 1.5 1 18 51 15 17 12 24 28 1.2 24 67 30 28 28 27 36 1.5 30 90 69 67 68 73 77 Table 4.7: Percentage of significant p-values when the two-parameter probit model is used on items with guessing parameter value of c = 0.25. 86

When the one-parameter probit IRT model was fitted to the data sets, the results for items with no guessing effect (c=0) and with difficulty parameter values of -1.5 and 1.5 were quite similar. There was a respectable detection rate when the discrimination value is

0.5 and 1.2, and a high detection rate when the discrimination value is 1.5 (see Table 4.4).

However, the results for items with guessing effect (c=0.25) were very different for items with difficulty value of -1.5 compared to those with difficulty value of 1.5. The detection rates for items with difficulty value of 1.5 were extremely high (see Table 4.5). This was not really unexpected since the ability of students to guess the correct answer to difficult questions would inadvertently indicate that these items were easier than they actually were.

In the case of easy items, the guessing effect was not really pronounced as most students were able to get the right answer without resorting to guessing.

When the two-parameter probit IRT model was fitted to the data sets, the items with no guessing effect were supposed to fit this model correctly. For this reason, small values in Table 4.6 were expected. Consistent with this expectation, the type I error rates that were observed from the simulation were all very low (see Table 4.6), especially for the first 4 discrepancy measures (B, HL, Q1, and G2). For items with guessing effect, only the items with high difficulty values (b = 1.5) and large discrimination values (a=1.2 or 1.5) were noted to have acceptable detection rates (see Table 4.7).

4.3.3 Using Partial Posterior Predictive

Bayarri and Berger ([13],[14],[15]) have argued that the posterior predictive distribution is a little conservative because it uses the data twice − first, to estimate the reference predictive distribution of the discrepancy measure, and second, to calculate the observed 87 discrepancy measure value. This causes the observed discrepancy value to be closer to the reference posterior predictive distribution and would yield higher posterior p-values. As an alternative, they proposed to use the partial posterior predictive distribution. In Section

2.5.4, this predictive distribution was defined as,

Z

m(y|yobs\tobs) = f(y|u, ξ)π(ξ|yobs\tobs)dξ.

The idea is to use the information in the data that is not in tobs to get the reference distribution h(t). One easy way to do this is to divide the data in two − one for getting the reference distribution and the other for computing the p-value.

To see its performance, this partial predictive distribution was used to compute the p-value using the G2-index as the discrepancy measure. These calculations were performed on another 100 artificially altered data sets, similar to those used in Section 4.3.1. These generated data sets were divided into two equal-size data sets. One was used to obtain the

2 2 posterior sample of the G , while the other was used to compute Gobs and its p-value. The actual calculations were done using the Matlab program gof item G ppp and a summary of the results is shown in Table 4.8.

From Table 4.8, one could see that this method does have an acceptable type I error rate and a respectable power in detecting the misfitted items. However, these values are not as good as the type I error rate and power rate of the posterior predictive distribution shown in Table 4.2. The same thing can be said about the results of a second simulation done using the two-parameter model (see third column of Table 4.8). These are probably due to the reduced sample size used in computing for the reference distribution. Using a smaller sample size would increase the variability of the posterior distributions, which would in turn, 88

Item G2 (pp1) G2 (pp2) 1 66 32 (a=2,c=0.25) 2 7 1 4 6 1 5 100 0 (a=2,c=0) 6 10 1 9 4 0 10 67 25 (a=2,c=0.25) 11 8 1 14 6 2 15 83 1 (a=0.5,c=0.25) 16 1 1 19 3 0 20 52 0 (a=0.5,c=0) 21 4 1 24 7 1 25 89 2 (a=0.5,c=0.25) 26 6 1 29 6 1 30 35 1 (a=1,c=0.25) 31 4 0 34 5 0 35 37 3 (a=1,c=0.25)

Table 4.8: Percentage of p-values < 0.05 out of 100 using G2 (pp1 and pp2 represent the one-parameter and two-parameter probit model).

increase the variability of the posterior sample of the discrepancy measure. Similarly, the observed discrepancy value would also be subject to higher variability. All these would affect the p-values and would result in an increased type I error rate and decreased detection rate, as was observed in the simulation study.

4.4 Examinee Fit Analysis

In Chapter 3, the examinee residual and latent residual plots were introduced to detect potential guessers in the group of examinees. However, searching the potential guessers from 89 a large group of examinees could be very tedious and time consuming. In this section, a method is proposed to narrow down the search and at the same time provide a quantitative way to detect these outlying examinees. The idea of posterior predictive distribution is used to get a reference distribution of two discrepancy measures. These two discrepancy measures are designed to capture model misfit in terms of examinee fit.

4.4.1 Discrepancy Measures for Person Fit

In the classical IRT literature, several test statistics were proposed to check the person

fit of IRT models. The more famous ones are those proposed by Wright and Stone (1979),

Levine and Rubin (1979), Tatsuoka (1984), and Smith (1985, 1986). In this section only the

W -statistic by Wright and Stone, and the Log-likelihood statistic by Levine and Rubin are used.

1. W -statistic by Wright and Stone (1979)

This W -statistic is defined as

Pk 2 j=1[Yij − pij] Wi = Pk , (4.4.1) j=1 pij[1 − pij] where Yij is the response of examinee i to item j, and pij is the probability the examinee i correctly answered item j. In the two-parameter model, pij = Φ(ajθi − bj).

2. Log-likelihood statistic by Levine and Rubin (1979)

Another well-known person-fit statistics is the log-likelihood statistic, defined as

Xk Li = {Yij log pij + (1 − Yij) log[1 − pij]}, (4.4.2) j=1 where Yij and pij are the same as in the W -statistic. 90

For a given n × k IRT data set, the Matlab program exa w and exa l computes the

values of these two statistics, respectively, for each examinee. To determine whether the

assumed model fits the observed data, the reference distributions of these statistics are

needed. However, just like the previous six χ2−indices, the exact distributions of these two

statistics are unknown. As before, in the Bayesian framework, this problem is conveniently

solved by the availability of the posterior sample of the parameters in the model. In this

section, only the posterior predictive distribution is utilized as this was shown in the previous

sections to have the most desirable results among the three predictive distributions.

4.4.2 Detecting Guessers using Posterior Predictive

As before, to get a sample from the posterior predictive distribution of these two dis-

crepancy measures, draw m values of ξ = (a1, . . . , ak, b1, . . . , bk, θ1, . . . , θk) from its posterior

∗ (l) (l) rep(l) distribution, π (ξ|Yobs), to get {ξ }, l = 1, . . . , m. Then for each ξ , simulate y from

f(yrep|ξ(l)). The set of values of W (yrep(l)) and L(yrep(l)) would serve as posterior samples

from their respective distributions.

To measure the surprise of the observed data for each examinee, the posterior predictive

p-value given in equation (4.3.2) can be used. Because the W -statistic measures the deviation

2 of the data to the assumed model in the form ([Yij − Pj(θi)] ), then the observed data

rep(l) is considered surprising if W (yobs) is larger than W (y ). Consequently, its posterior predictive p-value is defined as

1 Xm P = I rep(l) . (4.4.3) W m [W (y )>W (yobs)] l=1

On the other hand, the log-likelihood statistic measures the likelihood that the data have 91

come from the assumed model. Hence, the observed data is considered surprising if L(yobs)

is smaller than L(yrep(l)). Consequently, its posterior predictive p-value is defined as

1 Xm P = I rep(l) . (4.4.4) L m [L(y )

These two posterior predictive p-values are computed by the Matlab programs got exa w

and got exa l. The product of these two programs is an n × 1 vector of p-values − one p-value for each examinee.

To illustrate this method, an artificial data of 1000 × 35 responses was generated with 3 guessers. As in Section 3.3.1, the latent ability scores and item difficulty values were drawn from the standard normal, while the item discrimination values were randomly selected from the possible values of {0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6}. The three randomly selected guessers were made to guess the correct answer for each item in the test with success probability of

0.20. In the generated data, examinee 236, 529, and 777 were chosen to be the guessers.

This artificial data was fitted with the two-parameter probit model and 500 parameter

values were drawn from their posterior distributions. These posterior samples were used to

produce 500 replicated future data. For each of these replicated data, the W and L statistics

were computed to form approximate distributions of the two discrepancy measures and the

p-value for each examinee was computed using both discrepancy measures.

Using a threshold value of 0.05, the W discrepancy measure detected a total of 17

misfitted examinees, including examinee 236 and 777 with p-values of 0.024 and 0.002, re-

spectively. The estimated ability scores, test raw scores, and posterior p-values (PW ) of these

17 examinees are given in Table 4.9.

The examinee residual plots for examinee 236 and 777 is shown in Figure 4.6. Both of 92

Examinee θˆ Raw Score P-value 68 0.4739 23 0.048 85 0.8146 25 0.034 96 0.1387 20 0.040 236 -0.9155 6 0.024∗ 241 -0.8261 8 0.038 272 0.6625 24 0.038 484 -0.2868 17 0.012 503 0.0360 16 0.032 639 0.6582 24 0.012 680 0.0132 17 0.006 728 0.0123 18 0.026 777 -0.6283 7 0.002∗ 869 0.3618 22 0.042 912 0.4180 23 0.016 926 0.9557 27 0.048 940 0.7579 25 0.048 993 -0.5009 12 0.028

Table 4.9: The 17 misfitted examinees with PW < 0.05 (* signifies a guesser).

LATENT RESIDUAL POSTERIOR DISTRIBUTIONS LATENT RESIDUAL POSTERIOR DISTRIBUTIONS 3 3

2 2

1 1

0 0 Bayesian Residual Bayesian Residual

−1 −1

−2 −2

−3 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 Estimated Item Difficulty Estimated Item Difficulty

Figure 4.6: Residual plots of the two guessers, examinee 236 and 777. 93 their residual plots exhibit large positive residuals for difficult items and also large negative residuals for very easy items. As was discussed in Chapter 3, these are the marks of somebody who is guessing on the test. However, the residual plot of examinee 529, shown in Figure 4.7, does not exhibit any positive residuals for difficult items. As such, this examinee was labeled as simply someone with a low ability trait. The estimated ability score of this examinee was

−1.45 and PW = 0.336.

LATENT RESIDUAL POSTERIOR DISTRIBUTIONS 2.5

2

1.5

1

0.5

0 Bayesian Residual

−0.5

−1

−1.5

−2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 Estimated Item Difficulty

Figure 4.7: Residual plots of examinee 529.

When this procedure was repeated using the log-likelihood discrepancy measure, the results were very similar. Using a threshold value of 0.05, the L discrepancy measure detected a total of 16 misfitted examinees, including examinee 236 and 777 with p-values of 0.002 and

0, respectively. Table 4.10 gives the estimated ability scores, test raw scores, and posterior p-values (PL) of these 16 examinees. Again, examinee 529 was not labeled as a guesser

(PL = 0.31).

To see the power of this method in detecting guessers, a simulation study was conducted. 94

Examinee θˆ Raw Score P-value 85 0.8146 25 0.022 236 -0.9155 6 0.002∗ 241 -0.8261 8 0.02 272 0.6625 24 0.04 484 -0.2868 17 0.012 639 0.6582 24 0.01 680 0.0132 17 0.022 693 0.4262 22 0.038 728 0.0123 18 0.044 734 -0.4529 15 0.044 777 -0.6283 7 0.000∗ 829 0.4455 25 0.038 869 0.3618 22 0.034 905 0.6203 25 0.044 912 0.4180 23 0.024 993 -0.5009 12 0.022

Table 4.10: The 16 misfitted examinees with PL < 0.05 (* signifies a guesser).

Examinee PW PL 99 6 5 100∗ 82 85 101 0 0 299 5 3 300∗ 79 82 301 0 0 599 2 1 600∗ 82 87 601 4 4 699 4 4 700∗ 81 82 701 4 4 899 0 1 900∗ 83 88 901 4 5

Table 4.11: The percentage of PL and PW < 0.05 (* signifies a guesser). 95

The process described earlier was repeated 100 times using 100 artificially altered data sets.

These generated data sets have 5 guessers, examinee 100, 300, 600, 700, and 900. A summary of the results is given in Table 4.11. From this table, one could see that these procedures have good detection rates of about 80% using the W -statistic and around 85% using the

L-statistic. They also have very low false positive results as can be seen with the small number of significant p-values for examinees other than the 5 guessers. In fact, the maximum percentage of significant p-values for the 995 non-guessers is 8% for both PW and PL. Figure

4.8 shows the histograms of the percentages of significant PW and PL values of the 995 non-guessers.

Histogram of the 995 non−guessers Histogram of the 995 non−guessers 250 300

250 200

200

150

150

100

100

50 50

0 0 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 The number of examinees with pval_W < 0.05 Number of examinees with pval_L < 0.05

Figure 4.8: Histogram of the 995 non-guessers.

4.5 Application To Real Data Set

1. Checking for the appropriateness of the one-parameter model.

For the BGSU Mathematics placement exam, the observed standard deviation of the point biserial correlations was 0.0731. The prior predictive distribution remains the same as 96

that shown in Figure 4.1. In this case, the observed rpbis of 0.0731 is well within the prior predictive distribution. The prior p-value of this data set was 0.4426. Therefore, there was not enough evidence found to reject the one-parameter IRT probit model using this method.

However, as pointed out in Section 4.2, the posterior predictive distribution has higher

precision and better power to detect deviations from the one-parameter model. The his-

togram of the posterior predictive sample of std(rpbis) is shown in Figure 4.9. This time,

the observed rpbis of 0.0731 is located to the extreme right of the posterior predictive distri-

bution. This resulted in a posterior predictive p-value of 0. Therefore, using this method,

enough evidence was found to reject the one-parameter IRT probit model and warrant to

use of the two-parameter IRT model.

Posterior Predictive Distribution of std(r−pbis) 140

120

100

80

60

40

20

0 0.03 0.035 0.04 0.045 0.05 0.055 0.06

Figure 4.9: Histogram of 500 simulated values of std(r-pbis)

2. Checking for item fit

To check for the fit of the 35 items in this placement exam, the G2 index was used as

the discrepancy. This was the index that showed the best result in the simulation study 97 in Section 4.3. Using the posterior predictive distribution, the posterior predictive p-values were computed for the goodness-of-fit of each item when the one-parameter and the two- parameter IRT models were used.

Using One−parameter model − Item # 11 Using One−parameter model − Item # 30 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

PROBABILITY 0.4 PROBABILITY 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 LATENT TRAIT LATENT TRAIT Using One−parameter model − Item # 33 Using One−parameter model − Item # 34 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

PROBABILITY 0.4 PROBABILITY 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 LATENT TRAIT LATENT TRAIT

Figure 4.10: The 90% interval bands for the item response curves of items 11 (upper left), 30 (upper right), 33 (lower left), and 34 (lower right) fitted with the one-parameter IRT model.

When the one-parameter model was used, items 11, 30, 33, and 34 had significant p- values. Their 90% probability interval band IRC plots are given in Figure 4.10. Notice the large deviations of the observed proportions of correct responses to each of the four items.

To contrast these plots with those of items that the model did not identify as misfit, consider 98 the corresponding plots of item 15 and 18 shown in Figure 4.11. In these plots, the observed proportions are quite close to the expected region making their fit acceptable.

Using One−parameter model − Item # 14 Using One−parameter model − Item # 15 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

PROBABILITY 0.4 PROBABILITY 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 LATENT TRAIT LATENT TRAIT

Figure 4.11: The 90% interval bands for the item response curves of items 14 (left) and 15 (right) fitted with the one-parameter IRT model.

When the two-parameter model was used to fit the BGSU data set, the posterior p- values of the G2 discrepancy measure were not significant for all the 35 items. This suggests that the two-parameter probit IRT model is acceptable for this particular data set. The

Bayesian estimates for the discrimination parameters for the 35 items were mostly between

0.50 and 0.60. Items 11, 30, 33, and 34 all have discrimination estimates less than 0.34. This explains the large deviations found in the probability interval band plots in Figure 4.10.

3. Checking for guessers.

To determine if there were potential guessers from the group of examinees who took the Math placement test in 2004, the method of Section 4.4 was used. Both discrepancy measures, W-statistic and Log-likelihood statistic, were employed and their results were very similar. The W-statistic tagged a total of 27 potential guessers, while the log-likelihood 99 statistic tagged 26 potential guessers. Twenty of these students were found in both group.

To see the kind of responses of these students, the latent residual plots of six of the twenty students are given in Figure 4.12. Note that the responses of these six students were very similar. They all got the correct response to the most difficult item (Q30), marked by the right-most positive residual, and some other hard items but failed to get the correct responses to some easy items (marked by the negative residuals towards the left side of the plots). The latent residual plots of the other 16 students were very similar to these six plots.

These results illustrate the consistency of the numerical method (using posterior predictive) with the graphical method (using latent residual plots).

However, before any final decision is made about these students, it is best to study carefully their latent residual plots. For example, the latent residual plot of student 991 seems to indicate that this is a good student as he/she got 29 correct answers out of 35. With these kind of scores, it seems more plausible that his/her mistakes were due to carelessness than from guessing. On the other hand, the latent residual plot of student 18 shows only

12 correct responses out of 35, six of which were not even predicted by the model. In other words, this student was lucky to get the correct answers to these six items. These could be the result of successful guesses.

As a final remark, it is then recommended that the posterior predictive method should be used in conjunction with the latent residual plots. The numerical method would help in narrowing down the number of potential guessers and then the latent residuals should be utilized to study the response pattern of those examinees marked as potential guessers. 100

LATENT RESIDUAL − Student no. 18 LATENT RESIDUAL − Student no. 418 2.5 2

2 1.5

1.5 1

1 0.5

0.5 0

0 −0.5 Bayesian Residual Bayesian Residual

−0.5 −1

−1 −1.5

−1.5 −2

−2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 Estimated Item Difficulty Estimated Item Difficulty LATENT RESIDUAL − Student no. 899 LATENT RESIDUAL − Student no. 991 2 2

1.5 1.5

1 1

0.5 0.5

0 0 −0.5

Bayesian Residual −0.5 Bayesian Residual −1

−1 −1.5

−1.5 −2

−2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 Estimated Item Difficulty Estimated Item Difficulty LATENT RESIDUAL − Student no. 1202 LATENT RESIDUAL − Student no. 1249 2 2

1.5 1.5

1 1

0.5 0.5

0 0

Bayesian Residual −0.5 Bayesian Residual −0.5

−1 −1

−1.5 −1.5

−2 −2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 Estimated Item Difficulty Estimated Item Difficulty

Figure 4.12: Latent residual plots of six students marked as potential guessers by the W and L statistics using the posterior predictive distribution. 101

CHAPTER 5 BAYESIAN METHODS FOR IRT MODEL SELECTION

5.1 Introduction

Within the Bayesian camp, some people prefer to work with predictive distributions for model assessment, while others prefer to do model comparisons using Bayes factor. One advantage of the predictive distribution over Bayes factor in model fit assessment is that the assessment can be done even without the specification of an alternative model. However, people advocating the use of predictive distribution also recognize that a model should not be rejected with the absence of a better alternative model. That is why, many of those working with predictive distributions recommend its use only for preliminary studies. If strong evidence of model misfit was found using predictive distributions, they recommend the use of a full Bayesian analysis to look for another more appropriate model. And this usually entails the use of Bayes factors in a model selection process.

In Section 5.2, the Bayes factor is reintroduced with greater details. Because the IRT models are quite complex, computing the Bayes factor of two competing IRT models is not straightforward and requires numerical methods to estimate its value. Several of these numerical methods are discussed in this section and are illustrated using the simpler Beta-

Binomial model. In addition to these numerical methods, the exchangeable IRT model is also used in estimating the Bayes factor of two IRT models. Section 5.3 covers the details about the exchangeable model and how it is used to approximate the one-parameter and two-parameter IRT models. The actual estimation of the Bayes factor that compares the

101 102 model fit of two IRT models is discussed and performed in Section 5.4. The effectiveness of the Bayes factor in determining the appropriate model for data sets coming from different

IRT models are illustrated using simulated data.

In Section 5.5, two methods to detect outlying discrimination parameters are discussed.

One uses the Bayes factor to compare an outlier exchangeable model to the ordinary ex- changeable model, while the other uses a mixture prior density for the discrimination para- meters. The effectiveness of these two methods are illustrated using an artificially altered data.

Finally, all these methods are employed on the BGSU Mathematics placement data set in the last section.

5.2 Checking the Beta-Binomial Model using Bayes Factors

To facilitate the introduction of all the Bayesian concepts that will be discussed in this chapter, these ideas will be explained in the context of a simpler model − the Beta-Binomial model. To motivate the use of the Beta-Binomial model, consider the following problem in the game of baseball:

In baseball, people are usually interested with the hitting average of a player. This hit- ting average is supposed to tell people the likelihood that a particular player will successfully hit the ball. This statistic is basically the proportion of successful hits over the total number of opportunities at bats.

Let Yi denote the number of successful hits of a particular player out of a total of ni opportunities during the ith season. Assuming the success probability of this particular player remains constant throughout the ith season, then Yi is a random variable that follows 103

a binomial distribution with success probability pi, or Yi ∼ Bin(ni, pi). That is,

n ! i yi (1−yi) Pr(Yi = yi) = pi (1 − pi) , yi = 1, . . . , ni. (5.2.1) yi!(ni − yi)!

The hitting average is essentially an estimate of the success probability pi. If a player

plays in many baseball seasons, as they usually do, then one might be interested to learn

whether the success probability of a player is pretty much the same over different seasons

or whether it varies greatly from season to season. This question is equivalent to testing

the null hypothesis that all the pi’s are equal against an alternative hypothesis that assumes

that the pi’s are changing over different seasons according to some distribution, such as the

beta distribution.

In this section, a Bayesian procedure to test the null hypothesis, H0 : p1 = ··· = pn,

versus the alternative hypothesis, H1 : pi ∼ Beta(µk, k(1 − µ)), for i = 1, . . . , n, where

µ is the mean of the Beta distribution, using Bayes factor will be discussed. Through a

simple simulation study, the effectiveness of this method in checking whether the simple

binomial model is sufficient to model the observed data or if it is really necessary to use the

beta-binomial model, will also be illustrated.

5.2.1 Beta Binomial Model

The Beta-binomial model essentially assumes that the data, {y1, . . . , yn}, comes from the binomial distribution, Bin(ni, pi), with each pi coming from a beta distribution, Beta(α, β).

That is, 1 f(p ) = pα−1(1 − p )β−1, 0 ≤ p ≤ 1, and α, β > 0, (5.2.2) i B(α, β) i i i Γ(α)Γ(β) where B(α, β) = . Γ(α + β) 104

This model corresponds to the alternative hypothesis mentioned earlier. To work within the Bayesian framework, one needs to specify a prior distribution to the hyperparameter µ.

The other hyperparameter k will be considered a known quantity. Using a simple uni- form(0,1) prior for µ, then the joint posterior distribution of µ and the success probabilities

∗ (under the alternative hypothesis), π (p1, . . . , pn, µ|data, k), will be proportional to the prod- uct of the likelihood function and the prior density. That is,

∗ π (p1, . . . , pn|µ, data, k) ∝ π(data|p1, . . . , pn, µ, k)π0(p1, . . . , pn|µ, k)π0(µ|k) Yn ni! = pyi (1 − p )(ni−yi) × y !(n − y )! i i j=1 i i i Yn 1 pµk−1(1 − p )k(1−µ)−1. B(µk, k(1 − µ)) i i j=1 n Y pαi−1(1 − p )βi−1 = C × i i . (5.2.3) B(µk, k(1 − µ)) j=1

Yn n ! where, C = i , α = µk + y , and β = k(1 − µ) + (n − y ). y !(n − y )! i i i i i j=1 i i i Therefore, the normalizing constant is given by

Z n Y pαi−1(1 − p )βi−1 Pr(µ, data, k) = C i i dp B(µk, k(1 − µ)) j=1 Yn B(α , β ) = C i i . (5.2.4) B(µk, k(1 − µ)) j=1

Dividing (5.2.3) by (5.2.4) yields the posterior distribution

Yn ∗ π (p1, . . . , pn|µ, data, k) = Beta(αi, βi). (5.2.5) j=1

To find the posterior marginal distribution, π∗(µ|data, k), note that the joint posterior

∗ π (p1, . . . , pn, µ|data, k) can be expressed as

∗ ∗ ∗ π (p1, . . . , pn, µ|data, k) = π (p1, . . . , pn|µ, data, k) × π (µ|data, k). (5.2.6) 105

where,

n Y pαi−1(1 − p )βi−1 π∗(p , . . . , p , µ|data, k) ∝ i i . (5.2.7) 1 n B(µk, k(1 − µ)) j=1 Therefore, from (5.2.5), (5.2.6), and (5.2.7),

Yn B(α , β ) π∗(µ|data, k) ∝ i i . (5.2.8) B(µk, k(1 − µ)) j=1

To make inference on these success probabilities, p1, . . . , pn, one could simulate values

from their joint posterior distribution by simulating values of µ from the π∗(µ|data, k), to get

(l) (l) (l) (l) µ . Then simulate pi from Beta(µ k + yi, k(1 − µ ) + (ni − yi)). To get a sample of size

m, simply repeat this process m more times after the convergence of the MCMC algorithm.

Any inference about these parameters can be made using the obtained sample.

Similarly, under the null hypothesis of equal success probabilities (p) and using a vague

∗ prior density (uniform (0,1)) for this parameter, the posterior distribution, π (p|data, H0), can be shown to follow the beta distribution, Beta(Σyi + 1, Σ(ni − yi) + 1).

5.2.2 Bayes Factor

Given two competing models, H0 and H1, the Bayes factor is defined as the ratio of the

marginal likelihood of these two models. That is,

P r(Data|H1) BF10 = , (5.2.9) P r(Data|H0)

where, Z

P r(Data|Hj) = P r(Data|ξ,Hj)P r(ξ|Hj)dξ.

In our case, ξ = (p1, . . . , pn, µ) and Z Z

P r(Data|H1) = P r(Data|p, µ)π0(p, µ)dpdµ, µ p 106

where, Yn Yn ni! P r(Data|p, µ) = C pyi (1 − p )(1−yi),C = , i i y !(n − y )! i=1 i=1 i i i and 1 π (p, µ) = π (p|µ)π (µ) = pµk−1(1 − p )k(1−µ)−1. 0 0 0 B(µk, k(1 − µ)) i i

Hence, Z Z 1 Yn P r(Data|H ) = C pαi−1(1 − p )β−1dpdµ 1 B(µk, k(1 − µ)) i i p i=1 Z Yn B(α , β ) = C i i dµ (5.2.10) B(µk, k(1 − µ)) i=1

where, αi = µk + yi and βi = k(1 − µ) + (ni − yi).

Under the null hypothesis where p1 = ··· = pn = p, the model is much simpler and the

denominator of the Bayes factor becomes

Z Yn yi (ni−yi) P r(Data|H0) = C p (1 − p) dp Z i=1 P P = C p yi (1 − p) (ni−yi)dp

= C × B (Σyi + 1, Σ(ni − yi) + 1) . (5.2.11)

Dividing (5.2.10) by (5.2.11) yields the Bayes factor, Z 1 Yn B(α , β ) BF = i i dµ. (5.2.12) 10 B (Σy + 1, Σ(n − y ) + 1) B(µk, k(1 − µ)) i i i i=1

5.2.3 Laplace Method for Integration

To compute the Bayes factor (5.2.12) obtained in the previous section, one needs to

evaluate the integral in its numerator. To do this analytically is a challenging undertaking

as the integrand is a complicated expression involving products of gamma functions. In this

section, a method is presented to approximate this integral, called the Laplace method. 107

Consider the problem of integrating

Z m = f(θ)dθ.

The idea of the Laplace method is to express the integrand,

f(θ) = exp(log(f(θ))) = exp(l(θ)) and then, using taylor series expansion about the mode θˆ, we approximate

1 log(f(θ)) ≈ log(f(θˆ)) + l00(θˆ)(θ − θˆ)2. 2

Consequently, µ ¶ 1 f(θ) ≈ f(θˆ) exp l00(θˆ)(θ − θˆ)2 . (5.2.13) 2

Thus,

Z Z µ ¶ 1 f(θ)dθ ≈ f(θˆ) exp l00(θˆ)(θ − θˆ)2 dθ 2 √ = f(θˆ) 2π(−l00(θˆ)). (5.2.14)

In the context of Bayesian inference, it is often necessary to find the marginal likelihood,

Z

m(D) = L(D|θ)π0(θ)dθ, (5.2.15) as was shown in the computation of the Bayes factor. Also, this marginal likelihood is the normalizing constant in the denominator of the posterior distribution,

L(D|θ)π (θ) π∗(θ|D) = 0 . (5.2.16) m(D)

In an effort to approximate this marginal likelihood, DiCiccio, Kass, Raftery, and

Wasserman (1997) obtained the Laplace approximation of the marginal likelihood by using 108

the idea that the posterior distribution converges to the normal distribution as the sample

size increases. Replacing the posterior distribution by the normal distribution, they obtained

ˆ ˆ L(D|θ)π0(θ) p/2 ˆ 1/2 m(D) ≈ = (2π) |Σ| L(D|θˆ)π0(θˆ), (5.2.17) φ(θˆ; θˆ, Σ)ˆ

where, −Σˆ is the inverse of the Hessian matrix of the log-posterior evaluated at θˆ. Note

that expression ( 5.2.17) is the multivariate version of expression ( 5.2.14), with f(θˆ) =

L(D|θˆ)π0(θˆ).

Albert’s Matlab program laplace.m (see appendix B.4) will calculate an approximate value of log(m(D)) using the just described Laplace method. To use this program, one needs to define the log of the integrand, l(θ) = log(f(θ)), as a Matlab function.

5.2.4 Estimating the Bayes Factor

Using the Laplace method, the value of the integral in the numerator of the Bayes factor

can now be approximated. To facilitate the actual numerical calculations of the Bayes factor,

the log(BF10) was used. That is, ÃZ ! 1 Yn B(α , β ) log(BF ) = log i i dµ − log (B (Σy + 1, Σ(n − y ) + 1)) . 10 B(µk, k(1 − µ)) i i i 0 i=1 ÃZ ! 1 Yn B(α , β ) To estimate the quantity log i i dµ , one could use Albert’s laplace.m B(µk, k(1 − µ)) 0 i=1 Matlab program in conjunction with the Matlab function of the log-integrand, logpost beta.m

(see appendix). In our case, this log-integrand is given by

Xn l(θ) = {log B(αi, βi) − log B(µk, k(1 − µ))} . i=1

To make it compatible with the program laplace.m, the parameter µ had to be trans- µ ¶ µ formed to θ = log , so that the integral will be from −∞ to ∞. 1 − µ 109

To illustrate this computation of the Bayes factor, a sample of 20 binomial observations were generated from the Beta-binomial model. In particular, 20 success probabilities, pi, were generated from the Beta(5,2) distribution and 20 sample sizes, ni, were generated from

N(30,5) and rounded up to the next integer. Then, for each pair (ni, pi), i = 1,..., 20, yi was generated from the Bin(ni, pi). One such sample is given in Table 5.1.

obs 1 2 3 4 5 6 7 8 9 10

yi 18 33 23 20 17 15 19 22 24 29

ni 36 36 34 35 31 33 38 32 32 32

pi 0.54 0.84 0.74 0.60 0.47 0.50 0.47 0.76 0.69 0.92 obs 11 12 13 14 15 16 17 18 19 20

yi 22 24 21 15 24 21 25 33 30 30

ni 32 33 36 39 40 31 34 35 40 42

pi 0.72 0.77 0.69 0.35 0.56 0.79 0.58 0.94 0.62 0.72 Table 5.1: Twenty simulated observations from Beta-binomial model.

Applying all the methods that were just discussed in this section on the data set shown in Table 5.1, we obtained

log(BF10) ≈ (−435.39) − (−450.90) ≈ 15.5

(15.5) Hence, BF10 ≈ e , which is pretty large. This is a very strong evidence that the observed data is so much more likely to have come from the alternative model (Beta-binomial model) than from the null model (Simple binomial model). This is what we were expecting since the data set that we used was really generated from the Beta-binomial model.

To see how this procedure performs to observations coming from the simple binomial model, another data set of 20 binomial observations were generated. But this time, all the 110

20 observations were generated from Bin(ni, pi = 0.7). One such sample is shown in Table

5.2.

obs 1 2 3 4 5 6 7 8 9 10

yi 31 26 20 27 24 20 24 25 23 26

ni 36 36 34 35 31 33 38 32 32 32 obs 11 12 13 14 15 16 17 18 19 20

yi 22 26 30 28 36 21 22 21 30 28

ni 32 33 36 39 40 31 34 35 40 42 Table 5.2: Twenty generated binomial observations.

Performing the same procedures to this new data set, we obtained

log(BF10) ≈ (−417.60) − (−413.74) = −3.86.

(−3.86) Hence, BF10 ≈ e ≈ 0.0211, which is close to zero. This is a clear evidence that the observed data is so much more likely to have come from the null model (Simple binomial model) than from the Beta-binomial model. Again, this is what we were expecting since the data set that we used was really generated from the simple binomial model.

To see how consistent the results of this procedure, these calculations were repeated to a total of 100 simulated samples from each of the two models - simple binomial model and the

Beta-binomial model. Then, this simulation was repeated three times for 3 different sample sizes, N = 10, 20, 30. This simulation was carried out by the Matlab program sim betabin.m

(See Appendix B.4). A summary of the results is shown in Table 5.3.

Kass and Raftery (1995) provided the following categories, for values of Bayes factors or log base 10 of Bayes factors, that can serve as a rough descriptive statement about the standards of evidence in scientific investigation as shown in Table 5.4. 111

Sample Sizes N = 10 N = 20 N =30 Beta-binomial (-1.52, 19.18) (-0.42, 25.90) (1.18, 33.56) Simple binomial (-3.54, 0.89) (-6.31, -0.87) (-9.06, -0.93)

Table 5.3: Range of values of log10 BF .

log10 BF10 BF10 Evidence against H0 0 to 0.5 1 to 3.2 Not worth more than a bare mention 0.5 to 1 3.2 to 10 Substantial 1 to 2 10 to 100 Strong 2 100 Decisive

Table 5.4: Levels of evidence by log10 BF .

For data sets coming from the Beta-binomial model, even with the small sample size of

N = 10, only six negative values for log10 BF were observed, and only four of which were less than −0.5. These negative values of log10 BF were mainly due to similar simulated success probabilities, which causes the Beta-binomial model to behave similar to a simple binomial model. When the sample size is increased to N = 20, only one negative value was observed and when a sample size of N = 30 was used, the values of log10 BF were all bigger than one. These results showed that this method picked the right model for the Beta-binomial observed data at a very high rate.

Similarly, for data sets coming from the simple binomial model, only three positive

values was observed for log10 BF and only two of which were greater than 0.5 when the

sample size was N = 10. For sample sizes of N = 20 and N = 30, all the values of log10 BF

were less than −0.5. Again, these results showed that the method has the ability to pick the right model for the simple binomial observed data at a very high rate. 112

5.2.5 Application to Real Data

To illustrate this method on a real data set, consider the hitting data of Barry Bonds

from 1986 to 2005 (shown in Table 5.5).

Year 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995

Hits (yi) 92 144 152 144 156 149 147 181 122 149

At Bats (ni) 413 551 538 580 519 510 473 539 391 506 Year 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

Hits (yi) 159 155 167 93 147 156 149 133 135 12

At Bats (ni) 517 532 552 355 480 476 403 390 373 42 Table 5.5: Barry Bonds’ hitting data from 1986 to 2005.

Performing all the procedures described in this section, we obtained

log(BF10) ≈ −12.48.

(−12.48) −6 Hence, BF10 ≈ e ≈ 3.8019 × 10 , (or log10 BF10 ≈ −5.42), which is very close to zero. This result indicates that Barry Bonds’ hitting average was pretty consistent through- out his 20 years of professional career in baseball.

We have seen earlier that under the assumption that p1 = ··· = p20 = p, p ∼ Beta(Σyi +

1, Σ(ni −yi)+1). Therefore, an estimate of p is given by the mean of this distribution, which Σ(y ) + 1 isp ˆ = i = 0.30. Σ(ni) + 2

5.2.6 Approximating the Denominator of the Bayes Factor

In this particular case, we were fortunate that the denominator was analytically tractable

and the integration can be done analytically to get (5.2.11). However, in many other more

complicated models, this is not the case. In those instances one needs to approximate the 113

denominator in a much similar way as one approximates the numerator. This is where the

other parameter, k, of the Beta(µk, k(1 − µ)) distribution, can be useful.

Recall that the variance of p ∼ Beta(µk, k(1 − µ)) is,

(µk)k(1 − µ) µ(1 − µ) V ar(p) = = . (5.2.18) k2(k + 1) k + 1

Note that as k → ∞, this variance goes to 0. In other words, as k → ∞, the Beta-binomial

model converges to the simple Binomial model. Therefore, for large values of k, the Beta-

binomial model will be very close to the simple Binomial model. Thus, to get an approximate

value of the denominator of the Bayes factor, one could use the same procedure as was used

to approximate the numerator of the Bayes factor. The only difference is the value of k.

To see how well this procedure approximates the value of the denominator of log BF ,

log (B (Σyi, Σ(ni − yi))), 100 simulated data sets from each of the two models − Beta- binomial model and simple binomial model − were generated. For each data set, the exact Z 1 Yn B(α , β ) value of log(B (Σy , Σ(n − y ))) and its approximate value, log i i dµ, i i i B(µk, k(1 − µ)) 0 i=1 with k = 1000, were computed . The results are summarized in the scatterplots shown below in Figure 5.1.

Note how close the approximate values were to the exact values. The left scatterplot of

Figure 5.1 used data sets coming from the Beta-binomial model. As such, the approximate

value was expected to be slightly higher than the exact value because the approximation was

done through the assumption of a Beta-binomial model (albeit a very narrow Beta-binomial

model). This explains why the approximate values were consistently slightly higher than the

exact value. This behavior is not found in the right scatterplot because the data sets came

from the simple Binomial model. 114

Data from Beta−Binomial Model Data From Simple Binomial Model −340 −50

−100 −360

−150

−380 −200

−400 −250

−300 −420

−350 Approximate log denominator of BF Approximate log denominator of BF −440

−400

−460 −450

−480 −500 −480 −460 −440 −420 −400 −380 −360 −340 −500 −450 −400 −350 −300 −250 −200 −150 −100 −50 Exact log denominator of BF Exact log denominator of BF

Figure 5.1: Scatterplots of the exact values versus the approximate values of the log- denominator of the Bayes factor.

Using this method on the Barry bonds data set, we obtained

log(BF10) ≈ −14.08.

This value is close to the one obtained earlier of −12.48, indicating again the accuracy of the approximation.

This idea of approximating a simpler model by fixing the value of a parameter in a more complicated model will be utilized later.

5.2.7 Using Importance Sampling

One more effective method to approximate the Bayes factor is through Importance Sam- pling. This is a Monte Carlo method used to estimate an integral. Consider the following integration problem, Z I = f(x)dx.

The idea of importance sampling is to introduce a new density function g(x) into the 115 integral to get Z · ¸ f(x) f(x) I = g(x)dx = IE . (5.2.19) g(x) g g(x)

Preferably, g(x) is a density that can be easily simulated from, with similar support set as f(x), and for which f(x)/g(x) is bounded. Consequently, using a sample {x1, . . . , xm} drawn from the density g(x), the in (5.2.19) can be approximated by

1 Xm f(x ) Iˆ = i . (5.2.20) m g(x ) i=1 i

This estimator enjoys many good frequentist properties such as unbiasedness and strong consistency [29]. That is, Iˆ → I as m → ∞ with probability 1. Since m is the simulation size, it can be made as large as desired to yield an almost error-free estimate.

In Section 5.1.5, it was shown that the simple binomial model can be approximated by the Beta-binomial model using a large k value. In this case, the Bayes factor (5.2.12) can be approximated by R Q n B(µk1+yi,k1(1−µ)+(ni−yi)) dµ ˆ i=1 B(µk1,k1(1−µ)) BF 10 = R Q . n B(µk0+yi,k0(1−µ)+(ni−yi)) dµ i=1 B(µk0,k0(1−µ)) 12 ˆ with a large value for k0. In the actual calculations, k0 was set to e . To compute this BF 10 using importance sampling, first note that Q Z n B(µk1+yi,k1(1−µ)+(ni−yi)) ˆ i=1 B(µk1,k1(1−µ)) ∗ BF 10 = Q × π (µ|data, k0)dµ. n B(µk0+yi,k0(1−µ)+(ni−yi)) i=1 B(µk0,k0(1−µ)) where, Q n B(µk0+yi,k0(1−µ)+(ni−yi)) ∗ i=1 B(µk0,k0(1−µ)) π (µ|data, k0) = R Q , (5.2.21) n B(µk0+yi,k0(1−µ)+(ni−yi)) dµ i=1 B(µk0,k0(1−µ)) is the marginal posterior distribution of µ given in (5.2.8).

ˆ Hence, to obtain the importance sampling estimate of BF 10, one would need to simulate values of the parameter µ from its marginal posterior distribution. But from (5.2.21), one 116

could see that this is not one of the common standard distributions that could be simulated

directly. To get the needed simulated sample, the Metropolis-Hastings MCMC method was

employed (See Appendix A.2.1 for more details on the M-H algorithm). The Matlab program

post sim.m (See Appendix B.4) performs this M-H algorithm to simulate any desired sample

size from the posterior distribution of µ.

Once the simulated sample {µ1, . . . , µm} is obtained, the importance sampling estimate

ˆ of BF 10 is given by Q m n B(µj k1+yi,k1(1−µj )+(ni−yi)) 1 X i=1 ˆ B(µj k1,k1(1−µj )) BF 10 ≈ Q . (5.2.22) m n B(µj k0+yi,k0(1−µj )+(ni−yi)) j=1 i=1 B(µj k0,k0(1−µj ))

Finally, note that BF01 = 1/BF10. This implies that one could also use a simulated

∗ sample from π (µ|data, k1) and get Q m n B(µj k0+yi,k0(1−µj )+(ni−yi)) 1 X i=1 ˆ B(µj k0,k0(1−µj )) BF 01 ≈ Q , (5.2.23) m n B(µj k1+yi,k1(1−µj )+(ni−yi)) j=1 i=1 B(µj k1,k1(1−µj )) ˆ ˆ and the value of log(BF 10) = −log(BF 01).

∗ Simulating from π (µ|data, k1) has the advantage of it being a more diffused distribu-

∗ tion compared to π (µ|data, k0) since k1 < k0. This can be seen from (5.2.18), where the

variance of the beta distribution decreases as k increases. Simulating from the more diffused

distribution will guarantee the boundedness of the integrand term, f(x)/g(x) in (5.2.19).

Using this method on the Barry bonds data set, we obtained

log(BF10) ≈ −14.05.

This value is very close to the previous estimate of −14.08 and close to the first estimate of

−12.48, which was obtained using the Laplace method with exact denominator. Again, this indicates the accuracy of the approximation. 117

5.3 Exchangeable IRT Model

As mentioned in Section 1.3.4, the Exchangeable IRT model is a compromise between the

one-parameter IRT model and the two-parameter model. It assumes that the item discrim-

ination parameters {a1, . . . , ak} comes from a single distribution, say a normal distribution with mean µa and standard deviation sa. Assigning a uniform prior on µa and an inverse-

2 gamma density with parameters ν1 and ν2 on sa, the joint prior density for the discrimination

2 parameters {a1, . . . , ak} and hyperparameters (µa, sa) can be written

Yk 2 2 −ν1−1 2 2 π0(a, µa, sa) = (sa) exp{−ν2/sa} φ(aj; µa, sa). (5.3.1) j=1

Combining this hierarchical prior structure with the previous IRT assumptions described

in Chapter 1, the joint posterior distribution of (θ, a, b, µa, sa) is now proportional to

Yn Yk ∗ 2 π (θ, a, b, µa, sa|data) ∝ L(θ, a, b) φ(θi; 0, 1) φ(bj; 0, sb) × π0(a, µa, sa), i=1 j=1

where L(θ, a, b) is the same as (1.4.1).

As before, the same latent data Z = (Z11,...,Znk) are introduced to give a joint poste-

rior distribution

Yn Yk ∗ π (θ, Z, a, b, µa, sa|data) ∝ [φ(Zij; mij, 1)I(Zij, yij)] i=1 j=1 Yn Yk 2 × φ(θi; 0, 1) φ(bj; 0, sb)π0(a, µa, sa), (5.3.2) i=1 j=1

where I(z, y) is equal to 1 when {z > 0, y = 1} or {z < 0, y = 0}, and equal to 0 otherwise.

To reflect the changes made by this hierarchical structure in the Gibbs sampler, Albert

modified his algorithm to include two more steps. The first one is to update µa using a normal density with mean equal to the sample mean of all the item discrimination parameters aj 118

2 2 and variance sa/k. The second is to update sa using an inverse-gamma distribution with

P 2 shape parameter k/2 + ν1 and scale parameter ν2 + (aj − µa) /2. Also, in the original step of updating the item parameters, the fixed prior mean and standard deviation of aj were

2 replaced by µa and sa.

The Matlab program pphn bay.m (See appendix B.4) simulates values of the different parameters in this exchangeable model from their respective posterior distributions. Again, the means of these simulated values can serve as the estimates of the model parameters.

To see how well this program estimates the parameters, a data set of 1000 × 35 re- sponses was generated. The parameters used to generate the data set were in turn generated according to the exchangeable model. That is, θi, for i = 1, . . . , n, and bj, for j = 1, . . . , k,

2 were all drawn from N(0, 1), while aj, for j = 1, . . . , k, were drawn from N(1, 0.25 ). Then, the parameters were estimated using the samples drawn from their posterior distributions obtained with the help of pphn bay.m.

Using Exchangeable Model Using Exchangeable Model 2.5 3

2

2 1.5

1 1

0.5

0 0

−0.5 Estimated Item Difficulty Estimated Ability Score −1 −1

−1.5 −2 −2

−2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −3 −3 −2 −1 0 1 2 3 4 Actual Item Difficulty Actual Ability Score

Figure 5.2: Parameter estimates obtained using the exchangeable model is compared with the actual values: (left) Item difficulty, and (right) Ability scores. 119

Figure 5.2 illustrates the accuracy of the estimates. The correlation coefficient of the

actual item difficulty and their corresponding estimated item difficulty was 0.9969, while

the correlation coefficient of the actual ability scores and their corresponding estimates was

0.9659.

The Bayesian estimate for µa was 1.0626. Recall that the actual value of µa was 1. The slightly higher value of the estimate was due to the fact that the average value of the actual item discriminations was 1.0466, which in turn resulted an average of 1.0625 for all the item discrimination estimates.

Using this exchangeable model to the BGSU Math placement data set, the results

obtained were very similar to the ones obtained earlier using the Two-parameter IRT model.

Figure 5.3 displays some plots that look very similar to plots in Figure 1.11 and Figure 1.12.

Using Exchangeable Model Using Exchangeable Model 35

1200 21 2 3 30 14 9 2031 128 1000 2418 136 25 5 1 11 727 800 2229 20 19 4 3217 351510 1623 15 600 33 25

26 Number of Correct Items 3428 10 400 No. of Examinees With Correct Responses

5 200 30

0 −2 −1.5 −1 −0.5 0 0.5 1 −4 −3 −2 −1 0 1 2 3 Bayesian Estimates of Item Difficulty Bayesian Ability Score Estimates

Figure 5.3: Item Parameter and Ability scores estimates obtained using the exchangeable model is compared with the observed data: (left) Item difficulty vs. No. of correct students, and (right) Ability scores vs. Students raw scores.

To see the effect of the exchangeable prior on the posterior estimates of the discrimina-

tion parameters, the scatterplot in Figure 5.4 compares these estimates with the estimates 120

1

0.9

0.8

0.7

0.6

0.5

Exchangeable Discrimination Estimates 0.4

0.3

0.2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Two−parameter Discrimination Estimates

Figure 5.4: Scatterplot of the discrimination estimates obtained using the Exchangeable model and the Two-parameter model.

obtained using the Two-parameter model. Note the slight shrinkage of the estimates ob-

tained by the exchangeable model. The smallness of the shrinkage is due to the fact that

the estimates obtained using the Two-parameter model were not very varied.

5.3.1 Approximating the One-parameter model

As a simpler version of the above exchangeable model, one could also fix the hyper-

2 parameter sa just like the k was fixed in the Beta-binomial model. Using a uniform prior

for µa, the Gibbs sampler can be easily modified to run this simpler exchangeable model

2 by removing the step that updates sa and keep on using its fixed value. The Matlab pro- gram pphn2 bay.m simulates values of the parameters in this model from their posterior

distributions.

Applying this simpler exchangeable model with sa = 0.25 to the BGSU Math placement data set, we obtained very similar results as those obtained using the exchangeable model 121 with random sa. Figure 5.5 shows how close the estimates are. The correlation coefficients of the estimates obtained from the two exchangeable models for item difficulty and ability scores are 0.9999 and 0.9988, respectively.

Correlation Coefficient = 0.9988 Correlation Coefficient = 0.9999 3 1.5

2 1

1 0.5 =0.25 =0.25 a a

0 0

−0.5 −1

−1 Ability estimates using fixed s −2 Difficulty estimates using fixed s

−1.5 −3

−2 −4 −2 −1.5 −1 −0.5 0 0.5 1 1.5 −4 −3 −2 −1 0 1 2 3 Difficutly estimates using random s Ability estimates using random s a a

Figure 5.5: Estimates obtained using the two exchangeable models (one with random sa and one with fixed sa = 0.25) are compared: (left) Item difficulty; (right) Ability scores.

2 Because the discrimination parameters aj are assumed to have come from N(µa, sa) in

2 the exchangeable model, then when sa takes on a very small value, this exchangeable model will essentially work as a One-parameter model. In other words, one could approximate the One-parameter model with the exchangeable model using a small fixed value of sa, say sa = 0.01.

To check the accuracy of the approximation, a 1000 × 35 data set of responses coming from the One-parameter model was generated. The One-parameter model and the exchange- able model (with sa = 0.01) were used to get estimates for the item difficulty parameters and for the ability scores of examinees. Figure 5.6 displays the closeness of their estimates.

The correlation coefficients of the estimates obtained using the One-parameter model and 122

Correlation Coefficient = 0.9992 Correlation Coefficient = 1 3 2.5

2 2 =0.01) 1.5 a =0.01) a

1 1

0.5

0 0

−0.5 −1 −1

−1.5 −2 Difficulty estimates using Exchangeable model (s Ability score estimates using Exchangeable model (s −2

−2.5 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −3 −2 −1 0 1 2 3 Difficutly estimates using One−parameter model Ability score estimates using One−parameter model

Figure 5.6: Estimates obtained using the One-parameter model and the exchangeable model with fixed sa = 0.01 are compared: (left) Item difficulty; (right) Item discrimination.

the exchangeable model with fixed sa = 0.01 for item difficulty and ability scores were 1 and

0.9992, respectively. This very high correlation signifies the accuracy of the approximation of the exchangeable model with a small fixed value for sa for the One-parameter model.

5.3.2 Approximating the Two-parameter model

On the other hand, when sa takes on a large value, the exchangeable model behaves very

similar to the Two-parameter model. In other words, it is also possible to approximate the

Two-parameter model using the exchangeable model by using a large fixed value for sa, say sa = 10.

To assess the accuracy of the approximation, a 1000 × 35 data set of responses coming

from the Two-parameter model was generated. The Two-parameter model and the exchange-

able model (with sa = 10) were used to get estimates for the item difficulty parameters and

for the ability scores of examinees. Figure 5.7 and 5.8 display the closeness of their estimates. 123

Correlation Coefficient = 0.9999 Correlation Coefficient = 0.9991 2.5 1.8

2 1.6 =10) a

=10) 1.5

a 1.4

1 1.2

0.5 1

0 0.8 −0.5

0.6 −1

0.4 −1.5 Difficulty estimates using Exchangeable model (s Discrimination estimates using Exchangeable model (s −2 0.2

−2.5 0 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Difficulty estimates using Two−parameter model Discrimination estimates using Two−parameter model

Figure 5.7: Estimates obtained using the One-parameter model and the exchangeable model with fixed sa = 10 are compared: (left) Item difficulty; (right) Ability scores.

Correlation Coefficient = 0.9988 3

2 =10) a

1

0

−1

−2 Ability score estimates using Exchangeable model (s

−3 −3 −2 −1 0 1 2 3 Ability score estimates using Two−parameter model

Figure 5.8: Scatterplot of estimates of ability scores obtained using the One-parameter model and the exchangeable model with fixed sa = 10. 124

The correlation coefficients of the estimates obtained using the Two-parameter model and the exchangeable model with fixed sa = 10 for ability score, item difficulty and discrimi- nation parameters, were 0.9988, 0.9999 and 0.9991, respectively. Again, these illustrate the accuracy of the approximation of the exchangeable model with a large fixed value for sa for the Two-parameter model.

The importance of this relationships between the exchangeable model and the One- parameter and Two-parameter model is that it allows the comparison of these two latter models using Bayes factor and allows the estimation of the Bayes factor using importance sampling.

5.4 IRT Model Comparisons and Model Selection

In Chapter 4, Bayesian methods to determine whether a certain IRT model fits the data well enough were discussed using predictive distributions. However, these methods are not useful in model comparisons. If there are several alternative models available, these predictive methods do not have the capacity to compare the fit of the competing models. Also, as mentioned in the introduction of this chapter, those working with predictive distributions recommend the use of a full Bayesian analysis whenever the predictive method finds strong evidence of substantial model misfit. This full Bayesian analysis requires the specification of alternative models and a comparative study between these models using Bayes factor. In this section, Bayes factor will be utilized to determine which among the different IRT models is the best one to use for the data. 125

5.4.1 Computing the Bayes Factor for IRT models

In general terms, the Bayes factor for model H1 versus model H0 can be expressed as R Pr(data|H ) Pr(data|ξ,H )Pr(ξ|H )dξ BF = 1 = R 1 1 10 Pr(data|H ) Pr(data|ξ,H )Pr(ξ|H )dξ Z 0 0 0 Pr(data|ξ,H )Pr(ξ|H ) = 1 1 × π∗(ξ|data, H )dξ Pr(data|ξ,H )Pr(ξ|H ) 0 · 0 0 ¸ Pr(data|ξ,H1)Pr(ξ|H1) = IEπ∗ (5.4.1) Pr(data|ξ,H0)Pr(ξ|H0) where,

∗ Pr(data|ξ,H0)Pr(ξ|H0) π (ξ|data, H0) = R , (5.4.2) Pr(data|ξ,H0)Pr(ξ|H0)dξ is the posterior distribution of ξ under the model H0. Hence, using a sample {ξ1,..., ξm}

∗ from the posterior distribution π (ξ|data, H0), the importance sampling estimate of BF10 is given by 1 Xm Pr(data|ξ,H )Pr(ξ|H ) BFˆ = 1 1 . (5.4.3) 10 m Pr(data|ξ,H )Pr(ξ|H ) i=1 0 0

Note that for (5.4.1) to be true, it is necessary for the parameter space of model H1 and

H0 to be the same. This is the main reason for the need to approximate the Two-parameter and One-parameter model with the exchangeable model.

In the IRT case, when the exchangeable model with a fixed sa is used, the Bayes factor is given by

R Qn Qk L(θ, a, b) i=1 φ(θi; 0, 1) j=1 φ(bj; 0, sb)φ(aj; µa, s1)dξ BF10 = R Qn Qk (5.4.4) L(θ, a, b) φ(θi; 0, 1) φ(bj; 0, sb)φ(aj; µa, s0)dξ " i=1 j=1 # Qn Qk L(θ, a, b) i=1 φ(θi; 0, 1) j=1 φ(bj; 0, sb)φ(aj; µa, s1) = IE ∗ , (5.4.5) π0 Qn Qk L(θ, a, b) i=1 φ(θi; 0, 1) j=1 φ(bj; 0, sb)φ(aj; µa, s0) where the expectation is taken with respect to the joint posterior distribution of the para-

∗ meters of the model in the denominator π0. Therefore, using the samples obtained during 126

the estimation procedure of the IRT model (in the denominator) of its parameters, the

importance sampling estimate of BF10 can be computed as Q Q Xm L(θ, a, b) n φ(θ ; 0, 1) k φ(b ; 0, s )φ(a ; µ , s ) ˆ 1 i=1 i j=1 j b j a 1 BF 10 = Qn Qk m L(θ, a, b) φ(θi; 0, 1) φ(bj; 0, sb)φ(aj; µa, s0) l=1 ( i=1 j=1 ) 1 Xm Xk Xk = exp log φ(a ; µ , s ) − log φ(a ; µ , s ) (5.4.6) m j a 1 j a 0 l=1 j=1 j=1

5.4.2 IRT Model Comparison

Because the Two-parameter and the One-parameter models can be approximated using

the exchangeable model with an appropriate fixed value of sa, then it is possible to compare

any two of these three IRT models using (5.3.6). To see how well this method works with

different IRT data, a simulation study is conducted for data sets coming from the three IRT

models.

1. Using Exchangeable Data

To generate data from the exchangeable model, the discrimination parameters aj were

drawn from N(µa = 1, sa = 0.25). The latent ability scores θi and the difficulty parameters

bj were drawn from the standard normal distribution.

Knowing the actual model that generated the data set will allow us to see whether

the Bayes factor (5.3.5) correctly picked the right model. To compare the fit of the Two-

parameter model versus that of the exchangeable model to this data set, set s0 = 10 and s1 = 0.25. In other words, the sample drawn from the posterior distribution of the para- meters under the (approximate) Two-parameter model is used to calculate the importance sampling estimate (5.4.6). The (approximate) Two-parameter model was chosen to be in the denominator of the Bayes factor because it is a more diffused distribution compared to 127

the exchangeable model, and therefore, would yield a more stable result.

The Matlab program comp bf one exch.m calculates this estimate using the posterior

sample that was obtained when the (approximate) Two-parameter model were fitted (using

pphn2 bay.m) to the data set. The value of the log10BF that was obtained for the generated

exchangeable model was 47.13. This is consistent to expectation since the model in the

numerator is the correct exchangeable model. To see the variability of this estimate, this

process was repeated 100 times. A summary histogram of the 100 values of the log10BF is

given in Figure 5.9. The values range from 39 to 51, with a mean of 45.9.

Exchangeable Data Sets Exchangeable Data Sets 25 25

20 20

15 15

10 10

5 5

0 38 40 42 44 46 48 50 52 0 −250 −200 −150 −100 −50 0 Log BF (Exchangeable vs Two−parameter) log BF (Exchangeable vs One−parameter) 10 10

Figure 5.9: Histogram of 100 log10BF of Exchangeable model (sa = 0.25) vs. (left) Two- parameter model and (right) One-parameter model.

To compare the fit of the One-parameter model versus that of the exchangeable model

to this exchangeable data set, set s0 = 0.25 and s1 = 0.01. Again, it is preferred to use the more diffused distribution for the . However, due to numerical limitation (Matlab assigns a zero value to the standard normal density when |z| ≥ 39) and the fact that s1 is very small, the value obtained for log10BF was −∞. To circumvent this 128

problem, s1 was set at 0.05. In this case, the value obtained for log10BF was −110.77. This

is a very strong evidence in favor of the model in the denominator of the Bayes factor, which

is the correct exchangeable model. Again, to see the variability of the values of log10BF, this

process was repeated 100 times. A summary of the 100 values of log10BF is shown in the

right plot of Figure 5.9. The values range from −223.4 to −37.2 with an average value of

−108, all of which are consistent with the kind of data used.

Exchangeable Model Data (a = 0.25) Exchangeable Model Data (s = 0.25) s a 50 47

46

0 45 BF BF

4410 10 log log

−50 43

42

−100 41 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.1 0.15 0.2 0.25 0.3 0.35 Standard deviations of item discrimination Standard deviations of item discrimination

Figure 5.10: Values of log10BF of exchangeable models with varying standard deviations compared to the approximate Two-parameter model. The right plot is just a close up look at the peak of the graph.

Another way to use the Bayes factor to determine the most appropriate model for the

data is to consider a range of value of s1, the standard deviation of the exchangeable model in

the numerator of the Bayes factor. This does not require much more additional calculations

as the sampling distribution − the model in the denominator − remains the same. Hence,

once the sample from the posterior distribution of its parameter (µ) is obtained, it can be used to calculate the Bayes factor for different values of s1. A summary of the values of log10BF obtained for different values of s1 is given in the graph shown in Figure 5.10. Notice 129

that the maximum value of the graph is right at the value s1 = 0.25, which is the actual

standard deviation that was used to generate the data set.

2. Using Two-parameter Data

To generate a data set from the Two-parameter model, the discrimination parameters

aj were drawn from Uniform(0, 4). The latent ability scores θi and the difficulty para- meters bj were again drawn from the standard normal distribution. To compare the fit of the Two-parameter model versus that of the exchangeable model to this data set, the stan- dard deviations were set at s0 = 10 and s1 = 0.25. The value obtained for log10BF was

−17.41, implying the that data set is much more likely to have come from the model in the denominator which is the correct Two-parameter model.

3. Using One-parameter Data

To generate a data set from the One-parameter model, the discrimination parameters

aj were all set to one. The latent ability scores θi and the difficulty parameters bj were again

drawn from the standard normal distribution. To compute the Bayes factor to compare the

fit of the (approximate) One-parameter model versus that of the exchangeable model to this

data set, the standard deviations were set at s0 = 0.25 and s1 = 0.05. The value of log10BF

obtained was 6.03. This implies that the model in the numerator, which is the approximate

One-parameter model, is more appropriate for the data than the exchangeable model with

sa = 0.25. Again, this is consistent with the data that was used.

To see how the values of the Bayes factor behave for different values of s1, the previous process was repeated for a range of values of s1. A summary graph is given in Figure 5.11.

It is interesting to note from the graph that this procedure actually gave a maximum Bayes factor at s1 = 0.06. One reason that could explain this unexpected result is that some of 130

the variability arising from the generation of the data was wrongly attributed to a minor

variability in the item discrimination parameters. This would result in a data set that is

closer to the approximated One-parameter model with s1 close to 0.06 compared to a model

with s1 close to 0.01, thus, creating the observed graph in Figure 5.11.

Using One−parameter Data Using One−parameter Data 12 50

10 0

8 −50 BF BF

10 6

−10010 log log

−150 4

−200 2

−250 0 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 Standard deviations of item discrimination Standard deviations of item discrimination

Figure 5.11: Values of log10BF of exchangeable models with varying standard deviations compared to the approximate One-parameter model. The right plot is a closer look at the peak of the graph.

5.5 Finding Outlying Discrimination Parameters

In the previous sections, it was illustrated that the family of exchangeable models includes

the one-parameter and two-parameter IRT probit models. When the item discrimination

parameters are suspected to be similar and come from a certain distribution, an exchangeable

model is the most appropriate model to use. However, there might be tests that contain a

few items that have discrimination values that are far from most of the item discrimination

values. In such cases, it might be useful to identify these items with outlying discrimination

parameters as they would mess up the fit of the exchangeable model. 131

5.5.1 Using Bayes Factor

To detect outlying discrimination parameters using Bayes factor, an outlier exchangeable

model is considered. The ordinary exchangeable model used in the previous sections assumed

2 that all the discrimination parameters aj comes from N(µa, sa). The outlier exchangeable

model assumes that those discrimination parameters that are suspected to be outlying come

2 2 from a more diffused distribution N(µa,K sa), where K > 1.

Recall from Section 5.3, that the joint posterior distribution of (θ, a, b, µa, sa) under the

exchangeable model (denote this by model M) is proportional to

Yn Yk ∗ π (θ, a, b, µa, sa|data) ∝ L(θ, a, b) φ(θi; 0, 1) φ(bj; 0, sb) i=1 j=1 Yk 2 −ν1−1 2 2 × (sa) exp{−ν2/sa} φ(aj; µa, sa), j=1

where L(θ, a, b) is the same as (1.4.1). When al is suspected to be an outlier, the joint

posterior distribution of (θ, a, b, µa, sa) under the outlier exchangeable model (denoted by

Ml−out) is proportional to

Yn Yk ∗ π (θ, a, b, µa, sa|data) ∝ L(θ, a, b) φ(θi; 0, 1) φ(bj; 0, sb) i=1 j=1 2 Y 2 −ν1−1 {−ν2/sa} 2 2 2 × (sa) e φ(al; µa,K sa) φ(aj; µa, sa). j6=l

To determine whether the data is more likely to have come from the outlier exchange-

able model (Ml−out) than from the ordinary exchangeable model (M), the Bayes factor

BF(Ml−out/M) is computed. From equation (5.4.6), the importance sampling estimate of this Bayes factor is reduced to " # m (l) 2 2 1 X φ(a ; µa,K s ) BFˆ (M /M) = j a , (5.5.1) j−out m (l) l=1 φ(aj ; µa, sa) 132

where the sample used comes from the posterior distribution of the base model (M).

The Matlab program bf exch out a computes the Bayes factor for each item assuming

it is the only item in the test that has an outlying discrimination value. Hence, this program

will give k Bayes factor values, one for each item and each represents the likelihood that the

item has an outlying discrimination parameter.

To illustrate its effectiveness in detecting outlying discrimination parameters, a data set

of 1000 × 35 responses was generated using parameters with three outlying item discrimina- tion. In particular, all of the item discrimination parameters were drawn from N(1, 0.252), except for items 10, 20, and 30 which were assigned discrimination values of 4, −0.5, and 2.5, respectively. The other parameters were drawn in the same manner as before. The program bf exch out a was applied to this artificial data using K = 3.

A summary of the results is given in Table 5.6. From the last column of this table,

one could see that only items 10, 20, 30 have log10BF values that are large enough to tag

them as items having outlying discrimination parameters. This illustrates the effectiveness

of the procedure in detecting the outlying items. Also, except for item 22 which has an

unusually small discrimination parameter value (a22 = .24), the rest of the log10BF values

are all insignificant. This illustrates that this procedure is safe to use as it has a low false

positive rate.

5.5.2 Using Mixture Prior Density

Alternatively, one could also detect outlying discrimination parameters by using a mixture

prior density for the item discrimination parameters. For this mixture prior exchangeable

model, instead of the simple normal density, the distribution of aj is assumed to be a mixture 133

Item Actual aj Estimatea ˆj log10BF 1 1.40 1.24 -0.11 2 1.00 0.90 -0.46 3 0.78 0.66 -0.22 4 1.12 0.92 -0.46 5 1.06 1.01 -0.43 6 0.77 0.66 -0.21 7 1.16 1.20 -0.21 8 0.78 0.70 -0.28 9 1.46 1.35 0.19 10 4.00 2.45 8.86 11 0.91 0.75 -0.35 12 0.72 0.71 -0.30 13 0.61 0.68 -0.23 14 1.18 1.01 -0.43 15 1.07 1.09 -0.35 16 1.04 0.98 -0.45 17 0.95 0.96 -0.45 18 0.89 0.93 -0.46 19 1.48 1.30 0.01 20 -0.50 -0.43 5.70 21 1.10 1.10 -0.35 22 0.24 0.27 1.06 23 0.84 0.68 -0.26 24 1.09 0.97 -0.46 25 0.53 0.48 0.19 26 0.88 0.78 -0.39 27 0.98 0.85 -0.44 28 1.02 0.96 -0.46 29 1.04 0.88 -0.45 30 2.50 1.88 3.15 31 0.97 0.98 -0.45 32 0.76 0.78 -0.38 33 0.74 0.70 -0.28 34 0.87 0.85 -0.44 35 1.62 1.32 0.11

Table 5.6: The log10BF(Ml−out/M) for each item in the artificial data. Note that the value of log10BF(Ml−out/M) for items 10, 20, and 30 are all bigger than 3, marking them as items with outlying discrimination parameter. 134

of normal densities. That is,

aj ∼ pφ(µa, sa) + (1 − p)φ(µa, Ksa), (5.5.2)

where K is a known quantity bigger than one and p is the probability that aj is not outlying.

Using a uniform prior on µa and an inverse-gamma density with parameters ν1 and ν2

2 on sa, the joint prior density for the discrimination parameters {a1, . . . , ak} and hyperpara-

2 meters (µa, sa) can be written

Yk 0 2 2 −ν1−1 2 2 2 2 π0(a, µa, sa) = (sa) exp{−ν2/sa} [pφ(aj; µa, sa) + (1 − p)φ(aj; µa,K sa)]. (5.5.3) j=1

Combining this new prior density of aj with the previous IRT assumptions described in

Chapter 1, the joint posterior distribution of (θ, Z, a, b, µa, sa) is now proportional to

Yn Yk ∗ π (θ, Z, a, b, µa, sa|data) ∝ [φ(Zij; mij, 1)I(Zij, yij)] i=1 j=1 Yn Yk 0 2 × φ(θi; 0, 1) φ(bj; 0, sb)π0(a, µa, sa), (5.5.4) i=1 j=1

where I(z, y) is equal to 1 when {z > 0, y = 1} or {z < 0, y = 0}, and equal to 0 otherwise.

To simulate samples from the joint posterior distribution (5.5.3), additional latent vari-

ables are introduced in the model. For each discrimination parameter aj, a Bernoulli latent

variable γj is introduced such that γj = 1 when aj is outlying and γj = 0 otherwise. From

(5.5.2), it is clear that Pr(γj = 1) = 1 − p. Using Uniform prior on the γj, the joint prior

2 density for {a1, . . . , ak, γ1, . . . , γk} and hyperparameters (µa, sa) can be written

Yk 2 2 −ν1−1 2 2 π0(a, γ, µa, sa) = (sa) exp{−ν2/sa} [pφ(aj; µa, sa)I(γj = 0) j=1 2 2 + (1 − p)φ(aj; µa,K sa)I(γj = 1)]. (5.5.5) 135

The joint posterior distribution of (θ, Z, a, b, γ, µa, sa) is now proportional to

Yn Yk ∗ π (θ, Z, a, b, µa, sa|data) ∝ [φ(Zij; mij, 1)I(Zij, yij)] i=1 j=1 Yn Yk 2 × φ(θi; 0, 1) φ(bj; 0, sb)π0(a, γ, µa, sa), (5.5.6) i=1 j=1

To reflect the changes made by this new prior distribution for the item discrimination and the introduction of the additional latent variables γj in the Gibbs sampler for the exchangeable model (pphn bay), two more steps were included and some modifications were

th made in the simulation process of µa and sa. The first one is to simulate the t iteration of

(t) (t) γj from the Bernoulli distribution with success probability (1 − p ), where

(t−1) (t) p φ(aj; µa, sa) p = (t−1) (t−1) . (5.5.7) p φ(aj; µa, sa) + (1 − p )φ(aj; µa, Ksa)

(t+1) (t) The second is to simulate p from Beta(n1 + 1, n2 + 1), where n1 is the number of γj = 0

(t) and n2 is the number of γj = 1, for j = 1, . . . , k. In addition, µa is now updated using a normal density with mean, P P n1 k (t) n2 k (t) 2 j=1 aj I(γj = 0) + 2 2 j=1 aj I(γj = 1) µ = sa saK (5.5.8) n1 n2 2 + 2 2 sa saK and standard deviation, µ ¶ 1 − 2 n1 n2 σ = 2 + 2 2 . (5.5.9) sa saK

2 Finally, sa is updated using an inverse-gamma distribution with shape parameter,

∗ ν1 = ν1 + k/2, (5.5.10) and scale parameter,

1 X 1 X ν∗ = ν + (a − µ )2 + (a − µ )2. (5.5.11) 2 2 2 j a 2K2 j a j:(γj =0) j:(γj =1) 136

The Matlab program pphn out bay.m (See appendix B.4) simulates values of the dif- ferent parameters in this mixture prior exchangeable model from their respective posterior distributions. Again, the means of these simulated values can serve as the estimates of the model parameters. In particular,γ ˆj represents the probability that aj is an outlying item

discrimination parameter.

To see how effective this alternative procedure in detecting outlying discrimination

parameters, the program pphn out bay.m was applied to the artificially altered data set used in the previous Section 5.5.1. Posterior samples were obtained and the Bayesian estimates were computed by taking the averages of these samples.

Exchangeable Data With Three Outlying Discrimination Parameters 0.7 2.5

a 10 0.6 2

0.5 a 1.5 30

0.4

1

0.3

0.5 Estimated Item Discrimination 0.2

0 Estimated Probability of Having Outlying Discrimination 0.1

a 20 −0.5 0 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 0 5 10 15 20 25 30 35 Actual Item Discrimination Item

Figure 5.12: (left) Scatterplot of the actual vs. estimated item discrimination parameters. (right) Estimated probability of each item having an outlying discrimination parameter. Note that items 10, 20, and 30 have much bigger probabilities than the rest.

A summary of the results is given in Figure 5.12 and Table 5.7. From the bar graph on

the right of Figure 5.12, note that the three items 10, 20, and 30 have much bigger proba-

bilities of having outlying discrimination parameters than the rest of the items. From Table

5.7, these probabilities are 0.502, 0.411, and 0.326, respectively, for items 10, 20, and 30. 137

The rest of the items have probabilities around 0.20. This illustrates the effectiveness and

safety of this procedure in detecting outlying discrimination parameters.

Item Actual aj Estimatea ˆj log10BFγ ˆ 1 1.40 1.27 -0.11 0.190 2 1.00 0.93 -0.46 0.180 3 0.78 0.67 -0.22 0.154 5 1.06 1.05 -0.43 0.181 7 1.16 1.25 -0.21 0.195 8 0.78 0.69 -0.28 0.174 9 1.46 1.40 0.19 0.171 10 4.00 3.12 8.86 0.502 11 0.91 0.75 -0.35 0.195 12 0.72 0.71 -0.30 0.167 13 0.61 0.68 -0.23 0.210 14 1.18 1.01 -0.43 0.172 18 0.89 0.95 -0.46 0.167 19 1.48 1.37 0.01 0.203 20 -0.50 -0.49 5.70 0.411 21 1.10 1.06 -0.35 0.152 22 0.24 0.23 1.06 0.217 24 1.09 0.99 -0.46 0.175 25 0.53 0.49 0.19 0.182 27 0.98 0.88 -0.44 0.165 28 1.02 1.02 -0.46 0.159 29 1.04 0.90 -0.45 0.196 30 2.50 2.09 3.15 0.326 31 0.97 1.02 -0.45 0.192 32 0.76 0.80 -0.38 0.172 33 0.74 0.70 -0.28 0.177 34 0.87 0.86 -0.44 0.191 35 1.62 1.39 0.11 0.182 Table 5.7: Theγ ˆ for each item represents the likelihood that its discrimination parameter is outlying. Note that the value ofγ ˆ for items 10, 20, and 30 are all much bigger than the rest, marking them as items with outlying discrimination parameter. 138

5.6 Application To Real Data Set

1. Model Selection

To determine the most appropriate model for the BGSU Mathematics placement

data, different exchangeable IRT models were compared with the two-parameter model using

Bayes factor. The resulting Bayes factor values is summarized in the graphs shown in Figure

5.13. The values of log10BF in the graphs were computed using a sample obtained from the

posterior distribution of µ under the approximate Two-parameter model (s0 = 10).

The graph in Figure 5.13 indicates that the maximum value of the Bayes factor is at

s1 = 0.148. This implies that the most appropriate model for this data set is the exchangeable

model with sa = 0.148. A summary histogram of the posterior distribution of µa under this

particular model is given in Figure 5.14. The sample mean and sample standard deviation

were 0.56 and 0.029, respectively.

BGSU Math Placement Data Set BGSU Math Placement Data Set 55 55

50 50

45 45

40 40

35 35 BF BF 10 10 log 30 log 30

25 25

20 20

15 15 0.148

10 10 0 0.5 1 1.5 2 2.5 3 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Standard deviations of of item discrimination Standard deviations of item discrimination

Figure 5.13: Values of log10BF of exchangeable models with varying standard deviations compared to the two-parameter model using the BGSU Math placement data set. The right plot is just a close up look at the peak of the graph. 139

BGSU Math Placement Data (s = 0.148) a 300 Posterior Sample of m a 0.7

250

0.65

200

0.6 150

100 0.55

50 0.5

0 0.45 0.5 0.55 0.6 0.65 0.7 Histogram of 1000 posterior sample values of m 0.45 a 0 100 200 300 400 500 600 700 800 900 1000

Figure 5.14: Histogram of the 1000 posterior sample values of µa for the BGSU Math placement data using the exchangeable model with sa = 0.148.

2. Checking For Outlying Discrimination

To check for any potential outlying discrimination parameter, the Bayes factors of the outlier exchangeable model for each item versus the ordinary exchangeable model (with sa = 0.148) were computed for this data set. The left bar graph of Figure 5.15 and the third column of Table 5.8 show that the log10BF values for all the 35 items were all between

−0.47 to −0.21. None of which indicate any evidence to suggest any outlying discrimination parameter. Therefore, the exchangeable model is indeed an appropriate model to use.

When the mixture prior exchangeable model was used for this data set, the results obtained were very similar. The bar graph to the right of Figure 5.15 and the last column of Table 5.8 give the estimated probability of each item discrimination of being an outlier.

Notice from the graph and table that all items have similar probabilities, indicating that there is no evidence to suggest any outlying discrimination parameter. This result confirms the previous conclusion that the exchangeable model (with sa = 0.148) is indeed an appropriate 140

BGSU Math Placement Data BGSU Math Placement Data 0 0.25

−0.1 0.2

−0.2

0.15

−0.3 BF 10 log

0.1 −0.4

−0.5 0.05 Estimated probability of having an outlying discrimination

−0.6

0 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Items Items

Figure 5.15: Histogram of the 1000 posterior sample values of µa for the BGSU Math placement data using the exchangeable model with sa = 0.148.

model to use. 141

Itema ˆj log10BFγ ˆ 1 0.60 -0.46 0.174 2 0.45 -0.42 0.189 3 0.49 -0.45 0.192 4 0.51 -0.46 0.160 5 0.68 -0.42 0.193 6 0.61 -0.46 0.168 7 0.51 -0.46 0.204 8 0.42 -0.40 0.171 9 0.60 -0.46 0.213 10 0.69 -0.42 0.165 11 0.36 -0.33 0.194 12 0.58 -0.47 0.171 13 0.68 -0.42 0.181 14 0.52 -0.46 0.173 15 0.58 -0.47 0.182 16 0.50 -0.46 0.173 17 0.53 -0.46 0.187 18 0.56 -0.47 0.193 19 0.68 -0.43 0.169 20 0.71 -0.40 0.187 21 0.73 -0.38 0.190 22 0.73 -0.37 0.180 23 0.49 -0.45 0.188 24 0.61 -0.46 0.174 25 0.56 -0.47 0.175 26 0.75 -0.36 0.184 27 0.53 -0.46 0.178 28 0.43 -0.41 0.172 29 0.73 -0.38 0.198 30 0.28 -0.21 0.187 31 0.79 -0.30 0.194 32 0.65 -0.45 0.175 33 0.34 -0.30 0.204 34 0.37 -0.34 0.213 35 0.50 -0.46 0.176

Table 5.8: Bayesian estimates ofa ˆj, log10BF, and γ for the BGSU Math placement exam. 142

CHAPTER 6 SUMMARY AND CONCLUSIONS

This dissertation started with the introduction of the different IRT models. It was followed by a discussion on how to estimate the item parameters and the latent ability of examinees using the classical joint maximum likelihood estimation and Bayesian estimation.

In the last section of the first chapter, the advantages of the Bayesian method over the classical method were discussed. This includes, (1) its ability to work with extreme exam results (zero and perfect scores), (2) its ability to incorporate prior information, (3) its simple fitting procedure, and most importantly, (4) it offers a great deal of potential for model checking using the posterior sample of the parameters obtained during the estimation procedure.

Using this posterior sample of the model parameters, the posterior distribution of dif- ferent residuals defined in Chapter 3 were easy to estimate. This allowed the construction of probability intervals that were used in the residual plots. These different residual plots were used to identify outlying items and examinees, and to detect misfitted items and ex- aminees. In particular, the IRC probability interval band plot was useful for checking the fit of the model for each item, while the latent residual plot was useful for detecting outlying examinees.

In Chapter 2, six of the χ2−indices currently being used to test the goodness-of-fit of the IRT models were discussed. The main problem with these test statistics is that their distributions are all unknown. When these χ2−indices were used as discrepancy measures

142 143 in the Bayesian procedure, samples from their marginal distributions were easily obtained using samples of the model parameters drawn from the prior, posterior, or partial posterior distributions. In the case were the posterior sample of the model parameters were used, a posterior sample for each of the six discrepancy measures were obtained. These resulting posterior predictive samples were utilized to measure the surprise of the observed data. In particular, these six discrepancy measures were used to assess the goodness-of-fit of each item in the test. Simulation studies have shown that the G2 index had the best combination of low type I error rate and high detection rate among the six indices.

To test the null hypothesis that the data comes from the one-parameter model versus the alternative hypothesis that it comes from the two-parameter model, the standard devia- tion of the point-biserial correlations of the items in the test was used as a test quantity. The prior and posterior predictive distributions of this test quantity was obtained to measure the surprise of the observed data. Simulation results have shown that both predictive distrib- utions yielded low type I error rates and very high power. However, the simulation results have also shown that the posterior predictive distribution has higher precision compared to the prior predictive distribution.

To detect misfitted examinees, two more discrepancy measures for person fit were intro- duced. The posterior predictive distribution was used to obtain the reference distributions of these two discrepancy measures. The results of the simulation study have shown that both discrepancy measures have very low false positive rates and good detection rates. In the detection of guessers, the W −statistic has an average of 82% detection rate, while the

L−statistic has an average of 85% detection rate.

Finally, the posterior sample of the model parameters were also used to estimate the 144

Bayes factor, which is used for model comparisons. To accomplish the comparisons of IRT

models using Bayes factor, the Exchangeable IRT model was introduced and utilized. The

first few sections of Chapter 5 covered the ideas used to estimate the Bayes factor for IRT

models, which utilized the exchangeable IRT model, the posterior sample of item parameters,

and the idea of importance sampling. Simulation results in the later sections have illustrated

the effectiveness of the Bayes factor in determining the most appropriate IRT model for

simulated data sets. Using an outlier exchangeable IRT model, the Bayes factor was also

shown to be effective in detecting items with outlying discrimination parameters. This final

result was confirmed by another proposed method which utilized an IRT model with mixture

prior density for the discrimination parameters.

As a final remark, it is recommended that combinations of the methods that were

proposed in this work should be used in assessing the goodness-of-fit of any given IRT

model. If one is testing for the fit of test items, then the posterior predictive method using

the G2 discrepancy measure should be used together with the IRC probability interval band plots. If one is testing for examinee fit, then the posterior predictive method using the log- likelihood discrepancy measure (L) should be used together with the latent residual plots.

Finally, if one is trying to decide whether to use the one-parameter, two-parameter, or an exchangeable model, the posterior predictive method using the standard deviation of the item point-biserial correlations as test quantity should be used together with the Bayes factor method. 145

Appendix A NUMERICAL METHODS

A.1 Newton Raphson for IRT Models

The Newton Raphson procedure is a simple but very effective way for estimating a

solution of an equation f(x) = 0. The idea is to produce a sequence of estimates that

converges to a solution. Using a good starting estimate x0, a more accurate estimate x1 is

obtained by

f(x0) x1 = x0 − 0 , f (x0)

0 assuming that f (x0) 6= 0. Once x1 is obtained, an improvement on x1 may be obtained in the same manner. In general, if xm is the approximate solution at the mth stage, then an improved solution xm+1 is obtained by

f(xm) xm+1 = xm − 0 . (A.1.1) f (xm)

This process is repeated until (xm+1 − xm) is below a pre-established small value.

In section 1.4.2, it was desired to obtain the joint maximum likelihood estimates of the

parameters in the two-parameter logistic IRT model. That is, it was desired to maximize

the log-likelihood function

Xn Xk ln L = {yij ln(pij) + (1 − yij) ln(1 − pij)}, (A.1.2) i=1 j=1

exp(ajθi − bj) where pij = . 1 + exp(ajθi − bj) To address the problem of non-identifiability of the parameters in the two-parameter

logistic IRT model, it was necessary to fix the ability scores (θ) such that their mean is zero

145 146

and variance is one. Under these two restrictions, the maximization of ln L was done by

solving the likelihood equations

∂ ln L = 0, i = 1, . . . , n. ∂θi ∂ ln L = 0, j = 1, . . . , k. (A.1.3) ∂aj ∂ ln L = 0, j = 1, . . . , k. ∂bj

Because these equations are nonlinear and the system of equations (A.1.3) is not sepa-

rable, a multivariate version of the Newton Raphson procedure is required to approximate

the solution of this system of equations. In the multivariate version, the improved estimate

of a p dimensional vector ξ is obtained by

ξ(m+1) = ξ(m) − [f 00(ξ(m))]−1f 0(ξ(m)), (A.1.4)

where, f 00 is the (p × p) matrix of second derivatives, and f 0 is the (p × 1) vector of first

derivatives evaluated at ξ(m).

In the case of the two-parameter logistic model, ξ = (θ1, . . . , θn, a1, . . . , ak, b1, . . . , bj). To obtain the maximum likelihood estimate of these parameters, a two-stage iterative procedure was carried out: Starting with initial values (a(0), b(0)) and treating these item parameters as known, θi (i = 1, . . . , n) was estimated using the simple Newton Raphson procedure. Using the estimates of θ obtained in the first stage and treating these ability scores as known, the item parameters (aj, bj)(j = 1, . . . , k) were estimated using the bivariate version of Newton

(m) (m) (m) Raphson. That is, the improved estimate of xj =(aj , bj ) was obtained by

(m+1) (m) (m) −1 0 (m) xj = xj − {H[xj ]} f [xj ], (A.1.5)

(m) where, H[xj ] is a (2 × 2) symmetric matrix of second derivations. After obtaining the 147

Newton Raphson estimates of (a, b) in this second stage, the first stage was repeated using

the new item parameter estimates as known quantities to get new estimates of θ. This

two-stage procedure was repeated until the ability and item parameters converge.

The additional restriction on the ability parameters were incorporated into the two-

stage procedure by scaling the mean of θ to zero and its variance to one at each iteration.

Then item parameter estimates were scaled accordingly and estimated. The Matlab program

pl2 mle performs all of these procedures to give the joint maximum likelihood estimates of

the ability and item parameters for the two-parameter logistic model.

A.2 Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) is useful in simulating from complex posterior

distributions. MCMC algorithms draw correlated sequence of random variables in which

the nth value in the sequence, say θ(n), is sampled from a probability distribution that

depends only on the previous value, θ(n−1). Under conditions that are generally satisfied, the

distribution of θ(n) converges to the desired posterior distribution. Hence, after the burn-in

period of k iterations, the next m sampled values, {θ(k+1), . . . , θ(k+m)}, will approximate a

(dependent) sample from the posterior distribution (see [10], [20], and [29] for more details).

A.2.1 Metropolis-Hasting

The Metropolis-Hastings algorithms provide a general way to obtain successive iterates

in a MCMC run. When the parameter θ is scalar, a simple way to obtain a candidate θ(c) for the next iterate θ(n) is to draw it from a normal density centered at the previous value,

θ(n−1), with standard deviation s. This is called the proposal density and is denoted by

α(θ(c)|θ(n−1)). 148

This candidate value θ(c) is accepted as the next simulated value in the sequence θ(n) with probability pc, µ ¶ g(θ(c)|data)α(θ(n−1)|θ(c)) p = min 1, , (A.2.1) c g(θ(n−1)|data)α(θ(c)|θ(n−1)) where, α(θ(n−1)|θ(c)) is the probability of moving from the candidate value back to the pre- vious value of θ. When a symmetric proposal density is used, like the normal density, this probability reduces to µ ¶ g(θ(c)|data) p = min 1, . (A.2.2) c g(θ(n−1)|data)

Whenever the candidate value θ(c) is not accepted, θ(n) takes on the previous value

θ(n−1). The value of s, the standard deviation of the proposal density, is determined by the acceptance rate of candidate values. In practice, the value of s is chosen such that the acceptance rate falls between 25% to 40%.

In section 5.2.7, it was desired to obtain a sample of µ from its marginal posterior distribution given in equation (5.2.21), Q n B(µk0+yi,k0(1−µ)+(ni−yi)) ∗ i=1 B(µk0,k0(1−µ)) π (µ|data, k0) = R Q . n B(µk0+yi,k0(1−µ)+(ni−yi)) dµ i=1 B(µk0,k0(1−µ))

Using the normal proposal density, the probability of accepting a candidate value µ(c) was given by µ ∗ (c) ¶ π (µ |data, k0) pc = min 1, ∗ (m−1) . (A.2.3) π (µ |data, k0) where, Qn B(µ(c)k +y ,k (1−µ(c))+(n −y )) ∗ (c) 0 i 0 i i π (µ |data, k ) i=1 (c) (c) 0 = B(µ k0,k0(1−µ )) , (A.2.4) ∗ (m−1) Qn B(µ(m−1)k +y ,k (1−µ(m−1))+(n −y )) π (µ |data, k0) 0 i 0 i i (m−1) (m−1) i=1 B(µ k0,k0(1−µ )) Γ(α)Γ(β) and B(α, β) = . Γ(α + β) 149

The Matlab program post sim uses this algorithm to simulate a posterior sample of µ

from its marginal posterior distribution given in equation (5.2.21).

A.2.2 Gibbs Sampling

Gibb sampling is a special type of Metropolis-Hastings procedure that eliminates the

necessity of finding a good proposal density. This sampler draws the next iterate of the

MCMC run using the full conditional posterior distributions.

Consider the problem of simulating a posterior sample for the parameter vector

∗ ∗ ξ = (ξ1, . . . , ξp) from its joint posterior distribution, π (ξ|data). Let π1(ξ1|ξ2, . . . , ξp, data),

∗ ∗ π2(ξ2|ξ1, ξ3, . . . , ξp, data), ..., πp(ξp|ξ1, . . . , ξp−1, data), be the full-conditional posterior dis- tributions of ξ1, . . . , ξp given the rest of the parameters, respectively. Assuming that all of these conditional distributions are available and can be simulated from, the Gibbs sam- pling procedure will iteratively draw one parameter value at a time from its correspond- ing full-conditional posterior distribution. That is, starting from an initial vector, ξ(0) =

(0) (0) (ξ1 , . . . , ξp ), draw

(1) ∗ (0) (0) 1. ξ1 from π1(ξ1|ξ2 , . . . , ξp , data)

(1) ∗ (1) (0) (0) 2. ξ2 from π2(ξ2|ξ1 , ξ3 , . . . , ξp , data) . .

(1) ∗ (1) (1) p. ξp from πp(ξp|ξ1 , . . . , ξp−1, data).

This completes the first iteration and the process is repeated using the new vector ξ(1)

to get ξ(2), and so on. After a burn in period of k iterations, the next m simulates,

{ξ(k+1),..., ξ(k+m)}, will serve as an approximate dependent posterior sample of ξ from

its joint posterior distribution. 150

More details about Gibbs sampling are given in [10], [20], [29], and [56].

A.2.3 Importance Sampling

Importance Sampling is a Monte Carlo method that is useful for estimating integrals.

Consider the following integration problem, Z I = f(θ)π(θ)dθ. (A.2.5)

Let g(θ) be another density for θ with the same support of f. Then Z · ¸ f(θ)π(θ) f(θ)π(θ) I = g(θ)dθ = IE , (A.2.6) g(θ) g g(θ) where IEg denotes the expectation w.r.t. g. Hence, if a sample θ1, . . . , θn from g is available, then (A.2.5) can be estimated by · ¸ 1 Xn f(θ )π(θ ) Iˆ = i i . (A.2.7) n g(θ ) i=1 i This estimator enjoys good frequentist properties (Gamerman, D., 1997) such as:

ˆ 1. It is an unbiased estimator in that IEg(I) = I.

R 2 2 ˆ 2 2 f (θ)π (θ) 2 2. Its variance is in the form Vg(I) = σ /n, where σ = g(θ) dθ − I .

3. It has a central limit theorem stating that

√ Iˆ− I n → N(0, 1) as n → ∞ (A.2.8) σ

in distribution.

4. It is a strongly consistent estimator of I in that

Iˆ → I as n → ∞

with probability 1. 151

Appendix B MATLAB PROGRAMS

B.1 Chapter 1 codes

1. pl2 mle function [theta,av,bv]=pl2_mle(data)

% This function will estimate the item difficulty and discrimination parameters % and ability scores using the Joint Maximum Likelihood Estimation. % Sherwin Toribio (Bowling Green State Univesity), April 2005. % % command: [theta2,av2,bv2]=pl2_mle(data); % % Input: data = n by k matrix of 0’s and 1’s. % output: bv = 1 by k vector of estimated difficulty parameters. % av = 1 by k vector of estimated discrimination parameters. % theta = n by 1 vector of estimated ability scores. k=size(data,2); % no. of items n=size(data,1); % no. of examinees av=ones(1,k); % initial estimates of the ability scores. pro=sum(data)/n; % average no. of correct examinees per item. bv=pro-mean(pro); % initial estimates of the item difficulty. % bv=zeros(1,k); % alternative initial b estimates. theta=sum(data’)’; % initial estimates of the ability scores. theta=(theta-mean(theta))/std(theta); par=[av’; bv’; theta]; % (n+2k) column vector of parameters par_before=par+1;

% loop until convergence - Newton Raphson Method while (max(abs(par-par_before))>0.001)

par_before=par;

% assume ability are known and max w.r.t. item parameters. H=ones(2,k); while (max(max(abs(H)))>0.001) M=(theta*ones(1,k)).*(ones(n,1)*av)-ones(n,1)*bv; P=exp(M)./(1+exp(M)); H_before=H; H1=sum((theta*ones(1,k)).*(data-P)); % H1 is 1 by k H2=sum(P-data); H11=sum(-1*(theta*ones(1,k)).^2.*(1-P).*P); % H11 is 1 by k H12=sum((theta*ones(1,k)).*(1-P).*P);

151 152

H22=sum(P.*(P-1)); for i=1:k H(:,i)=inv([H11(i) H12(i);H12(i) H22(i)])*[H1(i);H2(i)]; if max(abs(H(:,i)))>5 H(:,i)=0; % controlling extreme values end end av=av-H(1,:); bv=bv-H(2,:); end

% assume item parameters are known and maximize w.r.t. ability scores. h=ones(n,1); while (max(abs(h))>0.001) M=((theta*ones(1,k)).*(ones(n,1)*av))-(ones(n,1)*bv); P=exp(M)./(1+exp(M)); h_before=h; h=(-1*sum(data’-P’)./sum(P’.*(1-P)’))’; theta=theta-h; end

theta=(theta-mean(theta))/std(theta); bv=(bv-mean(theta))/std(theta); av=std(theta)*av; par=[av’; bv’; theta]; end

2. pp2 bay function [av,bv,thv,th_m,th_s]=pp2_bay(y,s_a,s_b,m)

% This function will simulate from the posterior distributions of the item % difficulty and discrimination parameters, and ability scores % using the Two-parameter probit IRT model. % % command: [av_bay,bv_bay,th_bay,th_m,th_s]=pp2_bay(data,1,1,500); % % input: y - n by k matrix of 0’s and 1’s. % s_a, s_b - prior standard deviations % m - number of iterations (default is 500) % output: av - m by k matrix of simulated values of a_j % bv - m by k matrix of simulated values of b_j % thv - n by m matrix of simulated values of theta_i % th_m, th_s - vectors of means and std of the theta_i % Written by Dr. James Albert (BGSU) % Modified by Sherwin Toribio (2005) if nargin==3, m=500; end % default is 500 Gibbs cycles 153 s=size(y); n=s(1); k=s(2); burn=100; % discard the first 100 iterations mu=0; var=1; % hyperparameters of theta prior a=2*ones(1,k); % phat=(sum(y)+.5)/(n+1); % initial estimates b=-phiinv(phat)*sqrt(5); % th=zeros(n,1); % av=zeros((m+burn),k); % bv=av; % set up storage thv=zeros(n,(m+burn)); % for kk=1:(m+burn) % MAIN ITERATION LOOP

lp=th*a-ones(n,1)*b; % lp is n by k bb=phi(-lp) ; % simulate latent u=rand(n,k); % data z tt=(bb.*(1-y)+(1-bb).*y).*u+bb.*y; % tt is n by k z=phiinv(tt)+lp; % z is n by k

v=1/sum(a.^2); % pvar=1/(1/v+1/var); % simulate theta mn=sum(((ones(n,1)*a).*(z+ones(n,1)*b))’)’; % N(mu,var) prior pmean=(mn+mu/var)*pvar; % th=randn(n,1)*sqrt(pvar)+pmean; % th is n by 1

x=[th -ones(n,1)]; % pp=[1/s_a^2 0;0 1/s_b^2]; % prior precison matrix amat=chol(inv(x’*x+pp)); % bz=(x’*x+pp)\(x’*z); % simulate {alpha, gamma) beta=amat’*randn(2,k)+bz; % a=beta(1,:); b=beta(2,:);

av(kk,:)=a; % bv(kk,:)=b; % store simulated values thv(:,kk)=th; % end av=av((burn+1):(burn+m),:); bv=bv((burn+1):(burn+m),:); thv=thv(:,(burn+1):(burn+m)); th_m=mean(thv’)’; % compute mean and standard th_s=std(thv’)’; function val=phi(x) val=.5*(1+erf(x/sqrt(2))); function val=phiinv(x) val=sqrt(2)*erfinv(2*x-1); 154

B.2 Chapter 3 codes

1. plot res function plot_res(data,av,bv,th_m,nb,t)

% This function constructs the (grouped) Residual plots per item % % Command: plot_res(data,av_bay,bv_bay,th_m,16,15); % % Input: data % av - simulate values of the discrimination parameter % bv - simulate values of the difficulty parameter % nb - no. of groups % t - item number theta=linspace(-2,2,nb); [pr,lo,hi]=irtpost(av,bv,theta); %computes 5,50,95 prob percentiles. bins=zeros(nb, 2); for i=1:nb bins(i,1)=theta(i)-.5*(theta(2)-theta(1)); bins(i,2)=theta(i)+.5*(theta(2)-theta(1)); end

[p,mids,n]=irtobs(data,th_m,bins); % p is the observed proportion of students who got the item correct. figure(t) Rlo=p-lo; Rme=p-pr; Rhi=p-hi; errorbar(mids,Rme(:,t),Rme(:,t)-Rlo(:,t),Rhi(:,t)-Rme(:,t),’o’) xlabel(’Latent Trait’) ylabel(’Bayesian Residuals’)

2. plot obspr fitpr function plot_obspr_fitpr(data,av,bv,th_m,nb,t)

%Command: plot_obspr_fitpr(data,av_bay,bv_bay,th_m,16,15); %nb=16 %no. of bins theta=linspace(-3,3,nb);

[pr,lo,hi]=irtpost(av,bv,theta); %computes 5,50,95 prob percentiles. bins=zeros(nb, 2); for i=1:nb bins(i,1)=theta(i)-.5*(theta(2)-theta(1)); bins(i,2)=theta(i)+.5*(theta(2)-theta(1)); end 155

[p,mids,n]=irtobs(data,th_m,bins); % p is an nb by k vector of observed proportion % of students who got the item correct. figure(t) plot(theta,pr(:,t),’-’,theta,lo(:,t),’:’,theta,hi(:,t),’:’); hold on; plot(mids,p(:,t),’o’) xlabel(’LATENT TRAIT’);ylabel(’PROBABILITY’) figure() plot() fm=summ_resid; errorbar(summ_fitted(:,2),fm(:,2),fm(:,2)-fm(:,1),fm(:,3)-fm(:,2),’o’) xlabel(’Fitted Probability’) ylabel(’Bayesian Residual’) hold on; plot([0 1],[0 0],’:’),axis([0 1 -1 1]),hold off title(’BAYESIAN RESIDUAL POSTERIOR DISTRIBUTIONS’)

3. gen irt norm out2 function [Y1,Y2,Y3,THETA,A,B,G,guessers]=gen_irt_norm_out2(n,k,link,ng)

% Command: [Y1,Y2,Y3,THETA,A,B,G,guessers]=gen_irt_norm_out2(1000,30,’p’,2); % % This program will generate 3 n by k data matrix for IRT (with ng guessers) % % Input: n = no. of examinees % k = no. of items % ng = no. of guessers % Output: Y1 = n by k data matrix of 1’s \& 0’s from the Rasch model. % Y2 = n by k data matrix of 1’s \& 0’s from the 2-parameter logit model. % Y3 = n by k data matrix of 1’s \& 0’s from the 3-parameter logit model. % THETA = n by 1 vector of ability scores. % A = 1 by k vector of item discrimination values. % B = 1 by k vector of item difficulty values. % G = 1 by k vector of item guessing values. % guessers = the index specifying the guessers

THETA=normrnd(0,1,n,1); % alternative ability scores from N(0,1) B=normrnd(0,1,1,k); % generating k difficulty parameter values. discri=[.2 .4 .6 .8 1 1.2 1.4 1.6]; % 8 possible item discrimination values. indexa=unidrnd(8,k,1); A=discri(indexa); % generating k discrimination parameter values. guess=[0 .2 .25 .333 .5]; % 5 different possible correct guessing rate. indexc=unidrnd(5,k,1); 156

G=guess(indexc); % generating k discrimination parameter values. guessers=unidrnd(n,ng,1);

M1=(THETA*ones(1,k))-(ones(n,1)*B); P1=g(M1,link); P1(guessers,:)=.20;

% Data coming from the rasch model Y1=(P1>rand(n,k)); % n by k data matrix of 0’s and 1’s.

M2=((THETA*ones(1,k)).*(ones(n,1)*A))-(ones(n,1)*B); P2=g(M2,link); P2(guessers,:)=.20;

% Data coming from a 2-parameter logit model Y2=(P2>rand(n,k)); % n by k data matrix of 0’s and 1’s.

% Data coming from a 3-parameter logit model Y3=Y2+(Y2==0).*((ones(n,1)*G)>rand(n,k)); Y3(guessers,:)=Y2(guessers,:); function p=g(eta,link) if link==’l’ p=exp(eta)./(1+exp(eta)); elseif link==’p’ p=phi(eta); elseif link==’c’ p=1-exp(-exp(eta)); end

4. plot resid post exa function plot_resid_post_exa(data,av,bv,th,examinee,link)

% This function computes the 5% 50% & 95% of the fitted prob. and % residuals for a specified examinee. Then plots the posterior % distribution of the residuals vs. the estimated item difficulty % % Command: plot_resid_post_exa(data,av_bay,bv_bay,th_bay,1,’p’) % % Input : data - n by k response matrix of 0’s and 1’s. % av - m by k matrix of simulated values of a_j % bv - m by k matrix of simulated values of b_j % th - n by m matrix of simulated values of theta_j % examinee - examinee of interest % link - ’p’=probit; ’l’=logit 157 y=data(examinee,:); k=size(data,2); m=size(av,1); summ_fitted=zeros(k,le); summ_resid=zeros(k,le); probs=[.05 .5 .95]; le=length(probs); cp=((1:m)-.5)/m; lp=(th(examinee,:)’*ones(1,k)).*av-bv; % lp is an m by k matrix for j=1:k p=g(lp(:,j),link); summ_fitted(j,:)=interp1(cp,sort(p),probs); % summ_fitted is k by 3. residual=y(j)-summ_fitted(j,:); summ_resid(j,:)=residual([3 2 1]); end figure(examinee) fm=summ_resid; errorbar(mean(bv),fm(:,2),fm(:,2)-fm(:,1),fm(:,3)-fm(:,2),’o’) xlabel(’Estimated Item Difficulty’) ylabel(’Bayesian Residual’) title(’BAYESIAN RESIDUAL POSTERIOR DISTRIBUTIONS’)

5. plot latresid post examinee function plot_latresid_post_examinee(data,av,bv,th,z_col,exa,link)

% This function computes the 5% 50% & 95% of the fitted prob. and % residuals for a specified examinee. Then plots the posterior % distribution of the residuals vs. the estimated item difficulty % % command: plot_latresid_post_exa(data,av_bay,bv_bay,th_m,z_col,1,’p’) n=size(data,1); k=size(data,2); m=size(av,1); z=zeros(m,k); for j=1:k z(:,j)=z_col(exa+(j-1)*n,:)’; % this is m by k matrix end probs=[.05 .5 .95]; le=length(probs); summ_fitted=zeros(k,le); summ_resid=zeros(k,le); cp=((1:m)-.5)/m; lp=th(exa)*mean(av)’-mean(bv)’; % lp is an 1 by k row vector for j=1:k summ_latent(j,:)=interp1(cp,sort(z(:,j)),probs); residual=summ_latent(j,:)-lp(j); 158

summ_resid(j,:)=residual([3 2 1]); end figure(exa) fm=summ_resid; errorbar(mean(bv),fm(:,2),fm(:,2)-fm(:,1),fm(:,3)-fm(:,2),’o’) xlabel(’Estimated Item Difficulty’) ylabel(’Bayesian Residual’) title(’LATENT RESIDUAL POSTERIOR DISTRIBUTIONS’)

6. out post exa function [post_rs,post_rs_resid]=out_post_exa(data,av,gv,th,link)

% This function will computer the [2.5% 50% 97.5%] simulated raw scores % and simulated residuals to determine whether a particular examinee got % a score that is not compatible with his/her estimated ability score.

% Command: [post_rs,post_rs_resid]=out_post_examinee(data,av_bay,bv_bay,th_bay,’p’); % Output: post_rs = [2.5% 50% 97.5%] simulated raw scores. % post_rs_resid = [2.5% 50% 97.5%] simulated residuals.

N=size(data,1); % no. of examinees. k=size(data,2); % no. of items in the test. m=size(av,1); sa=sum(data’)’; % total score of each examinee who took exam a. probs=[.025 .5 .975]; l=length(probs); post_rs=zeros(N,l); % post_rs is an N by 3 matrix. post_rs_resid=zeros(N,l); cp=((1:m)-.5)/m; for i=1:N lp=(th(i,:)’*ones(1,k)).*av-gv; % lp is an m by k matrix. p=g(lp,link); rs=sum(p’)’; % rs is an m by 1 of expected raw scores of an examinee. post_rs(i,:)=interp1(cp,sort(rs),probs); residual=sa(i)-post_rs(i,:); post_rs_resid(i,:)=residual([3 2 1]); end

B.3 Chapter 4 codes

1. prior pred rpbis function [r_pb,std_rpb]=rpbis(data)

% This function computes the point-biserial correlations of an IRT data set. % Command: [r_pb,std_rpb]=rpbis(data); 159 n=size(data,1); k=size(data,2); X=sum(data’)’; % n by 1 column vector X0_bar=sum((X*ones(1,k)).*(1-data))./sum(1-data); % 1 by k row vector X1_bar=sum((X*ones(1,k)).*(data))./sum(data); % 1 by k row vector prop=mean(data); % 1 by k row vector r_pb=(X1_bar-X0_bar).*sqrt(prop.*(1-prop))/std(X); % 1 by k row vector dis=r_pb./sqrt(1-r_pb.^2); std_rpb=std(r_pb);

2. prior pred rpbis function [std_rpb,std_rpb_obs]=prior_pred_rpbis(data,m)

% This function generates a prior predictive distribution for rpbis. % Command: [std_rpb_pri,std_rpb_pri_obs]=prior_pred_rpbis(data,500); n=size(data,1); k=size(data,2); th=randn(n,m); bv=randn(m,k); std_rpb=zeros(m,1); for i=1:m lp=(th(:,i)*ones(1,k))-(ones(n,1)*bv(i,:)); % lp is an n by k matrix. prob=phi(lp); y_rep=(prob>rand(n,k)); % Replicating the data.

% Computing the point biserial correlation of Y and theta. [r_pb,stdrpb]=rpbis(y_rep); std_rpb(i)=stdrpb; end

% Computing the observed point biserial correlation [r_pb,stdrpb_obs]=rpbis(data); std_rpb_obs=stdrpb_obs; hist(std_rpb)

3. post pred rpbis function [std_rpb,std_rpb_obs]=post_pred_rpbis(data,bv,th)

% This function generates a posterior predictive distribution for rpbis % Command: [std_rpb_post,std_rpb_post_obs]=post_pred_rpbis(data,bv_bay,th_bay); n=size(data,1); k=size(data,2); m=size(th,2); std_rpb=zeros(m,1); 160 for i=1:m lp=(th(:,i)*ones(1,k))-(ones(n,1)*bv(i,:)); % lp is an n by k matrix. prob=phi(lp); %prob=g(lp,link); y_rep=(prob>rand(n,k)); % Replicating the data.

% Computing the point biserial correlation of Y and theta. [r_pb,stdrpb]=rpbis(y_rep); std_rpb(i)=stdrpb;

if ceil(i/100)==i/100 i end end

% Computing the observed point biserial correlation [r_pb,stdrpb_obs]=rpbis(data); std_rpb_obs=stdrpb_obs; hist(std_rpb)

4. gof item G function[x2L_G,x2L_obs_G,pval_G,pval_glo_G]=gof_item_G(data,av,bv,th,nb,link)

% This function will compute the posterior predictive p-values % using the G^2 index to measure the goodness of fit of the model % of each item in the test. % % Command: [x2L_G,x2L_obs_G,pval_G,pval_glo_G]=gof_item_G(data,av,bv,th,16,’p’);

N=size(data,1); % no. of examinees. k=size(data,2); % no. of items. m=size(th,2); % no. of simulated values of theta. x2L_G=zeros(m,k); x2L_obs_G=zeros(m,k); ma=mean(av); mb=mean(bv); th_m=mean(th’)’; ng=nb; for j=1:m lp=(th(:,j)*av(j,:))-(ones(N,1)*bv(j,:)); prob=g(lp,link); y_rep=(prob>rand(N,k)); % Replicating the data. 161

for item=1:k output=[y_rep(:,item) prob(:,item) th(:,j)]; so=sortrows(output,3); % so = sorted output

nperg=round(size(output,1)/ng); % nperg=no of points per group gs=zeros(ng,3); for i=0:(ng-2) gs(i+1,1)=sum(so((i*nperg+1):(i+1)*nperg,1)); %obs. values gs(i+1,2)=nperg; gs(i+1,3)=mean(so((i*nperg+1):(i+1)*nperg,2)); %average prob. end gs(ng,1)=sum(so((ng-1)*nperg+1:size(output,1),1)); gs(ng,2)=size(output,1)-(ng-1)*nperg; gs(ng,3)=mean(so((ng-1)*nperg+1:size(output,1),2));

gs2(:,1)=gs(:,1)+(gs(:,1)==0); gs3(:,1)=gs(:,1)-(gs(:,1)==gs(:,2)); x2L_G(j,item)=2*sum(gs(:,1).*log(gs2(:,1)./((gs(:,2).*gs(:,3))))+... (gs(:,2)-gs(:,1)).*log((gs(:,2)-gs3(:,1))./(gs(:,2).*(1-gs(:,3)))));

output=[data(:,item) prob(:,item) th(:,j)]; so=sortrows(output,3); %so = sorted output

%ng=10; % no of groups nperg=round(size(output,1)/ng); % nperg=no of points per group gs=zeros(ng,3); for i=0:(ng-2) gs(i+1,1)=sum(so((i*nperg+1):(i+1)*nperg,1)); %obs. values gs(i+1,2)=nperg; gs(i+1,3)=mean(so((i*nperg+1):(i+1)*nperg,2)); %average prob. end gs(ng,1)=sum(so((ng-1)*nperg+1:size(output,1),1)); gs(ng,2)=size(output,1)-(ng-1)*nperg; gs(ng,3)=mean(so((ng-1)*nperg+1:size(output,1),2));

gs2(:,1)=gs(:,1)+(gs(:,1)==0); gs3(:,1)=gs(:,1)-(gs(:,1)==gs(:,2)); x2L_obs_G(j,item)=2*sum(gs(:,1).*log(gs2(:,1)./((gs(:,2).*gs(:,3))))+... (gs(:,2)-gs(:,1)).*log((gs(:,2)-gs3(:,1))./(gs(:,2).*(1-gs(:,3)))));

end end pval_G=sum(x2L_G>x2L_obs_G)/m; % 1 by k vector of bayesian p-values. x2L_glo_G=sum(x2L_G’)’; x2L_glo_obs_G=sum(x2L_obs_G’)’; pval_glo_G=sum(x2L_glo_G>x2L_glo_obs_G)/m; 162

5. gof item G ppp function [x2L_G,x2L_G_obs,pval_G]=gof_item_G_ppp(data1,data2,nb,m)

% This function will compute the partial posterior predictive p-values % using the G^2 index to measure the goodness of fit of the model % of each item in the test. % % Command: [x2L_G,x2L_G_obs,pval_G]=gof_item_G_ppp(data1,data2,10,500);

N=size(data1,1); k=size(data1,2); x2L_G=zeros(m,k);

[bv_bay1,th_bay1,th_m1,th_s1]=pp1_bay(data1,1,m); av_bay1=ones(m,k); for iter=1:m lp=(th_m1*av_bay1(iter,:))-(ones(N,1)*bv_bay1(iter,:)); prob=phi(lp); y_rep=(prob>rand(N,k)); % Replicating the data. x2L_G(iter,:)=item_G(y_rep,av_bay1,bv_bay1,th_m1,10,’p’); end

[bv_bay1o,th_bay1o,th_m1o,th_s1o]=pp1_bay(data2,1,m); av_bay1o=ones(m,k); x2L_G_obs=item_G(data2,av_bay1o,bv_bay1o,th_m1o,10,’p’); pval_G=sum(x2L_G>(ones(m,1)*x2L_G_obs))/m;

6. item G function x2L_G=item_G(data,av,bv,th,nb,link)

% This function will compute the G^2 chi-square index to measure the % goodness of fit of the model to each item in the test for a data set. % % Command: x2L_G=item_G(data,av_bay,bv_bay,th_m,10,’p’);

N=size(data,1); % no. of examinees. k=size(data,2); % no. of items. m=size(th,2); % no. of simulated values of theta. x2L_G=zeros(1,k); ma=mean(av); mb=mean(bv); ng=nb; lp=(th*ma)-(ones(N,1)*mb); % lp is an N by k matrix. prob=g(lp,link); 163 for item=1:k output=[data(:,item) prob(:,item) th]; so=sortrows(output,3);

%ng=10; % no of groups nperg=round(size(output,1)/ng); % nperg=no of points per group gs=zeros(ng,3); for i=0:(ng-2) gs(i+1,1)=sum(so((i*nperg+1):(i+1)*nperg,1)); % observed values gs(i+1,2)=nperg; gs(i+1,3)=mean(so((i*nperg+1):(i+1)*nperg,2)); % average prob. end gs(ng,1)=sum(so((ng-1)*nperg+1:size(output,1),1)); gs(ng,2)=size(output,1)-(ng-1)*nperg; gs(ng,3)=mean(so((ng-1)*nperg+1:size(output,1),2));

gs2(:,1)=gs(:,1)+(gs(:,1)==0); gs3(:,1)=gs(:,1)-(gs(:,1)==gs(:,2)); x2L_G(item)=2*sum(gs(:,1).*log(gs2(:,1)./((gs(:,2).*gs(:,3))))+... (gs(:,2)-gs(:,1)).*log((gs(:,2)-gs3(:,1))./(gs(:,2).*(1-gs(:,3))))); end 7. exa w function w=exa_w(data,av_m,bv_m,th_m)

% This function will calculate the W-statistic of Wright and Stone % Command: w=exa_w(data,av_bay,bv_bay,th_bay); n=size(data,1); p=phi(th_m*av_m-ones(n,1)*bv_m); % n by k matrix diff2=(data-p).^2; prod=p.*(1-p); num=sum(diff2’)’; % n by 1 vector den=sum(prod’)’; % n by 1 vector w=num./den; % n by 1 vector

8. exa l function l=exa_l(data,av_m,bv_m,th_m)

% This function will calculate the W-statistic of Wright and Stone % Command: l=exa_l(data,mean(av_bay2),mean(bv_bay2),mean(th_bay2)); n=size(data,1); p=phi(th_m*av_m-ones(n,1)*bv_m); % n by k matrix inside=data.*log(p)+(1-data).*log(1-p); % n by k matrix l=sum(inside’)’; % n by 1 vector 164

9. gof exa w function[x2L_l,x2L_obs_l,pval_l,pval_glo_l]=gof_exa_l(data,av,bv,th) % Command: [x2L_l,x2L_obs_l,pval_l,pval_glo_l]=gof_exa_l(data,av_b,bv_b,th_b); n=size(data,1); % no. of examinees. k=size(data,2); % no. of items. m=size(th,2); % no. of simulated values from the post. dist. of theta. x2L_w=zeros(n,m); x2L_obs_w=zeros(n,m); ma=mean(av); mb=mean(bv); th_m=mean(th’)’; for kk=1:m lp=(th_m*av(kk,:))-(ones(n,1)*bv(kk,:)); % lp is an N by k matrix. prob=phi(lp); y_rep=(prob>rand(n,k)); % Replicating the data. x2L_l(:,kk)=exa_l(y_rep,av(kk,:),bv(kk,:),th_m); x2L_obs_l(:,kk)=exa_l(data,av(kk,:),bv(kk,:),th_m); end pval_l=(sum((x2L_l

B.4 Chapter 5 codes

1. laplace function [mode,var,log_int]=laplace(logpost,start,iter,par)

% LAPLACE summarizes a posterior density by the Laplace method % [MODE,VAR,LOG_INT]=LAPLACE(LOGPOST,START,ITER,PAR) returns the mode % MODE, the variance-covariance matrix VAR, and estimate LOG_INT of log of % integral of posterior density, where LOGPOST is function containing % definition of log posterior, START is initial guess at mode, % ITER is number of iterations in Newton-Raphson algorithm, and PAR % is the vector of associated parameters in the function definition. % Command: [mode,var,log_int]=laplace(’logpost1’,-.5,5,par); mode=start; for i=1:iter [mode,var,log_int]=nr(logpost,mode,par);mode end 165 function [new,h,int]=nr(f,old,par)

% newton-raphson step % [new,h,int]=nr(f,old,par) % where f is definition of log density defined at vector of values, % old is guess at mode, and par is vector of associated parameters % output is new guess, variance estimate, and estimate of log-integral p=length(old); h=-inv(f2(f,old,par)); new=old+f1(f,old,par)*h; int=p/2*log(2*pi)+.5*log(det(h))+feval(f,new,par); function val=f1(f,x,par) h=.0001; val=[]; s=[-h/2;h/2]; x2=ones(2,1)*x; for i=1:length(x) y=x2; y(:,i)=y(:,i)+s; v=diff(feval(f,y,par))/h; val=[val v]; end function val=f2(f,x,par) h=.0001; n=length(x); val=zeros(n);s=[-h;0;h]; x2=ones(3,1)*x; for i=1:n y=x2; y(:,i)=y(:,i)+s; t=feval(f,y,par); val(i,i)=(t(1)-2*t(2)+t(3))/h^2; end s=[h/2 h/2; -h/2 h/2; h/2 -h/2; -h/2 -h/2]; x2=ones(4,1)*x; for i=1:n for j=(i+1):n y=x2; y(:,[i j])=y(:,[i j])+s; t=feval(f,y,par); v=(t(1)-t(2)-t(3)+t(4))/h^2; val(i,j)=v; val(j,i)=v; end end

2. logpost beta function val=logpost_beta(theta,par)

% Computes the log posterior of beta-binomial density % of beta(theta,k), where theta=log(mu/(1-mu)), mu=mean % prior of mu is uniform(0,1). k is known. d=length(par); y=par(:,1); n=par(:,2); v=7; mu=exp(theta)./(1+exp(theta)); val2=theta(:,1)*zeros(1,d); 166 for i=1:d val2(:,i)=betaln(mu*v+y(i),v-v*mu+n(i)-y(i))-betaln(mu*v,v-mu*v); end val=sum(val2’)’;

3. sim betabin function [BF,log_BF,bf,log_bf,P_s,p_s]=sim_betabin(N,m)

% This program will generate two data sets: (1) from a simple binomial model % and (2) from the beta-binomial model. Both data sets will have N observations

% Command: [BF,log_BF,bf,log_bf,P_s,p_s]=sim_betabin(20,100); % % Input: N = no. of observations. % m = no. of iterations. % Output: BF = m base factors of beta-binomial data. % bf = m base factors of binomial data.

BF=zeros(m,1);log_BF=BF; bf=zeros(m,1);log_bf=bf; P_s=zeros(N,m); % matrix of success prob. for all iterations p_s=zeros(1,m); % vector of success prob. for all iterations for i=1:m P=betarnd(5,2,1,N); % values of the parameters in the beta-binomial p=betarnd(5,2); % value of parameter in the binomial model n=ceil(30+5.*abs(randn(1,N)));

Y=binornd(n,P); % observations from beta-binomial model y=binornd(n,p); % observations from simple binomial model

Par=[Y;n]’; par=[y;n]’; P_s(:,i)=P’; p_s(i)=p’;

[Mode,Var,Log_Int]=laplace(’logpost_beta’,0,5,Par); Log_Den=betaln(sum(Y),sum(n-Y));

[mode,var,log_int]=laplace(’logpost_beta’,0,5,par); log_den=betaln(sum(y),sum(n-y));

BF(i)=exp(Log_Int - Log_Den); log_BF(i)=log(BF(i))/log(10);

bf(i)=exp(log_int - log_den); log_bf(i)=log(bf(i))/log(10); end 167

4. post sim function [mu, accept]=post_sim(data,m,burn,s)

% Simulates mu from its marginal distribution % using metropolis-hastings algorithm

% Command: [mu, accept]=post_sim(par,2000,500,.35); k=7; %k=exp(12); y=data(:,1); n=data(:,2); theta=zeros(m+burn,1); mu=.5; i=2; j=0; theta(1)=log(mu/(1-mu)); theta(2)=log(mu/(1-mu)); accept=0; while i <= (m+burn)

theta_cand=theta(i)+s*randn; mu=exp(theta_cand)/(1+exp(theta_cand)); a=mu*k+y; b=k*(1-mu)+(n-y); prob_num=(sum(betaln(a,b)-betaln(mu*k,k*(1-mu))));

mu2=exp(theta(i-1))/(1+exp(theta(i-1))); a2=mu2*k+y; b2=k*(1-mu2)+(n-y); prob_den=(sum(betaln(a2,b2)-betaln(mu2*k,k*(1-mu2))));

prob=exp(prob_num-prob_den);

if rand < prob theta(i+1)=theta_cand; accept=accept+1; else theta(i+1)=theta(i); end i=i+1; end theta=theta((burn+1):(burn+m)); mu=exp(theta)./(1+exp(theta)); accept=accept/m;

5. pphn bay function [av,mu_a,tau,bv,thv,th_m,th_s]=pphn_bay(y,nu,s_b,m)

% pphn_bay - fits an exchangeable probit item response model of the form % % p_ij = phi(a_j t_i - b_j) 168

% % where the a_j are iid N(mu, s_a), the b_j are iid N(0, s_b) % % command: [av_bay,mu_a,tau,bv_bay,th_bay,th_m,th_s]=pphn_bay(data,[1,1],1,500); % % input: y - binary data matrix where rows are subjects % and columns are items % s_a, s_b - prior standard deviations % m - number of iterations (default is 500) % % output: av - m by k matrix of simulated values of a_j % (each row is a simulated vector) % bv - m by k matrix of simulated values of b_j % thv - n by m matrix of simulated values of t_i % th_m, th_s - vectors of means and standard deviations of the t_i if nargin==3, m=500; end % default is 500 Gibbs cycles s=size(y); n=s(1); k=s(2); burn=100; % discard the first 100 iterations mu=0; var=1; % hyperparameters of theta prior nu1=nu(1); nu2=nu(2); % hyperparameters of tau prior a=2*ones(1,k); % phat=(sum(y)+.5)/(n+1); % initial estimates b=-phiinv(phat)*sqrt(5); % th=zeros(n,1); % t=1;m_a=1; av=zeros((m+burn),k); bv=av; % thv=zeros(n,(m+burn)); % set up storage mu_a=zeros(m+burn); tau=zeros(m+burn); % for kk=1:(m+burn) % MAIN ITERATION LOOP

lp=th*a-ones(n,1)*b; % lp is n by k bb=phi(-lp) ; % simulate latent u=rand(n,k); % data z tt=(bb.*(1-y)+(1-bb).*y).*u+bb.*y; % tt is n by k z=phiinv(tt)+lp; % z is n by k

v=1/sum(a.^2); % pvar=1/(1/v+1/var); % simulate theta mn=sum(((ones(n,1)*a).*(z+ones(n,1)*b))’)’; % assuming N(mu,var) prior pmean=(mn+mu/var)*pvar; % th=randn(n,1)*sqrt(pvar)+pmean; % th is n by 1

x=[th -ones(n,1)]; % 169

pm=[m_a 0]’; pp=[1/t^2 0;0 1/s_b^2]; % prior precison matrix amat=chol(inv(x’*x+pp)); % (serves like the std) bz=(x’*x+pp)\(x’*z+pp*pm*ones(1,k)); % simulate {alpha, gamma) beta=amat’*randn(2,k)+bz; % a=beta(1,:); b=beta(2,:);

m_a=randn*(t/sqrt(k))+mean(a); b1=nu2+.5*sum((a-m_a).^2); a1=nu1+k/2; t=sqrt(b1/gamrnd(a1,1));

mu_a(kk)=m_a; tau(kk)=t; av(kk,:)=a; % bv(kk,:)=b; % store simulated values thv(:,kk)=th; % end mu_a=mu_a((burn+1):(burn+m)); tau=tau((burn+1):(burn+m)); av=av((burn+1):(burn+m),:); bv=bv((burn+1):(burn+m),:); thv=thv(:,(burn+1):(burn+m)); th_m=mean(thv’)’; % compute mean and standard th_s=std(thv’)’;

6. pphn2 bay function [av,mu_a,bv,thv,th_m,th_s]=pphn2_bay(y,m_a,s_b,m)

% pphn2_bay - fits an exchangeable probit item response model of the form % % p_ij = phi(a_j t_i - b_j) % % where the a_j are iid N(mu, s_a), the b_j are iid N(0, s_b) % % command: [av_bay,mu_a,bv_bay,th_bay,th_m,th_s]=pphn2_bay(data,1,1,500); % % input: same as in pphn_bay % output: av - m by k matrix of simulated values of a_j % (each row is a simulated vector) % bv - m by k matrix of simulated values of b_j % thv - n by m matrix of simulated values of t_i % th_m, th_s - vectors of means and standard deviations of the t_i % % Note : The second hyperparameter s_a is fixed. if nargin==3, m=500; end % default is 500 Gibbs cycles 170 s=size(y); n=s(1); k=s(2); burn=100; % discard the first 100 iterations mu=0; var=1; % hyperparameters of theta prior a=2*ones(1,k); % phat=(sum(y)+.5)/(n+1); % initial estimates b=-phiinv(phat)*sqrt(5); % th=zeros(n,1); % %t=0.01; % One-parameter %t=10; % Two-parameter t=0.25; % Exchangeable av=zeros((m+burn),k); bv=av; %set up storage thv=zeros(n,(m+burn)); mu_a=zeros(m+burn); % for kk=1:(m+burn) % MAIN ITERATION LOOP

lp=th*a-ones(n,1)*b; % lp is n by k bb=phi(-lp) ; % simulate latent u=rand(n,k); % data z tt=(bb.*(1-y)+(1-bb).*y).*u+bb.*y; % tt is n by k z=phiinv(tt)+lp; % z is n by k

v=1/sum(a.^2); % pvar=1/(1/v+1/var); % simulate theta mn=sum(((ones(n,1)*a).*(z+ones(n,1)*b))’)’; % assuming N(mu,var) prior pmean=(mn+mu/var)*pvar; % th=randn(n,1)*sqrt(pvar)+pmean; % th is n by 1

x=[th -ones(n,1)]; % pm=[m_a 0]’; pp=[1/t^2 0;0 1/s_b^2]; % prior precison matrix amat=chol(inv(x’*x+pp)); % (serves like the std) bz=(x’*x+pp)\(x’*z+pp*pm*ones(1,k)); % simulate {alpha, gamma) beta=amat’*randn(2,k)+bz; % a=beta(1,:); b=beta(2,:);

m_a=randn*(t/sqrt(k))+mean(a);

mu_a(kk)=m_a; av(kk,:)=a; % bv(kk,:)=b; % store simulated values thv(:,kk)=th; % end mu_a=mu_a((burn+1):(burn+m)); av=av((burn+1):(burn+m),:); bv=bv((burn+1):(burn+m),:); thv=thv(:,(burn+1):(burn+m)); 171 th_m=mean(thv’)’; % compute mean and standard th_s=std(thv’)’;

7. comp bf one exch function BF=comp_bf_one_exch(data,av,bv,th,mu_a)

% This function computes the bayes factor of the One-parameter IRT probit model % versus Exchangeable IRT probit model using MCMC. % % Command: BF=comp_bf_one_exch(data,av_bay,bv_bay,th_bay,mu_a); t1=.01;t2=.25; y=data; k=size(y,2); n=size(y,1); m=size(av,1); log_num=zeros(m,1); log_den=zeros(m,1); for kk=1:m log_num(kk)=sum(log(normpdf(av(kk,:),mu_a(kk),t1))); log_den(kk)=sum(log(normpdf(av(kk,:),mu_a(kk),t2))); end

BF=mean(exp(log_den-log_num));

8. bf exch out a function BF=bf_exch_out_a(data,av,mu_a)

% This function computes the bayes factor of the Mixture prior exchangeable % model versus the ordinary exchangeable model using MCMC.

% Command: BF=bf_exch_out_a(data,av_bay,mu_a); t2=0.25; K=3; k=size(data,2); n=size(data,1); m=size(av,1); log_num=zeros(m,1); log_den=zeros(m,1); BF=zeros(k,1); for item=1:k for kk=1:m log_num(kk)=(log(normpdf(av(kk,item),mu_a(kk),K*t2))); log_den(kk)=(log(normpdf(av(kk,item),mu_a(kk),t2))); end BF(item)=mean(exp(log_num-log_den)); end 172

9. pphn out bay function[av,mu_a,tau,bv,thv,th_m,th_s,gamv]=pphn_out_bay(y,nu,s_b,m)

% command: [av_bay,mu_a,tau,bv_bay,th_bay,th_m,th_s,gamv]=pphn_out_bay(data,[1,1],1,1000); if nargin==3, m=500; end % default is 500 Gibbs cycles s=size(y); n=s(1); k=s(2); burn=100; % discard the first 100 iterations mu=0; var=1; % hyperparameters of theta prior nu1=nu(1); nu2=nu(2); % hyperparameters of tau prior a=2*ones(1,k); % phat=(sum(y)+.5)/(n+1); % initial estimates b=-phiinv(phat)*sqrt(5); % th=zeros(n,1); % t=1;m_a=1; K=3;p=.9; gamv=zeros((m+burn),k); % av=zeros((m+burn),k); % bv=av; % set up storage thv=zeros(n,(m+burn)); % mu_a=zeros(m+burn); tau=zeros(m+burn); for kk=1:(m+burn) % MAIN ITERATION LOOP

lp=th*a-ones(n,1)*b; % lp is n by k bb=phi(-lp) ; % simulate latent u=rand(n,k); % data z tt=(bb.*(1-y)+(1-bb).*y).*u+bb.*y; % tt is n by k z=phiinv(tt)+lp; % z is n by k

v=1/sum(a.^2); % pvar=1/(1/v+1/var); % simulate theta mn=sum(((ones(n,1)*a).*(z+ones(n,1)*b))’)’; % assuming N(mu,var) prior pmean=(mn+mu/var)*pvar; % th=randn(n,1)*sqrt(pvar)+pmean; % th is n by 1

x=[th -ones(n,1)]; % pm=[m_a 0]’; pp=[1/t^2 0;0 1/s_b^2]; % prior precison matrix amat=chol(inv(x’*x+pp)); % (serves like the std) bz=(x’*x+pp)\(x’*z+pp*pm*ones(1,k)); % simulate {alpha, gamma) beta=amat’*randn(2,k)+bz; % a=beta(1,:); b=beta(2,:);

d1=normpdf(a,m_a,t); 173

d2=normpdf(a,m_a,t*K); probs=p*d1./(p*d1+(1-p)*d2); gam=(randn(1,k)>probs); % 1 by k vector

n1=sum(gam==0); n2=sum(gam==1); temp1=find(gam==0);temp2=find(gam==1); a1=a(temp1); a2=a(temp2); p=betarnd(n1+1,n2+1);

s=sum(a1/t^2)+sum(a2/(K^2*t^2)); f=(n1+n2/K^2)/(t^2); m_a=normrnd(s/f,sqrt(1/f));

b1=nu2+.5*sum((a1-m_a).^2)+(sum((a2-m_a).^2))/(2*K^2); a1=nu1+k/2; t=sqrt(b1/gamrnd(a1,1));

gamv(kk,:)=gam; mu_a(kk)=m_a; tau(kk)=t; av(kk,:)=a; % bv(kk,:)=b; % store simulated values thv(:,kk)=th; % end gamv=gamv((burn+1):(burn+m),:); mu_a=mu_a((burn+1):(burn+m)); tau=tau((burn+1):(burn+m)); av=av((burn+1):(burn+m),:); bv=bv((burn+1):(burn+m),:); thv=thv(:,(burn+1):(burn+m)); th_m=mean(thv’)’; % compute mean and standard th_s=std(thv’)’; 174

REFERENCES

[1] Agresti, A. (2002), Categorical Data Analysis, New York: Wiley.

[2] Aitkin, M. (1991), “Posterior Bayes Factors,” Journal of the Royal Statistical Society.

Series B, 53, 111 − 142.

[3] Albert, J.H. (1992), “Bayesian Estimation of Normal Ogive Item Response Curves Using

Gibbs Sampling,” Journal of Educational Statistics, 17, 261 − 269.

[4] Albert, J.H. (1992), “A Bayesian Analysis of a Poisson Random Effects Model for Home

Run Hitters,” The American Statistician, 46, 246 − 253.

[5] Albert, J.H. (1996), “Bayesian selection of log-linear models,” The Canadian Journal

of Statistics, 24, 327 − 347.

[6] Albert, J.H. (1996), “Criticism of a Hierarchical Model Using Bayes Factors,” Statistics

in Medicine, 18, 287 − 305.

[7] Albert, J.H. and Chib, S. (1993), “Bayesian Analysis of Binary and Polychotomous

Response Data,” Journal of the American Statistical Association, 88, 669 − 679.

[8] Albert, J.H. and Chib, S. (1995), “Bayesian Residual Analysis for Binary Response

Regression Models,” Biometrika, 82, 747 − 759.

[9] Albert, J.H. and Chib, S. (1997), “Bayesian Tests and Model Diagnostics in Condition-

174 175

ally Independent Hierarchical Models,” Journal of the American Statistical Association,

92, 916 − 925.

[10] Albert, J.H. and Johnson, V. E. (1999), Ordinal Data Modeling, New York: Springer-

Verlag.

[11] Baker, F.B. (2001), The Basics of Item Response Theory, 2nd ed., New York: ERIC

Clearinghouse on Assessment and Evaluation.

[12] Baker, F.B. and Kim, S.-H. (2004), Item Response Theory: Parameter Estimation Tech-

niques, 2nd ed., New York: Marcel Dekker.

[13] Bayarri, M.J. and Berger, J. (1997), “Measures of Surprise in Bayesian Analysis,” ISDS

Discussion paper, 97 − 46, Duke University.

[14] Bayarri, M.J. and Berger, J. (1998), “Robust Bayesian Analysis of Selection Models,”

The Annals of Statistics, 26, 645 − 659.

[15] Bayarri, M.J. and Berger, J. (1999), “Quantifying surprise in the data and model ver-

ification,” in Bayesian Statistics 6, eds. J.M. Bernardo, J.O. Berger, A.P. Dawid and

A.F.M. Smith, London: Oxford University Press, pp. 53-82.

[16] Bayarri, M.J. and Morales, J. (2003), “Bayesian measures of surprise for outlier detec-

tion,” Journal Statistical Planning and Inference, 111, 3 − 22.

[17] Bedrick, E.J., Christensen, r., and Johnson, W. (1996), “A New Perspective on Priors 176

for Generalized Linear Models,” Journal of the American Statistical Association, 91,

1450 − 1460.

[18] Berger, J.O. and Pericchi, L.R. (1996), “The Intrinsic Bayes Factor for Model Selection

and Prediction,” Journal of the American Statistical Association, 91, 109 − 122.

[19] Box, G.E.P. (1980), “Sampling and Bayes’ Inference in Scientific Modelling and Ro-

bustness,” Journal of the Royal Statistical Society - Series A, 143, 383 − 430.

[20] Chen, M.-H., Shao, Q.-M., and Ibrahim, J.G. (2000), Monte Carlo Methods in Bayesian

Computation, New York: Springer-Verlag.

[21] Cheng, K.F. and Chen, L.C. (2004), “Testing Goodness-of-fit of a Logistic Regression

Model with Case-Control Data,” Journal of Statistical Planning and Inference, 124,

409 − 422.

[22] Collett, D. (2003), Modelling Binary Data, 2nd ed.,London: Chapman & Hall.

[23] Dey, D.K., Ghosh, S.K., and Mallick, B.K. editors (2000), Generalized Linear Models:

A Bayesian Perspective, New York: Marcel Dekker.

[24] DiCiccio, T.J., Kass, R.E., Raftery, A., and Wasserman, L. (1997), “Computing Bayes

Factors by Combining Simulation and Asymptotic Approximation,” Journal of the

American Statistical Association, 92, 903 − 915.

[25] Dobson, A.J. (1990), An Introduction to Generalized Linear models, New York: Chap-

man & Hall. 177

[26] Embretson, S.E., and Reise, S.P. (2000), Item Response Theory For Psychologists, Mah-

wah, NJ: Lawrence Erlbaum Associates.

[27] Fernando, P.J. and Lorenzo-Seva, U. (2001), “Checking the Appropriateness of IRT

Models by Predicting the Distribution of Observed Scores: The Program EO-Fit,”

Educational and Psychological Measurement, 61, 895 − 902.

[28] Fishcer, G.H. and Molenaar, I.W. editors (1995), Rasch Models: Foundations, Recent

Developments, and Applications, New York: Springer-Verlag.

[29] Gamerman, Dani (1997), Markov Chain Monte Carlo: Stochastic Simulation for

Bayesian Inference, New York: Chapman & Hall.

[30] Gelfand, A.E, Smith, A.F.M. (1990), “Sampling-Based Approaches to Calculating Mar-

ginal Densities,” Journal of the American Statistical Association, 85, 398 − 409.

[31] Gelman, A., Meng, X.L., and Stern, H. (1996), “Posterior Predictive Assessment of

Model Fitness Via Realized Discrepancies,” Statistica Sinica, 6, 733 − 807.

[32] Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (2004), Bayesian Data Analysis,

London: Chapman & Hall.

[33] Glas, C.A.W. and Falcon, J.C.S. (2003), “A Comparison of Item-Fit Statistics for the

Three-Parameter Logistic Model,” Applied Psychological Measurement, 27, 87 − 106.

[34] Glas, C.A.W. and Meijer, R.R. (2003), “A Bayesian Approach to Person Fit Analysis

in Item Response Theory Models,” Applied Psychological Measurement, 27, 217 − 233. 178

[35] Ghosal, S. (1997), “Normal Approximation to the Posterior Distribution in Generalized

Linear Models with many Covariates,” Mathematical Methods of Statistics, 6, 332−348.

[36] Ghosh, M., Ghosh A., Chen, M-H., and Agresti, A. (2000), “Bayesian Estimation for

Item Response Models,” Journal of Statistical Planning and Inference, 88, 99 − 115.

[37] Hambleton, R.K., and Swaminathan, H. (1985), Item Response Theory - Principles and

Applications, Hingham, MA: Kluwer, Nijhoff.

[38] Hendrawan I., Glas, C.A.W., and Meijer, R.R. (2005), “The Effect of Person Misfit on

Classification Decisions,” Applied Psychological Measurement, 29, 26 − 24.

[39] Hulin, C.L., Drasgow, F., and Parsons, C.K. (1983), Item Response Theory -Applications

to Psychological Measurement, Homewood, IL: Dow Jones-Irwin.

[40] Kass, R.E., and Raftery, A.E. (1995), “Bayes Factors,” Journal of the American Statis-

tical Association, 90, 773 − 795.

[41] Lord, F.M. (1980), Application of Item Response Theory to Practical Testing Problems,

Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.

[42] Lynch, S.M. and Western, B. (2004), “Bayesian Posterior Predictive Checks for Complex

Models,” Sociological Methods and Research, 32(3), 301 − 335.

[43] McCullagh, P., and Nelder, J.A. (1989), Generalized Linear Models, London: Chapman

& Hall. 179

[44] Meijer, R.R. and Sijtsma, K. (2001), “Methodology Review: Evaluating Person Fit,”

Applied Psychological Measurement, 25, 107 − 135.

[45] Orlando, M. and Thissen, D. (2000), “Likelihood-Based Item-Fit Indices for Dichoto-

mous Item Response Theory Models,” Applied Psychological Measurement, 24, 50 − 64.

[46] Pettit, L.I. (1986), “Diagnostics in Bayesian Model Choice,” The Statistician, 35, 183−

190.

[47] Pettit, L.I. (1988), “Bayes Motheds for Outliers in Exponential Samples,” Journal of

the Royal Statistical Society B, 50, 371 − 380.

[48] Pettit, L.I. (1992), “Bayes Factors for Outlier Models Using the Device of Imaginary

Observations,” Journal of the American Statistical Association, 87, 541 − 545.

[49] Press, S.J. (2003), Subjective and Objective Bayesian Statistics: Principles, Models, and

Applications, 2nd ed., New Jersey: John Wiley and Sons.

[50] Raftery, A.E. (1998), “Bayes Factors and BIC: Comment on Weakliem,” Technical Re-

port no. 347 (Department of Statistics, University of Washington).

[51] Rasch, G. (1966), “An Item Analysis which takes Individual Differences into Account,”

British Journal of Mathematical and Statistical Psychology, 19, 49 − 57.

[52] Rubin, D.B. (1984), “Bayesianly Justifiable and Relevant Frequency Calculations for

the Applied Statistician,” Annals of Statistics, 12, 1151 − 1172. 180

[53] Sahu, S.K. (2002), “Bayesian Estimation and Model Choice in Item Response Models,”

Journal of Statistical Computation and Simulation, 72(3), 217 − 232.

[54] Stone, C.A. (2004), “IRTFIT RESAMPLE: A Computer Program for Assessing Good-

ness of Fit of Item Response Theory Models Based on Posterior Expectations,” Applied

Psychological Measurement, 28, 143 − 144.

[55] Swaminathan, H., Hambleton, R.K., Sireci, S.G., Xing, Dehui, and Rizavi, S.M. (2003),

“Small Sample Estimation in Dichotomous Item Response Models: Effect of Priors

Based on Judgmental Information on the Accuracy of Item Parameter Estimates,” Ap-

plied Psychological Measurement, 27, 27 − 51.

[56] Van der Linden, W.J., and Hambleton, R.K., editors (1997), Handbook of Modern Item

Response Theory, New York: Springer-Verlag.

[57] Wasserman, L. and Verdinelli, I. (1991), “Bayesian Analysis of Outlier Problems Using

the Gibbs Sampler,” Statistics and Computing, 1, 105 − 117.

[58] Zhang, B. (1999), “A Chi-squared Goodness-of-fit Test for Logistic Regression Models

Based on Case-Control Data,” Biometrika, 86, 531 − 539.