Download File
Total Page:16
File Type:pdf, Size:1020Kb
RATER DRIFT IN CONSTRUCTED RESPONSE SCORING VIA LATENT CLASS SIGNAL DETECTION THEORY AND ITEM RESPONSE THEORY Yoon Soo Park Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy under the Executive Committee of the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2011 © 2011 Yoon Soo Park All Rights Reserved ABSTRACT RATER DRIFT IN CONSTRUCTED RESPONSE SCORING VIA LATENT CLASS SIGNAL DETECTION THEORY AND ITEM RESPONSE THEORY Yoon Soo Park The use of constructed response (CR) items or performance tasks to assess test takers’ ability has grown tremendously over the past decade. Examples of CR items in psychological and educational measurement range from essays, works of art, and admissions interviews. However, unlike multiple-choice (MC) items that have predetermined options, CR items require test takers to construct their own answer. As such, they require the judgment of multiple raters that are subject to differences in perception and prior knowledge of the material being evaluated. As with any scoring procedure, the scores assigned by raters must be comparable over time and over different test administrations and forms; in other words, scores must be reliable and valid for all test takers, regardless of when an individual takes the test. This study examines how longitudinal patterns or changes in rater behavior affect model-based classification accuracy. Rater drift refers to changes in rater behavior across different test administrations. Prior research has found evidence of drift. Rater behavior in CR scoring is examined using two measurement models – latent class signal detection theory (SDT) and item response theory (IRT) models. Rater effects (e.g., leniency and strictness) are partly examined with simulations, where the ability of different models to capture changes in rater behavior is studied. Drift is also examined in two real-world large scale tests: teacher certification test and high school writing test. These tests use the same set of raters for long periods of time, where each rater’s scoring is examined on a monthly basis. Results from the empirical analysis showed that rater models were effective to detect changes in rater behavior over testing administrations in real-world data. However, there were differences in rater discrimination between the latent class SDT and IRT models. Simulations were used to examine the effect of rater drift on classification accuracy and on differences between the latent class SDT and IRT models. Changes in rater severity had only a minimal effect on classification. Rater discrimination had a greater effect on classification accuracy. This study also found that IRT models detected changes in rater severity and in rater discrimination even when data were generated from the latent class SDT model. However, when data were non-normal, IRT models underestimated rater discrimination, which may lead to incorrect inferences on the precision of raters. These findings provide new and important insights into CR scoring and issues that emerge in practice, including methods to improve rater training. TABLE OF CONTENTS Section Page Chapter I…………………………………………………………………………...... 1 INTRODUCTION..…………………………………………………………………. 1 1.1 Statement of the Problem…………………………………………….. 2 Rater Drift……………………………………………………………. 3 Models Used for CR Scoring……………………………………….... 4 1.2 Purpose of the Study…………………………………………………. 6 Chapter II……………………………………………………………………………. 10 LITERATURE REVIEW…………………………………………………………... 10 2.1 Rater Drift……………………………………………………………. 10 Studies on Rater Drift.………………………………………………... 11 Studies on Rater Drift with Efforts to Reduce Rater Effects ………... 14 2.2 Incomplete Designs…………………………………………………... 17 2.3 Latent Class Signal Detection Theory (SDT) Model……………….... 19 2.4 Item Response Theory (IRT) Models……………………………….... 23 Graded Response Model……………………………………………... 23 Partial Credit Model and Generalized Partial Credit Model…………. 24 FACETS Model………………………………………………………. 25 Chapter III…………………………………………………………………………... 28 METHODS………………………………………………………………………….. 28 3.1 Empirical Study………………………………………………………. 28 3.2 Simulation Study……………………………………………………... 31 Study 1: Examining Changes in Classification Accuracy due to Rater Drift…………………………………………………………………... 32 Study 2: Detecting Drift using Rater Models……………………….... 35 Chapter IV…………………………………………………………………………... 40 RESULTS……………………………………………………………………………. 40 4.1 Empirical Study: Teacher Certification Test...……………………….. 40 Rater Effects………………………………………………………….. 41 Rater Discrimination……….………………………………………… 47 Latent Class Sizes and Classification Accuracy……………………... 52 4.2 Empirical Study: High School Writing Test…………..……………... 55 Rater Effects………………………………………………………...... 55 Rater Discrimination…………………………………………………. 61 Latent Class Sizes and Classification Accuracy……………………... 66 4.3 Simulation Study 1: Examining Changes in Classification Accuracy due to Rater Drift……………………………………………………... 69 Classification Accuracy: Rater Effects………………………………. 69 Classification Accuracy: Rater Discrimination………………………. 71 i TABLE OF CONTENTS Section Page 4.4 Simulation Study 2: Detecting Drift using Rater Models……………. 71 Detecting Drift using the GR model…………………………………. 72 Effect on IRT Parameters for Normal and Non-Normal Class Sizes………………………………………………………………….. 75 4.5 Parameter Recovery: Rater Parameters, Latent Class Sizes, and Standard Errors from the Latent Class SDT Model …………………. 81 Rater Parameters and Latent Class Sizes…..………………………… 82 Standard Errors……………………………………………………….. 85 Chapter V……..……………………………………………………………………... 87 SUMMARY AND DISCUSSION….………………………………………………. 87 5.1 Summary……………… …………………………………………….. 87 5.2 Discussion…..………… …………………………………………….. 90 5.3 Limitations and Future Research.…………………………………….. 94 REFERENCES..…………………………………………………………………...... 96 Appendices Appendix A…………………………………………………………………………... 102 Mean Parameter Estimates and Standard Errors of Study II…………………………. 102 Appendix B…………………………………………………………………………... 111 Parameter Estimates, Bias, Percent Bias, and MSE………………………………….. 111 Appendix C…………………………………………………………………………... 141 Evaluation of the Estimated Standard Errors for d and the Latent Class Sizes……… 141 ii LIST OF TABLES Table Page Table 1………………………………………………………………………………... 18 Unbalanced Incomplete Design, 10 Rater Pairs……………………………………… 18 Table 2………………………………………………………………………………... 33 Conditions for Study of Rater Drift…………………………..……………………..... 33 Table 3………………………………………………………………………………... 34 Conditions for Drift in Rater Discrimination……...…………………………………. 34 Table 4………………………………………………………………………………... 36 Conditions for Differences in Latent Class Sizes over Two Scoring Occasions…….. 36 Table 5………………………………………………………………………………... 47 Regression Results to Summarize Parameter Estimates in Rater Effects……….…… 47 Table 6………………………………………………………………………………... 48 Mean Rater Discrimination for each Administration………...………………………. 48 Table 7………………………………………………………………………………... 52 Regression Results to Summarize Parameter Estimates in Rater Discrimination…… 52 Table 8………………………………………………………………………………... 60 Regression Results to Summarize Parameter Estimates in Rater Effects………….… 60 Table 9………………………………………………………………………………... 61 Mean Rater Discrimination for each Month…………………………………………. 61 Table 10…………………..…………………………………………………………... 66 Regression Results to Summarize Parameter Estimates in Rater Discrimination…… 66 Table 11…………………..…………………………………………………………... 70 Classification Accuracy due to Drift…………………………………..……………... 70 iii LIST OF FIGURES Figure Page Figure 1………………………………………………………………………………. 20 A Representation of SDT for Scoring Categories 1 to 4……………………………... 20 Figure 2………………………………………………………………………………. 42 Teacher Certification Test: Plot of Individual Rater’s Relative Criteria (LC-SDT)…. 42 Figure 3………………………………………………………………………………. 43 Teacher Certification Test: Plot of Individual Rater’s Location (GR)…………….…. 43 Figure 4………………………………………………………………………………. 44 Teacher Certification Test: Plot of Individual Rater’s Location (GPC)……………... 44 Figure 5………………………………………………………………………………. 49 Teacher Certification Test: Plot of Individual Rater’s Discrimination (LC-SDT)…... 49 Figure 6………………………………………………………………………………. 50 Teacher Certification Test: Plot of Individual Rater’s Discrimination (GR)…............ 50 Figure 7………………………………………………………………………………. 51 Teacher Certification Test: Plot of Individual Rater’s Discrimination (GPC)….......... 51 Figure 8………………………………………………………………………………. 53 Teacher Certification Test: Histogram of Latent Class Sizes………………………... 53 Figure 9………………………………………………………………………………. 54 Teacher Certification Test: Classification Statistics…………………………………. 54 Figure 10..……………………………………………………………………………. 57 High School Writing Test: Plot of Individual Rater’s Relative Criteria (LC-SDT)…. 57 Figure 11..……………………………………………………………………………. 58 High School Writing Test: Plot of Individual Rater’s Relative Criteria (GR)……….. 58 Figure 12..……………………………………………………………………………. 59 High School Writing Test: Plot of Individual Rater’s Relative Criteria (GPC)…........ 59 Figure 13..……………………………………………………………………………. 63 High School Writing Test: Plot of Individual Rater’s Discrimination (LC-SDT)…… 63 Figure 14..……………………………………………………………………………. 64 High School Writing Test: Plot of Individual Rater’s Discrimination (GR)……........ 64 iv LIST OF FIGURES Figure Page Figure 15..……………………………………………………………………………. 65 High School Writing Test: Plot of Individual Rater’s Discrimination (GPC)……….. 65 Figure 16..……………………………………………………………………………. 67 High School Writing Test: Histogram of Latent Class Sizes……………………........ 67