Rater Expertise in a Second Language Speaking Assessment
Total Page:16
File Type:pdf, Size:1020Kb
RATER EXPERTISE IN A SECOND LANGUAGE SPEAKING ASSESSMENT: THE INFLUENCE OF TRAINING AND EXPERIENCE A DISSERTATION SUBMITTED TO THE GRADUATE DIVISION OF THE UNIVERSITY OF HAWAIʻI AT MᾹNOA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN SECOND LANGUAGE STUDIES DECEMBER 2012 By Lawrence Edward Davis Dissertation Committee: John Norris, Chairperson James Dean Brown Thom Hudson Luca Onnis Thomas Hilgers © Copyright 2012 by Lawrence Edward Davis All Rights Reserved ii For Donna Seto Davis iii ACKNOWLEDGMENTS Many individuals made important contributions to my dissertation project. First, I thank my dissertation committee for many useful comments and suggestions made during the process. I particularly thank Professor Luca Onnis who introduced me to statistical learning and behavioral economics (domains which informed the final section of the study) and Professor J.D. Brown for providing publication opportunities outside of the dissertation process, and for his gentle corrections in the face of my many transgressions against APA style. In addition, John Davis, Hanbyul Jung, Aleksandra Malicka, Kyae- Sung Park, John Norris, and Veronika Timpe assisted in the piloting of instruments used in the study; their comments and observations did much to improve the data collection procedures. I am grateful to Xiaoming Xi at Educational Testing Service (ETS) for her help in obtaining access to the TOEFL iBT Public Use Dataset and to Pam Mollaun at ETS for her help in recruiting TOEFL scoring leaders to provide reference scores. I also thank the eleven ETS scoring leaders who provided additional scores for responses from the TOEFL iBT Public Use Dataset. A number of individuals assisted in recruiting participants for the study, including Barbara Davidson, Jennifer Hickman, Aurora Tsai, and Mark Wolfersberger. Although they must remain anonymous, I extend my heartfelt thanks to those individuals who participated as raters in the study. It goes without saying that there would be no dissertation without them, and I appreciate their efforts in completing a long list of tedious tasks, with minimal compensation. My dissertation advisor, Professor John Norris, played a key role in all aspects of the project. This dissertation has benefitted greatly from his broad knowledge and sage iv advice. Even more, John has been an immensely patient and attentive mentor, and provided me with opportunities to learn, work, and publish that have been crucial to my development as a professional. The study would not have been possible without financial support. I thank the Department of Second Language Studies and the College of Languages, Linguistics, and Literatures at the University of Hawaiʻi for providing graduate assistantships throughout my Ph.D program. Data collection was directly supported by a TOEFL Small Dissertation Grant from Educational Testing Service and a dissertation grant from the International Research Foundation for English Language Education (TIRF). Collection of data and writing of the dissertation was also greatly facilitated by a dissertation completion fellowship from the Bilinski Educational Foundation. I thank ETS, TIRF, and the Bilinski Foundation for their very generous support. Finally, I thank the many friends, colleagues, and family who have supported me throughout the Ph.D program. It is not possible to mention everyone, but special thanks go to John Davis, a colleague, friend, and fellow traveler always willing to lend a keen intellect or a sympathetic ear as the situation required. Finally, completion of this dissertation would have been impossible without the support of my wife, Donna Seto Davis; last in mention, but first in my heart. v ABSTRACT Speaking performance tests typically employ raters to produce scores; accordingly, variability in raters' scoring decisions has important consequences for test reliability and validity. One such source of variability is the rater's level of expertise in scoring. Therefore, it is important to understand how raters' performance is influenced by training and experience, as well as the features that distinguish more proficient raters from their less proficient counterparts. This dissertation examined the nature of rater expertise within a speaking test, and how training and increasing experience influenced raters’ scoring patterns, cognition, and behavior. Experienced teachers of English (N=20) scored recorded examinee responses from the TOEFL iBT speaking test prior to training and in three sessions following training (100 responses for each session). For an additional 20 responses, raters verbally reported (via stimulated recall) what they were thinking as they listened to the examinee response and made a scoring decision, with the resulting data coded for language features mentioned. Scores were analyzed using many-facet Rasch analysis, with scoring phenomena including consistency, severity, and use of the rating scale compared across dates. Various aspects of raters' interaction with the scoring instrument were also recorded to determine if certain behaviors, such as the time taken to reach a scoring decision, were associated with the reliability and accuracy of scores. Prior to training, rater severity and internal consistency (measured via Rasch analysis) were already of a standard typical for operational language performance tests, but training resulted in increased inter-rater correlation and agreement and improved correlation and agreement with established reference scores, although little change was vi seen in rater severity. Additional experience gained after training appeared to have little effect on rater scoring patterns, although agreement with reference scores continued to increase. More proficient raters reviewed benchmark responses more often and took longer to make scoring decisions, suggesting that rater behavior while scoring may influence the accuracy and reliability of scores. On the other hand, no obvious relationship was seen between raters' comments and their scoring patterns, with considerable individual variation seen in the frequency with which raters mentioned various language features. vii TABLE OF CONTENTS Acknowledgments ................................................................................................... iv Abstract ................................................................................................................... vi List of tables ........................................................................................................... xiii List of figures ......................................................................................................... xiv Chapter 1: Introduction ........................................................................................... 1 1.1 Effects of training and experience on raters .................................................. 2 1.2 Potential approaches to rater behavior from the domain of psychophysics 5 1.3 Purpose of the study and research questions ................................................. 7 1.4 Definitions ..................................................................................................... 8 1.4.1 Rater expertise/proficiency ................................................................... 8 1.4.2 Rater training ......................................................................................... 9 1.4.3 Experience ............................................................................................. 9 1.4.4 Rater cognition ...................................................................................... 9 1.4.5 Rater behavior ....................................................................................... 9 1.5 Organization of the dissertation ..................................................................... 10 Chapter 2: Variability in rater judgments and effects of rater background ............. 11 2.1 Variability in scoring patterns ....................................................................... 11 2.1.1 Consistency ........................................................................................... 12 2.1.2 Severity ................................................................................................. 12 2.1.3 Bias ........................................................................................................ 13 2.1.4 Use of the rating scale ........................................................................... 13 2.1.5 Halo ....................................................................................................... 14 2.1.6 Sequence effect ..................................................................................... 14 2.1.7 Impact of rater variability on scores ...................................................... 15 2.2 Variability in rater decision-making ............................................................. 18 2.2.1 Writing assessment ................................................................................ 18 2.2.2 Speaking assessment ............................................................................. 23 2.2.3 Summary ............................................................................................... 24 viii 2.3 Effect of rater professional background on scores and decision making ..... 25 2.3.1 Writing assessment ................................................................................ 25 2.3.2 Speaking assessment ............................................................................