Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations

2016 Integration of a web-based rating system with an oral proficiency interview test: argument-based approach to validation Hye Jin Yang Iowa State University

Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Bilingual, Multilingual, and Multicultural Education Commons, English Language and Literature Commons, and the Linguistics Commons

Recommended Citation Yang, Hye Jin, "Integration of a web-based rating system with an oral proficiency interview test: argument-based approach to validation" (2016). Graduate Theses and Dissertations. 15189. https://lib.dr.iastate.edu/etd/15189

This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected].

Integration of a web-based rating system with an oral proficiency interview test: Argument-based approach to validation

by

Hye Jin Yang

A dissertation submitted to the graduate faculty

in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Major: Applied Linguistics and Technology

Program of Study Committee: Carol A. Chapelle, Co-Major Professor Elena Cotos, Co-Major Professor Volker Hegelheimer Gary Ockey Frederick O. Lorenz Jo Mackiewicz

Iowa State University Ames, Iowa 2016

Copyright © Hye Jin Yang, 2016. All rights reserved. ii

I dedicate this dissertation to my mom and dad

Ok Chun Hwang & Seung Woo Yang

iii

TABLE OF CONTENTS

LIST OF FIGURES ------vi LIST OF TABLES ------viii ACKNOWLEDGMENTS ------xi ABSTRACT ------xixiii CHAPTER 1 INTRODUCTION ------1 1.1 Background ------1 1.2 Problems in the Context of This Dissertation ------6 1.3 Purpose of This Dissertation ------7 1.4 Significance of The Study ------9 1.5 Dissertation Overview ------10

CHAPTER 2 THEORETICAL AND EMPIRICAL FOUNDATION ------12 2.1 Issues in Computer-Assisted Language Testing ------12 2.1.1 History and key attributes of CALT ------12 2.1.2 CALT types ------15 2.1.3 Construct validation studies in CALT ------20 2.1.3.1 Correlation studies ------21 2.1.3.2 Comparison of examinees’ performance between CBT and PBT ------22 2.1.3.3 Sources of construct-irrelevant variance relevant to computers ------23 2.1.4 Gaps in research ------26 2.2 Assessing Speaking Ability in Language Performance Tests ------27 2.2.1 Conceptualization of the speaking performance test process ------28 2.2.2 Rater variability ------31 2.2.2.1 Rater severity ------32 2.2.2.2 Rating scale use ------35 2.2.3 Task variability ------36 2.2.3.1 Task difficulty factors ------37 2.2.3.2 Empirical methods of estimating task difficulty ------39 2.2.4 Gaps in research ------41

CHAPTER 3 CONTEXT AND ARGUMENT-BASED APPROACH TO VALIDATION - 42 3.1 Context ------42 3.1.1 Components of the OECT ------43 3.1.2 Rating procedure ------44 3.2 Rater-Platform (R-Plat) ------48 3.3 Interpretive Argument for the OPI scores ------57 3.4 Research Questions ------66

iv

CHAPTER 4 METHODOLOGY ------68 4.1 Research Design ------68 4.2 Participants ------69 4.3 Materials ------71 4.3.1 Questionnaire------71 4.3.2 Focus group and individual interviews------72 4.3.3 OPI prompts------72 4.3.4 Scoring rubric ------74 4.3.5 Diagnostic descriptors ------75 4.4 Procedure ------76 4.4.1 Questionnaire ------77 4.4.2 Focus group and individual interviews ------78 4.4.3 OPI prompt rotation ------78 4.4.4 Rating results ------81 4.5 Data Analysis ------84 4.5.1 Raters’ perceptions towards R-Plat (RQ1) ------85 4.5.2 Comparisons of diagnostic descriptor markings (RQ2) ------87 4.5.3 Raters’ comments as indicators of speaking ability levels (RQ3) ------88 4.5.4 Descriptive statistics (RQ4) ------93 4.5.5 OPI ratings as reliable indicators of different speaking ability levels (RQ5) ------94 4.5.6 Consistency of OPI prompts at different difficulty levels (RQ6) ------97 4.5.7 Consistency of raters within each administration (RQ7) ------99

CHAPTER 5 RESULTS ------101 5.1 Raters’ Perceptions towards R-Plat ------101 5.1.1 Clarity of R-Plat ------102 5.1.2 Level of raters’ comfort with R-Plat ------107 5.1.3 Effectiveness of R-Plat and raters’ satisfaction ------112 5.2 Use of Diagnostic Descriptors to Support Proficiency Level Ratings ------116 5.2.1 Proficiency level comparisons of diagnostic descriptor markings ------116 5.2.2 Seven categories of diagnostic descriptors ------119 5.2.3 Raters’ reasons for selecting diagnostic descriptors ------135 5.3 Use of Raters’ Comments to Support Proficiency Level Ratings ------137 5.3.1 Inter-coder reliability ------138 5.3.2 Comparison of positive and negative comments across proficiency levels ------138 5.3.3 Comparison of positive and negative evaluative units grouped by the OPI scoring criteria across proficiency levels ------141 5.4. Descriptive Statistics ------162 5.5 Dependability of OPI Ratings ------165 5.5.1 Descriptive statistics for OPI ratings ------165 5.5.2 Unidimensionality assumption check ------168 5.5.3 Dependability of the OPI ratings ------172 5.6 Comparison of Intended Prompt Level and Observed Difficulty ------172 5.6.1 Prompt difficulty at the advanced level ------173

v

5.6.2 Prompt difficulty at the intermediate-high level ------176 5.6.3 Prompt difficulty at the intermediate-mid level ------179 5.6.4 Prompt difficulty at the intermediate-low level ------182 5.7 Rater Consistency Within Each Test Administration ------184 5.7.1 Administration 1 ------185 5.7.2. Administration 2 ------187 5.7.3. Administration 3 ------188 5.7.4. Administration 4 ------190 5.7.5 Administration 5 ------191 5.7.6 Administration 6 ------192

CHAPTER 6 DISCUSSION AND CONCLUSIONS ------195 6.1 Validity Argument for OPI Scores with R-Plat Web-based Rating System ------195 6.1.1 Evaluation inference ------196 6.1.2 Generalization inference ------199 6.2 Conclusions ------207 6.2.1 Limitations and recommendations for future research ------207 6.2.2 Implications ------209

REFERENCES ------213 APPENDIX A SCREENSHOT OF THE TEACH RATING PAGE IN R-PLAT ------227 APPENDIX B SCREENSHOT OF THE FINAL SCORE CONFIRMATION PAGE IN R-PLAT ------228 APPENDIX C SAMPLE CONSENT FORM ------229 APPENDIX D QUESTIONNAIRE FOR NEW AND EXPERIENCED RATERS ------232 APPENDIX E PROTOCOL FOR THE FOCUS GROUP INTERVIEWS ------236 APPENDIX F QUESTIONS FOR THE FOCUS GROUP AND INDIVIDUAL INTERVIEWS ------237 APPENDIX G SCORING RUBRIC (OECT Rater Manual, 2014) ------239

vi

LIST OF FIGURES

Figure 2.1 Bachman’s model for oral test performance ------29 Figure 3.1 The development process of R-Plat ------51 Figure 3.2 Examples of rating schedule pages in R-Plat ------53 Figure 3.3 Examinee’s information in the OPI rating page ------54 Figure 3.4 The OPI rating page in R-Plat ------55 Figure 3.5 Example of rating page for ‘Fluency’ in seven diagnostic categories in R-Plat--56 Figure 4.1 A sequential embedded mixed methods design of the current study ------68 Figure 4.2 Evaluation of diagnostic descriptors for comprehensibility based on five-point scales ------76 Figure 4.3 Data collection procedure and timeline ------77 Figure 4.4 Partially crossed rating design ------82 Figure 4.5 Schematic diagram of procedures for analyzing raters’ comments ------90 Figure 4.6 Procedures for examining consistency in intended prompt difficulty levels ------98 Figure 5.1 Distribution of markings on thirty diagnostic descriptors at each scale across three proficiency levels ------118 Figure 5.2 Distribution of markings on diagnostic descriptors of comprehensibility at each scale across three proficiency levels ------121 Figure 5.3 Distribution of markings on diagnostic descriptors of pronunciation at each scale across three proficiency levels ------123 Figure 5.4 Distribution of markings on diagnostic descriptors of fluency at each scale across three proficiency levels ------125 Figure 5.5 Distribution of markings on diagnostic descriptors of vocabulary at each scale across three proficiency levels ------127 Figure 5.6 Distribution of markings on diagnostic descriptors of grammar at each scale across three proficiency levels ------129 Figure 5.7 Distribution of markings on diagnostic descriptors of pragmatics at each scale across three proficiency levels ------131 Figure 5.8 Distribution of markings on diagnostic descriptors of listening at each scale across three proficiency levels ------133 Figure 5.9 Procedures for analyzing raters’ comments ------139

vii

Figure 5.10 Overall distribution of positive and negative evaluative units across proficiency levels ------141 Figure 5.11 Distribution of positive and negative evaluative units across proficiency levels ------143 Figure 5.12 Distribution of positive and negative evaluative units across proficiency levels ------147 Figure 5.13 Distribution of positive and negative evaluative units across proficiency levels for pronunciation ------150 Figure 5.14 Distribution of positive and negative evaluative units across proficiency levels for fluency ------153 Figure 5.15 Distribution of positive and negative evaluative units across proficiency levels for vocabulary ------156 Figure 5.16 Distribution of positive and negative evaluative units across proficiency levels for grammar ------159 Figure 5.17 Histograms of the OPI scores for each test administration ------164 Figure 5.18 Histograms of the OPI ratings for each administration ------167 Figure 5.19 A scree plot with the imputed data from the first iteration ------169 Figure 5.20 Vertical ruler with all data from all test administrations ------171 Figure 5.21 The vertical ruler for prompts at the advanced level ------174 Figure 5.22 Vertical ruler for prompts at intermediate-high level ------177 Figure 5.23 Vertical ruler for prompts at intermediate-mid level ------180 Figure 5.24 Vertical Ruler for intermediate-low level ------182 Figure 6.1 Evaluation inference with three assumptions and backing ------196 Figure 6.2 Generalization inference with three assumptions and backing ------200

viii

LIST OF TABLES

Table 2.1 Types of CALT ------16 Table 3.1 Components of the OECT ------44 Table 3.2 Score bands for each proficiency level ------45 Table 3.3 Score range of OPI and TEACH for final placement decision ------47 Table 3.4 Inferences, warrants, assumptions, and backing in the interpretive argument for the OPI in support of R-Plat ------63 Table 4.1 Number of new and experienced raters participating in the official rating sessions for each test administration, the questionnaire, and the interviews ------71 Table 4.2 Examples of OPI prompts ------74 Table 4.3 Thirty diagnostic descriptors grouped by seven features of speaking ability ------75 Table 4.4 Prompts for impromptu question tasks ------79 Table 4.5 Prompts of role-play tasks ------80 Table 4.6 Rotation of impromptu questions’ prompts and role-plays at administration 3 ---- 80 Table 4.7 Summary of research questions, analytic methods, and data types ------83 Table 4.8 Examples of raters’ comments and their corresponding evaluative units selected from the advanced-level rating ------89 Table 4.9 An example template for coding evaluative units ------92 Table 5.1 Descriptive statistics for experienced and new raters, and total group responses to statements about clarity of R-Plat ------103 Table 5.2 Results of ANOVA tests for clarity of R-Plat ------105 Table 5.3 Descriptive statistics for experienced raters and new raters, and total group responses to statements about the level of comfort with R-Plat ------108 Table 5.4 Results of ANOVA tests for raters’ comfort with R-Plat ------109 Table 5.5 Descriptive statistics for experienced and new raters, and total group responses to statements about the effectiveness of R-Plat ------112 Table 5.6 Descriptive statistics for experienced raters and new raters, and total group responses to statements about the level of raters’ satisfaction with R-Plat ------114 Table 5.7 Frequencies and percentages of diagnostic descriptors at each scale point divided by proficiency level ------117 Table 5.8 Frequencies and percentages of diagnostic descriptors at each scale point for comprehensibility across proficiency levels ------120

ix

Table 5.9 Frequencies and percentages of diagnostic descriptors at each scale point for pronunciation across proficiency levels ------122 Table 5.10 Frequencies and percentages of diagnostic descriptors at each scale point for fluency across proficiency levels ------124 Table 5.11 Frequencies and percentages of diagnostic descriptors at each scale point for vocabulary across proficiency levels ------126 Table 5.12 Frequencies and percentages of diagnostic descriptors at each scale point for grammar across proficiency levels ------128 Table 5.13 Frequencies and percentages of diagnostic descriptors at each scale point for pragmatics across proficiency levels ------130 Table 5.14 Frequencies and percentages of diagnostic descriptors at each scale point for listening across proficiency levels ------132 Table 5.15 Examples of raters’ comments and their corresponding evaluative units selected from the advanced-level rating ------138 Table 5.16 Positive and negative evaluative units in raters’ comments divided by proficiency levels ------140 Table 5.17 Frequencies of positive and negative evaluative units for functional competency grouped by proficiency levels ------142 Table 5.18 Frequencies of positive and negative evaluative units for comprehensibility grouped by proficiency levels ------146 Table 5.19 Frequencies of positive and negative evaluative units for pronunciation grouped by proficiency levels ------149 Table 5.20 Frequencies of positive and negative evaluative units for fluency grouped by proficiency levels ------152 Table 5.21 Frequencies and percentages of positive and negative evaluative units for vocabulary grouped by proficiency levels ------155 Table 5.22 Frequencies of positive and negative evaluative units for grammar grouped by proficiency levels ------158 Table 5.23 Descriptive statistics of the OPI scores for each test administration and all test administrations pooled ------162 Table 5.24 Descriptive statistics of the OPI ratings for each test administration and all test administrations pooled ------166 Table 5.25 Results of the principal component analysis ------169

x

Table 5.26 Prompt difficulty for the advanced level ------174 Table 5.27 Prompt difficulty for intermediate-high level ------178 Table 5.28 Prompt difficulty for intermediate-mid level ------181 Table 5.29 Prompt difficulty for intermediate-low level ------183 Table 5.30 Rater severity and rating scale use for administration 1 ------186 Table 5.31 Rater severity and rating scale use for administration 2 ------187 Table 5.32 Rater severity and rating scale use for administration 3 ------189 Table 5.33 Rater severity and rating scale use for administration 4 ------191 Table 5.34 Rater severity and rating scale use for administration 5 ------192 Table 5.35 Rater severity and rating scale use for administration 6 ------193 Table 6.1 Validity Argument for the OPI Scores Assigned from Raters Using R-Plat ----- 205

xi

ACKNOWLEDGMENTS

First and foremost, I thank God who helped me go through this long journey and fulfilled my life with His love and words. Thank you for staying with me to finish my dissertation, and for opening a new chapter in my life.

I would like to show my sincere and deepest gratitude to my advisors, Dr. Carol

A. Chapelle and Dr. Elena Cotos. Thanks to Dr. Carol Chapelle for providing me the excellent and insightful guidance. Through the dissertation writing process, you always challenged me to think critically and to present ideas in a logical way. I have learned from you how to become an independent researcher. Also thanks to Dr. Elena Cotos, who gave me a tremendous support and encouragement. Without your help, I would not have been able to pursue this research topic for my dissertation, nor would I have developed R-

Plat and collected data to conduct my research. I am truly fortunate and honored to have both of you as my major advisors. My gratitude extends to the committee members for their advice at different stages of my dissertation. Thanks to Dr. Volker Hegelheimer for his constant support. Ever since I became a PhD student in ALT, you have shared positive energy to me when I needed. I also appreciate your help in the revision of my pilot study paper. I also thank Dr. Gary Ockey, who asked challenging but important and insightful questions that pushed me to think further. I am also grateful for sharing your knowledge whenever I was unsure of the application of statistical methods in my research. Thank you, Dr. Frederick Lorenz, for teaching me the fundamental knowledge of statistics that I needed for my own research. I also thank Dr. Jo Mckiewicz, who showed her excitement in my research and shared her insights to analyze qualitative data.

xii

My gratitude extends to people who helped me develop R-Plat. I am very grateful to Karl Schindel, Justin Bentz, and Ryan Wilson for their tremendous work on the development of R-Plat. I would like to especially thank Minho Lim for his guidance and support programming the R-Plat. I also thank the raters of the Oral English Certification

Test who utilized R-Plat since its first implementation in practice and kindly shared their constructive feedback with me. I thank Yoo Ree Chung for helping me collect data over several testing administrations and for her continuous encouragement and kindness. I thank my colleagues, faculty and staff in the ALT program, the Online Learning Team, and my friends. I am a lucky person because I was able to go through this long journey surrounded by all these precious people.

I thank my parents-in-law for their love and prayers for me. Your messages and warm hearts have always encouraged me. Thanks to my dearest husband, Jong Kwon

Choe. You stayed with me throughout my graduate study. Thank you for your endless love, advice, and support. I also thank my younger brother, Won Mo, who took care of our family while I was studying in the U.S., and also for your continuous encouragement.

Finally, I thank my dearest Mom and Dad. You have always believed in me. Without your endless, unconditional love, support, and prayers, I could not have come this far.

xiii

ABSTRACT

This dissertation focuses on the validation of the Oral Proficiency Interview

(OPI), a component of the Oral English Certification Test for international teaching assistants. The rating of oral responses was implemented through an innovative computer technology—a web-based rating system called Rater-Platform (R-Plat). The main purpose of the dissertation was to investigate the validity of interpretations and uses of the OPI scores derived from raters’ assessment of examinees’ performance during the web-based rating process. Following the argument-based validation approach (Kane,

2006), an interpretive argument for the OPI was constructed. The interpretive argument specifies a series of inferences, warrants for each inference, as well as underlying assumptions and specific types of backing necessary to support the assumptions. Of seven inferences—domain description, evaluation, generalization, extrapolation, explanation, utilization, and impact—this study focuses on two. Specifically, it aims to obtain validity evidence for three assumptions underlying the evaluation inference and for three assumptions underlying the generalization inference. The research questions addressed: (1) raters’ perceptions towards R-Plat in terms of clarity, effectiveness, satisfaction, and comfort level; (2) quality of raters’ diagnostic descriptor markings; (3) quality of raters’ comments; (4) quality of OPI scores; (5) quality of individual raters’

OPI ratings; (6) prompt difficulty; and (7) raters’ rating practices.

A mixed-methods design was employed to collect and analyze qualitative and quantitative data. Qualitative data consisted of: (a) 14 raters’ responses to open-ended questions about their perceptions towards R-Plat, (b) 5 recordings of individual/focus group interviews on eliciting raters’ perceptions, and (c) 1,900 evaluative units extracted

xiv from raters’ comments about examinees’ speaking performance. Quantitative data included: (a) 14 raters’ responses to six-point scale statements about their perceptions, (b)

2,524 diagnostic descriptor markings of examinees’ speaking ability, (c) OPI scores for

279 examinees, (d) 803 individual raters’ ratings, (e) individual prompt ratings divided by each intended prompt level, given by each rater, and (f) individual raters’ ratings on the given prompts, grouped by test administration.

The results showed that the assumptions for the evaluation inference were supported. Raters’ responses to questionnaire and individual/focus group interviews revealed positive attitudes towards R-Plat. Diagnostic descriptors and raters’ comments, analyzed by chi-square tests, indicated different speaking ability levels. OPI scores were distributed across different proficiency levels throughout different test administrations.

For the generalization inference, both positive and negative evidence was obtained.

MFRM analyses showed that OPI scores reliably separated examinees into different speaking ability levels. Observed prompt difficulty matched intended prompt levels, although several problematic prompts were identified. Finally, while the raters used rating scales consistently adequately within the same test administration, they were not consistent in their severity. Overall, the foundational parts for the validity argument were successfully established.

The findings of this study allow for moving forward with the investigation of the subsequent inferences in order to construct a complete OPI validity argument. They also suggest important implications for argument-based validation research, for the study of raters and task variability, and for future applications of web-based rating systems for speaking assessment.

1

CHAPTER 1

INTRODUCTION

1.1 Background

Language is a critical resource that human beings use to present their ideas and communicate with others to perform different roles in diverse contexts of society such as business, politics, and education. To reflect the situational communicative needs of language users, roles of language testing must also vary. Language testing researchers have addressed the close relationship of language testing and public protection; for example, the language of air traffic controllers (e.g., Molder & Halleck, 2012; R. Yan,

2014) and welfare situations involving communication disorders (e.g., Oller, 2012).

Other uses of language testing are inseparable from governmental policy, as in testing for immigration and citizenship (e.g., Kunnan, 2012) or court translators and interpreters

(e.g., Armstrong, 2014). Language testing also serves different purposes in higher education, such as university admissions (e.g., Xi, Bridgeman & Wendler, 2014) and speaking assessments of international teaching assistants (e.g., Farnsworth, 2014).

Overall, language testing is important because it is intended to represent test takers' ability to communicate in particular social settings.

Language tests are used to measure one or more dimension of language ability

(e.g., speaking, listening, writing, reading, grammar, and vocabulary), and they can be utilized for a number of different purposes (e.g., achievement tests, performance tests, proficiency tests, and diagnostic tests). Regardless of the many forms and objectives of language testing, a successful language test, in general, is expected to measure the target

2 language construct. Test scores derived from the test are indicators of the level of one’s language ability on “a trajectory of language development” (Chapelle, 2012, p. 28).

Therefore, the interpretations of the test scores need to be justifiable and comprehensible to test users. The concerns about the testing purpose and the interpretations of the test scores become the baseline for developing an effective language test.

However, in real testing settings, diverse sources of variability can threaten the validity of language test scores. For instance, in a performance-based speaking assessment, an examinee’s score is a function of multifaceted aspects associated with the testing context, including raters, scale criteria, speech samples, speaking performance, examinees’ ability, and test prompts (Bachman, 2001; Ockey, 2009). Although test scores are largely indicators of an examinee’s language ability, the examinee’s performance can be evaluated differently depending on the raters who rate the examinee' performance and on particular tasks given to the examinee. It is, therefore, necessary to make appropriate inferences about the examinee’s language ability by identifying potential sources of variability in the testing environment and by collecting multiple pieces of evidence to support the meanings of the scores.

Of the possible types of variability, raters and tasks are perceived as key sources influencing test scores (McNamara, 1996). In a language performance assessment, raters play a role in assigning scores to examinees while interacting with scoring rubrics, prompts, and with human interlocutors in face-to-face testing contexts. To justify the meanings of test scores, it is essential to uncover any possible rater variability threatening the accuracy of the test scores. Existing studies have found that rater variability can be attributed to several factors such as L1 background, age, gender, rating experiences, and a

3 number of other variables: 1) raters’ L1 familiarity (e.g., Gass & Varonis, 1984; Munro,

Derwing, Morton, 2006; Tauroza & Luk, 1997); 2) interlocutor effect (e.g., Brown, 2003;

Iwashita, 1996); and 3) rating strategies (e.g., Brown, 2000; Meiron, 1998; Pollitt &

Murray, 1996). Rater variability has also been examined in terms of rater severity (e.g.,

Engelhard, 1994; Winke, Gass & Myfold, 2011; Yan, 2014) and raters’ use of rating scales (e.g., Barkaoui, 2010; Cumming, 1990). Findings of these previous studies have exhibited mixed results about consistency in rater severity and rating scale use. These mixed results call for further investigations into rater variability pertinent to rater severity and rating scale use in different testing settings.

Tasks, another potential source of variability, are a crucial component of a testing process as they intend to stimulate and elicit language samples from examinees. Luoma

(2004) regarded a task as a medium that leads examinees to produce target language samples in the process of completing the given assignment. Thus, it is necessary to examine whether operational tasks are properly created to elicit the appropriate language samples. Previous research has concentrated on the complexity of tasks by examining what constitutes task difficulty (e.g., Brown & Yule, 1983; Brown, 1989). Other studies found that task difficulty is determined by interactions between tasks and other factors related to the testing procedures such as raters and examinees (e.g., Lim, 2009; Park,

1998; Reed & Halleck, 1997). Task difficulty can also be determined by testing conditions like time pressure or input types (e.g., Crookes & Rulon, 1988; Iwashita,

1998). Additionally, Bachman claimed that task difficulty could be determined by the interactions of tasks with other types of variability as opposed to inherent task characteristics (Bachman, 2002). Despite previous attempts to examine issues in task

4 difficulty in language testing, researchers leading those studies paid less attention to investigating the breadth of task difficulty when it is pre-determined by test developers.

The emergence of computer-assisted language testing (CALT) opens a new chapter to conceptualizing the nature of testing procedures because computers are perceived as possible threats to the validity of scores (Canale, 1986). Chapelle and

Douglas (2006) raised possible concerns with CBLT. First, examinees’ performance on

CALT could be different from that on the other testing forms, which may fail to represent the same ability. Second, test items for CALT could be different from those in other test types. Third, presenting items by an algorithm of an adaptive test may not provide proper item samples to examinees, subsequently resulting in an increase in examinees’ test anxiety. Fourth, automated scoring may fail to assign adequate scores to examinees’ responses relevant to the target construct to be measured. Fifth, test security could pose additional concerns. In high-stakes tests, a test setting maintains security through test administration in testing centers and verification of examinees’ identification. In the case of computer adaptive test (CAT), test developers protect test items by creating large item pools. The last potential threat is that CALT may result in negative impacts on learners, classes, and society due to its high cost or stakeholders’ limited access to technology (p.

41).

Despite the potential caveats discussed above, however, Chapelle and Douglas

(2006) claim that CALT has brought numerous benefits to language testing, and future exploration into CALT is worthwhile. CALT creates convenient physical and temporal testing circumstances and enhances fairness with consistent presentation of tasks, instructions, and input. CALT allows for implementing diverse input and response types.

5

The advances in CAT and natural language processing (NLP) technology facilitate more individualized and interactive testing environments for examinees (p. 23). Jamieson

(2005) has also highlighted several advantages of a computerized language test; it requires less time for testing, offers faster and timely score reporting, and can be administered at times suited for individual test takers’ schedules. To make appropriate inferences about test scores from CALT, Chapelle and Douglas (2006) assert that the potential threats to validity need to be exhaustively investigated through validation research.

Validation research is guided by the argument that test developers want to make about the interpretation and use of the test scores. Current approaches to argument-based validation have been proposed by Kane (1992, 2001, 2002, 2004, 2006; Kane et al.,

1999). The argument-based validation approach is characterized by an interpretive argument and a validity argument. The interpretive argument first specifies inferences and assumptions relevant to the proposed score’s interpretation and use, laying out the types of empirical research aimed at collecting relevant validity evidence. The validity argument is based on empirical evidence and theoretical justifications that support the inferences of the interpretive argument (Chapelle, Enright, & Jamieson, 2008).

Kane’s approach has been widely adopted by researchers in language testing

(Bachman, 2005; Bachman & Palmer, 2010; Chapelle et al., 2008) because it provides “a transparent working framework to guide practitioners in three areas: prioritizing different lines of evidence, synthesizing them to evaluate the strengths of a validity argument, and gauging the progress of the validation efforts” (Xi, 2008, p. 4). For example, Bachman’s validation framework (2010) was developed drawing upon Kane’s argument approach

6 although it focuses more on test usefulness. Chapelle et al. (2008) drew upon the same approach in a validation study for the Test of English as a Foreign Language (TOEFL).

In this study, an argument-based approach was used to guide research investigating the adequacy of a rating process and sources of variability in the scores on the OPI when raters used a web-based rating system during the rating process.

1.2 Problems in the Context of This Dissertation

Acknowledging issues regarding variability (e.g., raters, tasks, and computers, etc.) and the essential role of the validation process, the initial idea for this dissertation grew from a concern about the operational rating procedure of an Oral Proficiency

Interview (OPI), a component of the Oral English Certification Test developed for prospective international teaching assistants (ITAs) at Iowa State University (ISU).

Previously, the OPI raters used a paper-based rating format for rating purposes. However, a needs analysis, administered to raters who were experienced with paper-based rating forms, revealed some issues relevant to raters’ rating practices and interpretation of rating results in the previous paper-based rating forms. The issues were likely to impact reliability and validity of test scores. In addition, it has not been fully investigated how raters approach the rating procedure and how the prompts are utilized during the operational OPI test administration. The lack of empirical research on the OPI calls for validation research to support the use and the interpretation of the OPI scores for the purpose of screening potential ITAs. As part of a validation study, I led the development of a web-based rating system, called Rater-Platform (R-Plat). The system was designed to facilitate the rating procedure by addressing the existing problems with the paper- based rating form and by integrating new rating features into the system. With the help

7 of R-Plat, I aimed to enhance raters’ rating practices, which would in turn generate more reliable rating results and subsequently enhance the validity of scores.

1.3 Purpose of This Dissertation

The main purpose of this dissertation was to investigate the validity of interpretations of scores on the OPI performance-based speaking assessment. In the OPI, raters’ scores were derived from their observation of examinees’ performance on given tasks during a web-based rating process. I chose Kane’s argument-based approach to validation to frame the types of empirical studies and to collect ample validity evidence to inform the interpretation and the use of the test scores. The chain of multiple inferences of the interpretive argument include: domain description, evaluation, generalization, extrapolation, explanation, utilization, and impact (described in Chapter

3). The current study focused on the evaluation and generalization inferences. The evaluation inference involves the adequacy of rating procedures supported by the rating system to provide evidence of speaking ability in the OPI speaking assessment at a university in the U.S. The generalization inference is associated with consistency in raters’ ratings within the same testing cycle, as well as the match between the observed task difficulties and the intended difficulty level of tasks. This dissertation aimed to collect validity evidence by examining: a) raters’ perceptions towards R-Plat in terms of clarity, effectiveness, satisfaction, and comfort level; b) quality of raters’ ratings derived from diagnostic descriptor markings and rater comments; c) quality of OPI scores; d) quality of OPI ratings; e) prompt difficulty; and f) raters’ rating practices.

8

This dissertation employed a mixed-method design where a sequential embedded mixed design and a convergent parallel design were intertwined to collect qualitative and quantitative data. Raters’ perceptions towards R-Plat were examined through analyses of both new and experienced raters’ responses to six-point Likert-scale statements and open- ended questions in questionnaire, as well as of data from individual interviews and focus group interviews. The quality of raters’ ratings on diagnostic descriptors was examined in terms of the distributions of diagnostic descriptor ratings across different ability levels and chi-square tests. The quality of raters’ written comments, which specified examples of language produced by individual test-takers in response to prompts, was first analyzed drawing on grounded theory in order to find emerging themes. The identified themes reflected positive and negative aspects of raters’ comments, which were then analyzed using the metric of “evaluative units”, defined as a segment of the rater’s comment that states an evaluation of the examinee’s language. Next, the distributions of evaluative units were examined as they related to different proficiency levels, and chi-square tests were conducted to explore whether the relationships between the types of evaluative units and proficiency levels were statistically significant or not. To investigate the quality of the OPI scores, descriptive statistics were employed to find how the scores were distributed across different ability levels. A Many-Facet Rasch Measurement (MFRM) was used to analyze the dependability of the OPI ratings, task difficulty, and raters’ rating behaviors in terms of severity or rating and rating scale use. Finally, findings were interpreted and synthesized to assess the extent to which they justified the rating procedure and the OPI test scores, situated in an argument-based approach to validation.

9

The findings were also used to suggest potential implications, as described in the following section.

1.4 Significance of The Study

The findings of this study add to the empirical knowledge in validation of , and the areas of CALT. This dissertation deepens the validation research, as it examines the quality of test scores by taking into account rater and task variability, specifically when a computer-based rating system is utilized. Prior studies on rater and task variability have examined common testing contexts where a web-based rating system was not involved. In this dissertation, however, multiple sources of validity evidence were gathered systematically to support not only the adequacy of the web-based rating procedure for a speaking test, but also the adequacy of test scores by considering rater and task variability.

The study also contributes to disciplinary scholarship by demonstrating an innovative application of technology for rating purposes in a speaking assessment.

Although computers have been widely used for writing assessment and speech recognition systems are employed for speaking assessments, the utilization of computers for rating purposes is less exploited in interview-format speaking tests. The current study broadens the possibilities for computer-assisted assessments of oral language proficiency.

Finally, the results of this study can benefit raters, test administrators, examinees, and instructors of speaking classes for prospective ITAs. For raters and test administrators, the results of the raters’ perceptions and experiences using R-Plat will suggest directions for future improvements of the rating procedures and systems in such contexts. The findings about the raters’ rating practices and the observed task difficulty

10 will guide raters and test developers in improving the quality of rater training and in composing new tasks. Furthermore, examinees and instructors of speaking classes for prospective ITAs can benefit from the analyses of diagnostic descriptor ratings and raters’ written comments. Examinees will be better able to better understand their current language skills and make use of the scores to further enhance their speaking skills.

Instructors of speaking classes for ITAs can possibly improve their curricula and class activities by more accurately diagnosing their students’ language proficiency levels.

1.5 Dissertation Overview

This chapter opened by addressing issues in language testing, beginning with the roles and significance of language testing and types of variability relevant to raters, tasks, and computers. The following part of this chapter explains how the argument-based approach to validation has been applied in other validation research and has guided the areas of research focused on in this dissertation. Finally, the context and the main purpose of this dissertation were presented, and an overview of the methodology was provided.

The remainder of the dissertation consists of five chapters. Chapter 2 provides a review of previous literature that forms the theoretical and empirical foundation of this study. It covers how CALT has developed in language testing and the nature of rating procedures of a language performance test, and focuses on rater and task variability.

Chapter 3 elaborates on the OPI devised for measuring prospective international teaching assistants (ITAs) and describes the Rater-Platform (R-Plat). The latter part of Chapter 3 addresses the interpretive argument for the OPI, and presents seven research questions guiding this dissertation. Chapter 4 provides detailed descriptions of the methods used to

11 collect and analyze different types of evidence pertaining to the interpretive argument.

This chapter specifies the participants, instruments, data collection procedures, and data analysis procedures. Chapter 5 presents results of this dissertation respective to each of the seven research questions. Chapter 6 concludes with the validation argument constructed based upon the findings and interpretations. This is followed by an acknowledgement of study limitations, suggestions implications, and for future research.

12

CHAPTER 2

THEORETICAL AND EMPIRICAL FOUNDATION

The purpose of this chapter is to establish a theoretical and empirical foundation for the current study by addressing issues in computer-assisted language testing (CALT) and speaking assessment in the previous literature. The first section begins with CALT issues under the three following categories: the history and key attributes of CALT,

CALT types, and construct validation studies in CALT. The second section addresses speaking performance tests, and covers a conceptualization of the speaking test process, rater variability, and task (prompt) variability. Each section of this chapter ends by highlighting gaps in current scholarship and calls for further investigation as it relates to this study.

2.1 Issues in Computer-Assisted Language Testing

The widespread advances in computer technology have led to innovation in language testing, allowing for convenient test delivery, automated scoring of written and spoken discourse samples, and analysis of linguistic features, to name a few. However, emerging CALT practices have posed new challenges for validation because computers add a new type of variability to language tests. This section covers the history and key attributes of CALT, CALT types, and existing validation research in CALT.

2.1.1 History and key attributes of CALT

Computers were first introduced to language testing in the 1930s in the form of

IBM corporation scoring machines that accelerated the scoring process for multiple-

13 choice questions (Chapelle & Douglas, 2006). In the 1980s, computers were utilized to transfer paper-based tests to computer-based tests. In this period, the main goal of computerized tests was to enhance efficiency of test administration and “automate an existing process” (Benneett, 2000, p. 3) by converting items, test instructions, and input from paper to computers. The 1990s gave way to further advancement in CALT to more sophisticated assessment forms such as computer-adaptive testing (CAT), integration of new item types for measuring integrated language skills, and automated evaluation

(Suvorov & Hegelheimer, 2014). In CAT, computer algorithms automatically assign different items to examinees based on their responses to each item. The selective item presentation, which adjusts to examinees’ responses, appeared to produce a more precise estimation of examinees’ true ability (e.g., Larson & Madsen, 1985; Brown-Iwashita,

1996; Young, Shermis, Brutten, & Perkins, 1996). The new multimedia capabilities in computerized language testing further allowed for measuring integrated language skills through visual, audio, and video inputs. For example, a listening test is able to emulate a real-life academic lecture by integrating visual inputs (e.g., pictures, graphs, or videos) with audio inputs. As for positive aspects of the use of multimedia, Ginther (2002) found that the variety of media facilitated examinees’ improved performance on L2 language tests. However, other researchers (e.g., Fulcher, 2003b; Suvorov, 2009; Wagner, 2007) cast doubt on the positive influence of multimedia in testing situations because video or images could distract, instead of assist, examinees. Despite the mixed reviews of multimedia, it is evident that multimedia has allowed for creating new types of tasks in language testing. Lastly, the automated evaluation systems (AESs) enable automatic

14 assessment of productive language skills (writing and speaking), and even provide individualized feedback and scoring reports.

CALT has produced numerous advantages for language assessments. Chapelle and Douglas (2006) provided a clear conceptualization of advantages of CALT by their use of a test method framework, claiming that this form of testing helps describe: physical and temporal circumstances, rubric/instructions, input and expected response, interaction between the input and response, and characteristics of the assessment (p. 23).

For physical and temporal circumstances, CALT creates more accessible test environments. In contrast to paper-based tests, examinees can take tests at their own convenience without the restriction of time or location, and wherever the Internet is accessible. With regards to presentations of scoring rubrics and test instructions, CALT contributes to enhancing fairness in test taking practices because all examinees receive tasks, instructions, and test-related input in a consistent way. Liu, Moore, Graham, and

Lee (2002) highlighted that CALT can generate individualized tests, randomization of items through item banks, and more secured testing environments.

Integration of multimedia changes the nature of test input and examinees’ expected responses. For example, current CALT test input often deals with diverse visual, audio, or video stimuli to measure multifaceted aspects of examinees’ speaking ability. The input intertwined with various forms formulates more authentic and contextualized task types, which subsequently lead examinees to be more engaged in the test situation. In line with this view, Ockey (2009) asserted that CALT can incorporate more authentic tests and tasks in comparison with the conventional paper-based test, which restricts item types.

15

CALT also enables to provide individualized test input in response to examinees’ performance, and individualized feedback to examinees. The computer-adaptive test

(CAT) provides a good example of how CAT algorithms automatically assign different items to an examinee by gauging the student’s current proficiency level in response to the previous items. Furthermore, the evolution of natural language processing (NLP), which automatically processes learners’ language, has led to dramatic advances in automated essay scoring (AES) for writing assessment and automated speech recognition (ASR) technologies for spoken discourse evaluation. The rise of automated scoring technologies has made a huge impact on language assessment in that it quickens the scoring process and offers individualized feedback to examinees’ writing and speaking performance. The aforesaid discussion about CALT showed that CALT has considerably changed the fundamental characteristics of language assessment. The next part addresses the types of

CALT available in academia and industry.

2.1.2 CALT types

Computer-assisted language tests (CALTs) are utilized for multiple dimensions of evaluating various languages, different purposes (e.g., placement, achievement, diagnosis, and proficiency), and for different contexts (e.g., academic, business, and aviation). Table 2.1 provides examples summarizing CALT types that have been reported in previous studies and technical reports from testing companies. In this table, I divide the CALT types into three categories. The first classification of test type is CALT, which I indicate as common applications of computers in language testing. The second aspect of CALT revolves around computer-adaptive tests (CATs). The third aspect of the recent CALT is the integration of automated evaluation systems (AESs) for scoring

16

purposes. The tests are described in the following categories: type, test name, developer,

purpose, language, skills, and relevant resource.

Table 2.1

Types of CALT

Languag Type Test name Developer Purpose Skills Resource e Computer- Cambridge Speaking, based English Listening, http://www.ielt International Proficiency English Language Reading, s.org/ Language Assessment Writing Testing system Oral English http://www.pur Purdue Proficiency Proficiency English Speaking due.edu/oepp/o University Test ept/index.html CALT University of College Science and Achieveme Yu & Lowe English Oral English Speaking Technology nt (2007) Test System of China European 14 Grammar, http://www.lan universities Europea Listening, caster.ac.uk/res DIALANG with Diagnostic ns Reading, earchenterprise assistance languag Writing, /dialang/about. from Europe. es Vocab htm Grammar, Educational Computer- Speaking, Testing http://www.ets. based TOEFL Proficiency English Listening, Service org/toefl (CBT TOEFL) Reading, (ETS) Writing Business English, Speaking, Language Cambridge French, Listening, http://www.bul Testing Proficiency CAT ESOL German, Reading, ats.org/ Service Spanish Writing (BULATS) Basic English Center for http://www.cal Speaking, Skills Test Applied Proficiency English .org/aea/bestpl Listening (BEST) Plus Linguistics us/ Speaking, https://www.ac Compass ESL Writing, ACT Placement English t.org/compass/t Placement test Listening, ests/esl.html Reading

17

Table 2.1

Types of CALT (Continued)

Languag Type Test name Developer Purpose Skills Resource e Spanish Web Perpetual French Computerized Technology German Listening, http://www.per Adaptive Proficiency CAT Group Russian Reading, petualworks.co Placement , Placement (PTG) English Writing m Exam (Web- Chinese CAPE) Italian Arabic, English, Aviation Versant series Pearson Proficiency English, https://www.ve of speaking Education, , Speaking Chinese, rsanttest.com/ assessment Inc. Placement Dutch, French, CALT Spanish with Educational Speaking, AES Internet-Based Testing Writing, http://www.ets. TOEFL (iBT Proficiency English Service Listening, org/toefl TOEFL) (ETS) Reading Pearson Test Speaking, Pearson of English Writing, http://pearsonp Education, Proficiency English (PTE) Listening, te.com/ Inc. AcademicTM Reading

CALTs are widely developed to measure different components of language ability

for different purposes. For example, the Computer-based International Language Testing

system, by Cambridge English Language assessment, was developed to assess English

language speaking, listening, reading, and writing skills. CALT is also devised to

measure a single attribute of language ability, such as in an institutional application of

CALT. For example, the Oral English Proficiency Test (OEPT) is used to assess the oral

English skills of prospective teaching assistants via a computerized test format at Purdue

University in the U.S. Examinees take the test in a computer-lab and their spoken

18 responses to the given prompts are recorded and delivered to raters through computers.

Another institutional computer-based speaking test is the Collect English Oral Test

System (CEOTS), a speaking test created at the University of Science and Technology of

China (Yu & Lowe, 2007) that assesses Chinese college students’ English speaking ability.

CALT can also be formulated for multiple world languages. DIALANG is well known for its broad utilization in measuring 14 different European languages such as

Danish, Dutch, English, Finnish, French, German, Greek, Icelandic, Irish-Gaelic, Italian,

Norwegian, Portuguese, Spanish and Swedish. The broad range of computer-assisted language testing situations in the world demonstrates the growing potential of CALT as a successful medium for measuring language ability.

Computer-adaptive tests broaden the dimension of CALT because CAT adapts item difficulty levels in response to examinees’ responses to the given items, a feature which enhances measurement precision. CAT allows for shorter tests, examinee self- pacing, individualized item selection that adjusts to examinees’ ability level, and secure testing conditions, despite potential new item exposure during pilot testing (Brown, 2012;

Dunkel, 1999; Larson, 1999). Large testing companies frequently utilize CATs for proficiency assessment. The grammar and listening sections of the computer-based

TOEFL adopted the adaptive test item delivery in their CATs, but this approach was ceased after the inclusion of new integrated tasks (Jamieson, Eignor, Grabe, & Kunnan,

2008). In the Business Language Testing Service (BULATS), the CAT approach was employed for the reading and the listening sections of the test. A further example is the

Basic English Skills Test (BEST) Plus, which provides both a computer adaptive test and

19 semi-adaptive, print-based test to examinees to measure English listening and speaking skills.

CATs can similarly provide effective testing methods for placement purposes.

The Compass ESL Placement test, a strand of American College Testing (ACT), has implemented the adaptive approach to assess listening, speaking, writing, and reading skills of English. Another example is the Web Computerized Adaptive Placement Exam

(Web-CAPE), a placement and proficiency test used for assessing listening, reading, and writing ability of seven languages (e.g., Spanish, French, German, Russian, English,

Chinese, and Italian). At an institutional level, Web-CAPE is used to place students in the first two years of college language courses. The test can also be used to assess prospective employees’ language proficiency in business areas. The aforesaid examples of CAT utilizations show the flexibility of intended purposes and uses for CATs in different needs and contexts.

CALT continues to grow rapidly with the advancement of automated evaluating systems (AESs) and the constant refinement of NLP techniques. The Versant series of speaking assessments enable measurement of a wide range of languages such as Arabic,

English, Aviation English, Chinese, Dutch, French, and Spanish (Pearson Education, Inc.,

2009a). Examinees take the test over the phone and their responses are rated using an automated speech recognition algorithm that analyzes pronunciation, fluency, vocabulary, and sentence mastery of the speech sample (Pearson Education, Inc., 2009b).

In the Internet-based TOEFL (iBT TOEFL), the speaking and the writing sections are assessed using automated scoring engines. The speaking section is evaluated by the

SpeechRaterSM engine that analyzes discourse markers and lexical features. The writing

20 section is assessed by another engine, called “e-rater,” that analyzes the essay’s syntactic variety and lexical features. In a similar vein, the Pearson Test of English (PTE)

AcademicTM is equipped with two types of automated scoring systems to measure international students’ English proficiency. To be specific, the Intelligent Essay

AssessorTM (IEA) technology measures writing products, and Pearson’s Ordinate

Technology assesses speaking products.

This section introduced how computer applications have extensively advanced in the area of language assessment and the operational CALT types in the varying fields.

Building from the discussion of the increasing needs and utilizations of CALT, the following section discusses how computer applications have been studied in language testing from a validation prospective.

2.1.3 Construct validation studies in CALT

Language testing perceives validity as a central concern. However, the immense development of CALT has resulted in reconceptualizing the construct measured in CALT settings. Acknowledging that validity is a central to testing, the effects of CALT can be supported by investigating the extent to which CALT measures the same construct as a paper-based test or that which human raters measure. Dooey (2008) also asserted that construct validity in CALT can be achieved if a test measures examinees’ target language skills rather than their computer skills. The following sections address how construct validity in CALT has been investigated in previous literature, focusing on (a) the correlation between PBT and CBT scores, (b) a comparison of examinees’ performance

21 between PBT and CBT, and (c) potential sources of construct-irrelevant variance associated with computers.

2.1.3.1 Correlation studies

Examining correlations between CBT scores and PBT scores is one approach to confirming that a CALT measures the same construct as a PBT. For example, Bugbee

(1996) asserted that the use of a CBT as an alternative for a PBT can be supported when high correlation and nearly equal means and variances between the modes are present.

Wolfe and Manalo (2005) investigated the extent to which word-processed

TOEFL essays and handwritten TOEFL essays had similar or different impacts on test takers’ scores. Findings showed the writing scores assigned to word-processed essays received a slightly higher correlation with scores of TOEFL multiple-choice questions

(r= .78) than those of handwritten essays (r= .70).

Choi, Kim, and Boo’s (2003) correlation studies of CBTs and PBTs examined the scores of TOEFL listening, grammar, vocabulary, and reading sections that mainly consisted of multiple-choice questions. The results exhibited high disattenuated correlations, all of which were close to 1.00, for the subsets (listening r = .94, grammar r

=.99, vocabulary r = .96, and reading r = .94). The finding suggested comparability of the CBT and PBT.

In addition, confirmatory factor analyses of the two testing modes showed that both CBT and PBT shared the same construct in two different hypothetical models: one assuming equality of factor loadings of subsets, and the other assuming inequality of factor loadings. Taken together, Choi, Kim, and Boo’s results supported that the CBT and PBT tapped the same constructs.

22

Coniam (2006) examined efficacy and reliability of a computer-based listening test by comparing the listening ability of secondary school students (Grade 11 versus

Grade 12 classes) in Hong Kong as assessed when taking a paper-based listening test and an adaptive computer-based test. Results showed that correlations of the two test modes were high (r = 0.76), suggesting the potential use of CAT for a low-stakes test in a local institution, but not for a high-stakes test. To summarize, high correlations between CBT and PBT scores offer evidence for the comparability of CBTs as valid alternatives to

PBTs.

2.1.3.2 Comparison of examinees’ performance between CBT and PBT

Additional validation research of CALT can be conducted by comparing examinees’ achievement in CBTs and PBTs. The existing studies have yielded mixed results regarding students’ performance between the two testing modes. Barrera, Rule, and Diemart (2001) compared 18 first-grade students’ English writing performance using word processing with a computerized test and a paper-mode test. The writing samples were from the students’ handwritten and computer-assignments over two semesters. The students’ performance was evaluated as a part of a classroom performance test. Results indicated that the students were in favor of computer assignments and even displayed better performance, as they produced more words and longer sentences using the computer. Although the students’ enhanced performance on computer assignments was not derived from the computer use itself, the researchers admitted that computers provided a more supportive setting for student writing. Additionally, Coniam’s study

(2006) revealed that the secondary school students, both in Grade 11 and Grade 12, performed better on an adaptive computer-based listening test than on its paper-based

23 version. As for the reasons for the differences between the two testing modes, the author did not draw an explicit conclusion, but still asserted that the item types presented to students could contribute to students’ altered performances.

By contrast, findings from other studies have generated different results. For example, Choi et al. (2003) reported that a group of test takers who completed multiple choice reading items presented in a paper-based reading test outperformed those administered who completed the same items in a computer-based test. As another example, a comparative study done by Hosseini, Abidin, and Baghdarnia (2014) examined Iranian first-year English students’ reading comprehension in two testing modes (PBT versus CBT). They found that the students’ computer familiarity and attitudes towards computers did not generate differences in the results of the computerized tests. As shown in the mixed results of the comparability studies, it is still controversial to conclude whether CBT and PBT formats of the same language tests could be used concurrently for the same assessment purpose. These findings call for further investigations into examinees’ performance between CBT and PBT for validation purposes.

2.1.3.3 Sources of construct-irrelevant variance relevant to computers

Another strand of validation study in CALT centers on the identification of types of sources for construct-irrelevant variance linked to computers and their impact on examinees’ performance. During early use of computer applications for testing purposes, many researchers revealed their concerns about the possible influence of construct- irrelevant variance that hindered accurate estimation of students’ language ability in

CALT. Among diverse sources of construct-irrelevant variance in CALT settings,

24 computer familiarity was perceived as a factor possibly preventing measurement of the target construct of CALT. For example, Taylor, Jamieson, Eigno, and Kirsch (1998) compared CBT TOEFL scores between computer-familiar and computer-unfamiliar groups. The results indicated that the average score of the computer-familiar group was significantly higher than those of the other group, but the difference was too insignificant.

However, other studies reported different results. Hosseini et al. (2014) investigated the influence of computer familiarity and attitudes towards computers on testing performance by employing a questionnaire. They adopted the Computer Attitude

Scale (CAS) (Loyd & Gressard, 1984) to construct the questionnaire items. The results showed that computer familiarity and attitude towards computers did not have any significant effect on students’ performance in computer-based tests. Another study (Trites

& McGroarty, 2005) showed that the examinees’ computer familiarity did not have a significant influence on the scores of two types of TOEFL iBT reading tasks.

Surprisingly, with the increasing availability of computers for language learners world- wide, computer-familiarity has become less of a concern regarding construct-irrelevant variance in CALT (Wall & Horák, 2006). Despite the initial pessimistic views of CALT, evidence collected through various empirical studies suggests that computer familiarity is no longer a factor that influences test scores.

In an effort to alleviate the impact of potential sources of construct-irrelevant variance, usability testing can be employed in CALT development stages. Fulcher

(2003a) described three phrases of CALT development, namely (a) planning and initial design, (b) usability testing, and (c) field testing and fine tuning. From language testing perspectives, in particular, usability testing is an iterative process focusing on the

25 interactions between test-takers and the interface by examining whether test takers can navigate and use items on computers. Examinees’ successful interactions with the system are significant for obtaining precise estimations of language ability, because

“usability problems may constitute a threat to construct validity” (Fulcher, 2003a, p.

384). Therefore, any possible threats need to be identified through usability testing and removed in the developmental process.

In the area of language testing, usability testing has been conducted through questionnaire or surveys. Results of the usability tests have been used to enhance and support the effectiveness of the computer systems to measure the target construct. For example, Kim (2006) employed usability testing as a part of a validation study to develop a web-based speaking test for international teaching assistants in the U.S. The purpose of the study was primarily to investigate the effectiveness of test takers’ participation in the development of the web-based speaking test. The researcher collected test takers’ responses about the effectiveness of the tool through surveys and interviews conducted during tool development. Kim concluded that the test takers’ participation brought benefits not only to the creation of the speaking assessment, but also to the measurement of the target language ability.

Kenyon and Malabonga’s (2001) usability test was conducted to support the effectiveness of a new CALT as an instrument for measuring speaking ability. In the study, the researchers investigated test takers’ attitudes related to three types of speaking assessments—the tape-mediated Simulated Oral Proficiency Interview (SOPI), a new

Computerized Oral Proficiency Instrument (COPI), and the face-to-face American

Council on the Teaching of Foreign Languages (ACTFL) Oral Proficiency Interview

26

(OPI)—across three languages (Spanish, Arabic, and Chinese). After the test takers took the three types of tests, they shared their perceptions about the three test types in terms of strengths and weaknesses, difficulty, fairness, nervousness, clarity, and accuracy through

Likert scale statements in the questionnaire. It was found that the test takers favored the

COPI over the SOPI, due to the functionality of control over choice of tasks, difficulty levels, language directions, and thinking and response time. The adjustment of the test inputs that COPI offered had a more positive impact on lower level test takers, because the inputs were adjustable to their proficiency level. The test takers’ positive reactions to the new type of CALT provided evidence to suggest that computers no longer function as negative factors influencing performance during a test. In addition, the findings established grounds for further development in COPI as a promising testing instrument.

2.1.4 Gaps in research

This section has dealt with issues in how CALT has developed to create effective testing environments for measuring target language ability and brought benefits and caveats to the language testing field. Building upon the history and characteristics of

CALT, I reviewed how CALT has been implemented to measure diverse languages for different purposes. Then, different types of CALT research were examined to demonstrate how researchers have tackled construct validity in CALT.

As addressed in this section, the most common CALT applications are largely derived from the needs of examinees and test administrators. However, there were a few cases in which raters were among the central stakeholders of the language assessment.

Along with examinees, raters are significant stakeholders in a language assessment, as they assign scores or provide comments on examinees’ performance. Their observation

27 and judgment of examinees’ language ability contribute to making the final decision regarding test takers’ language proficiency levels. Many studies examining ordinary

CALT types reported that raters view or listen to examinees’ responses on computer screens and enter the scores in the computer. In the Best PlusTM test, for example, raters view examinees’ responses to the given tasks on the computer screen and enter the score in the computer (Brown, 2012). In addition, the rating process for the OEPT are operationalized on the test’s own testing website (Oral English Proficiency Program,

2013). Using the web-based rating page, raters can rate anywhere with Internet accessibility. Raters assign scores and write comments on the electronic rating page while viewing the recordings of examinees’ performance. What is damaging, however, is that no empirical research has been reported on situations where raters can rate on computers during a live, face-to-face speaking assessment. Considering the critical role of raters in language testing, it is worth developing a computerized rating system for raters and investigating how raters would perceive and use the system as a part of the validation process.

2.2 Assessing Speaking Ability in Language Performance Tests

A subsequent theoretical foundation for this study is associated with issues in factors influencing scores of speaking performance tests. This section begins by considering the nature of language performance tests and how different factors are intertwined with the testing process. The next part focuses on two types of prevalent variability (raters and prompts) and how they have been previously studied in the literature.

28

2.2.1 Conceptualization of the speaking performance test process

In language testing, scores are seen as indicators of an examinee’s true language ability. However, in real testing situations, there exist multiple factors affecting scores other than one’s language ability, and these factors are major sources of variability in test scores. Deville and Chalhoub-Deville (2006) classified variability as good and bad.

Good variability (i.e., differences in individual examinees’ true abilities) is associated with elements that differentiate examinees’ language ability, whereas bad variability indicates all the elements influencing scores other than true ability.

To obtain a precise estimation of scores that are reflective of one’s true language ability, it is essential to identify the types and magnitude of bad variability. The sources of bad variability are characterized as unsystematic or systematic, depending on the degrees of unity in their properties. Unsystematic variability involves random factors impacting scores, such as fatigue, health, and health or emotional conditions, which are difficult to predict. Systematic variability is more predictable and has more obvious influences on score; examples include test method or examinees’ attributes, like gender or ethnic background. Testers have extensively focused on identifying types and effects of systematic variability to reduce its impact on scores (Bachman, 1990).

In order to identify types of systematic variability, it is indispensable to examine the nature of the rating process of a language test. Bachman (2001) proposed an interactive model for oral test performance based upon the theory of communicative competence (Hymes, 1972) and previous models for speaking assessment (Kenyon, 1992;

McNamara, 1996; Skehan, 1998). Bachman’s model conceptualizes multiple dimensions of the rating procedure, accounting for more interactive relationships among different

29 facets of an oral test performance, as shown in Figure 2.1. The model for oral test performance (Bachman, 2001) describes different types of variability (facets) and their dynamic interaction. The current model entails diverse interactions among facets of oral test performance. In this model, facets of oral test performance are represented either in squares or circles, and the relationships among facets are represented with the single- headed arrows or two-headed arrows. The solid arrows indicate relations among facets originally proposed in Bachman’s model.

Figure 2.1 Bachman’s model for oral test performance (Ockey, 2009, p, 163)

A candidate’s score is a function of diverse aspects of the testing context including raters, scale criteria, speech samples, speaking performance, a candidate’s ability for language use, the candidate’s underlying competence, characteristics and qualities of the task, and the interactants. By evaluating a candidate’s speech sample and speaking performance,

30 based upon scale criteria, a rater assigns a score. The candidate’s speaking performance interacts with a) qualities and characteristics of tasks, b) the candidate’s ability for language use and underlying competencies, and c) the interactants. The candidate’s ability interacts with tasks and interactants. The candidate’s speech sample is a function of the candidate’s speaking performance and the ability for language use and underlying competence. Interactants include characteristics of examiners and other participants, both of which interact with the candidate’s speaking performance, tasks, and the candidate’s ability for language use and underlying competence.

A recent application of Bachman’s model for the language performance test can be witnessed in a study on group oral assessment (Ockey, 2009). Ockey conceived of the influence of group members’ personalities on scores of L2 group oral discussion tests based on Bachman’s model. The goal was to investigate the extent to which candidates’ speaking performances in the group oral discussion could be influenced by the assertiveness of their group members. Bachman’s model for oral test performance was applied to predict the potential relationships among candidates and between candidates and raters. For example, Ockey proposed that a candidate’s speaking performance on the group oral test is influenced by the personality of interactants in the group oral test. The double-headed arrow between interactants and speaking performance in Bachman’s model helps describe this relationship. In addition, Ockey assumed that the personality of interactants would affect raters’ perceptions towards candidates’ speaking performances. This association is indicated by adding a single-headed, dotted arrow stretching from interactants to raters in Figure 2.1 above.

31

Findings yielded positive evidence for the aforementioned assumptions for the following reasons. The assertive candidates obtained higher scores than expected when working with non-assertive group members, yet received lower scores than expected when interacting with assertive group members. This relationship was confirmed by the arrow between interactants and speaking performance, as predicted in the model. In addition, findings supported the influence of candidates’ interactions with other group members on raters’ perceptions, as indicated by the dotted arrow in the model. It turned out that the assertive test takers who were grouped with assertive group members received significantly lower scores than those working with non-assertive group members. Ockey interpreted that raters might perceive a candidate’s assertiveness positively when the assertive candidates worked with non-assertive group members. On the other hand, raters perceived candidates’ assertiveness negatively when they competed with all assertive group members. Through this application, we are able to see that this interactive model provides a clear picture of the multi-dimensional aspects of the rating process in a language performance test.

2.2.2 Rater variability

To maintain score validity, distortion of the impact of measurement errors should be minimized. Extensive research on performance tests has been conducted in an effort to minimize the impact of rater variability on test scores by investigating possible sources of variables associated with raters. McNamara (1996) perceived examinees’ language abilities, raters, and tasks as central variability of all potential variability in performance- based language testing. In particular, McNamara envisioned raters as critical factors of test scores, as they assign scores on examinees’ performance while also interacting with

32 examinees, scoring rubrics, and prompts. Dunbar, Koretz, and Hoover (1991) asserted that “fallible raters can wreak havoc on the trustworthiness of scores and add a term to the reliability equation that does not exist in the tests that can be scored objectively” (p.

291). Hoyt and Kerns (1999) described that an average of 37% of variance in ratings was attributable to rater main effects and rater-examinee interactions in language testing.

Discussions about significant impact of raters on test scores lead us to explore how rater variability has been previously studied. This section reviews research on rater variability and its relation to rating practices, focusing on rater severity and raters’ rating scale use.

2.2.2.1 Rater severity

In an effort to investigate rater variability, a number of researchers have examined rater severity. To guarantee quality of the ratings in language performance tests,

Engelhard (1994) argued that raters’ ratings need to be evaluated on a continuum of severity or leniency. Rater severity has received much attention, with related studies producing mixed results. Much research has revealed that rater severity changes based on interaction with a multitude of other variables (e.g., raters’ L1 background, rating experiences, and time).

Scholarship has shown that raters exhibit diverse severity depending on their L1 background respective to raters’ perceptions of pronunciation, accent, and intelligibility, which, in turn, has a considerable impact on raters’ decision making. For example, Yan’s

(X. Yan, 2014) investigation of eleven raters’ severity and leniency in a local, oral

English proficiency test revealed a significant difference in severity among the raters with no raters identified as extremely severe or lenient to examinees. Interestingly, Chinese raters tended to be more lenient to Chinese examinees, but more severe to Indian

33 examinees. By contrast, raters of native English speakers were more lenient towards

Indian examinees. Similar results were found in other studies (Carey, Mannell & Dunn,

2010; Winke, Gass & Myfold, 2013). Carey, Mannell and Dunn (2010) found that raters of the IELTS appeared to assign higher scores for examinees from China, Korea, and

India because they were familiar with the accent of one of these languages. Another study of the iBT TOEFL raters (Winke, Gass, & Myfold, 2011) produced the same results as the mentioned studies.

Distinctive levels of rater severity have also been witnessed between novice and experienced raters. Weigle (1998) compared the degrees of rater severity between experienced and novice raters in an L2 writing assessment. She observed that novice raters were more severe and less consistent in their ratings as compared with experienced raters. Lim (2011) compared changes in severity and consistency between new and experienced raters of the writing section of the Michigan English Language Assessment

Battery (MELAB). The experienced raters maintained acceptable severity levels over three time periods of 12 to 21 months. The novice raters showed inconsistent severity levels at the beginning of the grading periods, being either too harsh or lenient compared to the average severity of the raters. However, it was noted that the novice raters’ severity levels fell into the average severity levels and converged with the experienced raters’ severity levels over time.

Raters also changed in their severity within the same testing period. Lunz and

Stahl (1990) found that raters were not consistent in their severity even over a half-day rating period. Although the raters acknowledged the potential influence of other

34 variables, such as raters’ fatigue, raters’ interpretation of rating scales, or examinees, this study revealed that raters might not maintain consistency in severity within a given day.

Rater severity has also been investigated over different time periods other than a single session of rating. Bonk and Ockey (2003) employed MFRM to examine changes in rater severity over time in a second language oral assessment that took the form of a peer group discussion task. They found that raters were not consistent in ratings between two consecutive administrations of the test for individuals. For example, four of the 13 raters showed inconsistency in severity in the first year, but became more consistent in the following year. In a similar study, using MFRM, Congdon and McQueen (2000) studied the stability of rater severity over a period of seven rating days by analyzing ratings from 16 raters of elementary school students’ essays for rater severity. Findings indicated that the rater severity changed per day and over the rating period. Ten raters’ severity on the first rating day turned out to be significantly different from their severity estimates on the last rating day. These findings were in line with a study by Lumley and

McNamara (1995), which suggested that raters of a spoken English test changed in their severity over a 20-month period.

However, other studies came to different conclusions about changes in rater severity over time. For example, Lim (2011) investigated rater severity and consistency over three time periods in a span of 12 to 21 months in the writing section of the MELAB.

Except for one rater, who was inconsistent for a two-month period, the raters showed consistent rater severity and held consistent rating practices over time. Despite the mixed findings for rater severity, many have observed that raters often appear to change in their

35 severity during the rating process. Further investigation of rater severity in diverse testing contexts will help us better understand rater variability.

2.2.2.2 Rating scale use

Raters’ interaction with scoring criteria plays another role in rater variability, as raters tend to approach scoring criteria in different ways during the rating process.

Cumming’s study (1990) revealed different rating behaviors between experienced teacher raters and novice raters in an ESL writing evaluation. The experienced raters used a wide range of scoring criteria whereas the novice raters approached the task with a limited scope of rating criteria to evaluate the ESL essays. Differences between expert and novice raters were also observed in Barkaoui (2010). He found that, in an assessment of

ESL writings, novice raters appeared to rely more on the rating scales to make a decision than experienced raters. The scope of the novice raters’ ratings was restricted to local aspects of writing such as linguistic accuracy. By contrast, the experienced raters attended to rhetorical and overall presentation of ideas during the rating procedures.

In addition, raters appear to interact with scoring criteria inconsistently, depending on examinees’ language proficiency (Barkaoui, 2010; Meiron, 1998; Pollitt &

Murray, 1996). Pollitt and Murray (1996) observed that raters attended to grammar when assessing lower level examinees, whereas they weighed content more for higher-level examinees in the Cambridge assessment of spoken English oral interview. Raters also appeared to focus on aspects of speaking performance not explicitly included in the rating scales. For example, Meiron (1998) reported that raters of a speaking assessment unconsciously attended to how well examinees maintained communicative skills during the test, although communicative skills were not clearly defined in the rating scales. In

36 line with Meiron’s finding, Barkaoui (2010) noticed that experienced raters referred to external criteria, such as lengths of texts or a writer’s situation, to assign scores to examinees in a writing assessment. In short, studies on raters’ rating behaviors, in terms of severity and rating scale use, present mixed findings and remain inconclusive, demonstrating the need for further investigation of rater variability in various testing environments.

2.2.3 Task variability

Tasks are key to stimulating and eliciting language samples from examinees in performance-based language assessment. Bachman et al. (1995) argued that test scores derived from tasks and assigned by raters should be reliable in order to validate the inferences made from the scores regarding examinees’ language ability. In speaking assessments, tasks are described as “activities that involve speakers in using language for the purpose of achieving a particular goal or objective in a particular speaking situation”

(Luoma, 2004, p. 31). It is evident that tasks should be carefully designed to elicit adequate and rich speech samples from examinees.

However, researchers have found it to be a great challenge to define task difficulty because “tasks do not lend themselves readily to categorization for test purposes” (Iwashita, McNamara, & Elder, 2001, p. 404). In addition, the traditional approach to understanding task difficulty mainly considered the scope of task types and their appropriateness for eliciting ratable language output (Fulcher & Reiter, 2003).

Pollitt (1991) claimed that all tasks in non-CAT format are of equal difficulty, and tasks in a performance test are presented to examinees in a sequence of task difficulty from easiest to the most difficult, which would not be true in many operational testing types

37 like CAT. Noticing the lack of research on task difficulty, Fulcher and Reiter (2003) addressed the need for discussion of the relative difficulty of tasks. Stemming from the

Second Language Acquisition (SLA) area, current language testing research has dealt with several topics associated with task difficulty. This section reviews different factors influencing task difficulty and the empirical methods used to estimate task difficulty.

2.2.3.1 Task difficulty factors

Empirical research has suggested that task difficulty can be determined by a mixture of different task properties such as the number of objects, events, or individuals embedded in the task itself. From the SLA perspective, Skehan (1998) suggested a number of features that impact task difficulty such as number of participants or task components, abstractness of information, familiarity of task information, types of task information, and structure of task information (p. 174). Brown and Yule’s study (1983) revealed that the examinees of a speaking test felt several challenges they have to deal with regarding tasks including more components of the test environment, situational or contextual stimuli, and participants/interlocutors to communicate with, to name a few.

Speaking test prompts were also found to affect task difficulty. Brown (1989) compared two prompts for a story telling task in a speaking assessment. The first group of examinees was asked to describe scenes depicting three women delivering letters to an office with one of the women stealing money from the letters. On the contrary, the second group was asked to describe the same scenes, but was given a more detailed prompt explaining that one of the women was a thief. Compared to the second group, which provided more complicated and rich descriptions about the story, the first group produced short and simple responses to the story and felt more burdens during the task.

38

Task difficulty is not an inherent characteristic of the task itself (Luoma, 2004); rather it is determined by various components in testing situations, such as, interlocutors

(raters), and task delivery modes. Empirical studies have found that task difficulty was attributable to examinees’ diverse characteristics such as personal knowledge or academic background, proficiency level, and language background. For example, it was observed that examinees’ topic familiarity played a role in determining perceived task difficulty. In the case of writing assessments, some studies (e.g., Park, 1998) have investigated effects of two different writing tasks on examinees’ writing products. Park

(1998) compared two writing tasks: one, a traditional essay task and the other, a data commentary task with charts or graphs. Findings from the study suggested that examinees majoring in sciences or engineering performed better on the data commentary tasks than did examinees from social science majors.

Furthermore, task difficulty can also be determined by raters and not the task itself. In a study of prompt and rater effects in the MELAB, Lim (2009) examined the extent to which different writing prompts would influence raters’ rating behaviors.

Results showed that raters did not adjust their rating behaviors depending on the perceived prompt difficulty. Another aspect of task difficulty is relevant to raters’ selection of task types. In their examination of Interagency Language Roundtable (ILR) oral proficiency interviews (OPIs), Reed and Halleck (1997) found that examinees who received tasks from Rater 1 systematically received higher scores than examinees who received tasks from Rater 2. It turned out that Rater 1 selected lower-level role-play tasks motivating examinees to produce intermediate level responses. On the other hand, Rater

2, who used higher-level role-play prompts, asked intermediate to advanced level

39 questions to elicit the corresponding level of responses from the examinees. This shows that tasks for different levels tend to elicit different speech samples from the examinees as they interact with raters. Other studies showed that task difficulty was determined by testing situations involving aspects like time pressure or input types (e.g., Crookes &

Rulon, 1988; Iwashita, 1998). For example, researchers observed that greater time pressure on examinees resulted in an increase in task difficulty. As for input types, the sufficient provision of visual support led to decreases in task difficulty.

The abovementioned discussion presented how previous researchers attended to investigations of the diverse sources of variables potentially affecting task difficulty. The following part expands from the discussion by introducing the common empirical methods used to estimate task difficulty.

2.2.3.2 Empirical methods of estimating task difficulty

In language testing, task difficulty can be measured by advanced statistical methods using Many-faceted Rasch measurement (MFRM). Specifically, the advent of of MFRM allowed researchers to estimate difficulty of individual tasks in language testing (e.g., Bachman, Lynch & Mason, 1995; Brindley & Slatyer, 2002; Lynch &

McNamara, 1998; Norris, Brown, Hudson, & Bonk, 2002).

Researchers have treated “difficulty” as a facet in MFRM analysis, which generates empirical measures of task difficulty. Fulcher (1993, 1996a) compared task difficulty between a picture description task, an interview based on a text, and a group discussion. Results of the statistical analyses revealed significant, but extremely small differences in task difficulty that account for score variance. Another study (Bachman et al., 1995) suggested the same results as Fulcher’s study, with the researchers observing

40 significant, but small differences in task difficulty between a task involving summarizing an academic lecture and a task on relating a theme from the lecture to the examinees’ experience.

Eckes (2005) investigated the quality of tasks in the speaking section of the Test of German as a Foreign Language (TestDaF) using MFRM. The speaking section consists of four parts, including a warm-up task (one task), situation-related communication tasks (four tasks), description tasks (two tasks), and presentation of arguments tasks (three tasks). MFRM was used to estimate individual item difficulty.

Findings from MFRM uncovered each speaking task equally discriminated between high and low proficiency examinees, as the thresholds for each speaking task were separated along the overall examinee proficiency scale.

Similar uses of the Rasch measurement were found not only in speaking assessments, but also in listening assessments (e.g., Brindley & Slatyer, 2002) and performance-based language assessment (Norris, Hudson, & Bonk, 2002). For example,

Brindley and Slatyer (2002) adopted the Rasch measurement to investigate key task characteristics and conditions that exerted huge influences on the difficulty of listening assessment tasks. The researchers compared individual item difficulty of three tasks that were employed in different testing conditions. In a performance-based language assessment, Norris, Hudson, and Bonk (2002) explored to what extent combinations of several cognitive factors (e.g., code complexity, cognitive complexity, and communicative demand) affected task difficulty based on the Rasch measurement.

Likewise, the Rasch approach has widely been utilized to estimate individual item difficulty of different types of language assessments.

41

2.2.4 Gaps in research

The second section of this chapter described the nature of language performance tests based upon Bachman’s interactive model. The model conceptualizes the multifaceted factors in testing environments where diverse sources of variability affect the process of assigning scores to examinees. Among the noted issues, this section identified two major types of variability affecting test scores—rater variability and task variability. The studies on rater variability have attended to examining rater severity and raters’ use of a rating scale. The mixed results regarding rater variability call for further investigation of raters’ rating exercises. In particular, it has not been fully examined how rater variability would play a role in the integration of computers into the simultaneous rating process of a speaking assessment. With regards to task variability, much research has centered on factors that have influences on test scores and estimation of task difficulty using advanced statistical methods like the Rasch measurement. However, less attention has been drawn to investigating the quality of task difficulty, especially when levels of task difficulty are pre-determined by the test specification.

In response to the need for research on rater and task variability, the current study aimed to investigate whether the rating procedure of a speaking assessment, in which a web-based rating system was consolidated, was appropriate for generating accurate scores by examining rater variability (rater severity and scale use) and task variability in this context. The following chapter details the context of this dissertation and the web- based rating system.

42

CHAPTER 3

CONTEXT AND ARGUMENT-BASED APPROACH TO

VALIDATION

The main purpose of this study was to validate the interpretation and the uses of

OPI scores. Prior to the description of this study, the context of the study and the theoretical basis that formulates research questions are presented. This section begins with the Oral English Certification Test (OECT) where OPI belongs. Next, the development process and the descriptions of the web-based rating system, Rater-Platform

(R-Plat), are presented. Finally, the argument-based approach to validation is introduced to map out the specific types of studies needed to support the validity of the score meanings and interpretations. This chapter concludes with the seven research questions drawn from the validation argument framework.

3.1 Context

The Oral English Certification Test (OECT) intends to evaluate how effectively prospective international teaching assistants who are non-native speakers of English can communicate in English in academic and classroom settings at Iowa State University

(ISU). International graduate students who plan to work as teaching assistants are required to reach certain levels specified by each of their departments. These results serve as a basis for the departments or programs to assign teaching duties. The stakeholders of the test are divided into four groups. The first group, who has direct contact with the test, is the OECT administrators, raters, and test developers. The second

43 group is examinees who are the prospective international teaching assistants. The third group is potential undergraduate students taught by these prospective international teaching assistants. The fourth group is the Graduate College that funds the academic communication program in charge of test administration. The test is administered at the beginning and the end of each semester—a total of four times per academic year.

3.1.1 Components of the OECT

The OECT test consists of three sections—Warm-Up, Oral Proficiency Interview

(OPI), and the TEACH simulation. In speaking assessment, OPI is a face-to-face speaking test where examinees interact with interlocutors in real time. These face-to-face test types allow an evaluation of examinees' ability to engage in interactive communication and, thus, are useful in language testing for many purposes (Lazaraton,

2014).

In the context of this dissertation, the OECT begins with a 2-minute, unscored, warm-up question. The question aims to put examinees at ease by asking informal questions about daily life, summer/winter break, etc. The OPI is composed of three impromptu speaking questions, and a role-play question. No preparation time is allowed to examinees for the impromptu speaking questions. The prompts for the impromptu speaking questions are grouped into four intended difficulty levels—advanced level, intermediate-high level, intermediate-mid level, and intermediate-low level. The topics for each prompt level differ because they intend to elicit different speech samples.

During the impromptu speaking questions, raters select the first prompt’s level based on the examinee’s performance during the warm-up session. Next, raters select the levels of subsequent prompts by adjusting to the examinee’s responses to prior questions. The

44 role-play prompts ask an examinee to act out in daily life or academic situations and resolve these real life tasks. Examinees are allowed a one-minute preparation time, followed by a 2-minute conversation on a given situation with one rater, who acts as an interviewer.

The second section of the OECT is the TEACH simulation task that takes approximately 10 minutes. An hour prior to the beginning the OECT, examinees have one hour preparation time for the TEACH section. They can choose a topic from a list of possible topics in their discipline. During the test, they have a 2-minute preparation period to write notes or to draw graphs on the whiteboard. They teach their chosen topic for five minutes and answer questions from raters for three minutes. Table 3.1 presents a component of the OECT and its duration. All test sessions are both video- and audio- recorded.

Table 3.1 Components of the OECT

Section Component Duration Warm-up 2 minutes OPI 3 Impromptu Questions 6 minutes

Role-play 2 minutes Preparation 2 minutes TEACH Lecture 5 minutes

Question – Answer 3 minutes

3.1.2 Rating procedure

A group of three raters evaluates an examinee’s speaking performance simultaneously while one of the raters in the group interviews the examinee during the

45 test. That is, one of these three raters becomes the interviewer for this examinee.

Sometimes, only two raters are grouped together to assess one examinee because of limited raters’ availability. In this case, one of them plays a role as the interviewer.

Using the scoring criteria, raters assign a holistic score from eight score bands ranging from 0 to 300 on an examinee’s performance for each task. Table 3.2 presents the score bands for each proficiency level in the OECT. An advanced-ability level ranges from a score of 240 to 300. An intermediate-high proficiency level ranges from 210 to 220. An intermediate-mid level ranges from 170 to 200. An intermediate-low level ranges from 0 to 160.

Table 3.2 Score bands for each proficiency level Level Score Bands Scores Excellent 280 – 300 Advanced Very Strong 250 – 270 Strong 230 – 240 Intermediate-high Adequate 210 – 220 Limited 190 – 200 Intermediate Very Limited 170 – 180 Poor 120 – 160 Intermediate-low Not Competent 0 – 110

Raters assign holistic scores for each question given to an examinee, focusing on speech comprehensibility, and how appropriate and accurate the language is spoken during the test rather than evaluating the contents the examinees deliver during the test.

In the OPI, a rater’s scores on three impromptu questions and one role-play question are averaged to indicate the overall comprehensibility and effectiveness of English speaking ability. In the TEACH section, a rater assigns a holistic score on the overall TEACH performance. Along with the holistic score assignments, raters opt to write comments

46 and mark diagnostic descriptors to present examinees’ speaking ability during the rating sessions. These comments and diagnostic descriptor markings are then shared with examinees, who request feedback on their performance after the test. In addition, instructors of English speaking classes receive the information at the beginning of the semester from the OECT administrators and use it for diagnosing diverse dimensions of speaking ability.

After the test, the OECT administrators review the scores assigned by the group of raters and determine the final level based on the average of their ratings. For example, if an examinee received 200 in OPI and 210 in TEACH, he/she is placed into the intermediate-mid level. The final decision is made based upon the raters’ agreements on the final level assigned. If three raters disagree about the level of an examinee’s ability, the decision is passed to a fourth rater who makes a final decision by reviewing the video recording of the examinee’s test performance. Depending upon their final levels, examinees are required to take different speaking classes and complete pronunciation lab practices. An exception is those examinees at the advanced level. The examinees at the intermediate-high level are conditionally certified, which requires them to take one semester of English speaking classes with two hours of weekly independent work in the pronunciation lab. The intermediate-mid level examinees are certified with restriction.

They need to take one to two semesters of English classes with three to six weekly practice sessions in the pronunciation lab. The intermediate-low examinees are required to take two to three semesters of English classes with the completion of at least seven hours of pronunciation lab practice weekly. After the completion of these additional classes, all examinees are allowed to retake the test for certification as teaching assistants.

47

Table 3.3 presents the score ranges for OPI and TEACH for final placement decision.

Table 3.3 Score range of OPI and TEACH for final placement decision Scores Level OPI TEACH Advanced Above 230 Above 230 220 230 230 220 Intermediate-high 220 220 220 210 210 220 200 210 210 200 200 200 200 190 190 200 190 190 Intermediate-mid 180 190 190 180 180 180 180 170 170 180 170 170 160 170 Intermediate-low 170 160 Below 160 Below 160

The current study focused only on the OPI section for the following three reasons.

First, OPI rating data need connected because this is a requirement for the Many-facet

Rasch measurement (MRFM) analysis—a key statistic to investigate dependability of

OPI ratings, rater behaviors, and task difficulty. For Rasch measurement, data must be fully connected because all measures should be directly comparable in one frame of reference. However, it is difficult to obtain data fully linked in complex rating practices and experimental design in practice. To resolve data connectedness, FACETS, software used for MFRM, scans data to verify possible subsets of the data and offers a method to link the subsets (Linacre, 2012). Although FACETS guides toward linking the data

48 subsets, it is still challenging to sustain connectedness for these ratings drawn from questions in the TEACH section. This is because TEACH topics cover discipline- specific knowledge and contents, whereas OPI prompts contain more general contents in community, society, and world-wide issues.

Second, the format of the TEACH section is inconsistent with that of the OPI since it allows examinees an hour preparation time for the lecture. In addition, an examinee is given a two-minute preparation time prior to a five-minute lecture, followed by a three-minute period to answer raters’ questions. In contrast, the OPI prompts elicited examinees’ impromptu speech without any preparation time allowed. Different test conditions between OPI and TEACH are the other challenges to investigate scores from both sections.

Third, rating criteria for TEACH is inconsistent with those for the OPI. Although examinees’ performances in both sections are evaluated holistically, the criteria for the

TEACH involve not only linguistic ability, but also skills for teaching, handling questions, and communications relevant to cultural aspects in classroom settings in the

U.S.

3.2 Rater-Platform (R-Plat)

A web-based rating system, Rater-Platform (R-Plat), was devised, based upon the practical needs for resolving issues in the conventional paper-based rating format and theoretical concerns about the validity of OPI score interpretations and uses. Practical needs for improvement in the rating format were addressed by this investigator’s observations, as a certified rater of the OECT. The specific needs were further conceptualized by the diverse OECT stakeholders’ responses to the needs analysis

49 conducted in Fall 2012 (Yang, 2012). The needs analysis aimed to investigate the opinions and needs of the three OECT stakeholder groups, including five raters, two instructors for the speaking classes, and 22 students in the speaking classes. Focus group interviews and online questionnaire were conducted to collect their opinions about raters’ comments, diagnostic descriptor markings recorded in the paper-based rating formats, and their needs for a prospective web-based rating system. Several issues and needs for the improvement in the rating process were addressed.

Above all, rating with a paper-rating form was time consuming and placed a burden on raters. Before the test began, raters completed the rating form with an individual examinee’s information (i.e., name, department, test number, topics, and testing dates, etc.), which are not directly related to the score assignment. After completion of each rating, raters had to manually calculate the average of their ratings on the four questions in the OPI using calculators. This additional rating process placed more burden on raters time-wise, which, consequently, prevented them from solely attending to evaluating the examinees. Additionally, it was challenging for the other stakeholders to read and interpret raters’ hand-written comments and diagnostic descriptor ratings presented in the paper-based rating formats, due to ambiguity, although the stakeholders recognized the significance of such information. The results from the needs analysis revealed the instructors’ difficulties to interpret raters’ poor hand-written comments on examinees’ speaking abilities. The rating scales for the diagnostic descriptors, based on the four-point plus and minus scales, prevented the raters and the instructors from adequate usage of the diagnostic descriptor markings.

50

Consideration about the validity of the score interpretations and uses was another motivation for the development of R-Plat. In language testing, a score is an index of language ability or trajectory of linguistic development (Chapelle, 2012). The validity of interpretations and uses of test scores can be supported by gathering diverse pieces of validity evidence linked to test scores. In the setting of the OPI, for example, the pieces of evidence to support score interpretation and uses are available, such as raters’ diagnostic descriptor ratings and comments on examinees’ performances. However, such information has not been fully scrutinized to make connections to the scores although it entails rich descriptions about an examinee’s speaking performance during the test.

Moreover, the stakeholders’ difficulty in interpreting the comments and the diagnostic descriptor markings minimized the utility of the information as indicative of speaking ability. If additional information was collected and utilized systematically to mirror different speaking ability levels, the information could strengthen the justification of test score interpretation and use. Furthermore, scoring performance is as central to validity and reliability as the test design itself (Weir, 2005). However, the rating practice with the paper-based rating form held multiple limitations that hindered raters’ rating performance as addressed previously. Acknowledging the practical needs and in consideration of validity, R-Plat was devised to facilitate raters’ rating procedures with user-friendly functionality and features, and to collect different pieces of evidence indicative of different speaking abilities thoroughly. R-Plat was devised through multiple iterations to reach the version used on the research as depicted in Figure 3.1.

51

Figure 3.1 The development process of R-Plat

A needs analysis was conducted during Fall 2012. This analysis aimed to investigate different needs from diverse stakeholders of the OPI to identify issues in the paper-based rating system and to enhance rating practices. The stakeholders included

ITAs, raters, and instructors of English speaking classes for prospective ITAs. The instruments used in this mixed-methods study included (a) two sets of focus group interviews with five OECT raters, (b) one focus group interview with two instructors from the speaking classes for ITAs, and (c) one questionnaire with 22 prospective ITAs who were preparing for the test. Findings revealed several issues with the paper-based rating procedure, such as ambiguous jargons and diagnostic descriptor categories in the rating sheets. They also shared their concerns about potential challenges they could encounter when using a new web-based rating system. They casted doubt whether the web-based rating system could facilitate the rating procedure because some elderly raters were familiar with typing fast and manipulating computers. An inability to type a phonetic alphabet or possible noise from typing could distract raters and examinees.

However, they also revealed positive interest in a new web-based rating system, and provided insightful suggestions for future development of the web-based rating system.

R-Plat was developed during Spring 2013 based upon results from the needs analysis and considering the several issues in the conventional paper-based rating forms.

R-Plat was developed using two types of programming languages—PHP and JavaScript,

52 and used MySQL database. This database captures the following information categorized into three areas: examinees’ data, raters’ data, and rating data. Examinees’ data include

(a) first and last names, (b) school IDs, (c) disciplines, (d) email addresses, (e) testing data and time, (f) testing number, and (g) names of raters assigned to examinees. Raters’ data include (a) first and last names, (b) raters’ ID, and (c) email addresses. Rating data involves (a) ratings on individual prompts, (b) total OPI and TEACH scores, (c) prompt information assigned to examinees, (d) diagnostic descriptor ratings, and (e) raters’ comments. All data captured in the database can be extracted and saved as a csv data file.

The first version of R-Plat was officially implemented during the operational testing periods in August 2013. R-Plat went through updates and modifications during

Fall 2013. The current version of R-Plat has been operationalized since Spring 2014.

The interface for R-Plat basically bears a resemblance to the paper-based rating form in an effort to minimize raters’ learning curves to become accustomed to R-Plat. R-Plat is equipped with new features facilitating rating procedures. The first component of the new features is the automatic presentation of examinee information and test materials.

Compared to the paper-based rating format that provides the information on a separate sheet of paper, R-Plat presents all necessary information on a web page. Once raters log into R-Plat with their ID, they can check the number of examinees they are assigned to rate on a certain date in a calendar-like scheduler as shown in Figure 3.2. After clicking on the number of examinees, raters go to the next page that displays the list of examinees with relevant information, such as testing time, location, examinees’ numbers, their first and last names, types of tests, rating status, and scores from the other raters. Since R-Plat

53 automatically displays all necessary examinee information, raters no longer write the information on the rating sheet. Instead, they spend time on preparing the rating exercises.

Figure 3.2 Examples of rating schedule pages in R-Plat

To see the main rating page, raters click on the test number (blue clickable button). An examinee’s information is presented on the top of the page as shown in

Figure 3.3—testing date, time, a rater’s name, test number, examinee’s name, department, and a dropdown menu to select an interviewer’s name. Raters are supposed to double-check this information to ensure they evaluate the correct person. Raters can also view the scoring rubric by simply clicking on the button, named “View Scoring

Rubric”. In the paper-based rating format, raters had a separate paper sheet for the rating rubric, and switched back and forth between rubric and rating sheets, which was cumbersome. In R-Plat, raters do not deal with several pieces of papers during the rating sessions, since the scoring rubric is available on the computer screen with the rating form.

54

Figure 3.3 Examinee’s information in the OPI rating page

Under the examinee’s information section, the main rating page is presented as shown in Figure 3.4. First, raters simply choose topics given to examinees from the dropdown menu instead of writing it in a paper-rating format. Second, the rating scales presented in stars-shape points are located next to each topic. Raters assign scores ranging from 0 to 300. The numbers placed above the scale refer to the score band—4 with intermediate-low level, 3 with intermediate-mid, 2 with intermediate-high, and 1 with advanced level. The raters’ scores automatically display next to the scale. For example, 24 means the rater assigns 240 on an examinee’s performance on the first question as shown in Figure 3.4. Third, raters opt to type comments about the examinee’s speaking performance for each question in the comment box. In the case of the OPI rating page, raters can write comments about the examinee’s performance for each prompt and the overall OPI section. Fourth, the scores given to each question are automatically averaged and displayed under the overall comment box. The automatic score calculation intends to lessen raters’ burden and save their time for the main rating exercises. Finally, raters add impression scores, which are independent from the total

OPI scores. This option allows raters to provide estimated scores on the examinee’s performance, based on their observations and impressions.

55

Figure 3.4 The OPI rating page in R-Plat

Another feature embedded in R-Plat is the diagnostic descriptor ratings located at the bottom of the rating page. In conjunction with comment boxes, raters opt to freely use this feature to provide further descriptions about an examinee’s speaking ability, if necessary. The diagnostic descriptors represent multifaceted aspects of speaking ability and are grouped into seven categories—comprehensibility, pronunciation, fluency, vocabulary, grammar, pragmatics, and listening. Each descriptor is evaluated, based upon a five-point scale. Figure 3.5 depicts seven diagnostic descriptors (phrasing, choppiness, halting, false starts, pauses, incompleteness, and pace) and the corresponding

56 scales for ‘Fluency’. The rightmost scale point refers to the poorest performance; whereas, the leftmost scale point indicates the strongest performance.

Figure 3.5 Example of rating page for ‘Fluency’ in seven diagnostic categories in R-Plat

The rating page for the TEACH section has a similar interface with the same features (See Appendix A). After raters complete the OPI and TEACH rating pages, they are taken to the final score confirmation page, which displays the summary of the scores for both sections. At the end, raters opt to leave final comments on an examinee’s overall performance (See Appendix B). In R-Plat, raters provide three types of input to indicate an examinee’s speaking ability—scores for each prompt, diagnostic descriptor ratings, and comments. All inputs are automatically saved in the R-Plat database. Administrators of the OECT can view the raters’ inputs immediately after raters submitted them in R-

Plat, which accelerates the score reporting process.

57

3.3 Interpretive Argument for the OPI scores

Situated in the argument-based validation approach, this dissertation aimed to collect validity evidence to support the interpretations of the OPI scores when raters utilized the web-based rating system for the rating purpose. This section first began with the interpretive argument that specifics inferences and assumptions relevant to the proposed score interpretation. Then, this section introduces the specific research questions laid out by the interpretive argument.

Central to the interpretive argument for the OPI is the construct the test is intended to measure. The definition of the construct measured in the OPI refers to the ability to communicate in English in typical everyday, academic, and classroom situations. The interpretive argument was formulated to justify interpretation and the uses of the OPI score, comprising of a chain of seven inferences—domain description, evaluation, generalization, explanation, extrapolation, utilization, and impact. Among the seven inferences, this study aimed to find validity evidence to support the inferences for evaluation and generalization because the two inferences are closely pertinent to the rating procedure and the raters’ rating results when raters used R-Plat during the rating procedure. In addition, the two inferences are the foundational parts of the interpretive argument, linking to the other subsequent inferences of the argument.

The following interpretive argument introduces the evaluation and generalization inferences that I investigated in this study. The other five inferences are presented with types of research that will be needed.

58

Domain description

The domain description inference links the target language use domain to the observation of performance on OPI. This inference is supported by the warrant that observation of performance on the OPI reveals relevant knowledge, strategies required in

English-medium classrooms of higher education. This warrant is based on two assumptions.

The first assumption is that critical English language skills, knowledge, and processes needed for English-medium class instruction of higher education can be identified. Domain analysis is required to support this assumption. As a part of domain analysis, experts’ knowledge and judgment about required classroom language skills should be asked through surveys or interviews. The potential experts consist of researchers in applied linguistics, instructors from English speaking classes for prospective ITAs, and instructors in content courses from different disciplines. Empirical investigations of instructors’ classroom discourse are needed to help determine crucial

English language skills and knowledge required for classroom instruction in higher education.

The second assumption is OPI prompts requiring speaking skills used for class instruction and daily communication can be simulated. For this assumption, OPI prompts needs to be analyzed in relation to classroom instruction settings. Experts and test developers are asked to analyze the extent to which OPI prompts reflect authentic language skills and tasks required in university classes.

59

Evaluation inference

The evaluation inference connects the observed performance with raters’ outcomes, such as observed scores and observed performance descriptors. This inference is legitimized by a warrant that observed performance on the OPI tasks recorded in R-Plat is evaluated to provide observed scores and observed performance descriptors reflective of targeted speaking ability. Three assumptions underlying this warrant have been identified.

The first assumption is that rating procedures used in R-Plat are appropriate for raters to assess targeted speaking abilities. This assumption was investigated by exploring raters’ perceptions towards and their use of R-Plat during the rating process.

Raters were asked about their opinions regarding clarity, effectiveness, degree of comfort level, and satisfaction with using R-Plat during the rating process. Their opinions were gathered from surveys and interviews.

The second assumption is that test administration conditions where R-Plat is implemented are appropriate for providing evidence of targeted speaking abilities.

Raters’ outcome was analyzed to support this assumption, in terms of diagnostic descriptors and comments on examinees’ speaking performances. In the analysis of thirty diagnostic descriptor ratings, the relationships of their ratings on diagnostic descriptors with three proficiency ability levels were identified. Quality of raters’ comments on examinees’ speaking performances was determined by a comparison of positive and negative comments with three proficiency level ratings. The comparisons of raters’ comments with three proficiency level ratings were also conducted based upon the

60 six criteria in the scoring rubric—functional competency, comprehensibility, pronunciation, fluency, vocabulary, and grammar.

The third assumption is that examinees’ performances on the OPI are evaluated adequately in such a way that yields observed scores reflective of different speaking abilities. To find evidence for the assumption, OPI scores were analyzed to examine the extent to which the OPI scores are distributed across different ability levels.

Generalization inference

The third inference is ‘generalization’ that links the observed scores to expected scores. The warrant is that observed scores recorded in R-Plat are dependable estimates of expected scores over the relevant parallel versions of prompts, and consistent within intended prompt levels and across/within raters. There are three assumptions underlying this warrant.

The first assumption is that a test reliably distinguishes examinees’ different speaking proficiency levels. Individual ratings on the OPI prompts were analyzed to address this assumption.

The second assumption is that examinees’ proficiency is evaluated consistently across prompts at the same difficulty levels. This assumption was studied by a comparison of the observed prompt difficulty levels with the intended prompt difficulty levels.

The third assumption is that examinees’ proficiency is evaluated consistently within/across raters. To find backing for this assumption, raters’ individual ratings were analyzed to investigate consistency in rater severity and in the raters’ use of rating scales within the test administration.

61

Explanation inference

The fourth inference is ‘explanation’, which links expected scores to the construct. This warrant is that expected scores are attributed to a construct of speaking ability in English-medium classrooms of higher education. Three assumptions underlie this warrant.

The first assumption is that linguistics knowledge, processes, and strategies required to complete OPI prompts are pertinent to the English 180 syllabus, class assignments, and activities. Backing needs to be collected by empirical studies that compare linguistic features and communicative strategies required in OPI prompts with authentic classroom languages.

The second assumption is that scores collected via R-Plat relate to scores for other speaking assessments. This assumption requires a research that compares examinees’ performance on OPI with those on other comparable speaking assessments, such as

International English Language Testing System (IELTS).

The third assumption is that the internal structure of the OPI scores collected via

R-Plat is consistent with a theoretical view of speaking proficiency as a number of highly interrelated components. Reliability and factor analysis should be conducted to examine whether the internal structure of the OPI scores represents the theoretical structure of speaking proficiency.

Extrapolation inference

The fifth inference is ‘extrapolation’, which connects the construct to the target scores. The warrant is that the construct for speaking proficiency as evaluated by the OPI

62 in support of R-Plat are relevant to the quality of linguistic performance required in the

English-medium classrooms of higher education.

The underlying assumption is that performance on the test is related to other criteria for langue proficiency in real classroom instruction of higher education.

Criterion-related validity studies are needed by employing examinees’ self assessment and an investigation of examinees’ performance in English-speaking classes for prospective ITAs. The studies aim to examine the relationships between OPI scores and other indicators of language performance in university classroom instruction.

Utilization inference

The sixth inference in the interpretive argument, utilization, links target score to test use. This inference is based on the warrant that test results collected via R-Plat are useful for making decisions about teaching assignments and appropriate placement of

ESL speaking classes for prospective international teaching assistants. The warrant is based on two assumptions.

The first assumption is that placement of examinees based on the test result is appropriate. This assumption needs to be supported by instructors’ opinions about examinees’ placement results, including English-speaking courses for prospective ITAs and content courses.

The second assumption is that score reports delivered via R-Plat are clearly interpretable by diverse stakeholders, such as examinees, instructors, and administrators.

Empirical research through questionnaire and interviews should be conducted to explore diverse stakeholders’ opinions about interpretability, clarity, and comprehensibility of

OPI test scores, diagnostic descriptors, and comments.

63

Impact

The last inference, ‘impact’, links the test use to its impact on stakeholders. This inference is based on the warrant that test results collected via R-Plat provide a positive influence on the course curriculum, test development, and diverse stakeholders.

The underlying assumption is test results collected via R-Plat provide useful and rich information to stakeholders—examinees, instructors of English speaking classes, and test administrators—,and contribute to development of the test. This assumption should be backed by examining diverse stakeholders’ perspectives on the usefulness and the impacts of the test results on (a) examinees’ enhancement of English speaking skills, (b) curriculum development of English speaking classes, and (c) test development.

Interviews and questionnaire can be administered to collect evidence. Table 3.4 below presents the inferences, warrants, assumptions, and backing for the interpretive argument for the OPI scores.

Table 3.4 Inferences, warrants, assumptions, and backing in the interpretive argument for the OPI in support of R-Plat Inferences Warrants Assumptions Backing Observation of 1] Critical English language skills, Domain analysis performance on knowledge, and processes needed for - Expert opinion and the OPI reveals English-medium class instruction of consensus relevant

higher education can be identified. -Surveys or knowledge, strategies that interviews about are required in classroom English-medium instruction classrooms of languages higher -Discourse analysis education. 2] OPI prompts that require speaking -OPI prompts reflect Domain description Domain skills used for class instruction and authentic discourse daily communication can be features and tasks in stimulated. classroom instruction.

64

Table 3.4 Inferences, warrants, assumptions, and backing in the interpretive argument for the OPI in support of R-Plat (Continued) Inferences Warrants Assumptions Backing Observed 1] Rating procedures in support Analysis of raters’ performance on the of R-Plat are appropriate for perceptions towards OPI tasks collected raters to assess targeted and their use of R-Plat via R-Plat is speaking abilities. evaluated to provide observed scores and 2] Test administration Analysis of rater

observed conditions in support of R-Plat outcomes collected via performance are appropriate for providing R-Plat: Diagnostic descriptors reflective evidence of targeted speaking descriptor ratings and of targeted speaking abilities. Raters’ comments

Evaluation ability. 3] Examinees’ performances on Analysis of score

the OPI are evaluated distribution across adequately in such a way that different ability level yields observed scores reflective of speaking ability level. Observed scores 1] A test reliably distinguishes Dependability of a test recorded in R-Plat are examinees’ different speaking dependable estimates proficiency levels. of expected scores 2] Examinees’ proficiency is Analysis of prompt over the relevant parallel versions of evaluated consistently across difficulty prompts, are prompts at the same difficulty consistent within levels.

Generalization intended prompt A3] Examinees’ proficiency is Analysis of raters’ levels, and evaluated consistently rating patterns across/within raters. within/across raters. Expected scores are 1] The linguistics knowledge, Comparative analysis attributed to a processes, and strategies of key language construct of speaking required completing OPI features required to ability in English- prompts are pertinent to the complete OPI prompts medium classrooms of higher education. English 180 syllabus, class and instructional tasks assignments, and activities. in -Examinees’ performance in English speaking classes for

Explanation prospective ITAs, and other content classes. 2] Scores collected via R-Plat Concurrent correlation relate to scores on other studies speaking assessments.

65

Table 3.4 Inferences, warrants, assumptions, and backing in the interpretive argument for the OPI in support of R-Plat (Continued) Inferences Warrants Assumptions Backing Expected scores are 3] The internal structure of the OPI Studies of attributed to a

scores collected via R-Plat is reliability and construct of speaking consistent with a theoretical view factor analysis ability in English- of speaking proficiency as a medium classrooms of higher education. number of highly interrelated

Explanation components.

The construct of 1] Performance on the test is Criterion-related speaking proficiency related to other criteria for langue validity studies. as evaluated by the

proficiency in the real classroom -Examinees’ self- OPI in support of R- instruction of higher education. assessment Plat are relevant to the quality of linguistic -Examinees’ performance required performance in in English-medium English speaking Extrapolation classrooms of higher classes for education. prospective ITAs.

Test results collected Placement of examinees based on Washback. via R-Plat are useful the test result is appropriate. -Instructors’ for making decisions (English speaking about the teaching courses for ITAs assignments and

appropriate placement and content of ESL speaking courses) opinions classes for prospective about examinees’ international teaching placement results Utilization assistants. The score reports delivered via R- Washback. Plat are clearly interpretable by diverse stakeholders such as examinees, instructors, and administrators. Test results collected Test results collected via R-Plat Washback. via R-Plat give provide useful and rich positive influence on

information to stakeholders — the course curriculum, examinees, instructors of English test development and speaking classes, test

Impact diverse stakeholders. administrators, and students of ITAs—and contribute to development of the test itself.

66

3.4 Research Questions

Seven research questions were posed pertaining to the evaluation and generalization inferences. The first four questions were drawn from the evaluation inference. The questions were intended to collect evidence by investigating raters’ perceptions towards R-

Plat, quality of diagnostic descriptor ratings, and raters’ comments associated with proficiency levels. The remaining three research questions were devised to support the assumptions of the generalization inference. These questions are formulated to collect evidence through scrutinizing the quality of OPI scores and OPI ratings in terms of how well they distinguish among examinees’ abilities, how well the intended prompt level matches the observed difficulty, and how consistent raters’ behaviors are.

§ RQ1: Did both experienced raters and new raters perceive R-Plat as appropriate for

rating examinees’ speaking ability during OPI? (Evaluation: Assumption 1)

§ RQ2: To what extent are the raters’ markings of diagnostic features indicative of

examinees’ speaking ability? (Evaluation: Assumption 2)

§ RQ3: To what extent are raters’ comments be indicative of examinees’ speaking

ability? (Evaluation: Assumption 2)

§ RQ4: Do the OPI ratings place examinees into different proficiency levels?

(Evaluation: Assumption 3)

§ RQ5: To what extent do the scores reliably separate examinees based on speaking

abilities? (Generalization: Assumption 1)

§ RQ6: To what extent do the intended difficulty levels of the OPI items match the

observed difficulty levels? (Generalization: Assumption 2)

66 67

§ RQ7: To what extent are raters consistent in their severity and use of rating scales

within each test administration? (Generalization: Assumption 3)

This chapter introduced the interpretive argument for the OPI scores, consisting of seven inferences. Of the seven inferences, this study was centered on the evaluation inferences and the generalization inferences. Four research questions were generated from the assumptions for the evaluation inferences. The remaining three research questions stem from the assumptions of generalization inferences. The research questions were used to collect validity evidence. The next chapter, Chapter 4, specifies methodologies used to conduct the empirical studies linked to each research question.

67 68

CHAPTER 4

METHODOLOGY

This chapter presents the methodology used for this dissertation. It begins with the type of research design. Next, participants included new and experienced raters of the

OPI. Materials used for this dissertation contain questionnaire, OPI prompts, scoring rubric and types of diagnostic descriptors. The following section elaborated the procedure for data collection, and rating results. It concludes with specific procedures for analyzing data respective to each of the seven research questions.

4.1 Research Design

The current study employed a mixed methods research design where a convergent parallel design is embedded in a sequential design. The mixed methods design was adopted because it allows researchers to collect converging and convincing evidence by triangulating quantitative and qualitative data to reach a conclusion (Johnson &

Onwuegbuzie, 2004). A visual representation of the data types and the data collection procedure are presented in Figure 4.1.

Figure 4.1 A sequential embedded mixed methods design of the current study

69

For administration 1 (ADMIN 1), a convergent design (Creswell & Clark, 2011) was employed to collect both quantitative and qualitative data concurrently. The quantitative data consisted of OPI ratings, diagnostic descriptor ratings, raters’ responses to yes/no questions, and six-point Likert statements in the questionnaire. The qualitative data contained raters’ responses to open-ended questions in the questionnaire, individual interviews, focus groups, and raters’ written comments on examinees’ test performances.

The quantitative and qualitative data were analyzed separately, and then triangulated to answer research questions 1 through 3. The convergent design at ADMIN 1was embedded in an overarching sequential design because it connected with the subsequent data collection through five test administrations. From AMDIN 1 to 6, OPI ratings were collected for each test administration (ADMIN 1: 98 ratings, ADMIN 2: 104 ratings,

ADMIN 3: 200 ratings, ADMIN 4: 114 ratings, ADMIN 5: 155 ratings, and ADMIN 132 ratings). The OPI rating data was mainly used to answer research questions 4 through 7.

4.2 Participants

The participants were 18 OPI raters grouped into experienced and new raters.

Experienced raters had prior OECT (OPI and TEACH) rating experience. To be specific, all experienced raters were officially certified with at least one year of rating experience.

Among them, two, who were native speakers of English, worked as certified raters over seventeen years. The other experienced raters, native and non-native speakers of English, were graduate students in the applied linguistics program in this institution. In addition, the experienced raters had completed the transition from the paper-based rating format to

R-Plat. Prior to each test administration, all experienced raters joined the ‘rater brush-up

70 sessions’ for a minimum of three hours. These sessions intended to provide consistent trainings to raters to ensure they evaluated examinees based on the scoring rubric and administered the test as expected. Thus, the scores from raters accurately presented examinees’ proficiency levels. During these sessions, raters previewed the prompts used for the upcoming test administration, reviewed the scoring rubrics, and practiced with sample video recordings.

In contrast to the experienced raters, the new raters were not certified as official raters and had no experience using R-Plat when I collected the data during ADMIN 1.

They started to receive rater training from ADMIN 1. All were non-native speakers of

English, studying as graduate students in the applied linguistics program. To qualify as official raters, they were required to pass the new rater trainings that began with introducing the test procedure, the scoring rubrics, the prompts, and R-Plat. Then, these raters evaluated examinees’ performances in the sample videos and followed with actual students for OPI rating practices. A component of rater training was introducing R-Plat.

For an hour, the features and the functionality of R-Plat were presented. Next, they used

R-Plat during the mock rating sessions with actual students for three hours.

The number of raters participating in this study differed at each test ADMIN due to their availabilities. Table 4.1 exhibits the number of raters participating in the official test administrations, the questionnaire, and the interviews.

71

Table 4.1 Number of new and experienced raters participating in the official rating sessions for each test administration, the questionnaire, and the interviews Test Administration Instrument Raters 1 2 3 4 5 6 OPI rating New 0 3 4 2 2 0 sessions Experienced 8 9 10 6 9 7 New 8 - - - - - Questionnaire Experienced 6 - - - - - New 6 - - - - - Interviews Experienced 6 - - - - - Note: Questionnaire and interview data were collected only at test administration 1.

4.3 Materials

This section describes the five types of materials used to collect the data. To collect raters’ rating experiences and perceptions towards R-Plat, questionnaire and focus group/individual interview protocols were employed. In addition, OPI prompts, a scoring rubric, and diagnostic descriptors were utilized to investigate dependability of the OPI ratings, task difficulty, and raters’ rating practices.

4.3.1 Questionnaire

The purpose of the questionnaire used for eight new and six experienced raters was to investigate the extent to which the raters were able to use R-Plat without difficulties during the rating process. The questionnaire contained yes/no questions, six- point scale statements, and open-ended questions (see Appendix C). The questionnaire asked the raters to evaluate (a) the clarity of R-Plat features, (b) the effectiveness of R-

Plat, (c) the raters’ comfort levels with R-Plat, and (d) their satisfaction with R-Plat. In addition, the raters were asked how they used diagnostic descriptors and comment

72 features for evaluation. The questionnaire ended with questions seeking strengths and weaknesses of R-Plat, and suggestions for future R-Plat improvement.

4.3.2 Focus group and individual interviews

Focus group and individual interview protocols were devised to investigate raters’ experiences, uses, and perceptions towards R-Plat. The focus group interviews were mainly employed to collect rater perception data, particularly aiming to elicit more in- depth descriptions about participants’ experience and opinions in a more interactive group setting (Krueger & Casey, 2009). Individual interviews were carried out with raters who could not join the focus group interviews. The same questions were used for both focus group and individual interviews. The first part of the interview questions intended to elicit raters’ opinions about R-Plat in terms of its clarity and effectiveness, the raters’ comfort level and their satisfaction. The questions also asked how raters marked diagnostic descriptors and wrote comments to indicate different speaking abilities.

Finally, the last parts of the interview questions asked raters to share their opinions about strengths and weaknesses of R-Plat, and suggestions for future improvement in R-Plat.

The protocols for the focus group and individual interviews are presented in Appendix D.

4.3.3 OPI prompts

OPI prompts are developed to elicit examinees’ speech samples and to determine examinees’ functional abilities. The prompts are categorized into two categories: (1) prompts for impromptu questions and (2) those for role-play questions. These prompts are grouped into four intended difficulty levels—advanced, intermediate-high,

73 intermediate-mid, and intermediate-low. Prompts at each level consider different ranges of content and functions.

Advanced level prompts elicit responses on concrete or abstract topics in a range of academic and non-academic topics, such as topics on practical and professional issues, or social, political, and environmental issues. These prompts intend to require examinees to convey abstract and complex ideas by constructing a persuasive argument or supporting their views on a given topic. They also ask examinees to project their opinions by hypothesizing or exploring alternative possibilities in a given situation.

Intermediate-high level prompts cover abstract and most concrete topics in a range of familiar and unfamiliar non-academic topics associated with a community or worldwide issues. These prompts are designated to resolve situations with a complication and to elicit examinees’ perspectives towards the issues by comparing or contrasting different aspects on given issues. The prompts require examinees to explain, narrate, and describe issues or events with sufficient detail.

Intermediate-mid level prompts contain uncomplicated familiar academic or non- academic topics relevant to personal experiences in work, school, home, recreation, leisure, family, home, and daily routine, etc. Examinees are asked to convey meanings with simple narration, explanation, comparison, and description in simple situations.

Finally, prompts at the intermediate-low level ask examinees to simply narrate or enumerate ideas about daily life issues, such as basic objects, body parts, situations, colors, clothing, food, etc. The prompts for this level lead examinees to list, enumerate, imitate, and respond to simple, direct questions or requests.

74

Table 4.2 describes how OPI prompts at each difficulty level are developed to draw different contents and functions from examinees (Academic Communication

Program, 2014). The first row indicates the four difficulty levels, ranging from advanced to intermediate-low levels. The second row presents the actual prompts associated with birthdays. The third row refers to key functions elicited from each prompt.

Table 4.2 Examples of OPI prompts Intermediate- Intermediate- Intermediate- Topic Advanced high mid low Birthdays What would Some people How do people What do you people do if spend a lot of in your home usually do on celebrating money on their country usually your birthday? birthdays was birthday party. spend their prohibited? Why do you birthday? think this is right or wrong? Function hypothesize express opinions describe list/enumerate

4.3.4 Scoring rubric

The scoring rubric is categorized into functional competency, comprehensibility, pronunciation, fluency, vocabulary, and grammar. Functional competency is relevant to the extent a speaker is able to perform tasks by using appropriate languages and strategies.

Comprehensibility is associated with the extent to which an examinee’s speech is understandable without difficulty. Pronunciation is related to the extent to which an examinee articulates English sounds along with adequate accent in word/sentence levels.

Fluency refers to an examinee’s ability to link sentences in paragraphs and to speak with ease without pauses, hesitations, halting, etc. Vocabulary is measured, based on the scope of vocabulary or native-like vocabulary and expressions. Grammar is relevant to the

75 complexity of sentence structures constructing one’s speech and adequate control of grammar without errors. The specific descriptions about each criterion are described for different proficiency levels in Appendix E. In the rating results, the holistic scores are presented in four score bands to indicate different ability levels—Advanced level (230-

300), Intermediate-high (210-220), Intermediate-mid (170-200), and Intermediate-low

(below 160). Details about the scoring rubric is presented in Appendix F.

4.3.5 Diagnostic descriptors

Diagnostic descriptors included in the R-Plat rating page indicate different aspects of English speaking ability. Thirty diagnostic descriptors were categorized into seven representative features of speaking ability—comprehensibility, pronunciation, fluency, vocabulary, grammar, pragmatics, and listening. During the rating process, raters opted to mark thirty diagnostic descriptors in addition to the holistic score assignment. The categorization for the diagnostic descriptors is presented in Table 4.3.

Table 4.3

Thirty diagnostic descriptors grouped by seven features of speaking ability

Categories Diagnostic Descriptors Comprehensibility • Ease of understanding • Volume • Accent • Vowels • Reduction • Word stress Pronunciation • Consonants • Intonation • Enunciation • Insertion • Rhythm • Phrasing • Pauses • Pace Fluency • Choppiness • Incomplete • False starts • Halting utterances/ideas Vocabulary • Breadth of vocabulary • Word choice/expression • Grammatical • Singular/plural • Verb Grammar complexity • Pronouns tenses/forms • Word order • Articles • Word form Pragmatics • Interaction • Compensation strategies Listening • Listening

76

Each diagnostic descriptor can be marked on a five-point scale in R-Plat. In the scale, the leftmost point refers to the poorest performance; whereas, the rightmost point indicates the best performance. For example, Figure 4.2 shows the scale used to evaluate comprehensibility. Comprehensibility was characterized by three diagnostic descriptors—ease of understanding, accent, and volume. During the rating session, raters could mark a point for any or all of these descriptors to indicate their judgment of the level of comprehensibly of examinee’s speech.

Figure 4.2 Evaluation of diagnostic descriptors for comprehensibility based on five-point scales

4.4 Procedure

Prior to data collection, raters’ agreements to participate in this study were obtained through consent forms approved by the ISU Institutional Review Board.

Different types of data were collected through six test administrations (ADMINS) as described in Figure 4.3. For ADMIN 1, the questionnaire and focus group/individual interviews data were collected. A separate data collection was employed for new and experienced rater groups. From ADMIN 1 through ADMIN 6, OPI rating results were collected from the certified raters. The following sections explain the specific data collection procedure for each ADMIN instrument.

77

Figure 4.3 Data collection procedure and timeline

4.4.1 Questionnaire

In ADMIN 1, the questionnaire was implemented via an online questionnaire platform, Qualtrics, with the purpose of investigating raters’ perceptions and uses of R-Plat.

Eight new raters and six experienced raters completed the questionnaire separately.

Specifically, eight new raters completed the questionnaire after the three-hour new rater training sessions for R-Plat in April 2014. During the training, they were introduced to different features and functionality of R-Plat. Next, they used R-Plat for the mock rating practices with the sample video recordings and with actual students. Six experienced raters completed the questionnaire after the official OPI testing periods at ADMIN 1 in May

78

2014. Questionnaire responses were extracted from the Qualtrics server and saved in the

Excel format for data analysis.

4.4.2 Focus group and individual interviews

Separate interviews were conducted for the six new and six experienced rater groups at ADMIN 1. These interviews were intended to obtain raters’ in-depth descriptions about their rating experiences with R-Plat. Six new raters took part in a one- hour focus group interview after completing the questionnaire (one recording file). For the experienced raters, three experienced raters joined a one-hour focus group interview after completing the questionnaire (one recording file). In addition, the remaining three raters, who could not make it to the focus group interview, participated in a one-hour individual interview (three recording files from each individual interview). Conversations during the interviews were audio-recorded and five recordings were obtained from the interview sessions. The recordings were then transcribed for analysis. Raters’ names were replaced with random numbers to ensure confidentiality.

4.4.3 OPI prompt rotation

Prior to collecting OPI rating results, prompts were systematically rotated in an attempt to preserve connectedness of ratings through the repeating prompts. This is because raters assigned different prompts to individual examinees, adjusting the examinees’ responses to the given prompts.

First, the prompts that were used most frequently at ADMIN 1 and ADMIN 2 were identified. The selected sets of the OPI impromptu question tasks were 2, 5, 13, 16, 20, and

21, and those of the role-play prompts included 5 and 7. The main reason for this prompt

79 selection was that I intended to add the OPI ratings on these prompts used in the two

ADMINS for statistical analysis, which could subsequently enhance the power of the statistical result. Next, each prompt was labeled with its own unique number for subsequent data analysis, ranging from 1 to 90. Table 4.4 presents the typology of impromptu question prompts and the associated set numbers.

Table 4.4 Prompts for impromptu question tasks Set Number Prompt Difficulty Level Prompt Numbers Advanced 1 2 3 Intermediate-high 4 5 6 2 7 8 9 Intermediate-mid 10 11 12 Intermediate-low 13 14 15 Advanced 16 17 18 Intermediate-high 19 20 21 5 22 23 24 Intermediate-mid 25 26 27 Intermediate-low 28 29 30 Advanced 31 32 33 Intermediate-high 34 35 36 13 37 38 39 Intermediate-mid 40 41 42 Intermediate-low 43 44 45 Advanced 46 47 48 Intermediate-high 49 50 51 16 52 53 54 Intermediate-mid 55 56 57 Intermediate-low 58 59 60 Advanced 61 62 63 Intermediate-high 64 65 66 20 67 68 69 Intermediate-mid 70 71 72 Intermediate-low 73 74 75 Advanced 76 77 78 Intermediate-high 79 80 81 21 82 83 84 Intermediate-mid 85 86 87 Intermediate-low 88 89 90

80

In addition, one role-play set contains 12 prompts, grouped by the four intended difficulty levels. Each prompt was given a unique number, ranging from 91 to 114, as shown in

Table 4.5.

Table 4.5 Prompts of role-play tasks Set number Prompt Difficulty Level Prompt numbers Advanced 91 92 93 5 Intermediate-high 94 95 96 Intermediate-mid 97 98 99 Intermediate-low 100 101 102 Advanced 103 104 105 Intermediate-high 106 107 108 7 Intermediate-mid 109 110 111 Intermediate-low 112 113 114

Third, after the selection of the prompts, three sets of the impromptu question tasks, and one set of the role-play tasks were utilized for each day of administration. Table 4.6 presents how the sets were rotated at ADMIN 3, as an example. For day 1, the impromptu question sets (2, 5, 13) and one role-play set (7) were utilized, and the remaining sets were used on the following day. These sets were then rotated on the following days.

Furthermore, interviewers assigned different prompts from the given sets to successive examinees in order to prevent examinees from cheating during the same testing day.

Table 4.6 Rotation of impromptu questions’ prompts and role-plays at administration 3 The administration of the OECT OPI tasks Day 1 Day 2 Day 3 Day 3 2 16 2 16 Impromptu 5 20 5 20 questions 13 21 13 21 Role-play 7 5 7 5 Note: Numbers indicate the sets of the impromptu question task and the role-play tasks.

81

Likewise, the rotations of the prompts allowed for a crossed rating design for MFRM analysis. However, it should be noted it was challenging to completely restrict the range of prompts sets used for each examinee because different prompts were assigned adjusting to examinees’ performance during the test. Therefore, post-hoc connecting methods were employed when prompt subsets were found in FACET results. In other words, when disconnected prompts were observed in FACET results, prompts in the disjoint subsets were connected by anchoring prompts at the average of the corresponding prompt subsets.

4.4.4 Rating results

With approval from the university’s Institutional Review Board, the raters’ ratings were requested from the academic communications program in charge of the test. Three types of raters’ rating results were mainly collected via R-Plat—a) thirty diagnostic descriptor ratings, b) raters’ comments, and c) OPI ratings for each prompt given to individual examinees. The first two types of rating results were collected only at ADMIN

1; whereas, OPI ratings were collected via R-Plat across the six administrations.

In particular, the OPI ratings were systematically gathered by controlling raters’ rating assignments and limiting the possible range of OPI prompts each rater rated for. The control of the raters’ rating assignments and the range of the prompts was necessary to preserve data connectedness required for conducting the Many-facet Rasch measurement

(MRFM) analysis. To be specific, during data collection, raters were assigned to work in a group of two or three to evaluate certain examinee groups, generating separate rater groups.

The separate rater groups were connected through repeating raters who were intentionally

82 allocated to rate across different rater groups. This data collection procedure allowed for a three-facet partially crossed rating design (Shavelson & Webb, 1991).

As shown in Figure 4.4, different examinee groups are assessed by different rater groups and these rater groups are partially crossed through repeating raters.

Figure 4.4 Partially crossed rating design

For example, examinees S1, S2, and S3 belong to Rater group 1 (R1, R2, and R3) while examinees S4, S5, and S6 are rated by Rater group 2 (R4, R5, and R6). Then, Rater group

3 (Raters R3, R5, and R6) evaluated examinees S7, S8, and S9, and their ratings are partially crossed with Rater group 2 through R5 and R6. This allows for ratings by Rater group 1 to connect with those by Rater group 2 through Rater group 3. The same patterns of rater groups partially crossed were observed among the remaining raters. With this design, the ratings from the different rater groups were connected through the repeating raters partially crossed between the different rater groups.

The partially crossed rating design was available to deal with the OPI rating data in this study because the MFRM analysis is very robust with regard to missing data or designs not completely crossed. However, the less loosely connected data leads to more errors in the estimates. After collecting data based on this rating design, the rating results were extracted from the R-Plat database and saved them as csv data files. All identifiers in the

83 data, such as examinees’ numbers and rater identifiers, were removed and replaced with random numbers.

Table 4.7 shows a summary of research questions, analytic methods, and data types. The first column refers to the key inferences of this dissertation—evaluation and generalization. The second column indicates the research questions derived from the assumptions underlying these two inferences. The third column includes analytic methods to analyze the data. The data types described in this section are summarized in the right most column.

Table 4.7

Summary of research questions, analytic methods, and data types

Inferences Research question (RQ) Data types Data analysis -14 raters’ responses to the six- point scale questionnaire items RQ1] Did both -14 raters’ responses to open- experienced raters and - Descriptive ended questionnaire questions new raters perceive R- statistics - Focus group interview Plat as appropriate for - One-way ANOVA recordings with 6 new raters rating examinees’ - Independent t-test and with 5 experienced raters speaking ability during - Grounded theory - Individual interview OPI? recordings with 3 experienced raters

RQ2] To what extent are the raters’ markings of 2524 diagnostic descriptor diagnostic features markings from 146 ratings -Chi-square test indicative of examinees’ collected during ADMIN1. Evaluation speaking ability? -Analysis of raters’ RQ3] To what extent are 1900 evaluative units in raters’ comments based on raters’ comments be comments from 146 ratings grounded theory indicative of examinees’ collected during ADMIN1. -Inter-coder reliability speaking ability? -Descriptive statistics -Chi-square tests RQ4] Do the OPI ratings OPI scores of 279 examinees place examinees into collected through 6 -Descriptive Statistics different proficiency administrations levels?

84

Table 4.7

Summary of research questions, analytic methods and data types (Continued)

Infere Research question (RQ) Data types Data analysis nces -Multiple imputation RQ5] To what extent do -Principal 803 individual raters’ ratings the scores reliably separate component collected through 6 examinees based on analysis administrations speaking abilities? -Many-facet Rasch measurement RQ6] To what extent do the intended difficulty 803 individual raters’ ratings on -Many-facet levels of the OPI items 73 prompts collected through 6 Rasch

Generalization match the observed administrations measurement difficulty levels? RQ7] To what extent are raters consistent in their 803 individual raters’ ratings on -Many-facet severity and use of rating 73 prompts collected through 6 Rasch scales within each test administrations measurement administration?

The following sections describe the analytic methods utilized to answer each of the seven research questions.

4.5 Data Analysis

This section describes how the data were analyzed to address the seven research questions regarding 1) raters’ perceptions towards R-Plat, 2) quality of raters’ diagnostic descriptor ratings, raters’ comments, 3) quality of OPI scores, 4) dependability of OPI ratings, 5) comparisons of intended prompt level, 6) observed prompt difficulty, and 7) consistency in raters’ ratings. This section specifies how the data were analyzed using different analytic methods to address each research question.

85

4.5.1 Raters’ perceptions towards R-Plat (RQ1)

The first research question (RQ1) focused on investigating raters’ perceptions towards R-Plat and their uses of R-Plat. Both quantitative and qualitative data collected during ADMIN 1 were analyzed and triangulated to address this research question. The quantitative data included 14 raters’ responses to the six-point scale items in the questionnaire. The qualitative data contained 14 raters’ responses to the open-ended items in the questionnaire, 2 separate recordings from the focus group interviews (1 with the new raters and 1 with the experienced raters), and 3 recordings of individual interviews with 3 experienced raters.

The raters’ responses to the six-point scale statements were analyzed using descriptive statistics, one-way ANOVA, and independent samples t-tests to determine the raters’ perceptions towards R-Plat in terms of clarity of R-Plat, raters’ comfort level with

R-Plat, effectiveness of R-Plat, and raters’ satisfaction with R-Plat. For each aspect of raters’ perceptions, descriptive statistics were first employed to obtain means and standard deviations for raters’ Likert-scale choices, which can demonstrate their general perceptions of experienced and new raters towards R-Plat.

Next, to compare experienced and new raters’ perceptions, a series of one-way

ANOVA analyses were conducted separately for clarity of R-Plat and raters’ comfort levels with R-Plat. Each analysis had one independent variable, raters’ rating experience, which had two treatment levels, new raters and experienced raters. Each analysis also had one dependent variable. The dependent variables were the sum of raters’ responses to 11 six-point scale statements associated with the clarity of R-Plat, and the sum of their responses to 4 six-point scale statements about raters’ comfort. A one-way ANOVA

86 relies on the assumptions of independence, normality and homogeneity of variance

(Bachman, 2004). The assumption of independence is that individual raters’ responses to the six-point scale statements are independent. This assumption was satisfied because experienced and new raters were completely distinguishable in terms of their pervious rating experience and uses of R-Plat. They also responded to six-point scale statements independently. A test for the normality assumption was conducted to establish whether observations from each rater group were normally distributed. In order to investigate normal distribution, skewness, kurtosis, and Shapiro-Wilk statistics were examined.

Values for skewness and kurtosis of between -2 and +2 indicate a normal distribution of data (Bachman, 2004). If the significant value of the Shapiro-Wilk test is greater than

0.05, the data is normal. Lastly, a test for equality of variance was run based on Levene’s statistics. Significant value for Levene’s test should not be significant (α >.05) to satisfy this assumption. The assumption tests yielded acceptable values, as reported in Chapter

5.

In addition, independent samples t-tests were used to compare new and experienced raters’ perceptions towards the effectiveness of R-Plat, and their satisfaction with R-Plat. Each aspect was evaluated based on one six-point scale statement. The independent variable is raters’ rating experience, consisting of two treatment levels; new and experienced raters. Dependent variables were raters’ responses to each scale statement for the effectiveness of R-Plat, and raters’ satisfaction with R-Plat, respectively. The assumptions for independent t-test are the same as those required for one-way ANOVA, and were tested prior to the independent samples t-tests.

87

In addition, qualitative data, including raters’ written and verbal responses to the questionnaire, and the focus group/individual interviews were analyzed on the basis of grounded theory that allows themes to emerge from the data (Glaser & Strauss, 1967).

The findings from the qualitative data analysis were then triangulated with the quantitative data results in an attempt to add more descriptions about the raters’ perceptions and use of the R-Plat features.

4.5.2 Comparisons of diagnostic descriptor markings (RQ2)

The second research question (RQ2) compared the diagnostic descriptor markings across different proficiency levels. A total of 146 proficiency level ratings (50 advanced,

39 intermediate-high, and 57 intermediate-mid level) assigned by raters during ADMIN 1 were analyzed. For each rating, frequencies were counted for thirty diagnostic descriptors, which could be marked on a five-point scale. In total, 2524 diagnostic descriptor markings were analyzed to answer this research question. A Chi-square test was implemented to compare diagnostic descriptor markings at each scale point across the three proficiency levels. Furthermore, seven Chi-square tests were conducted separately to compare the diagnostic descriptor markings across the three proficiency levels for each of the seven diagnostic features of speaking ability—comprehensibility, pronunciation, fluency, vocabulary, grammar, pragmatics, and listening. Additionally, raters’ responses to the questionnaire and the interviews were examined with the purpose of understanding their perceptions and rationales for their markings of diagnostic descriptors.

88

4.5.3 Raters’ comments as indicators of speaking ability levels (RQ3)

The third research question (RQ3) aimed to investigate the extent to which raters’ comments provide indicators of examinees’ speaking abilities. Data included 1900 evaluative units extracted from the raters’ comments about the examinees’ performance in 146 ratings. These ratings were from advanced, intermediate-high, and intermediate- mid levels. No ratings were assigned for the intermediate-low levels.

The analysis of raters’ comments unfolded as follows. First, the grounded theory method (Glaser & Strauss, 1967) was employed to identify themes in the raters’ comments about individual examinees’ performance collected at ADMIN 1. These themes served as the analytic scheme to code types of raters’ comments. Eight experienced raters officially rated at ADMIN 1, leaving comments in R-Plat as they listened to examinees’ performances. Among the multiple themes that emerged from the data, two representative categories of comments were identified—positive and negative.

Raters’ positive comments often included complimentary expressions, such as

“excellent,” “strong,” and “good,” etc. Negative comments included expressions about erroneous language use and raters’ criticisms about examinees’ weaknesses.

Classifying qualitative data into positive and negative categories is not a common approach in the area of language assessment. However, in the field of business and technical communication focusing on written discourses, Mackiewicz (2007) used this approach to investigate written discourses in book reviews by examining the frequencies of compliments and criticisms in 48 book reviews from three business communication journals. Considering the fact that raters’ comments on examinees’ performance exhibited positive and negative statements, I adopted this approach and divided the

89 comments into positive and negative categories. Then, the comments in each category were further analyzed using the metric of evaluative unit, defined as a segment (words, phrases, or clauses) that expresses raters’ evaluation of an examinee’s language.

Table 4.8 displays selected examples of raters’ positive and negative comments, and the number of evaluative units extracted from these comments. These comments were selected from the advanced level ratings, since they demonstrate different types of evaluative units that could be often observed in the comments.

Table 4.8 Examples of raters’ comments and their corresponding evaluative units selected from the advanced-level rating Comment Evaluative Examples of Raters’ Comments Types Units (N) “No effort to understand. Excellent enunciation, Positive 3 vocabulary.” (Rater 2) “Some word stress issues. Lots of pausing and Negative halting when nervous. Some sounds deleted. ("w" in 4 wooden)” (Rater 1)

An example of positive comments from Rater 2 is “No effort to understand. Excellent enunciation, vocabulary.” The raters’ comments consisted of three different segments of a test taker’s speaking ability at the advanced proficiency level. An example of negative comments from Rater 1 is “Some word stress issues. Lots of pausing and halting when nervous. Some sounds deleted. (‘w’ in wooden).” In this comment, four segments including “word stress,” “pausing,” “halting,” and “deletion” were identified. Based upon the coding schemes, frequencies were counted for positive and negative evaluative units, and compared with these two set of frequencies: (1) across the three observed proficiency level groups and (2) for each of the six scoring criteria on the rubric

(functional ability, comprehensibility, pronunciation, fluency, vocabulary and grammar).

90

Figure 4.5 depicts the procedures for analyzing raters’ comments to obtain evaluative units for each proficiency level rating. This scheme identifies evaluative units and categorizes them into positive and negative components, and each of the six categories for the scoring rubric.

Figure 4.5 Schematic diagram of procedures for analyzing raters’ comments1

Next, a statistical analysis was conducted using descriptive statistics and Chi- square tests. Descriptive statistics were employed to obtain frequencies and percentages of positive and negative evaluative units for each proficiency level and for each category of the scoring rubric. Then, seven Chi-square tests were employed to investigate

1 E.U. refers to an evaluative unit.

91 differences in frequencies of positive and negative evaluative units across the three proficiency levels for each category of the scoring rubric. The two categorical variables were proficiency levels, and positive and negative evaluative units. The significant Chi- square test with a critical p-value of less than .05 indicates proficiency levels are dependent upon frequencies of positive and negative evaluative units. For example, if frequencies of positive evaluative units of the advanced proficiency level were significantly greater than those for the negative evaluative units, and the patterns were different from that for the intermediate-high and the intermediate-mid levels, the question about the raters’ comments providing indicators of examinees’ speaking ability would be positive.

To determine the quality of coding, inter-coder reliability was calculated to estimate the reliability of coding of raters’ comments between two coders. I invited the second coder who was a certified OECT rater and Ph.D. student in the applied linguistics and technology program. His background in language assessment-related research and rating experience was taken into consideration for selection. Next, the second coder was invited to attend a one-hour training session to familiarize him with the coding schemes in terms of positive and negative evaluative units as well as each category for the scoring rubric. After practice coding with sample raters’ comments, the second coder was asked to code twenty percent of the evaluative units—380 evaluative units—from each proficiency level rating (advanced, intermediate-high, and intermediate-mid levels).

When the data were shared with the coder, the original proficiency levels associated with the comments were disclosed to prevent any possible influence of proficiency level information on the coder’s decision-making processes. With the given template (Table

92

4.9), the second coder identified the evaluative units as positive or negative, and one of the criteria in the scoring rubric. In Table 4.9, the first two columns include the original comments and the extracted evaluative units. The third and the fourth columns are places where the second coder labeled positive (P) or negative (N) evaluative units. The next six columns include each category of the scoring rubric—functionality (F), comprehensibility (C), pronunciation (Pr), fluency (Fl), vocabulary (V), and grammar

(G). His coding was saved in an Excel file format for computing inter-coder reliability using SPSS version 22.

Table 4.9 An example template for coding evaluative units

Criteria of the Scoring Rubric Original Original Pos Neg Comments Comments (P) (N) Func Comp Pronun Fluen Vocab Gram (F) (C) (Pr) (Fl) (V) (G) No effort to No effort to P C understand. understand. Excellent Excellent enunciation, enunciation P Pr vocabulary. , vocabulary. Some word Some word stress issues stress N Pr (see paper issues. copy). Lots Lots of of pausing pausing and halting and halting N Fl when when nervous. nervous. Some sounds Some deleted. ("w" sounds N Pr in wooden) deleted. Note: Pos = Positive (P); Neg = Negative (N); Func = functional competency (F); Comp = comprehensibility (C); Pronun = pronunciation (P); Fluen = fluency (F); Vocab = vocabulary (V); Gram = grammar (G).

After obtaining the second coder’s coding, the labels were converted to numeric numbers for the statistical analysis—Positive (1), Negative (2), Functional competency

93

(3), Comprehensibility (4), Pronunciation (5), Fluency (6), Vocabulary (7), and Grammar

(8). With these prepared data, Cohen’s Kappa was utilized to compute two components of inter-coder agreement—(1) identification of positive and negative evaluative units, and

(2) identification of categories in the scoring rubric. Cohen’s Kappa (Cohen, 1960) using

SPSS version 22 was selected, since it measures the proportion of raters’ agreement more precisely after excluding the chance for raters’ agreements, which subsequently enhance the accuracy of the agreement estimate. Interpretation of Cohen’s Kappa was based on the following guidelines (Landis & Knoch, 1977): Less than chance agreement (< 0),

Slight agreement (0.01–0.20), Fair agreement (0.21–0.40), Moderate agreement (0.41–

0.60), Substantial agreement (0.61–0.80), and Almost perfect agreement (0.81–0.99). To support the good quality of coding, Cohen’s Kappa greater than 0.81 or higher is accepted.

4.5.4 Descriptive statistics (RQ4)

The fourth research question (RQ4) aimed to examine whether OPI scores separated examinees into different proficiency levels. The descriptive statistics for 279 examinees’ OPI scores, coming from six administrations, were computed using SPSS version 22. In the descriptive statistics, the ranges of the OPI scores across different ability levels at each ADMIN were examined. The distributions of OPI scores, skewness, and kurtosis were examined, based on the histograms for each ADMIIN to determine if the scores were normally distributed across a wide range of scores, which could support the adequacy of the test for norm-referenced purposes. Finally, the standard deviations for the OPI scores at each ADMIN were examined assuming the equivalence of the groups of examinees across ADMINS.

94

4.5.5 OPI ratings as reliable indicators of different speaking ability levels (RQ5)

The fifth research question (RQ5) investigated the extent to which OPI ratings could reliably separate examinees into different ability levels. This question was answered mainly by descriptive statistics and a Many-facet Rasch Measurement (MFRM) analysis. The MFRM analysis was conducted after checking the unidimensionality assumption. Checking this assumption required treating the responses to prompts that the examinees did not respond to as missing and imputing data for these missing values.

This made it possible to investigate the dimensionality of the data set with principal component analysis (PCA). Data to answer this research question are from 803 individual raters’ ratings collected from all six ADMINS.

Descriptive statistics was computed to examine whether the OPI ratings were normally distributed across different ability levels at each ADMIN, using SPSS version

22. Next, the unidimensionality assumption check was conducted to examine whether the prompts within each of the four difficulty levels measured a single construct, which is hypothesized to be speaking ability in the OPI. The assumption check was essential for the MFRM analysis. The MFRM is a special case of the one parameter (1PL) Rasch model within item response theory (IRT). For the 1PL, the observed data are perceived as a manifestation of a person-oriented latent factor, which assumes the unidimensionality of the observed variables. That is, the unidimensionality assumption presents “the observations on the manifest variable are solely a function of a single continuous latent person variable” (Ayala, 2009, p. 20). In this study, a single latent speaking ability variable underlies the examinees’ performance on the OPI prompts.

95

To examine the dimensionality of the data set, missing data from 803 individual raters’ ratings collected from all six ADMINS was imputed using the multiple imputation method with SPSS version 22. The multiple imputation method is a simulation-based procedure to replace missing data with imputed values using a specified regression model.

In the current study, the multiple imputation was run based on the linear regression. The independent variable is the prompt difficulty level, and the dependent variable is the individual examinee’s scores on each prompt level.

In 803 individual raters’ ratings, missing data was produced because each examinee was randomly tested on only four prompts of the possible 99 prompts during the OPI. To impute these missing data, I firstly identified the prompts given to each examinee by reviewing the audio recordings associated the collected OPI ratings. Then, I obtained averages for the three or two raters’ ratings on each prompt given to individual examinees to determine the average ability level for an examinee on a specific prompt.

An individual examinee’s scores for each prompt were grouped by prompt level. Finally, the averaged scores on each intended prompt level were calculated for each examinee.

This procedure was possible because the prompts at each intended level were assumed equal in their difficulty, according to the test specification of the OPI. By obtaining the average scores for each intended prompt level, the missing data rates were reduced, and a multiple imputation analysis was conducted to impute the missing data using SPSS version 22. The imputed data was always the same variable—individual examinees’ scores on each prompt level. Since the imputation number was set to five iterations as the default, the results of multiple imputations generated five separate datasets containing the imputed data. These five datasets were then automatically utilized to conduct the

96 principal component analysis in SPSS version 22, generating five separate outputs. It should be noted that the multiple imputation is not commonly used by researchers in the area of language testing to deal with missing data. However, this procedure may be defensible in this dissertation because the imputed data is only used to check the unidimensionality assumption as a part of the Rasch measurement.

Next, the PCA was completed for each set of imputed data, yielding five separate outputs. In the PCA outputs, the unidimensionality of the prompts was confirmed by (a)

Bartlett’s test of sphericity and the Kaiser-Meyer-Olkin measure of sampling adequacy for appropriateness of the common factor model, (b) the proportion of the first factor variance relative to the second factor variance, (c) the Eigenvalues for the first factor, and

(d), the scree plots. To be specific, all correlation matrices were examined for appropriateness of the common factor model, based on Bartlett’s test of sphericity and the

Kaiser-Meyer-Olkin measure of sampling adequacy. For Bartlett’s test of sphericity, if the associated significance level for each iteration output is small, the hypothesis is rejected that the population correlation matrix is identical. In the Kaiser-Meyer-Olkin measure for sampling adequacy, the iteration outputs showed all matrices with values above .50, which refer to adequate sampling (Kaiser, 1974). Second, if the total variance for the first factor is substantially greater than the subsequent factors, this suggests evidence of unidimensionality. Third, the number of eigenvalues greater than one indicates the number of common factors specified in the model. Finally, in the scree plots, the point where the slope of the curve obviously levels indicates the number of factors generated by this analysis.

97

Finally, MFRM was implemented to investigate the dependability of the scores in

803 rating results using FACETS 3.71.4 (Linacre, 2012). A three-facet rating scale model was adopted. The command SE = Real was included in the syntax to estimate standard errors for each facet, which assumes the error was systematic (Bonk & Ockey,

2003). The rater facet was set to float while the examinee and the prompt facets were automatically set to zero logits. The output was examined to determine whether the examinee facets spread widely across the vertical rulers, indicating the OPI ratings dependably separated examinees into different levels. The separation index and the reliability indices were expected to obtain high separation and high reliability indices, suggesting a high dependability of the OPI ratings.

4.5.6 Consistency of OPI prompts at different difficulty levels (RQ6)

The sixth research question (RQ6) investigates the extent to which the intended prompt difficulty matches the observed difficulty levels. Many-facet Rasch

Measurement (MFRM) was applied to analyze the rating results grouped by four intended prompt difficulty levels. Data came from individual prompt ratings divided by intended prompt level in 803 individual raters’ ratings collected through six administrations. This data included the raters’ ratings on the OPI prompts that were used at least ten times across all administrations, leading to the individual raters’ ratings on 73 prompts of 99 possible prompts (advanced: 19 prompts, intermediate-high: 24 prompts, intermediate- mid, 27 prompts, and intermediate-low: 3 prompts). For analysis, the ratings were divided into four intended levels. A separate run for MFRM was conducted for each prompt difficulty level as depicted in Figure 4.6.

98

Individual raters’ ratings From ADMIN 1 to ADMIN 6

Prompt at Prompt at Prompt at the Prompt at intermediate-mid level intermediate-low level advanced level intermediate-high level

MFRM MFRM MFRM MFRM

Figure 4.6 Procedures for examining consistency in intended prompt difficulty levels

Based on the three-facet rating scale model, FACETS was employed separately for each of the four prompt difficulty levels. The facets were examinees, raters, and prompts, and the rater facets were set to float. The disconnected data were connected using group-anchoring method. Detailed anchoring process is presented in the result section.

This research question was addressed by examining (a) the separation and reliability indices, (b) measures of the prompt facets on all facet vertical rulers, (c) fair averages, and (d) infit mean squares. The separation index indicates the extent to which prompts are separated into different difficulty levels and the reliability index is the reliability of this separation index. Since the current study seeks consistency in prompts at the same prompt difficulty level, a lower separation index with a lower reliability index is desirable. The distribution of the prompts on the logit scales in all facet vertical rulers showed consistency of prompt difficulty. In the current analysis, if the prompts have similar logit values on the vertical rulers, this suggests consistency in prompt difficulties at each intended prompt difficulty level. In the prompt measurement report, the fair

99 average scores for each prompt were utilized to determine the prompt difficulty on the original scales after adjusting for rater severity in the model. Model standard errors for each prompt were included to present the precision of estimation. Finally, infit mean squares were interpreted to identify problematic prompts within the same intended difficulty level. The normal range for fit statistics is between 0.4 and 1.5 (Linacre, 2002).

Any prompts with infit mean square over 1.5 would be considered a poor fit prompt because they measure different constructs other than prompts at the same prompt difficulty level. On the other hand, prompts with infit mean squares smaller than 0.4 suggest that they are not measuring a meaningful construct. The information is deterministic.

4.5.7 Consistency of raters within each administration (RQ7)

The seventh research question (RQ7) inquired whether individual raters were consistent in their severity and use of rating scales at each administration. Data included

803 individual ratings collected across all six ADMINS—98 ratings from ADMIN 1, 104 ratings from ADMIN 2, 200 ratings from ADMIN 3, 114 ratings from ADMIN 4, 155 ratings from ADMIN 5, and 132 ratings from ADMIN 6. FACETS was run for each test administration, based on the three-facet rating scale model. Among the examinees, raters, and prompts facets, the raters’ facet was set to float.

Rater severity for each ADMIN was examined by the separation index and the reliability index. The lower separation was a desirable estimate, since it indicates raters’ low separation into different severity levels. In FACETS outputs, the reliability index does not refer to a common inter-rater reliability measure, but rather indicates the degrees of similarity among raters in terms of severity. Therefore, the lower reliability index was

100 expected to suggest consistency among raters. The scope of the rater measures on a logit scale was also interpreted to examine consistency in severity. Raters with higher logit values indicate severe raters and those with lower logit values mean lenient raters.

Raters’ logit values which are similar signify consistency in severity. In addition, raters’ consistency in using the rating scales was investigated, based on infit mean squares.

Acceptable infit mean squares range from 0.4 to 1.5 (Linacre, 2002). The raters whose infit mean squares were above 1.5 were considered inconsistent in their ratings. In other words, they were not using the scales consistently as compared to the other raters. Those raters with infit values of less then 0.4 were considered muted. That is, these raters were only using the middle part of the scale.

This chapter elaborated the methodology used to collect and analyze data. It began with explanation about the mixed-method research design utilized in this study.

Next, the participants’ backgrounds were described in terms of the participants’ numbers, rating experiences, native languages, and completion of the rater trainings. Descriptions about the data collection instruments were provided—questionnaire, focus group/individual interview protocols, OPI prompts, scoring rubric, and diagnostic descriptors. The following components presented the specifics for the data collection procedures and the data analysis methods. Chapter 5 introduces the findings for each research question.

101

CHAPTER 5

RESULTS

This chapter presents the results for the seven research questions drawn from the evaluation and the generalization inferences. The research questions were addressed to find validity evidence for (1) raters’ perceptions towards R-Plat in terms of clarity, effectiveness, satisfaction, and comfort level; (2) quality of raters’ diagnostic descriptor markings; (3) quality of raters’ comments; (4) quality of the OPI scores; (5) quality of individual raters’ OPI ratings; (6) prompt difficulty; and (7) raters’ rating practices. The data included (a) 14 raters’ responses to open-ended questions about their perceptions towards R-Plat, (b) five recordings of individual/focus group interviews on their perceptions, (c) 1,900 evaluative units extracted from raters’ comments about examinees’ speaking performance, (d) 14 raters’ responses to six-point scale statements about their perceptions, (e) 2,524 diagnostic descriptor markings on examinees’ speaking ability, (f)

OPI scores for 279 examinees, (g) 803 individual raters’ ratings, (h) individual prompt ratings divided by each intended prompt level, given by each rater, and (i) individual raters’ ratings on the given prompts, grouped by test administration. Findings showed that the assumptions for evaluation were successfully supported, and those for generalization were at least partially supported by backing. The following sections report the detailed findings for each of the seven research questions.

5.1 Raters’ Perceptions towards R-Plat

The first assumption for the evaluation inference is that rating procedures with R-

Plat are appropriate for raters to assess speaking abilities of test takers. To provide

102 backing for this assumption, the first research question was posed to ascertain whether both experienced and new raters perceived R-Plat as an appropriate rating tool.

Perceptions from the two rater groups were elicited to offer evidence for the clarity of R-

Plat, raters’ comfort with R-Plat, the effectiveness of R-Plat, and raters’ satisfaction with

R-Plat.

5.1.1 Clarity of R-Plat

The clarity for each function in R-Plat was evaluated through 11 six-point scale statements (1 with “very unclear” vs. 6 with “very clear”). The statements were categorized into four areas—namely the pages for Oral Proficiency Interview (OPI),

TEACH, final score confirmation, and the rating path. Each of these four areas was divided into sub-categories as follows. On the rating page for OPI, raters evaluated the clarity of (a) test takers’ information, (b) ratings for each question and impression score,

(c) comment boxes and (d) the overall page. On TEACH page, the raters assessed the clarity of (a) taker’s information, (b) ratings for TEACH score, (c) ratings for cultural ability, (d) comment boxes for each rating criteria, and (e) overall comment boxes. In addition, raters evaluated the clarity of the final score confirmation page and the rating path.

Table 5.1 displays the results from descriptive statistics for the two rater groups’ ratings on the clarity of R-Plat. In Table 5.1, the first three columns display the means and the standard deviation for each statement for each group of raters and both rater groups. ANOVA was employed to make comparisons for the means of raters’ responses to the four statements relevant to the clarity of R-Plat.

103

Table 5.1 Descriptive statistics for experienced and new raters, and total group responses to statements about clarity of R-Plat Experienced New Total Areas (N = 6) (N = 8) (N= 14) Oral Proficiency Interview M S.D. M S.D. M S.D. Information about test takers (Name, Test 5.17 .98 4.75 1.04 4.93 .98 number, Test date, Interviewer) Rating for each question / impression 5.83 .41 5.38 .74 5.57 .65 score Comment boxes 6.00 .00 5.63 .74 5.79 .58 Overall comments 5.50 .55 5.25 .71 5.36 .63 TEACH Information about test takers (Name, 5.50 .84 5.00 1.07 5.21 .96 Test-number, Test date, Interviewer) Ratings for TEACH score 5.33 1.21 5.13 1.46 5.21 1.31 Ratings for cultural aspects 4.00 1.90 4.13 1.46 4.07 1.60 Comment boxes 5.67 .82 5.38 .74 5.50 .76 Overall comments 5.33 .82 5.00 1.07 5.14 .95 Clarity of Final score confirmation 5.67 .52 5.25 0.89 5.43 .76 Clarity of rating path 4.83 1.47 5.38 0.74 5.14 1.10 Note: M = Mean. SD = Standard Deviation. Experienced raters’ response ranges: 1 (Very unclear), 2 (Mostly unclear), 3 (Unclear), 4 (Clear), 5 (Mostly clear), and 6 (Very clear)

In the descriptive statistics, the average perception for all raters about clarity ranged from 4.07 to 5.79. This indicated the raters perceived R-Plat was “clear” or even

“mostly clear” in some features. The average score for the experienced raters’ responses ranged from 4.00 to 6.00 (Clear to Very clear) and that for the new raters was from 4.13 to 5.63 (Clear to Mostly clear). Both the experienced and the new raters selected the comment boxes on the OPI page as the clearest feature (Experienced raters: M = 6.00;

New raters: M = 5.63), followed by rating for each question (Experienced raters: M =

5.83; New raters: M = 5.38), and comment boxes in TEACH (Experienced raters: M =

5.67; New raters: M = 5.38). Compared to the other areas of R-Plat, both rater groups thought the rating for cultural ability was the least clear feature (Experienced raters:

104

M=4.00; New raters: M= 4.13), but the score still referred to Clear in the degree of clarity. Given the raters’ ratings on clarity, mostly scored above 5 (referring to Mostly clear), it appeared most features of R-Plat were clearly presented to the raters. Following the descriptive statistics, a one-way ANOVA was conducted to compare raters’ scores on

11 scale statements relevant to the level of clarity of R-Plat between the experienced and the new raters. The independent variable is the raters’ rating experience, which is divided into two treatment levels; experienced raters and new raters. Dependent variables are the sum of raters’ scores on the 11 scale statements.

Prior to a one-way ANOVA, assumptions of normality and equal variances were tested. Overall, normality assumption was satisfied considering the acceptable ranges of skewness (Experienced rater groups: -1.669; New rater groups: -.642) and kurtosis

(Experienced rater groups: 3.073; New rater groups: -1.045) for the new and the experienced rater groups although the kurtosis for the experienced rater group (3.073) is greater than +2. However, the significant values of Shapiro-Wilk tests for each rater group were greater than 0.05. Therefore, the normality assumption was satisfied. The equality of variance assumption was also satisfied because the significant value for

Levene’s tests (p = .192) was not significant.

Table 5.2. shows the results of a one-way ANOVA test for raters’ perceptions towards clarity of R-Plat. The results indicated no significant difference in raters’ perceptions regarding the clarity of R-Plat between the two rater groups (F (1, 12) = .504, p = .491).

105

Table 5.2

Results of ANOVA tests for clarity of R-Plat

Source Sum of Squares df MS F Sig. Between Groups 22.881 1 22.881 .504 .491 Within Groups 544.333 12 45.361 Total 567.214 13

The raters’ responses to the open-ended questions further supported the raters

‘positive views on clarity of R-Plat. In the questionnaire, three experienced and five new raters commonly appreciated the clear interface of R-Plat and enumerated specific features in

R-Plat to support their views. The raters highlighted the simple design of R-Plat, and the clear presentation of each function, such as radio buttons, comment boxes, and final score confirmation. For example, an experienced rater (Rater 5) and four new raters (Raters 6, 8,

9, 10) articulated that R-Plat contained a clear layout and functions.

• “Most of the sections on the OPI rating page are clear…R-Plat is almost the same

as the paper-based rating path.” (Rater 5Experienced, the questionnaire)

• “It's (The final score confirmation page) straight forward.” (Rater 6 new, the questionnaire) • “The layout of the pages is mostly clear. It is easy to click on ratings and to type

comments in boxes.” (Rater 8new, the questionnaire)

• “It is fairly clear.” (Rater 9new, the questionnaire)

• “I think it is quite clear and easy to get.” (Rater 10new, the questionnaire)

Since the interface of R-Plat replicated the layout of the conventional paper-based rating form, the experienced raters, who used both paper rating forms and R-Plat, were quite familiar with most of the features in R-Plat. In R-plat, the rating pages on R-Plat consisted of simple functions, such as radio buttons, rating scales designed with star marks, and comment boxes. These features did not require advanced computer skills to use R-Plat. In addition,

106

Rater 5 favored the clear presentation of a final score confirmation page, stating “It (the final score confirmation page) synthesizes the results in a clear way.”

Along with the positive opinion, however, the raters addressed unclear features on R-

Plat. As the prevalent feature mentioned in the questionnaire, three experienced raters and three new raters expressed their confusion about the meaning of cultural ability, which was one of the rating categories on the TEACH rating page. Three raters (Raters 4, 12, 13) stated this category created confusion, while assigning scores on the TEACH page:

• “The cultural aspects have always been an enigma to me.” (Rater 4Experienced, the questionnaire) • “As for the clarity of cultural ability, I think the interface itself is okay, but the problem is the concept of using it. So we should review the concept before using

R-Plat. Then we don’t have much trouble.” (Rater 12new, the focus group interview) • “It would be good to have more detailed explanation about cultural code. The

concept is not clear enough.” (Rater 13new, the questionnaire)

These raters’ comments on an ambiguous definition of cultural ability explained the lowest scores on the clarity of rating scores for cultural ability on the TEACH rating page

(Experienced raters: M=4.00 & New raters: M= 4.13) in the questionnaire. Based on these comments, it is evident the confusion among the raters was attributable to the unclear definition of “cultural aspect” in one of the rating criteria for TEACH rather than the presentation of the rating in R-Plat. This result called for a need to create well-defined rating scales to evaluate speaking ability, which is continued further in the discussion.

107

5.1.2 Level of raters’ comfort with R-Plat

To understand raters’ perceptions towards R-Plat, it was also worth investigating the extent to which the raters were comfortable with using R-Plat. As raters’ anxiety in using technology, like R-Plat, may influence test scores as construct-irrelevant variance, it was necessary to ensure that raters would feel comfortable with using R-Plat. In the questionnaire, the level of raters’ comfort was evaluated through four six-point scale statements, and the follow-up open-ended questions for each statement. The statements were categorized into four parts, namely raters’ comfort levels when using R-Plat for 1) interviewing test takers, 2) marking scores, 3) marking diagnostic features, and 4) inserting comments.

Raters’ responses to the four statements were analyzed through descriptive statistics to understand an overall level of raters’ comfort. A one-way ANOVA was conducted to compare the mean scores of ratings on each of the scale statements between two rater groups.

The independent variable is the raters’ rating experience, which is divided into two treatment levels; experienced raters and new raters. Dependent variables are the sum of raters’ scores on the 4 six-point scale statements.

For the normality assumption check, values for skewness (Experienced rater groups: -

.380; New rater groups: .441) and kurtosis (Experienced rater groups: -1.410; New rater groups: -1 .587) were between -2 and + 2 both for the experienced and the new rater groups, respectively. Results of Shapiro-Wilk tests (α >0.05) also supported the normality assumption. The homogeneity of variances was satisfied as Levene’s statistics was not significant (p = .099).

108

Table 5.3 Descriptive statistics for experienced raters and new raters, and total group responses to statements about the level of comfort with R-Plat Experienced New Total Aspect (N = 6) (N = 8) (N = 14) M S.D. M S.D. M S.D. Interviewing with R-Plat 4.83 .75 N/A N/A N/A N/A Rating with R-Plat 5.00 1.1 4.88 .84 4.93 .92 Checking diagnostic features 4.33 1.37 4.25 .89 4.29 1.06 Typing comments 5.17 .41 4.75 0.71 4.93 0.62 Note: M = Mean. SD = Standard Deviation. N/A (not applicable) - New raters did not interview test takers with R-Plat during the training. Experienced raters’ response range: 1 (Very uncomfortable), 2 (Mostly uncomfortable), 3 (Uncomfortable), 4 (Comfortable), 5 (Mostly comfortable), and 6 (Very comfortable)

As shown Table 5.3 above, the descriptive statistics show that the range of raters’ comfort level was from 4.29 to 4.93, suggesting that raters felt Comfortable with R-Plat. The range of the experienced raters’ responses was from 4.33 to 5.17 (Comfort to Mostly comfort), and that of new raters was from 4.25 to 4.88. The experienced raters felt the most comfortable when typing comments (M =5.17), followed by rating with R-Plat (M=5.00), checking diagnostic features (M = 4.33), and interviewing with R-Plat (M = 4.17). New raters felt the most comfortable when rating with R-Plat (M = 4.88), followed by typing comments (M = 4.75), checking diagnostic features (M = 4.25). As shown in Table 5.4, a one-way ANOVA was conducted to compare raters’ scores on four 11 scale statements about the degree of raters’ comfort with R-Plat between the experienced and the new raters (F (1,

12) = .973, p = .343). The result suggests no significant difference in raters’ perceptions regarding the comfort level of R-Plat between two groups.

109

Table 5.4 Results of ANOVA tests for raters’ comfort with R-Plat Source Sum of Squares df MS F Sig. Between Groups 6.881 1 6.881 .973 .343 Within Groups 84.833 12 7.069 Total 91.714 13

The raters’ responses to the open-ended questions provided additional information about raters’ comfort with R-Plat. The comments from five experienced and eight new raters consisted of positive and negative views pertinent to their comfort level with R-Plat. For their positive stance, the raters felt comfortable when using R-Plat due to its convenient features. Especially, they felt comfortable when they “mark scores and type comments” and

“calculate scores automatically.” Two experienced raters (Rater 5 and Rater 6) and two new raters (Rater 7 and 8) particularly appreciated these features; marking scores and typing comments.

• “Easy (for him) to mark scores and put commas.” (Rater 5Experienced, the questionnaire)

• “Easy to mark scores and puts comments.” (Rater 6Experienced, the questionnaire) • “Clicking is easier to do. Very convenient to give diagnostic feedback.” (Rater

7new, the questionnaire)

• “Generally it is very easy to use. Just click. Easy! (Rater 8new, the questionnaire)

The automated score calculation function in R-Plat was another asset contributing to raters’ comfort. Conventionally, raters used to calculate scores manually during the test to obtain the final scores. However, the new function in R-Plat lessened raters’ burdens on calculating scores. A new rater (Rater 8) mentioned this function was more convenient for him to get the final score for a test taker at ease.

110

“I like the idea of calculating grades automatically, which is more convenient than

using paper.” (Rater 8new, the questionnaire)

In contrast to the raters’ positive experiences, raters encountered some challenges when interviewing test takers, and typing pronunciation errors. Three experienced raters

(Rater 1, 2, and 3) during the interviews reported that it was more challenging for them to interview students with R-Plat, compared to rating with the traditional paper forms. Rater 2 and Rater 3 shared their strategies in terms of using R-Plat during the interview. To get themselves familiar with R-plat during the interview, they wrote comments in R-Plat during the transition periods, such as between impromptu questions and role-play question in the

OPI, and between the OPI rating page and the TEACH page. The raters’ quotes are as follows:

• “Really the only issues I have interviewing with R-Plat are that I'm used to jotting down notes by hand (for example, thinking of ideas for follow up questions) without really looking at the paper much, but need to locate the cursor at an

appropriate point on the screen to write potential ideas.” (Rater 1Experienced, the questionnaire) • “I find it hard to take notes when I am concentrating on the interview. I usually just give a numeric score (e.g. 21,22) and add comments during the two minutes

before the teach lecture starts.” (Rater 2Experienced, the questionnaire) • “I use strategies to be able to comment on the speaker during the role play prep time and the teach prep time and I generally avoid using R-Plat during the actual interviews because I am unable to think of follow up questions and write

comments/use diagnostic features at the same time.” (Rater 4Experienced, the questionnaire)

Three new raters also revealed their concerns about interviewing with R-Plat although they had not interviewed. For instance, a new rater (Rater 10) elucidated,

111

• “Since I'm not familiar with the questions and rating scale, I think it is hard to listen to candidates, thinking of their level of proficiency, formulating follow-up questions and going back and forth between the screen and the hard copy of the rating scale. I also don't think it's a good idea to switch between the rubric on the

screen and R-Plat.” (Rater 10new, the focus group interview)

Given the reports from the experienced and the new raters, the interviewers appeared to feel more pressure to use R-plat as they needed to select adequate items to calibrate a test take’s proficiency level during the interaction with test takers.

Another challenge was that four experienced raters (Rater 2, 3, 4, 5) criticized the inability to insert phonetic symbols to mark pronunciation errors into the system. Rater 3 exemplified the instance by elucidating:

“The only issue would be in searching for an efficient and standardized way to write about 1) what the speaker said, and 2) what they should have said. This includes an

easier way to type in phonetic symbols.” (Rater 3Experienced, the questionnaire)

During the focus group interview with three experienced raters, Rater 2 asserted he had troubles with marking phonetic transcriptions in R-Plat by stating:

“Also in paper, I wrote a lot of phonetic transcription, especially the vowels, using the phonetic symbols for the vowels. Whereas in the computer, you should spell it out because the computer doesn't have all those phonetic symbols. Even if it did, that will

be hard to find.” (Rater 2Experienced, the focus group interview)

In the paper rating form, raters were able to write phonetic symbols freely to indicate test takers’ errors with pronunciation. However, the technical limitation on typing phonetic symbols in R-Plat turned out to be significant interference of rating procedures. In brief, the results uncovered the raters mostly felt comfortable with using R-Plat considering the

112 average rating scores on their comfort level, which was mostly above 4 (referring to

Comfort). The technical restrictions should be taken into a consideration further to provide raters with positive rating experience with R-Plat.

5.1.3 Effectiveness of R-Plat and raters’ satisfaction

Both rater groups were asked to evaluate the effectiveness of R-Plat and the level of their satisfaction when using it. Each aspect was evaluated via a single six-point scale statement, respectively. When it comes to effectiveness of R-Plat, both rater groups showed the features in R-Plat was mostly effective (Experienced: 5.17; New: 5.25), as shown in

Table 5.5.

Table 5.5 Descriptive statistics for experienced and new raters, and total group responses to statements about the effectiveness of R-Plat Experienced New Total Aspect (N = 6) (N = 8) (N = 14) M S.D. M S.D. M S.D. Effectiveness 5.17 0.75 5.25 0.71 5.21 .70 Note: M = Mean. S.D. = Standard Deviation. Experienced raters’ response range for effectiveness: 1 (Very ineffective), 2 (Mostly ineffective), 3 (Ineffective), 4 (Effective), 5 (Mostly effective), and 6 (Very effective)

The normality assumption was successfully satisfied given the acceptable ranges of

Skewness (Experienced rater groups: -.313; New rater groups: -.404) and Kurtosis

(Experienced rater groups: -.104; New rater groups: -.229), and the insignificant Shapiro-

Wilke test values (Experienced rater groups: p = .212; New rater groups: p = .056). The assumptions of homogeneity of variances were also met given the insignificant Levene’s test result (p = .975). Results of the independent samples t-test2 indicated that there was not a

2 In the analysis of independent t-test, the mean difference was significant at 0.05 levels.

113 significant difference in raters’ views on R-Plat effectiveness between the experienced rater group (M=5.17, SD=0.75) and the new rater group (M=5.25, SD=0.71); t (12) = -.212, p =

.835).

In the raters’ follow-up comments on their ratings for the Likert scale statement, three raters (Rater 1, 7, and 9) stated that R-Plat was more efficient than using the paper rating form. Rater 1, who was an experienced rater, addressed he tended to write more than he did on a paper. Rater 7 and 9 also explicated that R-Plat allowed them to type and edit comments in an effective way. As a feature that was efficient, Rater 11 mentioned score auto calculation function as an example of an effective feature in R-Plat. The following presents the quotes from the open-ended questions in the questionnaire.

• “I love its efficiency, and that we don't have to use paper and pencil any more. I feel I write more comments using R-Plat versus what I wrote with pencil and

paper.” (Rater 1 Experienced, the questionnaire)

• “Much efficient than using paper.” (Rater 7New, the questionnaire) • “It is fairly straightforward to use. Typing compared with handwriting means a

higher efficiency and also makes editing more convenient.” (Rater 9New, the questionnaire) • “It helps reduce calculation, and navigate back and forth, very organized.” (Rater

11 New, the questionnaire)

The last aspect to evaluate quality of R-Plat was raters’ satisfaction with R-Plat.

Table 5.6 includes the result from descriptive statistics about raters’ satisfaction.

114

Table 5.6 Descriptive statistics for experienced raters and new raters, and total group responses to statements about the level of raters’ satisfaction with R-Plat Experienced New Total Aspect (N = 6) (N = 8) (N = 14) M S.D. M S.D. M S.D. Satisfaction 5.17 0.75 5.13 0.99 5.14 .86 Note: M = Mean. S.D. = Standard Deviation. Experienced raters’ response range for satisfaction: 1 (Very dissatisfied), 2 (Mostly dissatisfied), 3 (Dissatisfied), 4 (Satisfied), 5 (Mostly satisfied), and 6 (Very satisfied)

The raters in two groups were mostly satisfied with R-Plat considering the mean scores for each rater group (Experienced: 5.17; New: 5.13).

A test for the normality assumption indicated that the observations from the experienced rater group (Skewness: -.313, Kurtosis: -.104, Shapiro-Wilk test: .212) were normally distributed, but not those from the new rater groups (Skewness: -1.486) given its

Kurtosis (2.973) and the significant value of Shapiro-Wilk tests (0.15). However, the t-test is robust to violations of normality assumption in practice. The test for homogeneity of variances was satisfied based on the insignificant Levene’s test result (p = .764). Result of the independent samples t-test uncovered that there was not a significant difference between the experienced rater group (M=5.15, SD=0.75) and the new rater group (M=5.13, SD=0.99); t (12) = .086, p = .933).

In the open-ended questions in the questionnaire, three raters, including one experienced and two new raters, addressed they felt satisfied with R-Plat mainly because of the user-friendly features. For example, comment boxes, scoring buttons, drop-down menus for selecting topics appeared to be appealing to the raters. A Rater 3, an experienced rater, exemplified the features that made her feel convenient, by saying:

115

“Comment boxes; ease of scoring using gold reference stars; ease of putting in the OPI question topic. In general, the layout is highly visual and simple to see and use.”

(Rater 3Experienced, the questionnaire)

Two raters (Rater 7 and Rater 9) articulated that R-Plat was user-friendly due to a visually appealing feature and its ease of use.

• “Stars are attractive symbols. Imitation of a paper like rating sheet, user

friendly.” (Rater 7new, the questionnaire) • “It is user-friendly and does not require highly advanced computer skills.” (Rater

9new, the questionnaire)

The user-friendly features, which did not require advanced computer skills, and visually appealing features contributed to raters’ high satisfaction with R-Plat.

In a nutshell, the results for the first research question revealed both experienced and new raters generally hold positive views on R-Plat in terms of the clarity of R-Plat, raters’ comfort, the effectiveness of R-Plat and raters’ satisfaction. The average scores for the raters’ responses to all the six-point scale statements were at least above 4, reflecting positive perceptions. The result from the independent t-tests and ANOVA showed no differences in raters’ perceptions of the four aspects between the two-rater groups. Raters’ comments obtained from the open-ended questions in the questionnaire, and the interviews provided partial evidence to support raters’ positive rating experience with R-Plat. It should also be noted that the technical limitation of R-Plat (e.g. inability to type phonetic symbols) and the challenges that the rater experienced (e.g. difficulties in interviewing, vague definition of cultural ability) hindered the raters from evaluating test takers’ ability appropriately while using R-Plat. Further improvement in R-Plat should be taken into consideration to facilitate rating procedures by overcoming these limitations. Consequently, the evidence for the first

116 research question provides backing for the assumption that rating procedures with R-Plat are appropriate for raters to assess speaking abilities of examinees. The backing for the second assumption of the evaluation inference will be reported in the next section.

5.2 Use of Diagnostic Descriptors to Support Proficiency Level Ratings

The second research question was addressed to support the second assumption for the evaluation inference. The assumption is that the test administration condition related to the rating procedures in support of R-Plat are appropriate for providing evidence of targeted language abilities. The evidence for this assumption is based on analysis of the quality and use of the thirty diagnostic descriptors associated with the overall proficiency levels assigned by the raters. First, the diagnostic descriptors marked for each proficiency level were compared to assess the extent to which advanced level ratings were associated with high level diagnostic descriptor markings, the intermediate-high level ratings were associated with mid-level markings, and intermediate ratings were associated with low-level diagnostic descriptor markings. Second, the same comparisons were made for each of seven categories of diagnostic descriptors. Third, raters’ responses to open-ended questions were analyzed to assess how they used the descriptors to support proficiency level ratings.

5.2.1 Proficiency level comparisons of diagnostic descriptor markings

The quality of thirty diagnostic descriptors to support proficiency level ratings was observed by comparing diagnostic descriptor markings at each scale point across the three proficiency levels. To compare the markings, the raters’ 146 proficiency level ratings were analyzed, including 50 advanced, 39 intermediate-high, and 57 intermediate-mid level ratings. For each of the 146 level ratings, frequencies were counted for thirty diagnostic

117 descriptors that could potentially be marked at each scale point. Table 5.7 shows the frequencies and percentages of the markings for the diagnostic descriptors, divided by the three proficiency levels. A Chi-square test was employed to compare frequencies at each scale point across proficiency levels.

Table 5.7 Frequencies and percentages of diagnostic descriptors at each scale point divided by proficiency level Number of Diagnostic Scale point for diagnostic descriptors Descriptor Markings at each Proficiency Level Weak Fair Satisfactory Good Excellent Advanced Frequency 2 17 138 375 312 (n markings = 844) % (0.2) (2.0) (16.4) (44.4) (37.0) Intermediate-high Frequency 8 158 351 249 70 (n markings = 836) % (1.0) (18.9) (42.0) (29.8) (8.4) Intermediate-mid Frequency 51 204 404 173 12 (n markings= 844) % (6.0) (24.2) (47.9) (20.5) (1.4) Total Frequency 61 379 893 797 394 (n markings % (2.4) (15.0) (35.4) (31.6) (15.6) =2524) Note: (%) = Percentage within proficiency level, *p ≤ .05. In total, raters made 2524 markings for the diagnostic descriptors. For each of the

146 rated samples, thirty diagnostic descriptors were available for marking. Raters tended to select fewer than half of the available descriptors to add descriptions to their ratings. The data consisted of 844 markings at the advanced, 836 markings at the intermediate-high, and

844 markings at the intermediate-mid levels. Results reveal an association between proficiency level ratings and the scale points raters tended to mark for descriptors. At the advanced proficiency level, the percentage of descriptors marked Good and Excellent is the greatest (81.4%). At the intermediate-high level, the greatest percentage of markings

(42.0%) is Satisfactory, and at the intermediate-mid level is also Satisfactory (47.9%), and the percentage of Good and Excellent markings is lower. The Chi-square test revealed

118 significant differences in frequencies of the diagnostic descriptor markings at each scale points grouped by the three proficiency level ratings; X2(8, N markings = 2524) = 830.92, p <

.01.

The comparison of percentages across the three levels is displayed in the graph shown in Figure 5.1. The horizontal axis denotes each point on a five-point scale from Weak to

Excellent performance, while the vertical axis indicates percentages of markings on diagnostic descriptors at each scale point.

60.0% 47.5% 44.0% 50.0% 38.0% 40.0% 41.2% 30.3% 30.0% 23.7% 15.9% 20.0% 18.4% 6.0% 21.4% 9.1% 10.0% 2.0% 1.0% Percentages of markings 0.0% 0.2% 1.4% Weak Fair Satisfactory Good Excellent

Scale point for diagnostic descriptors

Advanced (n=844) Intermediate-High (n=836) Intermediate-mid (n=844)

Figure 5.1 Distribution of markings on thirty diagnostic descriptors at each scale across three proficiency levels

As shown in Figure 5.1, the percentages of the diagnostic descriptors marked at the advanced level are remarkably skewed towards Excellent (38.0%) and Good (44.0%). In addition, the percentages of the diagnostic descriptors marked for Excellent and Good in the advanced level surpass the percentages of the markings in the intermediate-high (Excellent:

9.1% and Good: 30.3%) and the intermediate-mid levels (Excellent: 1.4% and Good: 21.4%).

On the contrary, fewer diagnostic descriptors were marked for Fair (2.0%) and Weak (0.2%) in the advanced level ratings compared to those in the intermediate-high (Fair: 18.4% and

Weak: 1.0%) and the intermediate-mid levels (Fair: 23.7% and Weak: 6.0%).

119

In brief, the markings on the higher scale points (Good and Excellent) are higher in the advanced-proficiency levels; whereas, the markings on the lower scale points (Weak and

Fair) are greater in the intermediate-high and the intermediate proficiency levels. The findings from the distribution of the diagnostic descriptors at each scale score level suggest the more proficient students tended to obtain higher scores on the diagnostic descriptors, compared to the less proficient students. This finding revealed the ratings of the diagnostic descriptor markings support the proficiency level ratings, especially distinguishing between the advanced level and the two lower levels.

5.2.2 Seven categories of diagnostic descriptors

To support the value of diagnostic descriptors to represent proficiency level ratings, comparisons of diagnostic descriptor markings at each scale point grouped by seven categories of diagnostic descriptors were made across three proficiency levels. The thirty diagnostic descriptors are categorized into seven diagnostic categories in terms of comprehensibility (ease of understanding, accent, and volume), pronunciation (vowels, consonants, insertion, enunciation, reduction, intonation, rhythm, and word stress), fluency

(phrasing, choppiness, halting, false starts, pauses, incomplete utterances, and pace), vocabulary (breadth of vocabulary, and word choice and expression), grammar (grammatical complexity, word order, verbs, word form, singular or plural, pronouns, and articles), pragmatics (interaction and compensation strategies), and listening (listening). Proficiency level comparisons of diagnostic descriptor markings at each scale point are displayed in tables and figures for each of the seven diagnostic categories. A chi-square test for each of the seven diagnostic categories is presented.

120

Comprehensibility. Raters made 340 markings for comprehensibility, consisting of

105 markings at the advanced, 112 markings at the intermediate-high, and 123 markings at the intermediate proficiency level, as shown in Table 5.8.

Table 5.8 Frequencies and percentages of diagnostic descriptors at each scale point for comprehensibility across proficiency levels Categories Scale point for diagnostic descriptors Number of Diagnostic of Descriptor Markings at each Excell Diagnostic Weak Fair Satisfactory Good Proficiency Level ent Descriptors Advanced Frequency 0 1 22 43 39 (n markings (%) (0.0) (1.0) (21.0) (41.0) (37.1) =105)

Intermediate- high Frequency 1 20 51 27 13 (n markings (%) (0.9) (17.9) (45.5) (24.1) (11.6)

=112) Intermediate- mid Frequency 2 20 58 42 1 (n markings (%) (1.6) (16.3) (47.2) (34.1) (0.8) Comprehensibility =123) Total Frequency 3 41 131 112 53 (n markings = (%) (0.9) (12.1) (38.5) (32.9) (15.6) 340) Note: (%) = Percentage within proficiency level

At the advanced proficiency level, the greatest percentage of descriptors is marked

Good (41.0%), followed by Excellent (37.1%) whereas the fewer than one percent of diagnostic descriptors is marked at Weak (0.0%) and Fair (1.0%). At the intermediate-high level, the highest percentage of diagnostic descriptor markings is Satisfactory (45.5%), followed by Good (24.1%). Fewer markings at Fair (17.9%) and Excellent (11.6%) are made. Only 0.9 percent of the diagnostic descriptors is marked at Weak (0.9%). At the intermediate-mid level, the majority of diagnostic descriptors are marked at Satisfactory

(47.2%) and Good (34.1%). In this proficiency level, 0.8 percent of diagnostic descriptors is

121 marked Excellent. The chi-square test revealed significant differences in markings for diagnostic descriptors of comprehension across proficiency levels; X2 (8, N=340) = 83.99, p

< .01. Figure 5.2 displays the comparisons of diagnostic descriptor markings for comprehensibility at each scale point across the three proficiency levels.

50.0% 47.2% 41.0% 37.1% 40.0% 45.5% 34.1% 30.0% 16.3% 20.0% 17.9% 24.1% 11.6% 21.0% 10.0% 1.6% 0.8% 0.9% Percentages of markings 0.0% 0.0% 1.0% Weak Fair Satisfactory Good Excellent Scale point for diagnostic descriptors

Advanced (n=105) Intermediate-high (n=112) Intermediate-mid (n=123)

Figure 5.2 Distribution of markings on diagnostic descriptors of comprehensibility at each scale across three proficiency levels

This pattern of descriptor choices by the raters supports the assumption that raters used the descriptors for comprehensibility to support the overall proficiency level ratings.

The percentage of diagnostic descriptors marked Excellent (37.1%) and Good

(41.0%) at the advanced level is greater than the percentages of the descriptors marked at the same scale points at the intermediate-high (Excellent: 11.6% and Good: 24.1%) and the intermediate-mid level (Excellent: 0.8% and Good: 34.1%). On the contrary, the percentages of diagnostic descriptor markings at Satisfactory (21.0%), Fair (1.0%) and

Weak (0.0%) in the advanced level are fewer than the percentages of diagnostic descriptors at these scale points in the intermediate-high (Satisfactory: 45.5%, Fair: 17.9%, and Weak:

0.9%) and the intermediate proficiency levels (Satisfactory: 47.2%, Fair: 16.3%, and Weak:

1.6%).

122

Pronunciation. For diagnostic descriptors of pronunciation, 707 markings were made. Table 5.9 displays frequencies and percentages of diagnostic descriptor markings associated with pronunciation grouped by the three proficiency levels, consisting of 257 markings in the advanced level, 216 markings in the intermediate-high level, and 234 markings in the intermediate-mid level.

Table 5.9 Frequencies and percentages of diagnostic descriptors at each scale point for pronunciation across proficiency levels Categories Scale point for diagnostic descriptors Number of Diagnostic of Descriptor Markings at each Diagnostic Weak Fair Satisfactory Good Excellent Proficiency Level Descriptors Advanced Frequency 0 12 63 118 64 (n markings = (%) (0.0) (4.7) (24.5) (45.9) (24.9) 257) Intermediate-

high Frequency 4 76 74 44 18 (n markings = (%) (1.9) (35.2) (34.3) (20.4) (8.3) 216) Intermediate- mid Frequency 15 61 107 47 4 Pronunciation (n markings = (%) (6.4) (26.1) (45.7) (20.1) (1.7) 234) Total Frequency 19 149 244 209 86 (n markings = (%) (2.7) (21.1) (34.5) (29.6) (12.2) 707) Note: (%) = Percentage within proficiency level

At the advanced proficiency level, the percentage of diagnostic descriptors marked Good

(45.9%) is the highest, followed by Excellent (24.5%), Satisfactory (24.5%), and Fair (4.7%).

Raters did not mark Weak for pronunciation at this proficiency level. At the intermediate- high level, the greatest percentage of diagnostic descriptor markings is made for Fair

(35.2%), followed by Satisfactory (34.3%) and Good (20.4%). Fewer percentages of diagnostic descriptor markings are made for Weak (1.9%) and Excellent (8.3%). In the intermediate-level, raters marked Satisfactory (45.7%) the most, followed by Fair (26.1%)

123 and Good (20.1%). The percentages of diagnostic descriptor markings for Excellent (1.7%) and Weak (6.4%) are less than ten percent of the total diagnostic descriptor markings in this proficiency level. The chi-square test for pronunciation confirms that there are significant differences in frequencies of diagnostic descriptor marking across proficiency levels; X2 (8, N

= 707) = 185.95, p < .01. The pattern of raters’ diagnostic descriptor markings at each scale point for pronunciation across three proficiency levels is displayed in Figure 5.3.

50.0% 45.7% 45.9%

35.2% 40.0% 34.3% 24.9% 30.0% 20.4% 26.1% 20.0% 24.5% 8.3% 6.4% 20.1% 10.0% 4.7%

Percentages of markings 0.0% 1.9% 1.7% 0.0% Weak Fair Satisfactory Good Excellent Scale point for diagnostic descriptors

Advanced (n=257) Intermediate-high (n=216) Intermediate-mid (n=234)

Figure 5.3 Distribution of markings on diagnostic descriptors of pronunciation at each scale across three proficiency levels

The pattern of diagnostic descriptors in Figure 5.3 supports the assumption that diagnostic descriptors for pronunciation represent the overall proficiency level ratings. The percentages of diagnostic descriptor markings for Excellent (24.9%) and Good (45.9%) in the advanced level are greater than the percentages of diagnostic descriptor markings in the intermediate-high (Excellent: 8.3% and Good: 20.4%) and the intermediate (Excellent: 1.7% and Good: 20.7%) level ratings. However, the percentages of diagnostic descriptor markings for Satisfactory (24.5%), Fair (4.7%) and Weak (0.0%) in the advanced level are lower than the percentages of the diagnostic descriptor markings in the intermediate-high level

124

(Satisfactory: 34.3%, Fair: 35.2% and Weak: 1.9%) and the intermediate proficiency levels

(Satisfactory: 45.7%, Fair: 26.1% and Weak: 6.4%)

Fluency. Raters made 610 markings of diagnostic descriptors for fluency, consisting of 187 markings in the advanced, 216 markings in the intermediate-high, and 207 markings in the intermediate proficiency level, as shown in Table 5.10.

Table 5.10 Frequencies and percentages of diagnostic descriptors at each scale point for fluency across proficiency levels Categories Scale point for diagnostic descriptors of Number of Diagnostic Diagnostic Descriptor Markings at each Wea Fair Satisfactory Good Excellent Descriptor Proficiency Level k s Advanced Frequency 2 3 42 63 77 (n markings = (%) (1.1) (1.6) (22.5) (33.7) (41.2) 187) Intermediate- high Frequency 3 28 99 73 13

(n markings = (%) (1.4) (13.0) (45.8) (33.8) (6.0)

216)

Fluency Intermediate-mid Frequency 15 64 100 23 5 (n markings = (%) (7.2) (30.9) (48.3) (11.1) (2.4) 207) Total Frequency 20 95 241 159 95 (n markings = (%) (3.3) (15.6) (39.5) (26.1) (15.6) 610) Note: (%) = Percentage within proficiency level

At the advanced level, the highest percentage of diagnostic descriptor markings is made for Excellent (41.2%), followed by Good (33.7%), and Satisfactory (22.5%). The percentages of markings at Weak (1.1%) and Fair (1.6%) are lower than two percent of the total markings of diagnostic descriptors in this proficiency level. At the intermediate-high proficiency level, a majority of diagnostic descriptor markings is made for Satisfactory

(45.8%) and Good (33.8%). Fewer markings are made for Fair (13.0%), Excellent (6.0%), and Weak (1.4%). At the intermediate proficiency level, the greatest percentage of

125 diagnostic descriptor markings is Satisfactory (48.3%), followed by Fair (30.9%) and Good

(11.1%). Fewer than ten percent of diagnostic descriptor markings are made Excellent

(2.4%) and Weak (7.4%). The chi-square test found significant differences in frequencies of the diagnostic descriptor markings for fluency across proficiency level ratings; X2 (8, N =

610) = 232.02, p < .01. The distribution of diagnostic descriptor markings at each scale point across proficiency levels is presented in Figure 5.4.

60.0% 48.3% 50.0% 41.2%

40.0% 45.8% 33.8% 30.9% 33.7% 30.0% 22.5%

20.0% 13.0% 11.1% 7.2% 6.0% 10.0%

Percentages of markings 0.0% 1.4% 1.6% 2.4% 1.1% Weak Fair Satisfactory Good Excellent Scale point for diagnostic descriptors

Advanced (n=187) Intermediate-high (n=216) Intermediate-mid (n=207)

Figure 5.4 Distribution of markings on diagnostic descriptors of fluency at each scale across three proficiency levels.

The distribution of diagnostic descriptors associated with fluency supports the assumption that raters used the descriptors for fluency to support the three proficiency level ratings. The percentage of the diagnostic descriptors marked Excellent (41.2%) in the advanced level exceeds the percentages of diagnostic descriptor markings in the two lower proficiency levels (the intermediate-high: 6.0% and the intermediate-mid level: 2.4%). The percentages of diagnostic descriptors marked Good in the advanced (33.7%) and the intermediate-high level (33.8%) are higher than the percentages of diagnostic descriptor markings (11.1%) in the intermediate-mid level. On the other hand, the percentages of the diagnostic descriptors marked at Satisfactory (48.3%), Fair (30.9%) and Weak (7.2%) in the

126 intermediate-mid level are greater than the percentages of diagnostic descriptor markings at these scale points in the intermediate-high level (Satisfactory: 45.8%, Fair: 13.0%, and

Weak: 1.4%) and the advanced level (Satisfactory: 22.5%, Fair: 1.6%, and Weak: 1.1%).

Vocabulary. 178 markings were made for diagnostic descriptors of vocabulary. As shown in Table 5.11, the markings included 54 markings in the advanced level, 62 markings in the intermediate-high, and 62 markings in the intermediate-mid proficiency level.

Table 5.11 Frequencies and percentages of diagnostic descriptors at each scale point for vocabulary across proficiency levels Categories Scale point for diagnostic descriptors of Number of Diagnostic Diagnostic Descriptor Markings at each Satisfactor Weak Fair Good Excellent Descriptor Proficiency Level y s Advanced Frequency 0 0 0 29 25 (n markings = (%) (0.0) (0.0) (0.0) (53.7) (46.3) 54) Intermediate-

high Frequency 0 0 31 29 2 (n markings = (%) (0.0) (0.0) (50.0) (46.8) (3.2) 62) Intermediate-mid Frequency 5 13 29 15 0 Vocabulary (n markings = (%) (8.1) (21.0) (46.8) (24.2) (0.0) 62) Total Frequency 5 13 60 73 27 (n markings = (%) (2.8) (7.3) (33.7) (41.0) (15.2) 178) Note: (%) = Percentage within proficiency level

The majority of diagnostic descriptor markings for vocabulary are made for Good (53.7%) and Excellent (46.3%) in the advanced level whereas no markings are made Satisfactory, Fair and Weak. In the intermediate-high level, the percentage of the diagnostic descriptors marked Satisfactory (50.0%) is the highest, followed by Good (46.8%), and Excellent

(3.2%). No markings for Weak and Fair are made in this proficiency level. At the intermediate-mid level, the greatest percentage of diagnostic descriptor markings is

127

Satisfactory (46.8%), followed by Good (24.2%) and Fair (21.0%). Fewer diagnostic descriptors are marked at Weak (8.1%) and no markings are made for Excellent. The chi- square test indicated significant differences in frequencies of diagnostic descriptors across proficiency levels; X2 (8, N = 178) = 116.53, p < .01. Figure 5.5 shows a graphical representation of the distribution of the diagnostic descriptor markings across proficiency levels.

53.7% 60.0% 50.0% 46.3% 50.0% 46.8% 40.0% 46.8% 24.2% 30.0% 21.0%

20.0% 8.1% 3.2% 10.0%

Percentages of markings 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Weak Fair Satisfactory Good Excellent Scale point for diagnostic descriptors

Advanced (n=54) Intermediate-high (n=62) Intermediate-mid (n=62)

Figure 5.5 Distribution of markings on diagnostic descriptors of vocabulary at each scale across three proficiency level

The distribution of diagnostic descriptors for vocabulary supports the assumption that the descriptors for vocabulary distinguish the overall proficiency level ratings. The percentages of diagnostic descriptor markings for Good (53.7%) and Excellent (46.3%) in the advanced level are greater than the percentages of the diagnostic descriptor markings at these scale points in the intermediate-high (Good: 46.8%, Excellent: 3.2%) and the intermediate- mid levels (Good: 24.2%, Excellent: 0.0%). However, the percentage of markings for

Satisfactory (50%) is the greatest in the intermediate-high, followed by the percentage of markings for Satisfactory (46.8%) in the intermediate-mid level. No markings are made to

128

Satisfactory in the advanced level. Markings at Fair (21.0%) and Weak (8.1%) are made only in the intermediate-mid level.

Grammar. Raters made 531 markings for diagnostic descriptors associated with grammar, including 187 markings at the advanced level, 176 markings at the intermediate- high, and 168 markings in the intermediate proficiency levels, as shown in Table 5.12.

Table 5.12 Frequencies and percentages of diagnostic descriptors at each scale point for grammar across proficiency levels Categories Scale point for diagnostic descriptors of Number of Diagnostic Diagnostic Descriptor Markings at each Descriptor Proficiency Level Weak Fair Satisfactory Good Excellent s Advanced Frequency 0 1 7 99 80 (n markings = (%) (0.0) (0.5) (3.7) (52.9) (42.8) 187) Intermediate- high Frequency 0 30 80 53 13 (n markings = (%) (0.0) (17.0) (45.5) (30.1) (7.4) 176) Intermediate-

Grammar mid Frequency 13 42 87 26 0 (n markings = (%) (7.7) (25.0) (51.8) (15.5) (0.0) 168) Total Frequency 13 73 174 178 93 (n markings = (%) (2.4) (13.7) (32.8) (33.5) (17.5) 531) Note: (%) = Percentage within proficiency level

At the advanced proficiency level, the greatest percentage of markings is Good and

Excellent (95.7%) whereas fewer markings are made for Satisfactory (3.7%), Fair (0.5%), and Weak (0.0%). At the intermediate-high proficiency level, the greatest percentage of markings is Satisfactory (45.5%), followed by Good (30.1%) and Fair (17.0%). The percentages of markings at Excellent (7.4%) and Weak (0.0%) are lower. At the intermediate-mid level, the greatest percentage of markings is Satisfactory (51.8%), followed

129 by Fair (25.0%) and Good (15.5%). Fewer markings are made for Weak (7.7%) and

Excellent (0.0%). The result of the chi-square test shows significant differences in frequencies of diagnostic descriptors at each scale point across proficiency levels for

Grammar; X2 (8, N = 531) = 289.83, p < .01. Figure 5.6 presents a visual representation of diagnostic descriptor markings at each scale for grammar grouped by the three proficiency levels.

60.0% 51.8% 52.9%

50.0% 42.8%

40.0% 30.1% 25.0% 45.5% 30.0%

20.0% 15.5% 7.7% 17.0% 7.4% 10.0% 0.0% 0.5% 3.7% Percentages of markings 0.0% 0.0% 0.0% Weak Fair Satisfactory Good Excellent Scale point for diagnostic descriptors

Advanced (n=187) Intermediate-high (n=176) Intermediate-mid (n=168)

Figure 5.6 Distribution of markings on diagnostic descriptors of grammar at each scale across three proficiency levels

This tendency of diagnostic descriptors corroborates the assumption that raters used the descriptors for grammar to support the overall proficiency level ratings. In Figure 5.6, the percentages of diagnostic descriptors marked Excellent (42.8%) and Good (52.9%) in the advanced level surpass the percentages of diagnostic descriptors in the intermediate-high

(Excellent: 7.4% and Good: 30.1%) and the intermediate (Excellent: 0% and Good: 15.5%) level. Contrarily, the percentages of diagnostic descriptors marked at lower scale points

(Satisfactory: 51.8% Fair: 25%, and Weak: 7.7%) in the intermediate-mid level are greater than the percentages at these scale points in the intermediate-high (Satisfactory: 45.5%, Fair:

17%, and Weak: 0.0%) and the advanced level (Satisfactory: 3.7%, Fair: 0.5%, and Weak:

0.0%).

130

Pragmatics. Raters made 158 markings for pragmatics, consisting of 54 markings in the advanced level, 54 markings in the intermediate-high, and 50 markings in the intermediate-mid levels. Table 5.13 presents frequencies and percentages of diagnostic descriptors at each scale point for pragmatics across proficiency levels.

Table 5.13 Frequencies and percentages of diagnostic descriptors at each scale point for pragmatics across proficiency levels Categories Scale point for diagnostic descriptors of Number of Diagnostic Diagnostic Descriptor Markings at each Descriptor Proficiency Level Weak Fair Satisfactory Good Excellent s Advanced Frequency 0 0 4 23 27 (n markings = (%) (0.0) (0.0) (7.4) (42.6) (50.0) 54) Intermediate- high Frequency 0 4 16 23 11

(n markings = (%) (0.0) (7.4) (29.6) (42.6) (20.4) 54)

Intermediate-

Pragmatics mid Frequency 1 4 23 20 2 (n markings = (%) (2.0) (8.0) (46.0) (40.0) (4.0) 50) Total Frequency 1 8 43 66 40 (n markings = (%) (0.6) (5.1) (27.2) (41.8) (25.3) 158) Note: (%) = Percentage within proficiency level

At the advanced level, the percentage of diagnostic descriptors marked for Good and

Excellent is the greatest (92.6%). Only 7.4 percentages of markings is made Satisfactory

(7.4%) and no markings are assigned to “Weak” and “Fair” in this proficiency level. At the intermediate-high level, the greatest percentage of diagnostic descriptor markings is made for

Good (42.6%), followed by Satisfactory (29.6%) and Excellent (20.4%). The percentage of diagnostic descriptor markings at Fair (7.4%) is less than ten percent of the total markings and no markings are made for Weak in this proficiency level. At the intermediate-mid level,

131 the percentage of descriptors marked Satisfactory and Good is the greatest (86.0%). The percentages of descriptors marked Fair (8.0%), Excellent (4.0%), and Weak (2.0%) are very small. The chi-square test supports significant differences in frequencies of the diagnostic descriptor markings for pragmatics divided by proficiency levels; X2 (8, N = 158) = 43.26, p

< .01. The distribution of diagnostic descriptor markings for pragmatics distinguishes three proficiency levels as shown in Figure 5.7.

60.0% 50.0% 46.0% 50.0% 42.6%

40.0% 29.6%

30.0% 40.0% 20.4% 20.0% 7.4% 8.0% 4.0% 10.0% 2.0% 7.4% Percentages of markings 0.0% 0.0% 0.0% 0.0% Weak Fair Satisfactory Good Excellent Scale point for diagnostic descriptors

Advanced (n=54) Intermediate-high (n=54) Intermediate-mid (n=50)

Figure 5.7 Distribution of markings on diagnostic descriptors of pragmatics at each scale across three proficiency levels

Generally, this distribution of diagnostic descriptors serves as backing for the assumption that raters used the descriptors for pragmatics to support the overall proficiency level ratings, except markings for Satisfactory. The percentages of diagnostic descriptors marked

Satisfactory are similar in three proficiency levels (the advanced: 42.6%, the intermediate- high: 42.6%, and the intermediate: 40.0%). This tendency can be explained by the fact that pragmatics is not a sub-trait which is directly relevant to speaking proficiency. Except the ambiguous markings on Satisfactory level, the percentage of diagnostic descriptor markings at Excellent (50.0%) in the advanced level surpasses the percentage of diagnostic descriptor markings at the same scale level in the intermediate-high (20.4%) and the intermediate

(4.0%) proficiency levels. As the proficiency level decreases, however, the percentages of

132 diagnostic descriptors marked Satisfactory (46.0%), Fair (8.0%), and Weak (2.0)” in the intermediate-mid level are higher than the percentages of diagnostic descriptor markings at these scale points in the intermediate-high (Satisfactory: 29.6%, Fair: 7.4%, and Weak:

0.0%) and the advanced proficiency levels (Satisfactory: 7.4%, Fair: 0.0%, and Weak: 0.0%).

Listening. For diagnostic descriptors of listening, raters checked 73 markings, consisting of 25 markings in the advanced, 24 markings in the intermediate-high and 24 markings in the intermediate-mid levels, as displayed in Table 5.14.

Table 5.14 Frequencies and percentages of diagnostic descriptors at each scale point for listening across proficiency levels Categories Scale point for diagnostic descriptors Number of Diagnostic of Descriptor Markings at each Diagnostic Proficiency Level Weak Fair Satisfactory Good Excellent Descriptors Advanced Frequency 0 0 0 7 18 (n markings = 25) (%) (0.0) (0.0) (0.0) (28.0) (72.0) Intermediate-high Frequency 1 0 3 12 8 (n markings = 24) (%) (4.2) (0.0) (12.5) (50.0) (33.3) Intermediate-mid Frequency 1 2 8 13 0

Listening (n markings = 24) (%) (4.2) (8.3) (33.3) (54.2) (0.0) Total Frequency 2 2 11 32 26 (n markings = 73) (%) (2.7) (2.7) (15.1) (43.8) (35.6) Note: (%) = Percentage within proficiency level

At the advanced level, most of diagnostic descriptor markings are made Excellent

(72%) and Good (28%). The greatest percentage of markings in the intermediate-high level is Good (50%), followed by Excellent (33.3%) and Satisfactory (12.5%). Only one rating for

Weak (4.2%) and no rating for Fair are made in this proficiency level. In the intermediate proficiency level, the greatest percentage of diagnostic descriptor markings for listening is made Good (54.2%) and Satisfactory (33.3%). Fewer percentages of diagnostic descriptor markings are observed at Fair (8.3%), Weak (4.2%), and Excellent (0.0%) in this proficiency

133 level. The chi-square test supports that there are significant differences in frequencies of diagnostic descriptors for listening grouped by proficiency levels; X2 (8, N = 73) = 34.41, p <

.01. Figure 5.8 shows a visual presentation of diagnostic descriptors for listening at each scale point divided by the three proficiency levels.

80.0% 72.0% 70.0% 54.2% 60.0% 50.0% 33.3% 33.3% 40.0% 50.0% 30.0% 12.5% 20.0% 8.3% 28.0% 10.0% 4.2%

Percentages of markings 0.0% 0.0% 0.0% 0.0% 0.0% Weak Fair Satisfactory Good Excellent Scale point for diagnostic descriptors

Advanced (n=25) Intermediate-high (n=24) Intermediate-mid (n=24)

Figure 5.8 Distribution of markings on diagnostic descriptors of listening at each scale across three proficiency levels

This visual presentation of diagnostic descriptors supports the assumption that raters checked the descriptors for listening to indicate the overall proficiency level ratings except the diagnostic descriptors marked Good. In Figure 5.8, the percentage of diagnostic descriptor markings at Excellent (72.0%) in the advanced level exceeds the percentages of diagnostic descriptors marked at this scale point in the intermediate-high (33.3%) and the intermediate-mid level (0.0%). In the lower scale points, the percentages of diagnostic descriptors marked Satisfactory (33.3%), Fair (8.3%), and Weak (4.2%) in the intermediate- mid level are greater than the percentages of diagnostic descriptor markings in the intermediate-high levels (Satisfactory: 12.5%, Fair: 0.0%, Weak: 4.2%). No markings are made at these scale points in the advanced level ratings. However, the percentages of diagnostic descriptors marked Good do not distinguish three proficiency levels since the percentage of descriptors marked Good (28.0%) in the advanced level (28.0%) is lower than

134 the percentages of descriptors marked at the same Likert-scale level in the intermediate-high

(50.0%) and the intermediate (54.2%) proficiency levels. In line with the results shown in pragmatics, listening is not a sub-trait of speaking proficiency, and thus it seems to be challenging for raters evaluate listening ability as they assess the other dimensions of speaking proficiency such as comprehensibility, pronunciation, fluency, and grammar, etc.

In summary, the comparisons of descriptor choices across the three proficiency levels supported the assumption that the raters used the descriptors in support of the overall proficiency levels they had assigned. This evidence was found when the thirty diagnostic descriptors were analyzed together and when each of the seven diagnostic categories was analyzed. All chi-square tests revealed statistically significant differences and the visual plots showed the expected pattern of choices of scale points to support the proficiency level ratings. When it comes to percentages of diagnostic descriptors at the various Likert-scale points across the three proficiency levels, high-level diagnostic descriptor marking

(Excellent) is consistently higher at the advanced proficiency level. On the contrary, low- level diagnostic descriptor markings (Weak and Fair) are higher at the intermediate and the intermediate-high proficiency levels in all of the diagnostic categories across proficiency levels. These findings corroborate the markings of the diagnostics descriptors grouped by seven diagnostic categories features explains test takers’ speaking proficiency levels.

5.2.3 Raters’ reasons for selecting diagnostic descriptors

Raters’ use of the thirty diagnostic descriptors to support proficiency level ratings was investigated by raters’ responses to an open-ended question in the questionnaire. In the questionnaire distributed separately to fourteen raters (six experienced and eight new raters), an open-ended question asked them to share their experiences in marking thirty diagnostic

135

descriptors. Of all fourteen raters who answered to the questions, ten raters specifically

mentioned how they used diagnostic descriptors during the rating procedure. The main

theme identified from the ten raters’ responses uncovered that the raters concentrated on the

most noticeable linguistic features of the test takers and chose the diagnostic descriptors that

indicated test takers’ weaknesses.

Raters’ intentional use of diagnostic descriptors is observed in ten raters’ responses.

All of the raters indicated that they tended to check diagnostic descriptors that indicated test

takers’ speaking ability. The following excerpts were selected as they explained the raters’

deliberate approaches to mark diagnostic descriptors. For example, three experienced raters

(Rater 1, 2, and 6) stated that they often clicked on the most significant diagnostic descriptors

to add specific descriptions to their ratings. The key phrases showing their deliberate use of

the descriptors in the excerpts are underlined.

• “I usually do the number score first. Then I go to the diagnostic features and fill in most of them. Sometimes I skip some of them and mark only the descriptors, which I think are most significant (e.g., EN, AC, VT). I guess I assume that if I

don't give any score, there is no problem with the feature.” [Rater 2Experienced, the questionnaire] • “I use the diagnostic features as descriptors of score level. I usually pick out descriptors that are truly representative of a students' level and then fill out the

other descriptors only if they apply to the student.” [Rater 6Experienced, the questionnaire]

In line with the experienced raters’ approaches, the new raters used diagnostic descriptors that indicated noticeable weaknesses in the students’ language. For example, two new raters (Rater

8 and 12) articulated that they highlighted test taker’s weaknesses when checking diagnostic descriptors.

136

• “I usually focus on the features that are most relevant to a test taker's performance. I first check the features with which the test taker has the most

problems. Then if I have time left, I check the rest of the features.” [Rater 8New, the questionnaire] • “If there are certain features that are quite noticeable (especially in terms of

weaknesses), I use them.” [Rater 12New, the questionnaire]

The raters’ self report on the use of diagnostic descriptors revealed that diagnostic descriptor markings offered informative evidence to entail test takers’ speaking ability.

To put it briefly, the analysis of the quality and use of the thirty diagnostic descriptors for supporting the three proficiency levels provided evidence for the second assumption in three ways. First, the association between markings of thirty diagnostic descriptors and three proficiency levels suggested that the diagnostic descriptors supported the test takers’ proficiency level ratings. Second, the same finding appeared in each of seven categories of descriptors. Finally, the raters’ responses to questionnaire about their use of diagnostic descriptors uncovered the raters’ selections of diagnostic descriptors reflected examinees’ speaking performance precisely because the raters focused on the most prominent aspects of their performance with particular attention to strengths and weaknesses in their speaking ability. Consequently, findings served as backing for the second assumption that test administration conditions in terms of rating procedures with R-Plat are appropriate for providing evidence of targeted language abilities.

5.3 Use of Raters’ Comments to Support Proficiency Level Ratings

The third research question was proposed to provide another piece of evidence for the second assumption for evaluation inference. The assumption is that test administration

137 conditions related to rating procedures with R-Plat are appropriate for providing evidence of targeted language ability. Backing was found from analysis of the quality of raters’ open- ended comments on test takers’ performance to support the three proficiency levels ratings.

First, raters’ positive and negative comments on test takers’ speaking ability were examined to assess their association with proficiency level ratings. Second, the same analysis was done for comments made on the six evaluation criteria in the OECT scoring rubric. Examples of raters’ comments pertaining to each of the six criteria are presented to demonstrate how raters’ comments supported the proficiency level ratings.

5.3.1 Inter-coder reliability

To verify quality of coding for raters’ comments, the second coder analyzed 20% of the evaluative units in each proficiency level ratings. Then, Cohen’s kappa was run to determine if there was agreement between two coders’ judgment on types of evaluative units, in terms of positive and negative, and each of six evaluation criteria. There was strong agreement on the categorization of positive and negative evaluative units between the two coders’ judgments, k = .900 (p =.000). The strong agreement was also found in the raters’ judgments on each of the six evaluation criteria, k = .823 (p = .000). The strong agreements on the categorizations of the evaluative units between two coders provided evidence for consistent coding for raters’ evaluative units, which became the basis of the following analysis.

5.3.2 Comparison of positive and negative comments across proficiency levels

The value of raters’ comments for supporting the overall proficiency level ratings was examined by comparing raters’ comments across proficiency levels. During the rating

138 procedures for the OECT, the eight experienced raters opted to type comments in R-Plat as they listened to test takers’ performances. The raters’ comments were analyzed on performance associated with 146 proficiency level ratings. The analysis was based upon the grounded theory to identify themes that emerged from the data. Two themes were identified, namely positive and negative aspects in the raters’ comments. Raters’ positive comments often included complimentary words and expressions often described with “excellent,”

“strong,” “good,” etc. Negative comments included mention of specific examples of errors and raters’ criticisms of test takers’ weaknesses.

To compare the frequencies of positive and negative comments across the three proficiency levels, a matrix was developed, called the “evaluative unit,” for raters’ comments, since the comments consisted of words, phrases, and clauses. Using this matrix comments were easier to count. An evaluative unit is defined as a segment of the rater’s comment that states an evaluation for the test taker’s language. Table 5.15 displays examples of raters’ positive and negative comments, and the number of evaluative units identified from the raters’ comments. The raters’ comments were selected from the advanced level ratings, as they provided examples of various evaluative units extracted from the raters’ comments.

Table 5.15 Examples of raters’ comments and their corresponding evaluative units selected from the advanced-level rating

Comment Number of evaluative Examples of raters’ comments Types units from raters’ comments

Positive “No effort to understand. Excellent 3 enunciation, vocabulary.” (Rater 2) “Some word stress issues. Lots of Negative pausing and halting when nervous. 4 Some sounds deleted. ("w" in wooden)” (Rater 1)

139

An example of positive comments from Rater 2 is “No effort to understand. Excellent enunciation, vocabulary.” The raters’ comments consisted of three different segments of a test taker’s speaking ability at the advanced proficiency level. An example of negative comments from Rater 1 is “Some word stress issues. Lots of pausing and halting when nervous. Some sounds deleted. (‘w’ in wooden).” In this comment, four segments, including

“word stress,” “pausing,” “halting,” and “deletion” were identified. Based upon the coding schemes, frequencies of evaluative units were counted to make comparisons of raters’ positive and negative evaluative units for each of the three proficiency levels. Figure 5.9 depicts the procedures for analyzing raters’ comments to obtain evaluative units for each proficiency level rating.

Figure 5.9 Procedures for analyzing raters’ comments3

3 E.U. refers to an evaluative unit

140

Results of comparison of positive and negative evaluative units across the three-proficiency level are displayed in Table 5.13 and Figure 5.10. In Table 5.16, raters’ comments consisted of 1900 evaluative units in total, with a total of 517 positive (27.2%) and 1383 negative

(72.8%) evaluative units.

Table 5.16 Positive and negative evaluative units in raters’ comments divided by proficiency levels

Numbers of evaluative units at each proficiency Comment types levels Positive Negative Advanced Frequency 290 185 (n evaluative units = 475) (%) (61.1) (38.9) Intermediate-high Frequency 106 417 (n evaluative units = 523) (%) (20.3) (79.7) Intermediate-mid Frequency 121 781 (n evaluative units = 902) (%) (13.4) (86.6) Total Frequency 517 1383 (n evaluative units = 1900) (%) (27.2) (72.8) Note: (%) = Percentage within proficiency level.

Overall, the raters made more negative evaluative units (72.8%) than positive evaluative units (27.2%). Both in the intermediate-high and the intermediate proficiency levels, the percentages of negative evaluative units (the intermediate-high: 79.7% and the intermediate:

86.6%) exceed those of positive evaluative units (the intermediate-high: 20.3% and the intermediate: 13.4%). Only at the advanced level, the raters stated more positive units

(61.1%) than negative units (38.9%). A chi-square test supports significant differences in frequencies of positive and negative evaluative units across the three proficiency levels; X2

(2, N = 1900) = 386.50, p < .01. The patterns of raters’ positive and negative evaluative units across proficiency levels are presented visually in Figure 5.10.

141

100.0% 86.6% 81.5% 80.0% 61.0% Positive Negative 60.0% 39.0% 40.0% 13.4% 18.5% comments 20.0%

Percentage of raters' 0.0% Intermediate-mid Intermediate-high Advanced Proficiency levels

Figure 5.10 Overall distribution of positive and negative evaluative units across proficiency levels

When it comes to the comparisons of positive and negative evaluative units across the three proficiency levels, the percentage of positive evaluative units (61.0%) is the greatest at the advanced level. Conversely, the percentage of negative evaluative units (86.6%) is the highest at the intermediate-mid level, followed by the intermediate-high (81.5%) and the intermediate (39.0%) level. The clear patterns of the positive and negative evaluative units depending on proficiency levels offered evidence to support raters’ comments were reflective of test takers’ different proficiency. The further investigation was made to examine whether the clear patterns of the positive and negative evaluative units appeared consistently in the comparisons of positive and negative evaluative units grouped by the six criteria for the

OECT scoring rubric across the three proficiency level ratings.

5.3.3 Comparison of positive and negative evaluative units grouped by the OPI scoring criteria across proficiency levels

To support the quality of raters’ comments to represent proficiency level ratings, evaluative units were grouped by each of the six criteria for the scoring rubric: functional competency, comprehensibility, pronunciation, fluency, vocabulary, and grammar.

Functional competency is related to a speaker’s ability to complete tasks by using appropriate

142 languages and diverse strategies; the latter five criteria are defined above. Of the overall

1900 evaluative units, the number in each category were functional competency (n=348), comprehensibility (n=150), pronunciation (n=526), fluency (n=424), vocabulary (n=167), and grammar (n=285). Then, I compared the number of positive and negative evaluative units grouped by the six criteria across the three proficiency levels.

Functional competency. 348 evaluative units associated with functional competency contained of 119 evaluative units in the advanced, 70 evaluative units in the intermediate- high, and 159 evaluative units in the intermediate proficiency level, as shown in Table 5.17.

Table 5.17 Frequencies of positive and negative evaluative units for functional competency grouped by proficiency levels

OECT Types of evaluative units Number of evaluative units at scoring each proficiency level criteria Positive Negative

Advanced Frequency 106 13 (n units = 119) (%) (89.1) (10.9) Intermediate-high Frequency 31 39 (n units = 70) (%) (44.3) (55.7) Intermediate-mid Frequency 68 91 (n units = 159) (%) (42.8) (57.2) Total Frequency 205 143

Functional Competency Functional (n units = 348) (%) (58.9) (41.1) Note: (N units) = Number of evaluative units, (%) = Percentage within level

At the advanced proficiency level, the percentage of positive evaluative units (89.1%) is remarkably higher that that of negative evaluative units (10.9%). At the intermediate-high level and the intermediate-mid level, on the contrary, the percentages of negative evaluative units (the Intermediate-high: 55.7% and the intermediate: 57.2%) are greater than the percentages of positive evaluative units (the intermediate-high: 44.3% and the intermediate:

143

42.8%). A chi-square test for frequencies of evaluative units about functional competency shows significant differences in frequencies of positive and negative evaluative units across proficiency levels; X2 (2, N =348) = 68.04, p < .01. Figure 5.11 shows a graphical representation of positive and negative evaluative units relevant to functional competency at the three proficiency levels.

100.0% 89.1%

80.0% 57.2% 55.7% Positive 60.0% 42.8% 44.3% 40.0% Negative Percentages of evalautive units 20.0% 10.9%

0.0% Intermediate-mid (n=159) Intermediate-high (n=70) Advanced (n=119)

Proficiency levels

Figure 5.11 Distribution of positive and negative evaluative units across proficiency levels for functional competency

The percentage of positive evaluative units is the greatest at the advanced proficiency level, followed by the intermediate-high (42.8%) and the intermediate (44.3%). As the proficiency levels decreases, however, the percentage of negative evaluative units (57.2%) is the highest at the intermediate-mid level, followed by the intermediate-high (55.7%), and the advanced level (10.9%). The clear distribution of positive and negative units for functional competency across each proficiency level echoed test takers’ expected performance depending on proficiency levels. The divergent patterns of the positive and negative evaluative units were further supported by characteristics of positive and evaluative units that represented test takers’ proficiency levels.

Raters’ comments about functional competency for the advanced level performance often contained positive adjectives, such as “strong”, “good”, and “quite”, to noted speakers

144 ability to deal with given topics, and to develop their arguments competently. Some examples of raters’ comments showing each of these features is shown below:

• The positive adjectives describing the advanced test takers.

“has strong classroom presence.” (Rater1) “having a very good command in oral English.” (Rater5) “quite confident in explaining the concepts.” (Rater16)

• Test takers’ ability to deal with topics competently

"the topic and how difficult the class assignment load is. Does a great job of presenting the problem and solutions.” (Rater1)

“can handle an abstract concept in a general sense with good command of English.” (Rater5)

• Test takers’ ability to develop arguments

“showing no difficulty or struggle in expressing his opinions and making arguments.” (Rater5)

“good development of argument through explanation/details.” (Rater6)

“She makes good arguments.” (Rater7)

On the other hand, negative evaluative units relevant to functional competency in the advanced proficiency level criticized test takers’ short (“somewhat limited answer, but not due to linguistic ability”, Rater 2) and inappropriate responses to given topics (“the pace and organization of the lecture is a little off”, Rater 4).

At the intermediate-high proficiency level, the raters’ evaluative units appreciated test takers’ abilities to develop arguments and to deal with give topics. However, compared with the adjectives used in the advanced level, types of adjectives used at this proficiency level

145 contained “to some extent”, “relatively”, “fairly”, which implied test takers’ somewhat limited abilities, as displayed in the following excerpts.

• The adjectives describing the intermediate-high test takers

“can freely express and develop ideas, to some extent.” (Rater5) “relatively good development of explanations.” (Rater7) “communicated fairly effectively.” (Rater6)

Negative evaluative units at the intermediate-high proficiency level addressed test takers’ difficulties in maintaining performance and providing sufficient responses to questions as shown in the following excerpts.

• Test takers’ difficulties in sustaining performance “because she's having difficulties in developing content.” (Rater7) “he struggled a lot to keep talking about the topic.” (Rater16)

• Test takers’ short responses “short responses, not sure if he is struggling because of the topic or due to limited language.” (Rater5) “giving very short answers.” (Rater1) “doesn't have much to say at times.” (Rater2)

At the intermediate proficiency level, the positive evaluative units highlighted the test takers’ abilities to present arguments and to respond to questions as observed in the two upper proficiency levels. However, the negative evaluative units revealed test takers were more struggling to sustain their performance and to understand questions properly as shown in the following excerpts.

• Test takers’ difficulties in maintaining performance “he's having difficulties in developing the topic.” (Rater7) “had more difficulty with this task.” (Rater4) “experiences a number of breakdowns in communication.” (Rater1)

146

• Difficulties in comprehending questions “did not understand the question at first.” (Rater4) “experiences a number of breakdowns in communication.” (Rater 1) “could not deal with the situation due to the lack of the understanding of the topic.” (Rater16)

This result shows that the raters’ evaluative units for functional competency mirrored test takers’ speaking proficiency at different proficiency levels.

Comprehensibility. Evaluative units for comprehensibility included 70 evaluative units at the advanced, 46 evaluative units at the intermediate-high, and 34 evaluative units at the intermediate–mid level ratings as shown in Table 5.18.

Table 5.18 Frequencies of positive and negative evaluative units for comprehensibility grouped by proficiency levels OECT Types of evaluative units Number of evaluative units at each scoring proficiency level criteria Positive Negative Advanced Frequency 48 22

(n units = 70) (%) (68.6) (31.4) Intermediate-high Frequency 15 31 (n units = 46) (%) (32.6) (67.4) Intermediate-mid Frequency 9 25 (n units = 34) (%) (26.5) (73.5)

Comprehensibility Comprehensibility Total Frequency 72 78 (n units = 150) (%) (48.0) (52.0)

Note: (N units) = Number of evaluative units, (%) = Percentage within level

At the advanced proficiency level, the percentage of positive evaluative terms

(68.6%) surpasses the percentage of negative evaluative terms (31.4%). In the two lower- proficiency levels, on the other hand, the percentages of negative comments (the intermediate-high: 67.4% and the intermediate: 73.5%) are greater than those of positive

147 evaluative terms (the intermediate-high: 32.6%, and the intermediate: 26.5%). A chi-square test reports significant differences in frequencies of positive and negative evaluative units for comprehensibility across the three proficiency levels; X2 (2, N = 150) = 22.55, p < .01.

Figure 5.12 presents differences in the percentages of the two types of evaluative units for comprehensibility at each proficiency level.

80.0% 73.5% 67.4% 68.6%

60.0% Positive

40.0% 32.6% 31.4% 26.5% Negative Percentages of evalautive units 20.0%

0.0% Intermediate-mid (n=34) Intermediate-high (n=46) Advanced (n=70)

Proficiency levels

Figure 5.12 Distribution of positive and negative evaluative units across proficiency levels for comprehensibility

The distribution of positive and negative evaluative units for comprehensibility is consistent with the patterns of the evaluative units for functional competency. The percentage of positive evaluative unit (68.6%) is the greatest at the advanced proficiency level, followed by the percentages of positive units in the two lower proficiency levels. On the other hand, the percentage of negative evaluative units (73.5%) is the greatest at the intermediate-mid level, followed by the percentages of negative units the two upper proficiency levels. Thus, the comparisons of positive and negative evaluative units relevant to comprehensibility across the three proficiency levels indicated that the evaluative units provided evidence to support test takers’ different proficiency levels. Excerpts from raters’ evaluative units for comprehensibility are presented in the following section to elucidate test takers proficiency levels.

148

At the advanced proficiency level, positive evaluative units for comprehensibility indicated that test takers’ speech was mostly easy to understand. Although test takers had problems with word stress or accented speech, these issues did not impede overall comprehensibility as shown in the following excerpts.

• Test taker’s comprehensible speech. “no effort to understand.” (Rater2, Rater1) “accented speech but comprehensible.” (Rater2, Rater4) “some intrusive p/d/t aspirated sounds, but again, does not impede my understanding of her speech.” (Rater1)

Negative evaluative units at the advanced level also pinpointed issues relevant to word stress and accents as shown in the following excerpts. Considering relatively lower percentage of negative evaluative units, however, these limitations were not remarkably observed among the advanced test takers.

“somewhat strong accent due to L1 impeded the comprehension during the question 1 and 2.” (Rater16) “has a slight accent.” (Rater4)

At the intermediate-high proficiency level, negative evaluative units surpassed positive units. Compared to the advanced test takers, negative evaluative units for comprehensibility uncovered that the raters had to put more efforts to understand test taker’s speech because of test takers’ stronger accent and lack of enunciation that impeded comprehensibility.

• Lack of comprehensibility due to strong accent and lack of enunciation “strong accent hinders comprehensibility, requires some listener effort for understanding.” (Rater5)

149

“There are moments that he is not intelligible because of the speech rate and strong accent.” (Rater7) “however, somewhat strong accent and lack of enunciation (vowels, and consonant) impeded comprehension.” (Rater16)

At the intermediate-mid level, negative evaluative units addressed severe limitations of test takers’ speech that impeded comprehensibility. Especially, test takers’ strong accent, lack of enunciation, and frequent pauses contributed to weaken comprehensibility at this proficiency levels as described in the following excerpts.

• Limited comprehensibility due to strong accent, lack of enunciation, and pauses “low comprehensibility.” (Rater1) “several words are not intelligible.”(Rater16) “the frequent hesitations and pauses impede comprehensibility.” (Rater4)

The findings supported the raters’ evaluative units for comprehensibility mirrored test takers’ speaking abilities at different proficiency levels.

Pronunciation. Raters’ evaluative units for pronunciation consisted of 121 units in the advanced, 171 units in the intermediate-high, and 234 units in the intermediate proficiency level, as shown in Table 5.19.

Table 5.19

Frequencies of positive and negative evaluative units for pronunciation grouped by proficiency levels

OECT Types of evaluative units Number of evaluative units at each scoring proficiency level Positive Negative criteria Advanced Frequency 22 99

(n units = 121) (%) (18.2) (81.8)

ion Intermediate-high Frequency 10 161

Pronunciat (n units = 171) (%) (5.8) (94.2) Note: (N units) = Number of evaluative units, (%) = Percentage within level

150

Table 5.19

Frequencies of positive and negative evaluative units for pronunciation grouped by proficiency levels (Continued)

OECT Types of evaluative units Number of evaluative units at each scoring proficiency level Positive Negative criteria Intermediate-mid Frequency 8 226

(n units = 234) (%) (3.4) (96.6)

ion Total Frequency 40 486

Pronunciat (n units = 526) (%) (7.6) (92.4) Note: (N units) = Number of evaluative units, (%) = Percentage within level

Across three proficiency levels, the percentages of negative units (the advanced:

81.8%, the intermediate-high: 94.2%, and the intermediate-mid level: 96.6%) are much greater than those of positive evaluative units (the advanced: 18.2%, the intermediate-high:

5.8%, and the intermediate–mid level: 3.4%). The result from the chi-square test indicates significant differences in frequencies of positive and negative evaluative units pertaining to pronunciation across the three proficiency level ratings; X2 (2, N = 526) = 25.85, p < .01.

Figure 5.13 exhibits the dramatic differences between positive and negative evaluative units at each proficiency level.

120.0% 96.6% 94.2% 100.0% 81.8% 80.0% Positive 60.0% Negative 40.0% 18.2% 20.0% 3.4% 5.8% Percentages of evalautive units 0.0% Intermediate-mid (n=234) Intermediate-high Advanced (n=121) (n=171) Proficiency levels Figure 5.13 Distribution of positive and negative evaluative units across proficiency levels for pronunciation The percentage of negative evaluative units (96.6%) is the highest at the intermediate- mid levels, followed by the intermediate-high (94.2%) and the advanced proficiency levels

151

(81.8%). In contrast, the percentage of positive evaluative units is the greatest (18.2%) in the advanced proficiency levels. The distributions of positive and negative units for pronunciation were in consistent with the findings associating with functional competency and comprehensibility. Analysis of raters’ evaluative units added more descriptions to the patterns of positive and negative evaluative units for pronunciation.

At all three proficiency level ratings, the positive evaluative units showed general statements to appreciate test takers’ good pronunciation such as “good pronunciations” or

“Pronunciation is mostly intelligible.” On the contrary, the negative evaluative units for pronunciation mostly contained specific error examples that described individual test takers’ pronunciation errors. Among all error examples, the major pronunciation issues were associated with pronouncing consonants, vowels, stress, and insertion. Following excerpts presented examples of the four major pronunciation errors for each proficiency level.

• Consonants “enviro[n]ment => weakened.” (the advanced, Rater5) “[fut] food.” (the intermediate-high, Rater2) “has difficulty with word final consonants and consonant clusters. For example, test takers said “Rai now” (Right now).” (the intermediate, Rater4)

• Vowels “some vowel issues - OY, AY sounds pronounced for "a" and "o" sounds” (the advanced) (Rater1) “v[e]ctor -> [i.]” (the intermediate-high, Rater16) “meedle(middle).” (the intermediate, Rater3)

• Word stress “Some word stress issues (but not overwhelming) like ‘inTEger’ instead of ‘Integer’...which is a little confusing or ‘PREferred’ instead of ‘preFERRED’.” (the advanced, Rater 1)

152

“protein (slight word stress issue).” (the intermediate-high, Rater7) “equal stress on all syllables.” (the intermediate, Rater3)

• Insertion “inserted sounds [next a skills]” (the advanced, Rater4) “insertion of vowels at the beginning of some words (i.e. stress -> estress)” (the intermediate-high, Rater7) “soft-uh-ware” (the intermediate, Rater3)

Findings from the analysis of positive and negative evaluative units pertaining to pronunciation suggested raters’ comments for pronunciation were reflective of test takers’ different proficiency levels.

Fluency. The ratings involved 82 evaluative units in the advanced, 122 units in the intermediate-high, and 220 units for fluency in the intermediate-mid level ratings, as shown in Table 5.20.

Table 5.20 Frequencies of positive and negative evaluative units for fluency grouped by proficiency levels

OECT scoring Number of evaluative units at each Types of evaluative units criteria proficiency level Positive Negative Advanced Frequency 58 24 (n units = 82) (%) (70.7) (29.3)

Intermediate-high Frequency 20 102 (n units = 122) (%) (16.4) (83.6)

Intermediate-mid Frequency 12 208 Fluency (n units = 220) (%) (5.5) (94.5) Total Frequency 90 334 (n units = 424) (%) (21.2) (78.8) Note: (N units) = Number of evaluative units, (%) = Percentage within level

The percentage of positive evaluative units (70.7%) is greater than the percentage of negative evaluative units (29.3%) only in the advanced proficiency level. By contrast, the percentages of negative evaluative units (Intermediate-high: 83.6% and Intermediate: 94.5%)

153 are higher than those of positive evaluative units both in the intermediate-high (16.4%) and the intermediate proficiency levels (5.5%). A chi-square test found statistically significant differences in frequencies of positive and negative evaluative units for fluency across the three proficiency level ratings; X2 (2, N = 424) = 154.62, p < .01. Figure 5.14 provides visual presentations of positive and negative evaluative units for fluency.

94.5% 100.0% 83.6% 80.0% 70.7% Positive 60.0%

40.0% 29.3% Negative Percentages of

evalautive units 16.4% 20.0% 5.5% 0.0% Intermediate-mid (n=220) Intermediate-high Advanced (n=82) (n=122) Proficiency levels

Figure 5.14 Distribution of positive and negative evaluative units across proficiency levels for fluency

The percentage of positive evaluative units for fluency is consistently the greatest in the advanced level for fluency (70.7%), followed by the intermediate-high (16.4%) and the intermediate (5.5%). The percentage of negative evaluative units for fluency is the greatest in the intermediate-mid level (94.5%), followed by the intermediate-high (83.6%) and the advanced level (29.3%). The following excerpts provide further description about the pattern of positive and negative evaluative units for fluency for each proficiency level.

At the advanced level, positive evaluative units for fluency described that test takers delivered speech fluently without serious hesitation or halting. The evaluative units often involved evaluative adjectives such as “very”, “good”, and “highly” to depict test takers’ strengths in fluency, as shown in the following excerpts.

“she improves ideas fluently, also speaks fluently.” (Rater7) “can express her ideas and opinions without hesitation at all; very fluent” (Rater5)

154

“good phrasing” (Rater3, Rater2) “highly fluent.” (Rater4) “good pace and intonation” (Rater6)

At the intermediate-high and the intermediate proficiency level, raters wrote relatively a few positive evaluative units that recognized test takers’ fluent speech. On the contrary, raters’ evaluative units the two proficiency level ratings were full of negative evaluative units that impeded fluency of test takers’ speech. The noticeable issues associated with fluency were test takers’ halting, choppiness, and lack of controls in thought groups.

Examples of raters’ negative evaluative units corresponding to each of the issues are presented for each proficiency level in the following excerpts.

• Halting

“lots of pausing and halting when nervous.” (the advanced, Rater1) “starts with several hesitations.” (the advanced, Rater6) “lots of pauses.” (the intermediate-high, Rater2, Rater16) “halting.” (the intermediate-high, Rater5, Rater2, Rater4, Rater16) “lots of hesitations.” (the intermediate, Rater2, Rater4) “too many hesitations, it impedes fluency and comprehensibility.” (the intermediate, Rater4) • Choppiness

“sometimes a bit choppy.” (the advanced, Rater2, Rater5)

“choppy.” (the advanced, Rater1, Rater5, Rater7)

“sometimes gets a bit choppy and word-by-word.” (the intermediate-high, Rater5, Rater7) “sometimes sounds a bit choppy.” (the intermediate-high, Rater1, Rater2, Rater5, Rater7, Rater16) “repetitions as gap fillers.” (the intermediate, Rater7) “MUCH word-by-word prosody.” (the intermediate, Rater1)

155

• Lack of controls in thought groups

“talks too fast without pauses.” (the advanced, Rater7) “false starts.” (the intermediate-high, Rater1, Rater5, Rater6, Rater7) “relative long pauses between phrases, seems searching for expressions or ideas.” (the intermediate, Rater5)

The distribution and examples of positive and negative evaluative units for pronunciation across proficiency levels presented raters’ comments for pronunciation described test takers’ different proficiencies.

Vocabulary. Evaluative units associated with vocabulary consisted of 36 evaluative units in the advanced, 40 units in the intermediate-high, and 91 units in the intermediate proficiency level, as shown in Table 5.21.

Table 5.21 Frequencies and percentages of positive and negative evaluative units for vocabulary grouped by proficiency levels

OECT Types of evaluative units Number of evaluative units at each scoring proficiency level criteria Positive Negative Advanced Frequency 33 3 (n units = 36) (%) (91.7) (8.3) Intermediate-high Frequency 19 21 (n units = 40) (%) (47.5) (52.5) Intermediate-mid Frequency 12 79

Vocabulary (n units = 91) (%) (13.2) (86.8) Total Frequency 64 103 (n units = 167) (%) (38.3) (61.7) Note: (N units) = Number of evaluative units, (%) = Percentage within level

The percentage of positive evaluative units (91.7%) for vocabulary is a lot greater than that of negative units (8.3%) in the advanced level. At the intermediate-high level, the percentage of negative evaluative units (52.5%) is slightly greater than that of positive evaluative units (47.5%). At the intermediate-mid level, the percentage of negative

156 evaluative units (86.8%) is much higher than that of positive evaluative units (13.2%). A chi-square test indicates there are significant differences in frequencies of positive and negative evaluative units for vocabulary across proficiency levels; X2 (2, N = 167) = 69.09, p

< .01. Figure 5.15 shows the distribution of positive and negative evaluative units for vocabulary at each proficiency level.

100.0% 86.8% 91.7% 80.0% 52.5% Positive 60.0% 47.5%

40.0% Negative Percentages of

evalautive units 13.2% 20.0% 8.3%

0.0% Intermediate-mid (n=91) Intermediate-high (n=40) Advanced (n=36)

Proficiency levels Figure 5.15 Distribution of positive and negative evaluative units across proficiency levels for vocabulary

The percentage of positive evaluative unit (91.7%) in the advanced level is greater than the percentages of positive units in the two lower proficiency levels (Intermediate-high:

47.5% and Intermediate: 13.2%). On the contrary, the percentage of negative units (86.8%) in the intermediate-mid level is the greatest, followed by the intermediate-high (52.5%), and the advanced proficiency level (8.3%). Raters’ specific comments support the distribution of positive and negative evaluative units for vocabulary across the three proficiency levels.

At the advanced proficiency level, positive evaluative units were full of appreciation focusing on test takers’ strong vocabulary, sophisticated and a wide range of language and expressions whereas a few negative evaluative units were observed. The following excerpts exhibited examples of raters’ positive evaluative units for appreciation.

“strong vocabulary.” (Rater2)

“sophisticated language and expressions.” (Rater2, Rater5)

157

“a wide range use of vocabulary, maybe it's related her discipline?” (Rater16) “can use a variety of expressions and syntactic structures.” (Rater5)

At the intermediate-high proficiency level, positive evaluative units for vocabulary described test takers did not use a wide scope of vocabulary and expressions, as did the advanced level students. However, they were able to use developed language and expressions “to some extent” to deliver their speech as shown in the following excerpts.

“somewhat developed language, both structurally and vocabulary-wise.” (Rater5) “can express and develop her ideas to some extent.” (Rater5)

In negative evaluative units, test takers appeared to use somewhat frequent repetition of same words and expressions, and inappropriate use of vocabulary and expressions. Following excerpts presented examples for test takers’ limitations regarding vocabulary.

• Repetition of same words and expressions

“repetition of simple expression.” (Rater16) “sometimes repeats ideas, maybe.” (Rater7)

• Inappropriate use of words and expressions

“some word choice inappropriate or non-native like ‘I'm going to discuss to you’; non-native like choice of some words.” (Rater1)

“but lacks the vocabulary to deliver beyond some concrete thoughts.” (Rater6)

At the intermediate-mid level ratings, negative evaluative units for vocabulary were outstanding. Test takers’ use of vocabulary at this proficiency level was described as

“unnatural”, and “limited use of words and expressions.”

158

• Unnatural use of vocabulary “Student was not able to expand/elaborate on ideas using sophisticated vocabulary or grammatical structure.” (Rater6) “somewhat unnatural expressions from time to time.” (Rater5) “we can see the beautiful sign (word choice: scenery) on the road.” (Rater4)

• Limited scope of vocabulary use “simple grammar and vocabulary.” (Rater1, Rater3) “use a limited set of vocabulary.” (Rater4) “but struggles to portray message due to a limited vocabulary.” (Rater6)

The analysis of quality of positive and negative evaluative units for vocabulary supported raters’ comments associated with vocabulary were good indicators of test takers’ different proficiency levels.

Grammar. Evaluative units pertaining to grammar included 47 evaluative units in the advanced, 74 units in the intermediate-high, and 164 units in the intermediate proficiency level, as shown in Table 5.22.

Table 5.22 Frequencies of positive and negative evaluative units for grammar grouped by proficiency levels

OECT Types of evaluative units Number of evaluative units at each scoring proficiency level criteria Positive Negative Advanced Frequency 23 24 (n units = 47) (%) (48.9) (51.1)

Intermediate-high Frequency 11 63 (n units = 74) (%) (14.9) (85.1) Intermediate-mid Frequency 12 152 Grammar (n units = 164) (%) (7.3) (92.7) Total Frequency 46 239 (n units = 285) (%) (16.1) (83.9) Note: (N units) = Number of evaluative units, (%) = Percentage within level

159

For grammar, the percentages of negative evaluative units are greater than those of positive units at all three proficiency levels. At the advanced level, the percentage of negative evaluative units for grammar (51.1%) is slightly greater than that of positive units

(48.9%). The percentages of negative evaluative units (Intermediate-high: 85.1% and

Intermediate: 92.7%) exceed those of positive evaluative units (Intermediate-high: 14.9% and

Intermediate: 7.3%) both in the intermediate-high and the intermediate proficiency levels. A chi-square test shows significant differences in frequencies of positive and negative evaluative units for grammar across the three proficiency levels; X2 (2, N = 285) = 46.87, p <

.01. Figure 5.16 displays the noticeable distribution of positive and negative evaluative units across the three proficiency levels.

92.7% 100.0% 85.1% 80.0% Positive 60.0% 48.9% 51.1%

40.0% Negative Percentages of evalautive units 14.9% 20.0% 7.3%

0.0% Intermediate-mid (n=164) Intermediate-high (n=74) Advanced (n=47)

Proficiency levels

Figure 5.16 Distribution of positive and negative evaluative units across proficiency levels for grammar

The percentage of positive evaluative units is the greatest at the advanced level

(48.9%), followed by the intermediate-high (14.9%) and the intermediate (7.3%). By contrast, the percentage of negative evaluative units is the highest at the intermediate-mid level (92.7%), followed by the intermediate-high (85.1%) and the advanced level (51.1%).

The following raters’ excerpts entailed the distribution of positive and negative evaluative units for grammar.

160

At the advanced proficiency level, positive evaluative units revealed test takers’ good control of constructing advanced sentence structures and using accurate grammar and sentence structures as displayed in the following excerpts.

• Advanced sentence structure “perfectly accurate grammar with ability to create complex grammatical patterns.” (Rater1) “can use a variety of expressions and syntactic structures.” (Rater5)

• Accurate use of grammar and sentence structures “grammar overall is accurate.” (Rater6) “accurate grammar, similar to that of an educated native speaker in the U.S.” (Rater1)

On the other hand, positive evaluative units at the intermediate-high proficiency level and the intermediate-mid level indicated test takers used appropriate but not sophisticated grammar and sentence structures.

“but she can use somewhat complex language.” (the intermediate-high, Rater5) “vocabulary and grammar appropriate.” (the intermediate-high, Rater1) “fairly good grammar.” (the intermediate, Rater4) “grammar use is not bad.” (the intermediate, Rater7)

Compared to positive evaluative units at all the three proficiency levels, negative evaluative units across all the three proficiency levels revealed a wide range of grammar errors test takers’ produced. The types of grammatical errors noticed from the evaluative units were subject-verb agreement, verb tense, and singular-plural nouns and verbs. The following excerpts exemplify each of grammar errors identified from the negative evaluative units across the three proficiency levels.

161

• Subject-verb agreement

“subject-verb agreement [s]” (the advanced, Rater 16) “it [do] something : subject-verb agreement.” (the intermediate-high, Rater16) “subject-verb agreement issues, has/have.” (the intermediate, Rater1)

• Verb tense

“There are occasional grammatical issues, such as the omission of an article or a misplaced tense.” (the advanced, Rater 4) “we have [had]; inconsistent use of past.” (the intermediate-high, Rater4) “I write (wrote) (Verb Tense).” (the intermediate, Rater2)

• Singular-plural

“plural of research: researches; those softwares (that software for plural)” (the advanced, Rater 3) “one of my friend.” (the intermediate-high, Rater5) “Some question (singular and plural).” (the intermediate, Rater7)

Analysis of raters’ positive and negative evaluative units for grammar supported that raters’ evaluative units represented test takers’ proficiency levels.

In summary, the analysis of the quality of raters’ comments for supporting the three proficiency levels corroborated the second assumption. This finding was evident from the analysis of the association between all positive and negative evaluative units and the three proficiency levels. The finding was also obvious in comparisons between positive and negative evaluative units referring to each of the six scoring criteria for the OECT for the proficiency levels. Overall, the findings from the analysis of raters’ comments supplied backing for the assumption that test administration conditions in terms of rating procedures with R-Plat are appropriate for providing evidence of targeted language abilities.

162

5.4 Descriptive Statistics

The fourth research question asked whether the OPI scores effectively separated

examinees into different proficiency levels. Data included the OPI scores for 279 examinees

collected at six test administrations (administration 1: 36 examinees, administration 2: 37

examinees, administration 3: 68 examinees, administration 4: 36 examinees, administration

5: 52 examinees, and administration 6: 50 examinees). The descriptive statistics were

calculated to investigate the characteristics of the scores on the OPI for each test

administration (ADMIN), and for all test administrations pooled. Table 5.23 presents the

results of the descriptive statistics of the OPI scores for each test administration and all test

administrations pooled. To address the research question, the range of the scores, the

standard deviations, skewness, kurtosis, and distributions of the scores in the histograms at

each test administration were examined.

Table 5.23 Descriptive statistics of the OPI scores for each test administration and all test administrations pooled Skewness Kurtosis ADMIN N Mean S.D. Min Max Stat S.E. Stat S.E. ADMIN 1 36 217.30 18.28 185 264.17 .34 .39 -.27 .77 ADMIN 2 37 209.47 20.07 180 268 .55 .39 .49 .76 ADMIN 3 68 212.14 20.56 175.83 288.75 .93 .29 1.71 .57 ADMIN 4 36 210.19 16.13 175 234.17 -.33 .39 -.58 .77 ADMIN 5 52 214.06 17.95 183.33 253.33 -.12 .33 -1.06 .65 ADMIN 6 50 213.68 25.26 147.50 299.17 .68 .34 2.56 .66 All 279 212.83 20.15 147.50 299.17 .53 .15 1.46 .29 ADMINS Note: N refers to the number of examinees; ADMIN refers to an individual test administration.

The results of the descriptive statistics showed the OPI scores separated examinees

into different proficiency levels, in general. First, the ranges of the OPI scores showed that

163 the OPI scores place the examinees into the different ability levels mostly from the intermediate-mid to the advanced from ADMIN 1 through ADMIN 5. For example, at

ADMIN 1, the range of scores is between 185 (minimum) and 264.17 (maximum), which place examinees in the intermediate-mid level to the advanced level. However, the OPI scores separated the examinees from the intermediate-low to the advanced level at ADMIN

6, and all test administrations pooled.

For the OPI scores from all test administrations pooled, the scores range from 147.50

(minimum) to 299.17 (maximum)—between the intermediate-low and the advanced level.

Next, the histograms of the score distributions at most test ADMINs provide a bell-shaped curve, which is evident for the normal distribution of scores across different ability levels

(Figure 5.17), except for ADMIN 6. Although skewness and peaks for the bell-shaped curves at each test administration slightly vary, the skewness and the kurtosis of these scores, from ADMIN 1 through ADMIN 5, and all test administrations pooled, fall into the normal range between -2 and +2 (George & Mallerly, 2010). This result shows that the OPI scores are normally distributed across different ability levels in general. However, the kurtosis at

ADMIN 6 is 2.56, showing a picked distribution for the OPI scores.

However, assuming the equivalence of the groups of examinees across administrations, it was found that the scores did not exhibit the characteristics of parallel tests across test administrations due to dissimilar standard deviations for each ADMIN. The highest standard deviation was observed for ADMIN 6 (25.26), followed by ADMIN 3 (20.56), ADMIN 2

(20.07), ADMIN 1 (18.28), ADMIN 5 (17.95), and ADMIN 4 (16.13). The difference between the highest and the smallest standard deviation was 9.13, which is large.

164

Administration 1 (N=36) Administration 2 (N=37) Administration 3 (N=68) Frequency Frequency Frequency

Final Score Final Score Final Score

Administration 4 (N=36) Administration 5 (N=52) Administration 6 (N=50) Frequency Frequency Frequency

Final Score Final Score Final Score

Figure 5.17 Histograms of the OPI scores for each test administration

Overall, it was found that the OPI scores separated the examinees into different ability levels given the spread of OPI scores across different levels and the normal distributions of the histograms at most test ADMINs. However, it should be noted the different standard deviations at each test ADMIN suggest the tests administered during the six different time periods were not entirely parallel, assuming equivalence across groups of examinees.

165

5.5 Dependability of OPI Ratings

The fifth research question investigated whether the OPI ratings can dependably separate examinees into different ability levels. The distributions of ratings need investigation because they indicate whether a test can distinguish high ability examinees from lower ability examinees. Data from 803 individual raters’ ratings were collected during six test administrations were utilized (Administration 1: 98 ratings, Administration 2: 104 ratings, Administration 3: 200 ratings, Administration 4: 114 ratings, Administration 5: 155 ratings, and Administration 132 ratings). This section begins with the results of descriptive statistics to examine the overall distributions of the OPI ratings. Next, it shows the results of the unidimensionality assumption check using the principal component analysis in SPSS22, which is required for the following MFRM analysis. Finally, this section concludes with results for score dependability obtained from MFRM analysis using FACETS 3.71.4.

5.5.1 Descriptive statistics for OPI ratings

Descriptive statistics were calculated to examine how individual raters’ ratings are distributed at each test ADMIN, and all test administrations pooled. Table 5.21 presents the results for the descriptive statistics of the OPI ratings for each test administration, and all test administrations pooled. Table 5.24 includes the average ratings, the ranges of these ratings, and the standard deviations of these ratings at each test administration. In addition, the histograms for the ratings along with skewness and kurtosis are reported to provide graphical representations of the score distribution at each test administration and for all test administrations pooled.

166

Table 5.24 Descriptive statistics of the OPI ratings for each test administration and all test administrations pooled Skewness Kurtosis ADMIN N Mean S.D. Min Max Std. Std. Stat Stat Error Error ADMIN 1 98 217.93 20.36 170 290 .18 .12 .19 .24 ADMIN 2 104 209.11 22.47 160 300 .59 .12 1.06 .23 ADMIN 3 200 211.80 22.14 150 300 .53 .08 .89 .17 ADMIN 4 114 209.65 18.60 160 250 -.26 .11 -.51 .22 ADMIN 5 155 213.92 20.20 170 300 -.066 .098 -.31 .19 ADMIN 6 132 214.85 27.11 140 300 .610 .106 1.43 .21 All 803 212.81 22.22 140 300 .398 .043 1.00 .08 ADMINS Note: N refers to the number of individual raters’ ratings; ADMIN refers to an individual test administration.

The results showed that the individual raters’ ratings are distributed across the different ability levels from the intermediate-low to the advanced level. First, the ranges of the ratings are between 140 and 300, which place examinees in the intermediate-low level to the advanced level, in most test administrations including ADMIN 2, ADMIN 3, ADMIN 4,

ADMIN 6, and all test ADMINS pooled. Although the ratings at ADMIN 1 and ADMIN 5 fall into the intermediate-mid (170) to the advanced level (300), the wide range of the rating distribution is observed.

In addition, the histograms for the rating distributions at each test ADMIN provide a clear presentation of the bell-shaped curve evident for the normal distribution of the ratings as shown in Figure 5.18. The skewness and kurtosis of the scores at each test ADMIN are placed in the normal range, indicating the OPI ratings are normally distributed across different ability levels. However, the different standard deviations at each test ADMIN were observed for each distribution. The highest standard deviation was observed in ADMIN 6

167

(27.11), followed by ADMIN 2 (22.47), ADMIN 3 (22.14), ADMIN 1 (20.36), ADMIN 5

(20.20), and ADMIN 4 (18.60). The difference between the highest and smallest standard deviation was 8.51, which is large.

Administration 1 (N=98) Administration 2 (N=104) Administration 3 (N=200) Frequency Frequency Frequency

Final Score Final Score Final Score

Administration 4 (N=114) Administration 2 (N=155) Administration 3 (N=132) Frequency Frequency Frequency

Final Score Final Score Final Score

Figure 5.18 Histograms of the OPI ratings for each administration

To conclude, despite the different standard deviations for each administration, the findings supported the OPI ratings are distributed across the different ability levels across the six test administrations.

168

5.5.2 Unidimensionality assumption check

A unidimensionality assumption check was conducted using the principal component analysis (PCA) in SPSS 22 to examine whether the prompts within each of the four intended levels measured a single construct, hypothesized as speaking ability in the OPI. Data were collected from 803 individual raters’ ratings from six test administrations.

To examine the number of constructs, missing data in the file containing 803 rating results was imputed using the multiple imputation method that generated five sets of imputed data in SPSS 22. Next, the PCA was run with each of the five sets of imputed data, yielding five separate outputs. In the PCA outputs, the unidimensionality of the prompts was confirmed by (a) Bartlett’s test of sphericity and the Kaiser-Meyer-Olkin measure of sampling adequacy for appropriateness of the common factor model, (b) the proportion of the first factor variance relative to the second factor variance, (c) the eigenvalues for the first factor, and (d), the scree plots.

The results of the principal component analysis are presented for each iteration in

Table 5.25. First, the small significance values (p-values <0.05) in Bartlett’s test of sphericity and the Kaiser-Meyer-Olkin measure indices above 0.5 indicate the proportion of variance among the variables that might be common variance as an indicator of latent common factors. Second, in all outputs from the five separate iterations, the variance of the first factor is approximately five times greater than that for the second factor, suggesting the first factor primarily accounts for the variance in the data. For example, in the first iteration the first factor variance is 93.19% and the second factor variance is 4.34%. Third, the

Eigenvalues for the first factor were greater than one for all outputs and less than one for the

169 second factor. In all iterations, the eigenvalues were in excess of 3 for the first factor and substantially less than one for the second factor.

Table 5.25 Results of the principal component analysis Iteration

1 2 3 4 5 Eigenvalues 3.72 3.79 3.77 3.49 3.82 Factor 1 Total Variance % 93.19 94.74 94.43 87.39 95.59 Eigenvalues .17 .15 .15 .42 .08 Factor 2 Total Variance % 4.34 3.81 3.80 10.67 2.10 Bartlett’s test of sphericity .00 .00 .00 .00 .00 Kaiser-Meyer-Olkin measure .84 .77 .63 .70 .86

Finally, in the scree plot, the point where the slope of the curve levels indicates the number of factors. The scree plot in Figure 5.19 is drawn from the dataset in the first iteration and the component number in the horizontal axis refers to the factor number.

Figure 5.19 A scree plot with the imputed data from the first iteration

The slope of the curve is clearly leveling beginning at component 2, and the line from component 3 is almost flat, meaning each successive factor accounts for a smaller amount of

170 the total variance. The results showed existence of one factor in the model. The scree plots from the other iteration outputs showed the same patterns of leveling between components 1 and 2. In short, the results of the principal component analysis indicated the prompts across the four intended levels measured a single underlying construct. This finding allowed for

MFRM analysis to investigate the dependability of the OPI ratings, task difficulty and rater’s severity and rating scale use. Results from the MFRM analyses are presented next.

5.5.3 Dependability of the OPI ratings

Dependability of the OPI ratings was examined by analyzing the 803 individual raters’ ratings using MFRM analysis. Using a three-facet rating scale model, the examinee facet and the prompt facets were set to zero logits in FACETS; whereas, the rater facet was non-centered. This occurred because one facet should float to execute FACETS. By floating the rater facet, rater severity relative to the other facets was examined. To investigate whether the scores separated examinees into different ability levels, this study examined (a) distribution of the measures of examinee facets on the vertical ruler, (b) separation index, and

(c) reliability index.

Figure 5.20 shows the vertical ruler displaying the three facets on the common logit scale. FACETS “converts raw scores to logits on an objective measurement scale” and presents the estimation of individual examinee ability, prompt difficulty, and rater severity on a common logit scale (Bond & Fox, 2007, p. 309). The first column indicates a range of logit values from -17 to 10 logits. The second column presents individual examinees’ relative abilities on the logit scale. Positive logit values refer to more able examinees; whereas, negative logit values present the opposite case. In this output, the measures of the examinees’ facets spanned -16.21 to 9.76 logits with a scope of 25.97 logits, showing that the

171 examinees are spread from high to low ability levels broadly. The third column denotes the relative severity of raters. Raters on high logit scale present severe raters, while those on low logit scale are lenient. The fourth column indicates prompt difficulties ranging from difficult prompts on higher scales to easy prompts on lower scales. The last column shows the rating scales ranging from 14 to 30, which refers to 140 to 300, respectively, in the original scales.

+------+ |Measure|+examinee |-judges |-prompt |SPEAK| |-----+------+------+------+-----| | 10 + . + + +(30) | | 9 + + + + | | 8 + + + + --- | | 7 + . + + + 29 | | 6 + . + + + 27 | | 5 + *. + + + 25 | | 4 + **. + + + 24 | | 3 + ***. + + + --- | | 2 + ****** + + + 23 | | 1 + ******** + . + *. + --- | * 0 * ********. * ******. * ********** * 22 * | -1 + *********. + * + . + 21 | | -2 + ********. + + + 20 | | -3 + ****** + + + 19 | | -4 + *. + + . + 18 | | -5 + . + + + 17 | | -6 + + + + 16 | | -7 + + + + --- | | -8 + + + + | | -9 + + + + | | -10 + + + + | | -11 + + + + 15 | | -12 + + + + | | -13 + + + + | | -14 + + + + | | -15 + + + + | | -16 + . + + + --- | | -17 + + + +(14) | |-----+------+------+------+-----| |Measure| * = 5 | * = 2 | * = 8 |SPEAK| +------+ Figure 5.20 Vertical ruler with all data from all test administrations

Dependability of the OPI ratings is determined based on the separation index and the reliability index. Results showed that the separation index is 6.53 and the reliability for the separation is 0.98, showing the test scores could dependably separate examinees at least into six ability levels. The OPI aims to separate examinees into four ability levels, namely,

172 advanced, intermediate-high, intermediate-mid, and intermediate-low levels. The separation index suggested the OPI separates examinees into more levels than the original ability bands.

However, considering the fact that this research question mainly focused on the overall separation of the OPI ratings, this result still supports the dependability of the test scores. In addition, in practice the intermediate-mid level is divided into upper and lower bands, indicating five possible ability levels. To conclude, given the spread of examinees on the vertical ruler, the separation, and the reliability indices, these findings supported the dependability of the test scores in terms of discriminating examinees of different proficiency levels. They served as backing for the assumption that a test reliably distinguishes examinees’ different speaking proficiency levels in the evaluation inference.

5.6 Comparison of Intended Prompt Level and Observed Difficulty

The sixth research question investigated the extent to which the observed difficulties of the prompts intended to be at the same level are similar to one another in their empirical prompt difficulties. Data were from the raters’ ratings on 73 prompts consisting of 19 prompts at the advanced level, 24 prompts at the intermediate-high level, 27 prompts at the intermediate-mid level, and 3 prompts at the intermediate-low level.

A separate FACETS analysis was executed for each prompt level using the three- facet rating scale model with the prompt facet being non-centered. In the initial analyses, disjoint subsets of data were observed in the data sets for the intermediate-mid and intermediate-low prompt levels. To connect these unconnected subsets, group-anchoring method was utilized. Group-anchoring method allows groups of elements to be anchored so that their mean is fixed, though individual groups of elements float relative to that mean

(Linacre, 2012). To anchor disjoint subsets, therefore, the examinee subsets were anchored

173 at the means for each examinee subset group. Detailed descriptions for anchoring methods are provided for the analysis for each prompt level.

The output for each level examined (a) separation and the reliability indices, (b) measures for the prompt facets on all facet vertical rulers, and (c) fair averages to determine the consistency of prompts in difficulty taking into account rater severity. In addition, (d) infit mean squares were utilized to identify problematic prompts.

5.6.1 Prompt difficulty at the advanced level

Prompt difficulty at the advanced level was investigated with individual raters’ ratings on the 19 prompts, including 1, 6, 11, 21, 26, 36, 41, 51, 56, 66, 71, 81, 86, 91, 92, 93,

103, 104, and 105. The FACETS output revealed no unconnected subsets in the data.

In the output, the separation index is 1.03, indicating the prompts were similarly difficult to each other because for the most part they fell into one difficulty level. Reliability for the separation index reflects a lack of dispersion among the prompts with a value of 0.51, which is quite low. This means that the prompts are not separable into different difficulty levels. When it comes to the measures for prompt facets, the observed prompt difficulty ranged from -.17 (prompt 92) to 1.32 (prompt 1) with a 1.49 logit spread, as shown in the fourth column in the vertical ruler in Figure 5.21. Most prompts appeared to gather around the average difficulty of the prompts (.62 logits) and were within one logit. This analysis shows the prompts at the advanced level performed very similarly to each other with respect to their observed difficulties.

174

Figure 5.21 The vertical ruler for prompts at the advanced level

Consistency of the observed prompt difficulty was further examined by the range of fair averages of the prompts in the prompt difficulty table (Table 5.26). In Table 5.26, 19 prompts at the advanced level are presented in the sequence of their observed difficulty, from the most difficult on the top (Prompt 1) to the easiest at the bottom (Prompt 92).

Table 5.26 Prompt difficulty for the advanced level Fair Severity Model S. Infit Prompt ID Count Average Measure E. MnSq ZStd 01 16 22.41 1.32 .41 1.80 1.7 71 26 22.61 .97 .22 1.02 .1 103 74 22.67 .85 .17 1.35 1.6 11 50 22.73 .75 .17 1.08 .4 36 37 22.73 .74 .19 .67 -1.3 81 45 22.74 .73 .20 .62 -1.7 91 24 22.74 .72 .26 .89 -.2 86 65 22.76 .67 .15 .85 -.7

175

Table 5.26 Prompt difficulty for the advanced level (Continued) Fair Severity Model S. Infit Prompt ID Count Average Measure E. MnSq ZStd 66 34 22.77 .66 .21 .41 -2.6 93 58 22.77 .65 .17 1.39 1.7 21 50 22.80 .59 .18 1.12 .5 6 56 22.82 .55 .16 1.20 .9 56 51 22.83 .53 .16 1.00 .0 105 44 22.83 .53 .18 .51 -2.3 41 35 22.85 .48 .20 .92 -.2 51 34 22.86 .45 .21 .79 -.6 26 46 22.89 .39 .17 .70 -1.3 104 44 22.94 .28 .16 .67 -1.4 92 34 23.14 -.17 .20 1.12 .4 Mean 43.3 22.78 .62 .20 .95 -.3 S.D. 14.4 .14 .30 .06 .34 1.3 Separation 1.03 Reliability .51

The first column indicates the prompt ID. The second column shows the number of responses for each prompt over the six administrations. In the third column, the fair average indicates the average score of the prompts taking into account rater severity. The ranges of the fair average among the advanced level prompts are from 22.41 (the most difficult prompts) to 23.14 (the easiest prompts), which correspond to 220.41 and 230.14 on the original scale. The scope of the fair average is approximately a score of 10 in the original scale, which is narrow, demonstrating the observed difficulty of the advanced-level prompts appeared consistent. Measures in the fourth column indicate the estimated logit values for each prompt. In the fifth column, model standard errors are the error variances for each prompt.

The rightmost column shows the infit statistics consisting of the mean squares for each prompt and the corresponding standard errors. The normal range of infit statistics is between 0.4 and 1.5 (Linacre, 2002). Any prompts with infit mean squares over 1.5 indicate

176 a poorly fitting prompt that would measure a different construct compared with the other prompts at the same intended difficulty level. The prompts with infit mean squares smaller than 0.4 indicate that they are not measuring a meaningful construct. In this output, the infit statistics for the 18 prompts are detected in the normal range between 0.4 and 1.5 except for

Prompt 1. Prompt 1, whose infit mean square is 1.8, is a prompt that produces noise in the otherwise stable set of prompts. The noise from this prompt could be attributable to the prompt’s overall underuse during the overall test administrations. The prompts at the advanced level were used at least 24 times. Compared with the other prompts, however,

Prompt 1 was used 16 times, which had fewer responses than other prompts. The standard error for prompt 1 is 0.41—the greatest among the standard errors for the other prompts— and approximately 1.5 times greater than the second largest standard error (.26). This is because fewer responses increase errors in estimation, which subsequently leads to a lack of precision of the measures. In brief, the prompts at the intended advanced level were similar in their observed difficulty despite Prompt 1. Overall, a good match was observed between the intended prompt level and the observed prompt difficulty.

5.6.2 Prompt difficulty at the intermediate-high level

The consistency in the prompts at the intermediate-high level was examined using the individual raters’ ratings on 24 prompts at the intended intermediate-high level. The prompts included 2, 7, 12, 17, 22, 27, 32, 37, 42, 47, 52, 57, 62, 67, 72, 77, 82, 87, 94, 95, 96, 106,

107, and 108. The output showed a successful connection of the data without any disjoint subsets.

To estimate the consistency for the 24 prompts’ difficulties, the separation index was obtained. At 1.17, this indicated prompt difficulty did not vary considerably, but individual

177 prompt differences did exist. The reliability index is 0.58, showing the prompts were not reliably separable. In the vertical ruler (Figure 5.22), the ranges of the prompts are from -.47

(Prompt 107: the easiest) to .59 (Prompt 95: the most difficult) about a 1.06 logit spread, almost within one logit. The prompts appear to bundle in a very narrow band around 0.00 on the logit scales.

Figure 5.22 Vertical ruler for prompts at intermediate-high level

Next, the range of the fair average is also narrow. As shown in the prompt difficulty table (Table 5.27), the ranges of the fair average scores are between 21.00 (the most difficult) and 21.90 (the easiest), interpreted as 210 and 210.90, respectively, in the original scale.

This range supports the narrow scope for the score distribution after taking account of rater severity.

When it comes to problematic prompts, most prompts at the intermediate-high level functioned properly to measure examinees’ abilities because the infit statistics for most of the

178 prompts are detected in the normal range. However, prompt 107 was flagged as problematic with its infit mean squares (1.59) and the largest standard error (0.39) among the other prompts at this difficulty level. The standard error for prompt 107 is 1.5 times greater than the second largest standard error for prompt 95 (.23). The large infit mean squares and the largest standard error might be derived from a lack of respondents for prompt 107, which were used only 16 times, compared with the other prompts being used at least 28 times.

Table 5.27 Prompt difficulty for intermediate-high level Severity Infit Prompt ID Count Fair Average Model S. E. Measure MnSq ZStd 95 28 21.00 .59 .23 1.21 .8 42 59 21.16 .42 .14 .92 -.3 22 63 21.22 .35 .14 .91 -.4 77 57 21.22 .34 .16 .61 -2.1 27 62 21.28 .27 .17 1.25 1.3 47 56 21.40 .14 .16 .96 -.1 12 57 21.40 .13 .16 1.06 .3 17 56 21.41 .13 .17 .94 -.2 108 55 21.44 .10 .16 1.14 .7 37 79 21.45 .08 .13 .89 -.6 52 57 21.45 .07 .16 1.17 .9 94 45 21.47 .05 .17 1.08 .4 87 38 21.51 .01 .19 1.00 .0 82 71 21.57 -.06 .13 .97 -.1 32 50 21.59 -.08 .17 .81 -.8 2 56 21.60 -.10 .16 1.19 .9 62 50 21.68 -.19 .16 .72 -1.4 106 50 21.71 -.24 .16 .77 -1.1 67 66 21.72 -.24 .15 1.13 .7 57 39 21.75 -.28 .21 1.28 1.2 7 58 21.76 -.30 .15 .82 -.9 96 55 21.77 -.30 .16 1.04 .2 72 76 21.86 -.41 .13 .87 -.8 107 14 21.90 -.47 .38 1.59 1.4 Mean 54 21.51 .00 .17 1.01 .0 S.D. 14.2 .23 .27 .05 .21 .9 Separation 1.17 Reliability .58

179

Despite one problematic prompt, the overall results reveal the prompts at the intended intermediate-high level are consistent in their observed difficulty levels.

5.6.3 Prompt difficulty at the intermediate-mid level

The prompt difficulty at the intended intermediate-mid level was investigated with individual raters’ ratings on 27 prompts, consisting of prompt 3, 8, 13, 18, 23, 28, 33, 38, 43,

44, 48, 53, 54, 58, 59, 63, 68, 73, 78, 83, 88, 97, 98, 99, 109, 110, and 111. Since two unconnected subsets were determined in the examinee facets from the initial FACET analysis, the group-anchoring method was utilized to anchor each disjoint subset at the means for each subset. The examinees in subset 1 were anchored at the mean of subset 1 group (.00 logits), while the examinees in subset 2 were anchored at .54 logits, resulting in connected data.

Consistency in the intermediate-mid level prompts was estimated by the low separation index (.68) and the low reliability index (.32), indicating that the prompts are not reliably separable into different difficulty levels. Furthermore, on the vertical ruler, the prompts mingled around the mean for the prompt measures, which is 0 logit on the scale as displayed in Figure 5.23. The narrow range of the prompt distribution presented the prompts at the intended intermediate-mid level are similar in their observed difficulty. The ranges of the prompt measures are between -.57 (Prompt 3: the easiest) and .50 (Prompt 54: the most difficult), with a range of 1.07 logits.

180

Figure 5.23 Vertical ruler for prompts at intermediate-mid level

Next, the narrow scope for the fair average scores supported the consistency in the prompts at this level. In the prompt difficulty table (Table 5.28), the fair average scores ranged from 19.82 (the most difficult prompt) to 20.84 (the easiest prompt), from 190.82 to

200.82 in the original scale, after adjusting for rater severity.

Finally, the prompts at the intermediate-mid level performed adequately, except for three problematic prompts (38, 43, and 53). Most of the prompts are detected in the acceptable range of infit mean squares between 0.4 and 1.5. On the other hand, the infit mean squares for these three prompts are 1.7 (Prompt 38), 2.00 (Prompt 43), and 1.62

(Prompt 53), suggesting the prompts are not measuring a meaningful construct. In the prompt analyses for advanced and intermediate-high levels, the problematic prompts were

181 attributable to a lack of responses. However, the prompts for this level obtained sufficient responses and the corresponding standard errors appeared relatively high, but did not show a great difference compared to the other prompts. This result points to a further investigation of quality of the three problematic prompts.

Table 5.28

Prompt difficulty for intermediate-mid level Severity Infit Prompt ID Count Fair Average Model S. E. Measure MnSq ZStd 54 12 19.82 .50 .33 .40 -1.7 43 28 19.87 .45 .29 2.00 3.0 44 12 19.94 .37 .32 .97 .0 53 24 19.95 .36 .29 1.62 1.9 28 16 20.00 .31 .34 1.46 1.2 58 22 20.07 .23 .24 1.07 .3 33 72 20.07 .23 .13 1.09 .6 83 12 20.07 .23 .31 .60 -1.0 48 61 20.09 .21 .15 1.05 .3 111 20 20.11 .19 .26 1.01 .1 88 25 20.15 .14 .21 .97 .0 97 46 20.16 .14 .19 1.41 1.8 78 65 20.20 .09 .13 .81 -1.0 38 15 20.29 .00 .36 1.70 1.7 98 40 20.38 -.09 .17 .81 -.8 68 24 20.38 -.10 .22 .86 -.4 13 27 20.45 -.17 .21 .76 -.8 18 63 20.45 -.17 .14 .83 -.9 59 12 20.46 -.18 .34 .65 -.8 73 22 20.49 -.21 .24 1.01 .1 110 53 20.55 -.27 .15 1.01 .0 63 75 20.58 -.30 .13 .83 -1.0 109 32 20.58 -.31 .21 1.14 .6 23 14 20.60 -.32 .30 .93 .0 99 36 20.65 -.38 .18 .91 -.3 8 20 20.66 -.38 .25 .74 -.7 3 64 20.84 -.57 .14 .69 -1.9 Mean 33.8 20.29 .00 .23 1.01 .0 S.D. 20.7 .28 .29 .08 .36 1.2 Separation .68 Reliability .32

182

In short, the prompts at the intended intermediate-mid level appeared similar in their empirical difficulty, but there existed the problematic prompts that required for further action.

5.6.4 Prompt difficulty at the intermediate-low level

The analysis for prompt difficulty at the intended intermediate-low level was completed with only three prompts (100, 101, and 113), used at least ten times during the test administrations. The initial output yielded two disjoint subsets within the data. The disjoint subsets were anchored by group-anchoring the examinee subsets at the means of each subset.

The prompts for disjoint subset 1 were anchored at -4.36 logits, and those for disjoint subset

2 were anchored at -1.05 logits, which allowed them treated as the connected data sets.

Results revealed the three prompts were not consistent in their difficulty because the separation index is 2.58, indicating the prompts were divided into at least two difficulty levels. The reliability index is .87, showing the prompts are reliably separable. In addition, the prompts showed a wide range of difficulty, ranging from -1.38 to 1.52 logits, as shown in the vertical ruler (Figure 5.24).

Figure 5.24 Vertical Ruler for intermediate-low level

183

Next, as shown in the prompt difficulty table (Table 5.29), the fair average scores for the three prompts range from 19.28 (Prompt 100: the most difficult) and 19.90 (Prompt 101: the easiest), 190.28 and 190.90 in the original scale after adjusting for rater severity.

With regard to the quality of the prompts at the intermediate-low level, three prompts

100 fall into the normal range considering its infit mean squares (Prompt 100: 1.29, Prompt

113: .44, and Prompt 101: .43). However, the standard errors for these three prompts are relative large, which can be explained by a lack of sufficient responses collected for these prompts. Generally, in contrast to the prompts at the three upper ability levels, the prompts at the intermediate-low level showed inconsistency in their observed difficulty.

Table 5.29 Prompt difficulty for intermediate-low level Fair Severity Infit Prompt ID Count Model S. E. Average Measure MnSq ZStd 100 12 19.28 1.52 .56 1.29 .7 113 18 19.71 -.14 .42 .44 -1.7 101 11 19.90 -1.38 .59 .43 -1.6 Mean 13.7 19.63 .00 .52 .72 -.9 S.D. 3.8 .32 1.46 .09 .49 1.4 Separation 2.58 Reliability .87

In summary, the results of prompt difficulty overall provided positive evidence to support the prompts at the three intended prompt levels were consistent in their respective observed difficulty levels, namely, advanced, intermediate-high, and intermediate-mid levels.

However, observed prompt difficulty at the intermediate-low level showed inconsistency, due to fewer responses for these prompts. Therefore, results from the prompt analyses largely supported consistency between the intended prompt difficulty levels and the observed difficulty levels.

184

5.7 Rater Consistency Within Each Test Administration

The seventh research question investigated the extent to which raters are consistent in terms of severity and use of the rating scales within each test administration. The following data collection procedure was utilized to ensure data connectedness, required for FACET analysis.

During the data collection at each test ADMIN, raters were systematically assigned to work in a group of two or three to assess examinees. This yielded several rater groups to examine the same examinee groups. The separate rater groups were connected through repeating raters to intentionally allocate them to work across the different rater groups. This data collection process resulted in a partially crossed rating design where examinees were nested in different groups of raters and the groups of raters were partially crossed through repeating raters.

Next, the six sets of impromptu question tasks and two sets of role-play tasks during the data collection were systematically rotated to connect different prompts. Each prompt set included 15 individual prompts for impromptu question tasks and 12 individual prompts for role-play tasks, respectively. Since raters assigned different prompts in response to examinees’ performances during the OPI, in practice it was challenging to completely restrict the exact prompts used for each examinee and to connect the individual prompts with each other. Therefore, post-hoc connecting methods were employed when prompt subsets were determined in the FACET results. In other words, by anchoring each prompt subset at the average from the corresponding subsets, disjoint subsets were connected. The detailed anchoring procedure was conducted at ADMIN 1, ADMIN 4, and ADMIN 6 where prompt

185 subsets were identified. The data for the remaining test ADMINS were all successfully connected, which did not require the post-hoc anchoring method.

Based on this rating design and prompt rotation, the data came from individual raters’ ratings for each prompt assigned to each examinee during the OPI. In the current OPI, for example, each rater produced individual ratings on four prompts that were provided to an examinee. The individual raters’ ratings for each prompt were grouped by each administration and a separate MFRM was completed for each ADMIN. Data included individual raters’ ratings for the given prompts: 392 ratings (ADMIN 1), 416 ratings

(ADMIN 2), 800 ratings (ADMIN 3), 456 ratings (ADMIN 4), 620 ratings (ADMIN 5), and

528 ratings (ADMIN 6). A three-facet rating scale model was utilized in FACETS, including examinees, raters and prompts facets. Rater facets were set to be non-centered, so they could float relative to the other facets. Rater severity for each test ADMIN was examined by the separation and reliability indices, and the scope for the rater measures on logit scales. Next, consistency in raters’ use of rating scales was investigated based on the infit mean squares for each rater. Results are presented for each test ADMIN in the following sections.

5.7.1 Administration 1

At administration 1 (ADMIN 1), eight raters were divided into a group of two or three raters, and each assessed five to 17 examinees (total 36 examinees). Two disjoint subsets of the prompt facets were identified in the initial FACETS analysis and were connected by anchoring each subset at the averages of the corresponding subsets. For example, the average of the prompts in subset 1 is 0.12 logits; whereas, that for subset 2 is -0.473 logits.

To connect the subsets in prompts, the prompts in subset 1 were anchored at 0.12 logits, while the prompts in subset 2 were anchored at -0.473 logits, respectively.

186

Table 5.30 presents rater severity and the raters’ use of rating scales at administration

1. The first column refers to individual rater ID, ordered by the degrees of their severity, from the harshest to the most lenient. The second column, “Examinees,” indicates the number of examinees rated by each rater. The third column, “Count,” refers to the number of ratings each rater provided. The fourth column refers to severity measures. The fifth column indicates the standard error (Model S.E.) found for each rater. The right most column refers to the infit mean squares and standardized mean squares, which determine the consistency for the rating scale usage.

Table 5.30 Rater severity and rating scale use for administration 1 Severity Infit Rater Examinees Count Model S. E. Measure MnSq ZStd R01 13 52 2.31 .16 1.01 .1 R03 7 28 1.97 .25 1.52 1.7 R04 17 68 1.83 .14 .88 -.6 R07 13 52 1.76 .18 1.34 1.6 R16 12 48 1.63 .18 .59 -2.2 R02 14 56 1.25 .16 .87 -.6 R05 17 68 .99 .14 .88 -.6 R06 5 20 .89 .26 .44 -2.1 Mean - 49.0 1.58 .18 .94 -.3 S.D. - 17.2 .49 .05 .35 1.5 Separation 2.40 Reliability .85

To estimate the consistency in rater severity, the separation and reliability indices were examined. The separation index is 2.40 with a reliability index of 0.85, indicating raters were highly separated into at least two distinctive levels. Second, the wide band for the logit values under “Severity Measure” indicates the raters were not similar in their severity during ADMIN 1. The raters’ severity estimates range from 0.89 (Rater 6: most lenient) to 2.31 (Rater 1: harshest), about a range of 3.20 logits.

187

In addition, consistency in using rating scales was estimated based on the infit statistics. It is shown the infit statistics for most raters fell within the acceptable range (0.4-

1.5). The finding means that all raters participating at ADMIN 1 rating sessions consistently used the rating scales to distinguish advanced examinees from low-level examinees. In short, at ADMIN 1, the raters were not similar in their severity, but they used the rating scales in a consistent way to assess examinees of different abilities except for one rater.

5.7.2. Administration 2

For analysis of rater consistency at administration 2 (ADMIN 2), 12 raters worked in groups of two or three, and each evaluated from one to 22 examinees (total of 37 examinees).

A successful data connection was achieved in the FACET analysis because no subsets were observed.

As shown in Table 5.31, the rater severity for ADMIN 2 did not vary substantially although their level of severity was not equal considering the separation index was 1.65 with a reliability index of 0.73. Next, the range of the severity appears to be wide. The most lenient rater received a measure of 1.07 (rater 15), while the severe rater (rater 5) is at 2.97 logits on the scale, with a 4.04 logit spread.

Table 5.31 Rater severity and rating scale use for administration 2

Examinee Severity Model Infit Rater Count s Measure S. E. MnSq ZStd R05 1 4 2.97 .60 .15 -1.6 R01 8 32 2.33 .19 .83 -.6 R20 2 8 2.27 .38 .32 -1.0 R03 16 64 1.81 .17 1.66 2.7 R08 7 28 1.78 .21 .86 -.4 R06 5 20 1.68 .23 .87 -.1

188

Table 5.31 Rater severity and rating scale use for administration 2 (Continued)

Examinee Severity Model Infit Rater Count s Measure S. E. MnSq ZStd R17 22 88 1.66 .13 1.26 1.4 R02 7 28 1.60 .19 .75 -.7 R16 13 52 1.45 .13 .58 -2.3 R10 9 36 1.44 .18 .79 -.9 R07 1 4 1.36 .24 .04 -4.2 R15 13 52 1.07 .14 .61 -2.0 Mean - 34.7 1.79 .23 .73 -.8 S.D. - 25.6 .51 .14 .45 1.8

Separation 1.65 Reliability .73

For consistency in rating scale use, infit mean squares for the raters fell into the acceptable range, except for raters 3, 5, 7, and 20. For example, Rater 3 with infit mean squares of 1.66 was considered inconsistent in rating scale use. On the other hand, Rater 5

(infit mean squares: 0.15), Rater 7 (0.04), and Rater 20 (0.32) were the raters who used a narrow range of the rating scales. In particular, the standard error for Rater 5 is the greatest

(.60), and those for Rater 7 and Rater 20 are relatively high. A noticeable fact is that the three raters evaluated only one or two examinees, which contributes to the infit mean squares, and high standard errors. In short, Rater 3 appeared to use the rating scales inconsistently as compared to the other raters, which called for an additional investigation on Rater 3’s rating patterns, and a provision of rater trainings to this rater.

5.7.3. Administration 3

During administration 3 (ADMIN 3), 14 raters were grouped into groups of two or three, and each rated 3 to 33 examinees (total of 68 examinees). No disjoint subsets were

189 determined in the FACET analysis. Table 5.32 presents rater severity and their rating scale use for ADMIN 3.

Results showed inconsistency in their severity because the separation index is 3.77 with a reliability index of 0.93, indicating the raters are highly separated into approximately four distinctive levels. In addition, the spread between the most lenient rater (Rater 6) and the harshest rater (Rater 16) is about -0.53 logits to 2.77 logit (about 3.30 logit spread), indicating the raters were inconsistent in their severity at ADMIN3.

To examine the raters’ consistent usage of the rating scales, the infit mean squares for each rater were examined. The infit mean squares for the raters fell into the normal range, except Rater 5. Rater 5 with an infit mean square of 0.26 appeared to use a middle part of the scale. However, considering the fewer examinees evaluated by Rater 5 and its standard error

(.31), it can be concluded that most raters working at ADMIN 3 showed appropriate use of rating scales.

Table 5.32 Rater severity and rating scale use for administration 3 Model S. Infit Rater Examinees Count Measure E. MnSq ZStd R16 6 24 2.77 .23 .78 -.6 R03 18 72 2.37 .16 1.49 2.5 R01 19 76 2.04 .13 .75 -1.5 R10 5 20 1.72 .25 .89 -.2 R08 16 64 1.70 .14 .62 -2.4 R17 33 132 1.62 .11 1.12 .9 R20 31 124 1.42 .11 1.27 2.0 R05 4 16 1.42 .31 .26 -2.6 R21 19 76 1.37 .13 .78 -1.4 R14 4 16 .95 .33 1.53 1.4 R15 7 28 .50 .22 .72 -1.1 R02 29 116 .45 .11 .62 -3.1

190

Table 5.32 Rater severity and rating scale use for administration 3 (Continued) Model S. Infit Rater Examinees Count Measure E. MnSq ZStd R18 6 24 .18 .24 .80 -.5 R06 3 12 -.53 .38 .47 -1.2 Mean - 57.1 1.28 .20 .86 -.6 S.D. - 43.3 .90 .09 .37 1.7 Separation 3.92 Reliability .94

5.7.4 Administration 4

Eight raters were divided into groups of two or three to examine a total of 36 examinees. Each rater assessed six to 25 examinees for ADMIN 4. By anchoring each subset at its own average logit value (subset 1: -0.4 logits, and subset 2: 0.23 logits), two disjoint subsets for prompt facets were connected. To be specific, prompts for subset 1 were anchored at -0.4, while prompts for subset 2 were anchored at 0.23.

The raters working for ADMIN 4 were highly divided into at least 3 distinctive levels because the separation index is 3.12 and the reliability index is 0.91, as shown in Table 5.32.

Furthermore, the range for rater severity supported inconsistency in rater severity because the range was -0.96 (Rater 2: most lenient rater) to 0.66 (Rater 10: harshest rater), with a 1.62 logit scope.

With regard to consistency in rating scale use, all raters at ADMIN 4 showed they used the rating scale adequately in terms of distinguishing advanced examinees from lower- level examinees. The infit mean squares for all raters were detected in the acceptable range—0.4 to 1.5. This also indicates the scores assigned by all raters were reliable in estimating examinees’ abilities.

191

Table 5.33 Rater severity and rating scale use for administration 4 Model S. Infit Rater Examinees Count Measure E. MnSq ZStd R10 13 52 .66 .18 1.29 1.4 R03 18 72 .64 .15 1.22 1.2 R01 11 44 .13 .18 .54 -2.4 R20 19 76 -.05 .13 .87 -.7 R17 25 100 -.21 .12 .94 -.3 R06 6 24 -.42 .26 1.12 .5 R15 11 44 -.70 .18 .65 -1.7 R02 11 44 -.96 .19 .91 -.3 Mean - 57.0 -.11 .17 .94 -.3 S.D. - 24.1 .59 .04 .26 1.4 Separation 3.12 Reliability .91

5.7.5 Administration 5

For administration 5 (ADMIN 5), 11 raters working in groups of two or three rated a total of 52 examinees, each assessing 1 to 20 examinees. No disjoint subsets were detected with the FACET analysis.

With regard to the consistency among raters’ ratings, the separation index was 3.02, indicating that the raters were divided approximately into three different levels, as shown in

Table 5.34. The reliability index was quite high, 0.90, suggesting that we can be very confident the raters were different in terms of severity. Moreover, the scope of the measures for the raters was 3.47 logits, ranging from -0.38 (Rater 5: most lenient) to 2.09 (Rater 14: harshest). This wide range of raters on the logit scales indicates the raters participating in

ADMIN 5 showed different levels of severity.

192

Table 5.34 Rater severity and rating scale use for administration 5 Infit Rater Examinees Count Measure Model S. E. MnSq ZStd R14 15 60 2.09 .15 .70 -1.7 R20 11 44 1.90 .22 1.45 1.8 R03 19 76 1.85 .18 1.74 3.6 R18 19 76 1.84 .13 .95 -.2 R07 15 60 1.75 .16 .41 -4.0 R08 20 80 1.62 .13 .97 -.1 R01 11 44 1.61 .20 1.19 .9 R17 17 68 1.51 .15 .76 -1.4 R16 10 40 1.17 .21 .30 -3.9 R02 17 68 -.11 .14 .73 -1.4 R05 1 4 -.38 .67 .38 -.4 Mean - 56.4 1.35 1.35 .87 -.7 S.D. - 22.2 .83 .83 .45 2.3 Separation 3.02 Reliability .90

For use of the rating scales, seven raters showed an appropriate use of the rating scales considering the normal range of their infit mean squares, as shown in Table 5.31 above.

However, there three problematic raters; Rater 3, Rater 16, and Rater 5. Rater 3 with an infit mean square of 1.74 used the rating scales inconsistently. Rater 16 with an infit index of

0.30, and Rater 5 with an infit index of 0.38 used the restricted range of the rating scales.

However, Rater 5 assessed only one examinee, and the standard error is the greatest (.67). In short, during ADMIN 5, Rater 3 and Rater 16 appeared to use the rating scales inappropriately.

5.7.6 Administration 6

For administration 6 (ADMIN 6), seven raters were divided into groups of two or three and each rater rated one to 49 examinees (total of 50 examinees). By anchoring each subset at its group mean logit values, two disjoint subsets of the prompts facet were connected. Subset 1 was anchored at -0.66, and subset 2 was anchored at 0.57. Rater facets

193 were not centered; whereas, the examinee facet and the prompt facet were set to 0 as defaults in FACETS.

As shown in Table 5.35, the raters participating in ADMIN 6 were highly separated into approximately three distinct levels because the separation index was 2.98 and the reliability index was 0.90. For the range of severity level, the results showed a wide scope

(2.26 logit) because this ranges from -0.39 logits (Rater 7: most lenient) to 1.87 logit (Rater 3: harshest).

Table 5.35 Rater severity and rating scale use for administration 6 Infit Rater Examinees Count Measure Model S. E. MnSq ZStd R03 19 76 1.87 .23 1.05 .3 R05 1 4 1.09 .16 .35 -.9 R01 12 48 .56 .13 .76 -1.1 R17 49 196 .47 .25 .89 -1.1 R16 16 64 .09 .14 .53 -3.0 R02 27 108 -.35 .11 .93 -.4 R07 8 32 -.39 .11 1.18 .7 Mean - 75.4 .48 .20 .81 -.8 S.D. - 62.5 .81 .17 .29 1.2 Separation 2.98 Reliability .90

To identify consistency in the raters’ rating scale use, infit mean squares were examined, showing most of the raters were detected in the normal range, except Rater 5.

Rater 5 appeared to use the restricted range of the scales to assess examinees based on infit mean squares (.35). However, considering the number of examinees rated by Rater 5, it is not surprising to find this rater used the narrow range of the scales. In short, the raters were different in the degree of severity, and most rater used the rating scales appropriately during

ADMIN 6.

194

In general, the raters exercised different levels of severity in rating the examinees throughout six administrations. This is because the levels of rater severity were separated into at least two levels based on separation and reliability indices in five administrations

(ADMIN 1, 3, 4, 5 and 6). For ADMIN 2, moreover, it is challenging to suggest raters’ consistency in their severity because raters’ severity did not differ considerably, but individual differences in severity did exist considering the separation index (1.65).

When it comes to raters’ use of the rating scales, most raters exhibited an adequate, consistent use of rating scales to distinguish examinees of different ability levels within each

ADMIN, except Rater 3 and Rater 16. Rater 3 was flagged as a problematic rater who measured a different construct during ADMIN 2 and ADMIN 5. Rater 16 turned out to use a limited range of rating scales to assess examinees during ADMIN 5. This result suggests that

Rater 3’s rating practices should be consistently monitored and additional rating trainings should be provided to this rater.

To conclude, the findings provided both positive and negative evidence to support consistency among raters within each test administration. The raters showed inconsistency in rater severity, but consistent use of rating scales for each administration.

195

CHAPTER 6

DISCUSSION AND CONCLUSIONS

The empirical findings in response to the seven research questions became the backing to support the assumptions underlying the evaluation and the generalization inferences in the interpretive argument in this study. In this chapter, the findings are summarized and integrated to show how they were to formulate the validity argument for the interpretation and use of the OPI scores with R-Plat integrated into the rating procedure. The remaining sections of this chapter present limitations of the study, suggestions for future research, and implications for future studies.

6.1 Validity Argument for OPI Scores with R-Plat Web-based Rating System

In this dissertation, the complete interpretive argument outlined multiple types of backing that will be needed for its seven inferences, namely, domain description, evaluation, generalization, explanation, extrapolation, utilization, and impact. Beginning this process, empirical research conducted for this dissertation addressed only the evaluation and generalization inferences to investigate the extent to which support could be found for these two foundational parts of the argument for uses and interpretation of the OPI scores.

The following parts present how the findings serve as backing for these parts of the validity argument, focusing on the evaluation and generalization inferences. Findings from the seven research questions are integrated and reported to support each assumption underlying the warrants for these two inferences.

196

6.1.1 Evaluation inference

The evaluation inference links observed performance to observed scores on the OPI.

As illustrated in Figure 6.1, observations of performance serve as the grounds for this step in the validity argument. They are linked to an intermediate conclusion, which states that the observed scores reflect relevant aspects of the examinee's observed performance. The warrant needed to support the inference is that observed performance on the OPI prompts collected via R-Plat is evaluated to provide observed scores and observed performance descriptors reflective of targeted speaking ability. Figure 6.1 depicts three assumptions underlying the warrant for the evaluation inference and backing to support each assumption.

Figure 6.1 Evaluation inference with three assumptions and backing

197

The first assumption is that rating procedures in support of R-Plat are appropriate for raters to assess targeted speaking abilities. Backing for the first assumption is both experienced and new raters revealed positive perceptions and experiences in using R-Plat for rating purposes. This backing came from raters’ responses to the questionnaire and individual interviews/focus groups. Findings showed both six experienced and eight new raters mostly held positive views of R-Plat in terms of (a) clarity, (b) comfort, (c) effectiveness, and (d) satisfaction with R-Plat. This is because the mean score for their responses was above 4 for all six-point scale statements in the questionnaire, with 6 the most positive. Next, the independent t-tests and ANOVA showed no significant differences between the two rater groups with regard to their perceptions of the four aspects, showing both raters’ groups held positive views. Furthermore, the raters’ written responses and verbal reports revealed their successful rating practices with R-Plat. However, it should be noted they mentioned some limitations and challenges about using R-Plat for rating purposes, which called for future improvement in R-Plat. Despite these limitations, the raters generally shared positive opinions and experiences with R-Plat, supporting the appropriateness of using

R-Plat for the rating purpose.

The second assumption is that test administration conditions in support of R-Plat are appropriate for providing evidence of targeted speaking abilities. It was supported by backing that the diagnostic descriptor markings and raters’ comments were indicative of different ability levels. Backing was determined from the investigations of the thirty diagnostic descriptor markings and the raters’ comments on examinees’ speaking performances. When it comes to backing relevant to diagnostic descriptor markings, the expected patterns in the visual plots for the diagnostic descriptor markings first supported the

198 relationships between the choices of scale points and proficiency level ratings. In other words, high-level diagnostic descriptor marking was consistently higher at the advanced proficiency level ratings. On the contrary, low-level diagnostic descriptor markings were higher at the intermediate and the intermediate-high proficiency levels. Second, all chi- square tests exhibited statistically significant differences, suggesting a significant relationship between the diagnostic descriptor markings with the proficiency level ratings. Third, these same results were observed for each of the seven diagnostic descriptor categories⎯ comprehensibility, pronunciation, fluency, vocabulary, grammar, pragmatics, and listening.

Finally, the raters’ responses to the questionnaire revealed that they marked the diagnostic descriptors to indicate an examinee’s speaking performance, focusing on an individual examinee’s strengths and weaknesses.

The second assumption was further supported by findings about significant relationships between evaluative units extracted from raters’ comments and the three proficiency level ratings. Findings showed the obvious patterns existed for the positive and negative evaluative units depending on proficiency levels. The percentages of positive evaluative units were the greatest at the advanced level ratings. On the contrary, those with negative evaluative units were the highest at the intermediate-mid level, followed by the intermediate-high level ratings. Second, all chi-square tests exhibited statistically significant differences, strongly supporting the associations between positive and negative units with proficiency level ratings. Third, the same patterns were witnessed in comparisons of positive and negative evaluative units with proficiency levels in regards to the six scoring criteria⎯functional competency, comprehensibility, pronunciation, fluency, vocabulary, and grammar. Finally, analysis of the actual raters’ comments showed that the comments for

199 each criterion reflected the typical characteristics of examinees’ speaking abilities at different proficiency levels.

The third assumption is that examinees’ performance on the OPI are evaluated adequately in such a way that yields observed scores reflective of speaking ability level. It was supported by finding showing that the OPI scores were spread across different proficiency levels. Backing was sought from the investigation of the OPI scores for 279 examinees collected at six ADMINS. First, the OPI scores ranged from the intermediate-mid to the advanced level from ADMIN 1 through ADMIN 5. The histograms for the OPI scores at most ADMINS showed bell-shaped curves, supporting the normal distribution of the OPI scores across ADMINS. However, these scores did not present the entire parallel test forms across six ADMINS because of dissimilar standard deviations and the picked shape of the

OPI score distribution for ADMIN 6. Results indicated that there existed differences in test forms used for each ADMIN, but in general the OPI scores were distributed across different proficiency levels throughout the six ADMINS. The overall findings supported the OPI scores separated examinees into different proficiency levels. These findings point to further investigation with an approximately equal number of examinees participating in each

ADMIN.

6.1.2 Generalization inference

The next inference in the validity argument is generalization. Observed scores, the grounds for this step, are connected with the intermediate conclusion stating that expected scores reflect what observed scores would be across parallel tasks and within and across raters. The warrant is that observed scores recorded in R-Plat are dependable estimates of expected scores over the relevant parallel versions of prompts, and consistent within intended

200 prompt levels and across/within raters. Figure 6.2 shows the three assumptions underlying the warrant for the generalization inference, and backings obtained from this research.

Figure 6.2 Generalization inference with three assumptions and backing

201

The first assumption is a test reliably distinguishes examinees’ different speaking proficiency levels. Backing for the first assumption was that the OPI ratings dependably separated examinees into different speaking levels. Backing was generated from the MFRM analysis for the 803 individual raters’ ratings. The results showed the examinee facets spread between -16.21 to 9.76 logits with a scope of 25.97 logits, suggesting the examinees were distributed from high to low ability levels. This assumption was further supported by the separation index (6.53) and the reliability index (0.98), meaning the individual ratings dependably separated examinees into at least six ability levels. This result suggests that a potential modification of the OPI rating scales. Currently, the OPI rating scales is divided into four score bands, the advanced, the intermediate-high, the intermediate-mid, and the intermediate-low levels. The intermediate-mid level is further divided into the upper and the lower score bands, but this distinction does not affect the final placement for examinees. A future research should be conducted to identify the appropriate number of rating scales, and to examine whether the modified scale can distinguish examinees’ different speaking ability levels adequately.

The second assumption is that examinees’ proficiency is evaluated consistently across prompts at intended prompt levels. The assumption was partially supported because observed prompt difficulty largely was consistent with intended prompt levels, but problematic prompts were identified at some intended prompt levels.

For consistency between observed prompt difficulty and intended prompt levels, observed prompt difficulty matches well with intended prompt levels for the three upper intended prompt levels, except the intermediate-low intended level, which only included three prompts. Backing was obtained from MFRM analyses of the individual ratings for 73

202 prompts at the different intended levels. The separation indices were below 1.17 and the reliability indices were quite low for these three upper prompt levels. The prompt facets at the three upper prompt levels were also gathered around 0 logit on the vertical rulers, drawing the same conclusion about prompt consistency. The ranges of fair averages for the prompts at these three prompt levels were narrow. However, the three prompts at the intermediate-low level exhibited inconsistency given the high separation index (2.58) and the wide ranges of the prompt logits placed on the vertical ruler. Considering the very few number of prompts at this level and the low responses for these prompts, it would be premature to make any conclusions about the inconsistent prompts at this level. Inclusion of more prompts with high respondents should enhance the precision of this estimation.

With regards to the problematic prompts that behaved differently from the others, only one prompt each was discovered for the advanced and the intermediate-high levels.

These results may be attributable to instability in the data for these prompts because of the lower numbers of responses. Three prompts for the intermediate-mid prompt level and two prompts for the intermediate-low prompt levels were flagged. However, they received the equivalent number of responses as those for the other prompts in the same difficulty levels.

These results point to a need to further include more responses for the advanced and intermediate-high level prompts, and to revise the prompts for the intermediate-mid and intermediate-low prompt levels.

With regards to rater severity, raters exhibited different severity levels within each

ADMIN. Raters were at least separable into two severity levels for each ADMIN based on separation and reliability indices for most ADMINS. Although the raters at ADMIN 2 were

203 less separable than those at the other ADMINS given its separation index (1.65), the results still suggest individual differences in terms of severity levels at ADMIN 2.

For raters’ rating scale use, most raters participating at each ADMIN revealed they used the rating scales appropriately to distinguish examinees’ different ability levels given the infit statistics falling into the acceptable range (0.4—1.5). Results showed some problematic raters who did not use the rating scales adequately. However, these raters had fewer responses, which might weaken the precision of the estimation. Considering the contradictory results about rater consistency between their severity level and rating scale uses, this study concludes that the backings were supported by some of the garnered evidence but not supported by other evidence.

Results about consistency in raters’ rating behaviors suggest implications for future rater trainings and for the analysis of scores using FACETS for decision-making. Given the

FACETS offers rich information about rater behaviors—e.g., severity and rating scale use—

OPI test administrators can provide enhanced rater trainings at multiple stages for different rater groups. On the one hand, new raters’ rating patterns should be investigated using

FACETS for the new rater training stage. Once their rating patterns are comparable with certified raters, they can take part in official rating sessions. On the other hand, experienced raters’ rating behaviors should be monitored on a regular basis. For example, in this study,

Rater 3 was relatively harsh across the six test ADMINS, and appeared to measure different constructs during ADMIN 2 and ADMIN 5. Considering the fact that this rater had more than 10 years of rating experience, test administrators should provide Rater 3 with a more targeted training based on detailed information about the rating patterns. The necessity of training or retraining, especially for problematic raters, was also recommended in early

204 literature (e.g., Lunz, Write, & Linacre, 1990; Stahl & Lunz, 1991). For other experienced raters, individualized feedback on raters’ rating behaviors can help them adopt standardized scoring criteria for the OPI test. For example, research on the influence of individualized feedback on raters’ rating behaviors revealed individualized feedback enhanced awareness of their rating behaviors and consistency in their patterns based on the results from FACETS

(Elder, Knoch, Barkhuizen, & Randow, 2005). Likewise, OPI rater trainings can be further improved in support of rich information from FACETS.

Moreover, OPI administrators might consider screening prospective ITAs based on results from FACETS. The current OPI administrators rely largely on raters’ agreement on the original observed scores to make a final decision. However, this decision-making process does not take into account possible variations among raters and their impact on test scores. This approach may not be suitable because an examinee’s ability can be either underestimated or overestimated, depending upon the degree of rater severity. In this sense, results from FACETS generate not only observed scores, but also fair scores adjusting the original observed scores for rater severity. Therefore, utilization of fair scores from

FACETS may provide a more precise estimation of an examinee’s speaking ability to the

OPI test administrators.

Table 6.1 summarizes the inferences, warrants, assumptions, and backing for the validity argument for the interpretation and use of OPI scores assigned by raters using R-Plat during the rating process. The last column indicates whether backing evidence supported the assumptions in the validity argument.

205

Table 6.1. Validity Argument for the OPI Scores Assigned from Raters Using R-Plat. An Infer Warra Assumptions Backing sw ence nt er Observe 1] Rating procedures in Raters’ responses to questionnaire and individual d support of R-Plat are interviews/focus groups showed their positive perform appropriate for raters to attitudes towards clarity of R-Plat, effectiveness of R- Yes ance on assess targeted speaking Plat, their comfort level with using R-Plat, and the OPI abilities. satisfaction with R-Plat. tasks 2] Test administration Diagnostic descriptors and raters’ comments were collecte conditions in support of indicative of different ability levels because: d via R- R-Plat are appropriate Plat is for providing evidence Diagnostic descriptors reflected different ability levels evaluate of targeted speaking because: d to abilities. (a) High-level diagnostic descriptor marking was provide consistently higher at the advanced level ratings observe whereas low-level diagnostic descriptor ratings d scores were higher at the lower level ratings, and (b) Chi-square test (p = .00) revealed the significant observe relationships between diagnostic descriptor d markings and three proficiency level ratings, perform (c) The same findings stated above were found in ance the analysis of the seven categories of diagnostic descript descriptors. ors (d) In the questionnaire, raters reported that they reflectiv

used diagnostic descriptors to indicate different e of speaking ability level, focusing on individual Yes targeted examinees’ strengths and weaknesses. speakin g Raters comments reflected different ability levels Evaluation ability. because: (a) The percentages of positive evaluative units were the greatest at the advanced level ratings. On the contrary, those of the negative evaluative units were the highest at the intermediate-mid level, followed by the intermediate-high level ratings, (b) Chi-square tests (p=.00) revealed the significant relationships between positive/negative evaluative units and three proficiency level ratings, (c) The same patterns of the distributions of positive/negative evaluative units and the chi- square tests were found for six OPI scoring criteria, (d) The raters’ comments for each scoring criterion reflected typical characteristics of examinees’ speaking ability at different proficiency levels. 3] Examinees’ OPI scores were distributed across different performance on the OPI proficiency levels throughout the six ADMINS is evaluated adequately although there existed some differences in test forms in such a way that used for each ADMIN. Further research is needed Yes yields observed scores with an approximately equal number of examinees reflective of speaking participating at each test ADMIN. ability level.

206

Table 6.1. Validity Argument for the OPI Scores Assigned from Raters Using R-Plat (Continued) Infer Answ Warrant Assumptions Backing ence er Observed scores 1] A test reliably The test scores reliably separated examinees recorded in R-Plat distinguishes into six ability levels. are dependable examinees’ Yes estimates of different speaking expected scores proficiency levels. over the relevant 2] Examinees’ Observed prompt difficulty largely was parallel versions proficiency is consistent with intended prompt levels, but of prompts, are evaluated problematic prompts were found at some consistent within consistently across intended prompt levels. intended prompt prompts at levels, and intended prompt 1. Observed prompt difficulty matched well across/within levels. with intended prompt level because; raters. (a) low separation (below 1.17) and reliability indices were observed at the three upper prompt levels, (b) prompts facets were gathered around 0 logits on the vertical rulers, (c) the ranges of fair averages of the prompts at the three upper prompt levels Yes were narrow, /No (d) high separation, the wide range of prompt facets and fair averages of the prompts were found at the intermediate- low level, which calls for further investigation with more respondents.

2. Problematic prompts were observed at

Generalization some intended prompt levels. Future research is thus needed with more respondents for the problematic prompts the advanced level, and the intermediate-high level. The problematic prompts observed at the intermediate-mid and the intermediate-low levels need further revisions and investigation. 3] Examinees’ Raters exhibited inconsistency in severity, proficiency is but used rating scales adequately to evaluated distinguish different proficiency levels. consistently within/across 1. Raters’ severity was separated at least two raters. levels at five ADMINS. Raters at ADMIN 2 showed individual differences in severity given its separation index (1.65), which Yes requires further investigation with an /No approximately equal number of raters assessing equal number of examinees.

2. Most raters at each ADMIN showed adequate use of rating scale sues within each ADMIN as the infit statistics of the raters fell into the acceptable ranges (0.4-1.5)

207

6.2 Conclusions

Overall, three assumptions for the evaluation inference were supported by empirical results. For generalization inference, the first assumption was supported, while the second and third were backed by less supporting evidence. This section addresses the limitations of this study as well as suggestions for future research and implications.

6.2.1 Limitations and recommendations for future research

Although the results of this dissertation work add valuable empirical knowledge to the filed of language testing, it had a number of limitations that can be addressed by future research. The first limitation lies in the small sample size of the rater participants. Only eight experienced and six new raters’ opinions were collected through the questionnaire and interviews during ADMIN 1. These raters were the only raters available to participate during the data collection period. Raters’ opinions about R-Plat can further be explored in a longitudinal study that would gauge raters’ perceptions over time. Findings from future research can provide a stronger justification to support the benefits of integrating R-Plat into the rating procedure and to continue improving the R-Plat features.

The second limitation is in the restricted types of instruments employed to investigate raters’ perceptions towards R-Plat. To collect stronger validity evidence associated with raters’ uses of R-Plat. Future research should investigate raters’ use of R-Plat not only through questionnaire and interview data. For instance, raters’ rating behaviors with R-Plat can be recorded using screen-capturing software like Camtasia to capture how they utilize the various features of R-Plat. Used for stimulated recalls, the screen-capture would allow for scrutinizing raters’ concurrent thinking during the rating process with R-Plat. However, it should be noted that during stimulated recalls, the participants may miss reporting

208 something important or they may not accurately remember some details (Aranyi, Schaik, &

Barker, 2012). Furthermore, eye-tracking technology can be utilized to capture raters’ interactions with R-Plat because it would allow researchers to examine raters’ eye movements when looking at the computer screen (Bax, 2013). Rayner (1998) asserted, “eye movement data reflect moment-to-moment cognitive processes” (p. 372). In a future study, analysis of raters’ eye movements could explain not only raters’ physical behaviors, but also their underlying cognitive processes relevant to their use of R-Plat despite its expensive price and complexity in coding. Moreover, heat maps generated from the eye-tracking system can display a series of fixations by multiple users on the same page (Tullis & Albert, 2008).

Consequently, heat maps would show which parts of R-Plat raters most often gaze at a certain point of the rating procedure when they evaluate examinees at different proficiency levels.

The third limitation is in the small number of prompts used for the analysis of prompt difficulty at intermediate-low intended difficulty level. At this intended level, only three prompts were included to investigate their observed difficulty. The results revealed that difficulties for these three prompts were not equal. However, a lack of prompts leads to increases in estimation errors. Consequently, they may mitigate the precision of prompt difficulty estimation. In future research, more prompts for the intermediate-low difficulty level are required to draw a concrete conclusion about the consistency in the prompts for this prompt difficulty level.

In addition to the aforementioned future research associated with these limitations, future research can extend the investigation of raters’ rating behaviors. Considering the scope of the current study—focusing on overall raters’ rating behaviors for each test

209 administration—future research could inquire about raters’ rating changes over time. The ongoing investigation of raters’ rating patterns on a regular basis would inform both raters and test administrators about raters’ rating practices. This would also allow the OPI test administrators to provide more immediate and individualized training for raters who rate eccentrically different from other raters.

Another type of new research could examine raters’ rating consistency by comparing new and experienced raters. The current study examined raters’ severity and rating scale use regardless of the length of their rating experience. In this sense, future research could concentrate on comparing new raters’ rating behaviors to experienced raters’ rating behaviors

(e.g., Weigle, 1998; Lim, 2011). Findings from this research would also present implications, especially for the training of new raters.

6.2.2 Implications

This dissertation has practical, empirical, and theoretical implications.

Practical implications

Practical implications are associated with the improvement of the OPI test delivery through R-Plat in the targeted ITA speaking assessment context, for R-Plat was found to facilitate raters’ rating procedure. Findings about problematic prompts for the intermediate- mid difficulty level provide useful information about future modification of prompts. In addition, given that this study identified problematic raters, the test administrators will be able to address particular raters’ rating behaviors and provide more tailored rater trainings. R-

Plat itself can be further developed to integrate new features in response to raters’ concerns such as the lack of an affordance to type phonetic symbols in comment boxes. Moreover, an improved version of R-Plat could provide an individual report on raters’ rating behaviors in

210 terms of severity and rating scale use, both to raters and test administrators. This new feature may contribute to raters’ enhanced rating patterns because the reports delivered through R-

Plat may help them adopt a shared understanding of adequate rating practices for the OPI test.

Empirical implications

First and foremost, this research introduces an innovative application for a web-based rating system to assess speaking ability. In contrast to emerging technologies designed for the rating process in writing assessments, e.g. automated writing evaluation systems, only a few attempts have been made to explore a computer-supported rating system in an interview format speaking assessment. Although the current study was restricted to a local context, positive evidence was found to support the usefulness of R-Plat for the rating procedure.

Therefore, findings from this study suggest the potential utilization of a web-based rating system for interview-format speaking assessments in other contexts, such as institutional speaking assessments or a high stakes speaking test like the International English Language

Testing System (IELTS). However, it should be noted that R-Plat was devised specifically for the OPI test formats in this institution—in terms of the specifications and the number of prompts assigned to each examinee as well as in terms of the scoring bands (ranging from 0 to 300). To use R-Plat in other contexts, its features would need to be customized in view of other test formats, purposes, and target examinees.

Another implication of this study for research is that it reemphasized the significance of investigating prompt (task) and rater variability in speaking assessments. First, findings about prompt variability provided positive evidence to support the quality of pre-determined

211 prompt levels. These findings indicated a successful match between observed prompt difficulty and intended difficulty levels, suggesting that prompts at the same intended level are well developed in such a way that they elicit similar functions, linguistic features, and communication strategies from examinees tested on the same prompt levels. This finding also demonstrates the importance of prompt quality checks, especially when prompt difficulties are pre-determined in the test specifications, as in adaptive language testing. As for the findings about rater variability, they are in line with previous research. Most studies on rater severity revealed that severity often changes based upon interactions with other variables like raters’ L1 background (e.g., Carey, Mannell, & Dunn, 2010; Winke, Gass, &

Myfold, 2013; X. Yan, 2014), rating experiences (e.g., Weigle, 1998; Lim, 2011), and time

(e.g., Lunz & Stahl, 1990; Bonk & Ockey, 2003). Early research on rating scale use, in turn, showed that inconsistency in this respect depends on raters’ rating experiences (e.g.,

Cumming, 1990; Barkaoui, 2010) and examinees’ language ability (e.g., Barkaoui, 2010;

Meiron, 1998; Pollitt & Murray, 1996). OPI raters showed inconsistency in their severity at each administration, but they used the scales adequately, which reinforced the interpretation of the OPI rating results. These findings broaden the scope of empirical studies about rater and prompt variability in various speaking assessment contexts, especially when technology is integrated into the testing processes.

Theoretical implications

This dissertation expands the scope of the argument-based validation framework. An increasing number of studies have adopted the argument-based validation approach to support scores derived from diverse computer-assisted language assessment (e.g., Jun, 2014;

212

Chung, 2015; Chapelle, Cotos, & Lee, 2015). However, existing studies primarily focused on the validation of writing or grammar assessments rather than speaking assessments. The current study demonstrates how an argument-based approach was adopted to collect validity evidence that supports inferences underlying the use of scores of a speaking assessment where a computer was used for rating. In particular, it investigated backing for the evaluation and generalization inferences by taking into account rater and task variability in computer-based rating of speaking ability. Therefore, this dissertation extends the application of the argument-based approach for computer-assisted language assessments to under-investigated contexts.

To conclude, this dissertation makes a meaningful contribution to the field of language assessment by demonstrating how an argument-based approach can be applied to the validation of an Oral Proficiency Interview test. Based on the interpretive argument for the scores of OPI, multiple types of validity evidence were collected. The latter part of this dissertation demonstrates how the collected backing was synthesized to support the assumptions underlying the evaluation and generalization inferences. Moreover, this study introduces an example of an innovative integration of computer technology into the rating process in a speaking assessment.

213

REFERENCES

Academic Communication Program. (2014). Oral English Certification Test (OECT) Rater Manual. Ames, IA: Iowa State University.

American College Testing (ACT). (2014). Compass ESL Placement test. Retrieved from https://www.act.org/compass/tests/esl.html.

Aranyi, G. Schaik, P. Barker, P. (2012). Using think-aloud and psychometrics to explore users’ experience. Interacting with Computers, 24, 69-77.

Armstrong, P. (2014). Language assessment for court translators and interpreters. In A. J. Kunnan (2014), The Companion to Language Assessment, 21, John Wiley & Sons, Inc.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.

Bachman, L. F. (2002). Some reflections on task-based language performance assessment. Language Testing, 19(4), 453-76.

Bachman, L. F. (2001). February: Speaking as a realization of communicative competence. Paper presented at the meeting of the American Association of Applied Linguistics – International Language Testing Association (AAAL-ILTA) Symposium. St. Louis, Missouri.

Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 19(4), 453 - 476.

Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater judgments in a performance test of foreign language speaking. Language Testing, 12, 238–257.

214

Bachman, L.F. & Palmer. (2010). Language Assessment in Practice, Oxford University

Press, Oxford, UK.

Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7, 54–74.

Barrera, Rule & Diemart. (2001). The effect of writing with computers versus handwriting on the writing achievement of First-graders. Information Technology in Childhood Education. 215-229.

Bennett, R. E. (2000). Reinventing Assessment: speculations on the future of large-scale educational testing. Princeton, NJ: ETS.

Bonk, W.J., & Ockey, G. J. (2003). A many-facet Rasch Analysis of the second language group oral discussion task. Language Testing, 20(1), 89-110.

Brindley, G. & Slatyer, H. (2002). Exploring task difficulty in ESL listening assessment. Language Testing, 19 (4), 369-394.

Brown, A. (2000). An investigation of the rating process in the IELTS Speaking Module. In R. Tulloh (Ed.), Research reports (1999, Vol. 3, pp. 49-85). Canberra, Australia: IELTS Australia.

Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing, 20, 1–25.

Brown, G & Yule, G. (1983). Teaching the Spoken Language. Cambridge: CUP.

Brown, J. D. (1989). Cloze item difficulty. JALT Journal, 11, 46–67.

Brown, J. D. (2012). Research on computers in language testing: Past, present, and future, In M. Thomas, H. Reinders, & M. Warschauer (Eds), Contemporary Computer-Assisted Language Learning (p. 73-94). London, UK: Bloomsbury.

215

Brown, A. & Iwashita, N. (1996). The role of language background in the validation of a computer-adaptive test, System, 24(2), 199-206.

Bugbee, A. C. (1996). The equivalence of paper-and-pencil and computer-based testing. Journal of Research on Computing in Education, 28, 282-299.

Cambridge ESOL. (2014). Business Language Testing Service (BULATS). Retrieved from http://www.bulats.org/

Cambridge English Language Assessment. (2014). Computer-based International Language Testing System. Retrieved from http://www.ielts.org/

Canale, M. (1986). The promise and threat of computerized adaptive assessment of reading comprehension. In C. Stansfield (Ed.), Technology and language testing (pp. 30–45). Washington, DC: TESOL.

Carey, M. D., Mannell, R. H., & Dunn, P. K. (2010). Does a rater's familiarity with a candidate's pronunciation affect the rating in oral proficiency interviews? Language Testing, 28(2), 201–219.

Center for Applied Linguistics. (2014). Basic English Skills Test (BEST) Plus. Retrieved from http://www.cal.org/aea/bestplus/

Chapelle, C. A. (2012). Conceptions of validity. In Fulcher G. & Davidson, F. (Eds.), The routledge handbook of language testing, 21-33, New York, Routledge

Chapelle, C. A., & Douglas, D. (2006). Assessing language through computer technology. Cambridge: Cambridge University Press.

Chapelle, C. A., Enright, M. K., & Jamieson, J. (Eds.). (2008). Building a validity argument for the Test of English as a Foreign LanguageTM. New York: Routledge.

216

Chapelle, C. A., Cotos, E., Lee, J. (2015). Validity arguments for diagnostic assessment using automated writing evaluation. Language Testing, Language Testing, 33 (2), 385–405.

Choi, I.-C., Kim, K. S., & Boo, J. (2003). Comparability of a paper-based language test and a computer-based language test. Language Testing, 20(3), 295–320.

Chung, Y. (2014). A test of productive English grammatical ability in academic writing: Development and validation. Unpublished doctoral dissertation, Iowa State University, Ames.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

Coniam, D. (2006). Evaluating computer-based and paper-based versions of an English- language listening test. ReCALL, 18, 193–211.

Congdon, P. J. & McQueen, J. (2000). The stability of rater severity in large-scale assessment programs. Journal of Educational Measurement, 37, 163-178.

Crookes, G., & Rulon, K. (1988). Topic and feedback in native-speaker/ non-native speaker conversation. TESOL Quarterly, 22, 675–681.

Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7, 31—51.

Deville, C. & Chalhoub-Deville, M. (2006). Old and new thoughts on test scores variability: implications for reliability and validity. In M. Chalhoub-Deville. C.A. Chapelle & P. Duff (eds.) Inference and Generalizability in Applied Linguistics: Multiple perspectives. Amsterdam, The Netherlands: John Benjamins Publishing Company, 9-25.

DIALANG. (2014). Retrieved from http://www.lancaster.ac.uk/researchenterprise/dialang/about.htm

217

Dooey, P. (2008). Language testing and technology: Problems of transition to a new era. ReCALL, 20(1), 21–34.

Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991), Quality control in the development and use of performance assessment. Applied Measurement in Education, 4, 289-303.

Dunkel, P. A. (1999). Considerations in developing or using second/foreign language proficiency computer-adaptive tests. Language Learning & Technology, 2(2), 77-93.

Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197-221.

Educational Testing Service. (2014). Computer-based TOEFL (CBT TOEFL). Retrieved from http://www.ets.org/toefl

Educational Testing Service. (2014). Internet-Based TOEFL (iBT TOEFL). Retrieved from http://www.ets.org/toefl.

Engelhard, G., Jr. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31, 93–112.

Farnsworth, T. (2014). Assessing the oral English abilities of international teaching assistants in the USA. In A. J. Kunnan (2014), The Companion to Language Assessment, 29, John Wiley & Sons, Inc.

Fulcher, G. (1993). The construction and validation of rating scales for oral tests in English as a foreign language. University of Lancaster, UK: Unpublished PhD thesis.

Fulcher, G. (1996a). Testing tasks: issues in task design and the group oral. Language Testing,13, 23–51.

Fulcher, G. (2003a). Testing second language speaking. London: Longman.

Fulcher, G. (2003b). Interface design in computer-based language testing. Language Testing, 20(4), 384–408.

218

Fulcher, G. & Reiter, R. (2003). Task difficulty in speaking tests. Language Testing, 20(3), 321–344.

Gass, S., & Varonis, E. M. (1984). The effect of familiarity on the comprehensibility of nonnative speech. Language Learning, 34(1), 65–89.

George, D. & Mallery, M. (2010). Using SPSS for Windows step by step: a simple guide and reference. Boston, MA: Allyn & Bacon.

Ginther, A. (2002). Context and content visuals and performance on listening comprehen- sion stimuli. Language Testing, 19(2), 133–67.

Glaser, B. G., & Strauss, A. L. (1967). The discovery of grounded theory. New York: Aldine.

Hosseini, M., Abidin, M., & Baghdarnia, M. (2014). Comparability of Test Results of Computer Based Tests (CBT) and Paper and Pencil Tests (PPT) among English Language Learners in Iran. Procedia - Social and Behavioral Sciences, 98, 659 – 667.

Hoyt, W. T. & Kerns, M.-D. (1999). Magnitude and moderators of bias in observer ratings: A meta-analysis. Psychological Methods, 4, 403–424.

Hymes, D. (1972). On communicative competence. In Pride, J. & Holmes, J., editors, Sociolinguistics: Selected readings, 269–293. Harmondsworth, Middlesex: Penguin.

Iwashita, N. (1996). The validity of the paired interview format in oral performance assessment. Melbourne Papers in Language Testing, 5(2), 51–66.

Iwashita, N. (1998, March). The effect of visual support on the quality and quantity of interaction in task-based conversation. Paper presented at the 3rd Pacific Second Language Research Forum, Aoyama Gakuin University, Tokyo, Japan.

Iwashita, N., McNamara, T. & Elder, C. (2001). Can we predict task difficulty in an oral proficiency test? Exploring the potential of an information processing approach to task design. Language Learning, 51, 401–36.

219

Jamieson, J. (2005). Trends in Computer-Based Second Language Assessment. Annual Review of Applied Linguistics, 25, 228–242.

Jamieson, J., Eignor, D., Grabe, W., & Kunnan, A. J. (2008). The frameworks for the re- conceptualization of TOEFL. In C. Chapelle, J. Jamieson & M. Enright (Eds.), The new TOEFL (pp. 55–95). Mahwah, NJ: LEA.

Johnson, R. B. & Onwuegbuzie, A. J. (2004). Mixed methods research: a research paradigm whose time has come, Educational Researcher, 33(7), pp. 14–26.

Jun, H. (2014). A validity argument for the use of scores from a web-search-permi ed and web-source-based integrated writing test. Unpublished doctoral dissertation, Iowa State University, Ames.

Kaiser, H. F. (1974). An index of factorial simplicity. Psychometrika, 39, 31-36.

Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527-535.

Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38(4), 319-342.

Kane, M. T. (2002). Validating high-stakes testing programs. Educational Measurement: Issues and Practice, 21(1), 31-41.

Kane, M. T. (2004). Certification testing as an illustration of argument-based validation. Measurement: Interdisciplinary Research and Perspectives, 2(3), 135-170.

Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational Measurement, (4th ed.), 17-64, Westport, CT: American Council on Education.

Kane, M. T., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational Measurement: Issues and Practice, 18(2), 5-17.

220

Kenyon, D. (1992, February). Development and use of rating scales in language testing. Paper presented at the annual meeting of the Language Testing Research Colloquium, Vancouver, Canada.

Kenyon, D. & Malabonga, V. (2001). Comparing examinee attitudes toward computer- assisted and other oral proficiency assessments. Language Learning and Technology, 5, 60–83.

Kim, J. T. (2006). The effectiveness of test-takers’ participation in development of an innovative web-based speaking test for international teaching assistants at American colleges (Unpublished doctoral dissertation). University of Illinois at Urbana- Champaign.

Krueger, R. A., & Casey, M. A. (2009). Focus groups: A practical guide for applied research (4th ed.). Thousand Oaks, CA: SAGE Publications.

Kunnan, A. J. (2012). Language assessment for immigration and citizenship. In Fulcher G. & Landis, J & Koch, G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159-74.

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.

Larson, J.W. & Madsen, H.S. (1985). Computer-adaptive language testing: Moving beyond computer-assisted testing. CALICO Journal, 2(3), 32-36.

Larson, J. (1999). Considerations for testing reading proficiency via computer-adaptive testing. In Chalhoub-Deville, M. (Ed.), Studies in language testing, Vol. 10. Issues in computer-adaptive testing of reading proficiency. Cambridge: University of Cambridge Press, 71–90.

Lazaraton, A. (2014). Spoken Discourse, In Kunnan, A. J. (2014). The Companion to Language Assessment. John Wiley & Sons, Inc.

221

Lim, S. G. (2009). Prompt and rater effects in second language writing performance assessment (Unpublished dissertation). The University of Michigan. Ann Arbor.

Lim, S. G. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters, Language Testing, 28(4), 543-560.

Linacre, M. (2002). Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3, 85–106.

Linacre (2012). FACETS 3.71.4. Many-Facet Rasch Measurement: Facets Tutorial.

Liu, M., Moore, Z., Graham, L., & Lee, S. (2002). A look at the research on computer-based technology use in second language learning: A review of the literature from 1990– 2000. Journal of Research on Technology in Education, 34 (3).

Loyd, B. H., & Gressard, C. (1984). Reliability and factorial validity of computer attitude scales. Educational and Psychological Measurement, 44, 501-505.

Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12, 54–71.

Lunz, M. E., & Stahl, J. A. (1990). Judge consistency and severity across grading periods. Evaluation and the Health Professions, 13, 425–44.

Lunz, M.E., Wright, B.D. & Linacre, J.M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3, 331–45.

Luoma, S. (2004). Assessing Speaking. Cambridge University Press. Cambridge.

Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing, 15, 158–180.

222

Mackiewicz, J. (2007). Compliments and criticisms in book reviews about business communication. Journal of Business and Technical Communication, 21(2), 188-215.

McNamara, T. F. (1996). Measuring second language performance. London: Longman.

Meiron, B. E. (1998). Rating oral proficiency tests: A triangulated study of rater thought processes. (Unpublished masterís thesis). University of California at Los Angeles.

Molder, C. & Halleck, G. (2012). Designing language tests for specific social uses. In Fulcher G. & Davidson, F. (Eds.), The routledge handbook of language testing, 137- 149, New York, Routledge.

Munro, M. J., Derwing, T. M., & Morton, S. L. (2006). The mutual intelligibility of L2 speech. Studies in Second Language Acquisition, 28(1), 111-131.

Norris, J. M., Brown, J. D., Hudson, T. D., & Bonk, W. (2002). Examinee abilities and task difficulty in task-based second language performance assessment. Language Testing, 19(4), 395-418.

Norris, J.M., Brown, J.D., Hudson, T.D. & Bonk, W. (2000). Assessing performance on complex L2 tasks: investigating raters, examinees and tasks. Paper presented at the 22nd Language Testing Research Colloquium, Vancouver, March.

Ockey, G. J. (2009). The effects of a test taker’s group members’ personalities on the test taker’s second language group oral discussion test scores. Language Testing, 26 (2), 161–186.

Oral English Proficiency Program (2013). OEPT Technical Manual, Purdue University. Retrieved from http://www.purdue.edu/oepp/oept/

Oller, J. W. (2012). Language assessment for communication disorders. In Fulcher G. & Davidson, F. (Eds.), The routledge handbook of language testing, 150-161, New York, Routledge.

223

Park, Y. M. (1998). Academic and ethnic background as factors affecting writing performance. In A. C. Purves (Ed.), Writing across languages and cultures: Issues in contrastive rhetoric, 261-272, Newbury Park, CA: SAGE Publications.

Pearson Education, Inc. (2009a). Versant tests, Retrieved November 27, 2015 from http://www.ordinate.com/versant/versant.jsp.

Pearson Education, Inc. (2009b). Versant English Test: Test description and validation summary. Palo Alto, CA: Pearson. Retrieved November 27, 2015 from http://ordinate.com/technology/VersantEnglishTestValidation.pdf.

Pearson Education, Inc. (2014). Pearson Test of English (PTE) AcademicTM. Retrieved from http://pearsonpte.com/

Pearson Education, Inc. (2014). Versant series of speaking assessment. Retrieved from https://www.versanttest.com/

Pollitt, A. (1991). Giving students a sporting chance: assessment by counting and judging. In Alderson, J.C. and North, B., editors, Language testing in the 1990s. Modern English Publications in Association with the British Council, 46–59.

Pollitt, A., & Murray, N. L. (1996). What raters really pay attention to? In M. Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th Language Testing Research Colloquium, 74-91. Cambridge, England: Cambridge University Press.

Reed, D. J., & Halleck, G. B. (1997). Probing above the ceiling in oral interviews: What’s up there? In A. Huhta, V. Kohonen, L. Kurki-Suonio, & S. Luoma (Eds.), Current developments and alternatives in language assessment: Proceedings of LTRC 96 (pp. 225–238). Jyv¨askyl¨ a: University of Jyv¨ askyl¨ and University of Tampere, 225–38.

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.

224

Skehan, P. (1998). Processing perspectives to second language development, instruction, performance, and assessment. Thames Valley Working Papers in Applied Linguistics, 4, 70–88.

Stahl, J.A. & Lunz, M.E. (1991). Judge performance reports: media and message. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.

Suvorov, R. (2009). Context visuals in L2 listening tests: The effects of photographs and video vs. audio-only format. In C. A. Chapelle, H. G. Jun, & I. Katz (Eds.), Developing and evaluating language learning materials (pp. 53–68). Ames: Iowa State University.

Suvorov, R. & Hegelheimer, V. (2014). Computer-assisted language testing. In the Companion to Language Assessment, Kunnan, A. (Eds.), Ch. 36, John Wiley & Sons, Inc.

Tauroza, S., & Luk, J. (1997). Accent and second language listening comprehension. RELC Journal, 28(1), 54–71.

Taylor, C., Jamieson, J., Eignor, D. & Kirsch, I. (1998). The Relationship between Computer Familiarity and Performance on Computer-based TOEFL Test Tasks (TOEFL Research Report No. 61). Princeton, NJ: ETS.

Trites, L. & McGroarty, M. (2005). Reading to learn and reading to integrate: new tasks for reading comprehension tests?, Language Testing, 22, 174-210.

Wagner, E. (2007). Are they watching? Test-taker viewing behavior during an L2 video listening test. Language Learning & Technology, 11(1), 67–86.

Wall, D. & Horák, T. (2006). The impact of changes in the TOEFL examination on teaching and learning in central and eastern Europe: Phase 1, The baseline study (TOEFL Research Report RR-06-18, TOEFL-MS-34). Princeton, NJ: Educational Testing Service.

225

Perpetual Technology Group (PTG). (2014). Web Computerized Adaptive Placement Exam (Web-CAPE). Retrieved from http://www.perpetualworks.com.

Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15, 263–87.

Weir, C. (2005). Language Testing and validation: An evidence-based approach. Palgrave Macmillan. New York, N.Y.

Winke, P., Gass, S., & Myford, C. (2011). The relationship between raters’ prior language study and the evaluation of foreign language speech samples (TOEFL iBT research report no. 16). Princeton, NJ: ETS.

Winke, P., Gass, S., & Myford, C. (2013). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2). 231-252.

Wolfe, W. E. & Manalo, R. J. (2005). An investigation of the impact of composition medium on the quality of TOEFL writing scores. ETS Research Report, RR-04-29.

Xi, X. (2008). Methods of test validation. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of language and education. Vol. 7: Language testing and assessment (2nd ed., pp. 177-96). New York, NY: Springer.

Xi, X., Bridgeman, B. & Wendler, C. (2014). Test of English for academic purposes in university admission, In A. J. Kunnan (2014), The Companion to Language Assessment, 19, John Wiley & Sons, Inc.

Yan, X. (2014). An examination of rater performance on a local oral English proficiency test: A mixed-methods approach. Language Testing, 31(4), 501-527.

Yan, R. (2014). Assessing the English language proficiency of international aviation staff. In A. J. Kunnan (2014), The Companion to Language Assessment, 29, John Wiley & Sons, Inc.

Yang, H. (2012). Needs Analysis: Investigating diverse stakeholders’ needs from an oral proficiency test for ITAs. Unpublished manuscript.

226

Young, R., Shermis, M. D., Brutten, S. & Perkins, K. (1996). From conversational to computer adaptive testing of ESL reading comprehension. System, 24(1), 32-40.

Yu, X., & Lowe, J. (2007). Computer assisted testing of spoken English: A study to evaluate the SFLEP college English oral test in China. In F. Khandia (Ed.), 11th CAA international computer assisted assessment conference: Proceedings of the conference on 10th and 11th July 2007 at Loughborough University (pp. 489-502). Loughborough Leicestershire, UK: Loughborough University.

227

APPENDIX A

SCREENSHOT OF THE TEACH RATING PAGE IN R-PLAT

228

APPENDIX B

SCREENSHOT OF THE FINAL SCORE CONFIRMATION PAGE IN R-

PLAT

229

APPENDIX C

SAMPLE INFORMED CONSENT FORM

Title of Study: Investigating raters’ needs and perceptions towards the current and the revised OECT rating system, especially diagnostic features, for developing a web-based OECT feedback system.

Investigators: Hyejin Yang

This is a research study. Please take your time in deciding if you would like to participate. Please feel free to ask questions at any time.

INTRODUCTION The purpose of this study is to investigate OECT (Oral English Certification Test) raters’ experience and perception towards the OECT rating procedures and feedback. You are being invited to participate in this study because you have rated test takers’ performance in the OECT at Iowa State University.

DESCRIPTION OF PROCEDURES If you agree to participate, you will be asked to participate in one hour orientation session and one hour focus group interview. During the orientation session, you will be informed of the specific purpose of the study and the procedures of the focus group interview briefly. During the interview, you will be asked about your experiences in rating students’ performance and providing feedback to students during the OECT rating procedures. Your perceptions towards the current rating sheets and diagnostic features will also be asked. The interview will last for an hour and be audio-recorded.

RISKS While participating in this study, you will not experience any risks. However, if you feel psychological stress due to the focus group interview while you share your opinions, you can take a break from the interview sessions. Besides, you can quit the study at any time.

BENEFITS If you decide to participate in this study, there may be no direct benefit to you. It is hoped that the information gained in this study will benefit society by providing a sound basis and valuable implications for future development of language testing administration, rating procedures in adopting technology for language testing.

ALTERNATIVES TO PARTICIPATION It does not apply to the current study.

230

COSTS AND COMPENSATION You will not have any costs from participating in this study. You will not be compensated for participating in this study.

PARTICIPANT RIGHTS Your participation in this study is completely voluntary and you may refuse to participate or leave the study at any time. If you decide to not participate in the study or leave the study early, it will not result in any penalty or loss of benefits to which you are otherwise entitled. Furthermore, since this study is being limited to students 18 and older, volunteers must be 18 and older to participate.

RESEARCH INJURY Emergency treatment of any injuries that may occur as a direct result of participation in this research is available at the Iowa State University Thomas B. Thielen Student Health Center, and/or referred to Mary Greeley Medical Center or another physician or medical facility at the location of the research activity. Compensation for any injuries will be paid if it is determined under the Iowa Tort Claims Act, Chapter 669 Iowa Code. Claims for compensation should be submitted on approved forms to the State Appeals Board and are available from the Iowa State University Office of Risk Management and Insurance.

CONFIDENTIALITY Records identifying participants will be kept confidential to the extent permitted by applicable laws and regulations and will not be made publicly available. However, federal government regulatory agencies, auditing departments of Iowa State University, and the Institutional Review Board (a committee that reviews and approves human subject research studies) may inspect and/or copy your records for quality assurance and data analysis. These records may contain private information.

To ensure confidentiality to the extent permitted by law, the following measures will be taken: subjects will be assigned pseudonyms (that is, fake names). The researcher of this study, Hyejin Yang, will have access to study records, and they will be kept confidential in a locked filing cabinet and password protected computer files. The records will be retained for five years before erasure or destruction. If the results are published, your identity will remain confidential.

QUESTIONS OR PROBLEMS You are encouraged to ask questions at any time during this study. • For further information about the study contact Hyejin Yang at 217-722-6966 / [email protected] or Carol Chapelle at 515-294-7274 / [email protected] or Elena Cotos at 515-294-1958 / [email protected] • If you have any questions about the rights of research subjects or research-related injury, please contact the IRB Administrator, (515) 294-4566, [email protected], or Director, (515) 294-3115, Office for Responsible Research, Iowa State University, Ames, Iowa 50011.

231

*************************************************************************

Your signature indicates that you voluntarily agree to participate in this study, that the study has been explained to you, that you have been given the time to read the document and that your questions have been satisfactorily answered. You will receive a copy of the written informed consent prior to your participation in the study.

Participant’s Name (printed)

(Participant’s Signature) (Date)

(Signature of Parent/Guardian or (Date) Legally Authorized Representative)

232

APPENDIX D

QUESTIONNAIRE FOR NEW AND EXPERIENCED RATERS

1. How many testing days have you rated with R-Plat? 1) None 2) 1 session 3) 2-3 sessions 4) 4-5 sessions 5) 6-7 sessions 6) more than 8 sessions

2. How comfortable are you INTERVIEWING with R-plat? (Please answer only if you have interviewed a test taker with R-Plat)

(Very uncomfortable) 1 2 3 4 5 6 (Very comfortable)

2-1. Please explain your answer.

3. How comfortable are you RATING with R-plat?

(Very uncomfortable) 1 2 3 4 5 6 (Very comfortable)

3-1. Please explain your answer.

4. How comfortable are you checking diagnostic feature and their descriptors with R-plat? (Very uncomfortable) 1 2 3 4 5 6 (Very comfortable)

4-1. Please explain your answer.

5. How clear is the way Diagnostic Features and their descriptors are presented in R-Plat? Features (Very unclear) (Very clear) Comprehensibility 1 2 3 4 5 6 Ease of understanding, Accent, Volume Pronunciation Vowels, consonants, insertion, enunciation, 1 2 3 4 5 6 reduction, intonation, rhythm, word stress Fluency Phrasing, choppiness, hesitations, halting, 1 2 3 4 5 6 false starts, pauses, incomplete utterances/ideas, pace/speed Vocabulary Breadth instead of scope, Word 1 2 3 4 5 6 choice/expression Grammar Grammatical complexity, word order, verb 1 2 3 4 5 6 tenses/forms, word form, singular/plural, pronouns, articles Pragmatics 1 2 3 4 5 6 Interaction, compensation strategies Listening 1 2 3 4 5 6

233

6. How comfortable are you typing comments in R-Plat? (Very uncomfortable) 1 2 3 4 5 6 (Very comfortable)

6-1. Please explain your answer.

7. How clear is the OPI rating page in R-Plat? (Test Information, Scoring features, Comment boxes) Features (Very unclear) (Very clear) Information about test takers (Name, Test- 1 2 3 4 5 6 number, Test date, Interviewer) Rating for each question / impression 1 2 3 4 5 6 question Comment boxes 1 2 3 4 5 6 Overall 1 2 3 4 5 6

7-1. Please explain your answer.

8. How clear is the TEACH rating page in R-Plat? (Test Information, Scoring features, Comment boxes) Features (Very unclear) (Very clear) Information about test takers (Name, Test- 1 2 3 4 5 6 number, Test date, Interviewer) Rating for each question / impression 1 2 3 4 5 6 question Comment boxes 1 2 3 4 5 6 Overall 1 2 3 4 5 6

8-1. Please explain your answer.

9. How clear is the final score confirm page in R-Plat?

(Very unclear) 1 2 3 4 5 6 (Very clear)

9-1. Please explain your answer.

10. If any diagnostic features remain unclear, please identify them and state the reasons.

11. How clear is the web-based RATING PATH in R-Plat? Rating Path: from Log in to Final Score Confirm page?

(Very unclear) 1 2 3 4 5 6 (Very clear)

11-1. Please explain your answer.

234

12. In general, how do you use diagnostic features and their descriptors when scoring?

13. When you make a final decision for scores, do you typically review your ratings for diagnostic features? 1) Yes 2) No 13-1. Please explain your answer.

14. Do your ratings of diagnostic features influence your final decision? 1) Yes 2) No

14-1. Please explain your answer.

15. What do you find most challenging about use of the diagnostic features when scoring?

16. When you leave comments while rating, what do you typically mention, and why? (e.g. error examples: pronunciation errors, grammar errors, overall performance, etc.)

17. When you make a final decision for scores, do you typically review your comments?

1) Yes 2) No

17-1. Please explain your answer.

18. Do your comments typed in R-Plat influence your final decision?

1) Yes 2) No

18-1. Please explain your answer.

19. What is most challenging about writing comments in R-Plat?

20. Overall, how effective is R-Plat for rating the OECT?

(Very ineffective) 1 2 3 4 5 6 (Very effective)

20-1. Please explain your answer.

21. Overall, how much are you satisficed with R-Plat for rating the OECT?

(Very unsatisfied) 1 2 3 4 5 6 (Very satisfied)

21-1. Please explain your answer.

22. What are the strengths / weaknesses of the OPI page? 23. What are the strengths / weaknesses of the TEACH page?

235

24. What are the strengths / weaknesses of the Final score confirm page? 25. Please provide any suggestions you have for improving R-Plat. 26. Please provide any suggestions you have for the next R-Plat workshop.

236

APPENDIX E

PROTOCOL FOR THE FOCUS GROUP INTERVIEWS

Introduction:

Hi, my name is Hyejin and I will be helping to lead this discussion today. We will be here about an hour. You are here today because you have rated the OECT at Iowa State University. I am interested in hearing your ideas about your experience and opinions with regard to the rating procedures and the provision of feedback in the web-based rating system. This type of study is called a ‘focus group.’ Before we get started, have any of you been in a focus group before? For those of you who haven't, I'll give you some information. First of all, I am not here to convince you of anything. I am just here to help lead the discussion, so please feel free to make any positive or negative comments today. I want to hear your opinions whatever they are.

This is a free-flowing discussion. We're here to learn as much as possible about everyone's ideas. There are no wrong answers.

Ground Rules:

Here are some guidelines for you to know about:

Recording: Please notice the audio recorder. I am recording this conversation so that I don't have to take notes. I will use the tapes for a report I have to write. No names will be used. Because of the taping, please speak in a loud voice and speak one at a time.

Please avoid side conversations with the person who is sitting next to you. This is usually the best information, so please tell it to us all.

You do not need to address all of your comments to me. You can respond directly to another person who has made the point.

Everyone does not have to answer every single question, but make sure I hear from each one of you at some point this morning. If I don't hear from you, I'll assume you agree with what is being said.

If I cut you off, I'm not trying to be rude. We just have a lot to cover in our short time together.

Acknowledgement:

I want to thank you each for being here. Your time is very valuable and your opinions are important. Let's get started by having you introduce yourself to the group and tell us:

237

APPENDIX F

QUESTIONS FOR THE FOCUS GROUP AND INDIVIDUAL

INTERVIEWS

1. How many testing days have you rated with R-Plat? 2. How comfortable are you INTERVIEWING with R-plat? (Please answer only if you have interviewed a test taker with R-Plat) 3. How comfortable are you RATING with R-plat? 4. How comfortable are you checking diagnostic feature and their descriptors with R-plat? 5. How clear is the way Diagnostic Features and their descriptors are presented in R-Plat? 6. How comfortable are you typing comments in R-Plat? 7. How clear is the OPI rating page in R-Plat? (Test Information, Scoring features, Comment boxes) 8. How clear is the TEACH rating page in R-Plat? (Test Information, Scoring features, Comment boxes) 9. How clear is the final score confirm page in R-Plat? 10. If any diagnostic features remain unclear, please identify them and state the reasons. 11. How clear is the web-based RATING PATH in R-Plat? Rating Path: from Log in to Final Score Confirm page? 12. In general, how do you use diagnostic features and their descriptors when scoring? 13. When you make a final decision for scores, do you typically review your ratings for diagnostic features? 14. Do your ratings of diagnostic features influence your final decision? 15. What do you find most challenging about use of the diagnostic features when scoring? 16. When you leave comments while rating, what do you typically mention, and why? (e.g. error examples: pronunciation errors, grammar errors, overall performance, etc.) 17. When you make a final decision for scores, do you typically review your comments? 18. Do your comments typed in R-Plat influence your final decision? 19. What is most challenging about writing comments in R-Plat?

238

20. Overall, how effective is R-Plat for rating the OECT? 21. Overall, how much are you satisficed with R-Plat for rating the OECT? 21-1. Please explain your answer. 22. What are the strengths / weaknesses of the OPI page? 23. What are the strengths / weaknesses of the TEACH page? 24. What are the strengths / weaknesses of the Final score confirm page? 25. Please provide any suggestions you have for improving R-Plat. 26. Please provide any suggestions you have for the next R-Plat workshop.

239

APPENDIX G

SCORING RUBRIC (OECT Rater Manual, 2014)

Advanced Intermediate-low Intermediate-high Intermediate-mid

(210-220) (170-200) (230-300) (below 160) Functional Support arguments, Explain, narrate, Explain, narrate, Little to no functional competency hypothesize, discuss in describe, compare; describe, compare in ability; able to detail; highly fairly competent to simple ways, provide basic competent to convey convey ideas on maintain information & ideas on familiar & concrete, familiar conversation; able to respond to simple unfamiliar, concrete & topics & handle convey ideas on questions/requests but abstract topics & to unsophisticated tasks basic & concrete use language handle complicated in many formal and topics of personal reactively; often communicative tasks informal situations; relevance in informal unable to understand in all situations. linguistic and few formal the task. performance situations; can noticeably weakens occasionally perform when handling functions of Level 2 abstract, unfamiliar but unable to sustain topics or performing performance. more complicated tasks.

Comprehensibility errors rarely interfere understood without understood with may be difficult to with communication or difficulty by native some difficulty by understand even by distract the native speakers native speakers native speakers speaker from the unaccustomed to unaccustomed to accustomed to non- message non-native speaker non-native speaker native speaker speech speech speech Pronunciation very little or no some interference errors frequent, but very strong interference with from native language, generally interference from meaning generally comprehensible to a native language; little comprehensible sympathetic listener or not comprehensible Fluency ability to link in able to link sentences uses mostly discrete very poor; continuous paragraphs and speak into paragraphs with or isolated sentences; groping for words with ease occasional grasping periodic groping for and phrases for words and phrases words and phrases

Vocabulary confident use of broad mostly successful in sufficient rote and memorized vocabulary using vocabulary to vocabulary, generally utterances; unable to convey intended related to self and adapt memorized meaning; general immediate vocabulary to convey vocabulary for environment meaning general interest topics; able to circumlocute Grammar occasional errors with most grammar accuracy in very poor low frequency; constructions elementary complex structures; no accurate, but no constructions, but patterns of error thorough control; with partial control some more complex structures