Computationally Modeling the Impact of Task-Appropriate Language Complexity and Accuracy on Human Grading of German Essays
Total Page:16
File Type:pdf, Size:1020Kb
Computationally Modeling the Impact of Task-Appropriate Language Complexity and Accuracy on Human Grading of German Essays Zarah Weissa Anja Riemenschneiderb Pauline Schroter¨ b Detmar Meurersa;b a Dept. of Linguistics & LEAD b Institute for Educational Quality Improvement University of Tubingen¨ Humboldt-Universitat¨ zu Berlin [email protected] [email protected] [email protected] [email protected] Abstract to other education systems, such as the German one, where so far there is hardly any discus- Computational linguistic research on the lan- sion of automating the assessment of learning out- guage complexity of student writing typi- cally involves human ratings as a gold stan- comes and no high-stakes testing industry. In the dard. However, educational science shows that German Abitur examination, the official school- teachers find it difficult to identify and cleanly leaving state examination that qualifies students separate accuracy, different aspects of com- for admission to university, teachers grade lan- plexity, contents, and structure. In this paper, guage performance and content in essays without we therefore explore the use of computational technical assistance, using grading templates that linguistic methods to investigate how task- specify content and language expectations. In the appropriate complexity and accuracy relate to language arts and literacy subject-matters (Ger- the grading of overall performance, content performance, and language performance as as- man, English, French, etc.), language performance signed by teachers. is a crucial component of the overall grade across Based on texts written by students for the offi- all states. Yet, unlike content, language require- cial school-leaving state examination (Abitur), ments are only loosely specified in the education we show that teachers successfully assign standards, mentioning complex and diverse syn- higher language performance grades to essays tax and lexis, and a coherent argumentation struc- with higher task-appropriate language com- ture as indicators of high-quality language perfor- plexity and properly separate this from content mance (KMK, 2014b). The exact implementa- scores. Yet, accuracy impacts teacher assess- tion of these language requirements is left to the ment for all grading rubrics, also the content discretion of the teachers. Educational science score, overemphasizing the role of accuracy. has questioned to which extent teachers are biased Our analysis is based on broad computational by construct-irrelevant text characteristics while linguistic modeling of German language com- grading. There is evidence that mechanical ac- plexity and an innovative theory- and data- driven feature aggregation method inferring curacy over-proportionally influences grades and task-appropriate language complexity. even affects the evaluation of unrelated concepts such as content (Cumming et al., 2002; Rezaei 1 Introduction and Lovorn, 2010). Differences in lexical sophis- tication and diversity have been shown to impact Official state education standards highlight the rel- teachers’ evaluation of grammar and essay struc- evance of language complexity for the evalua- ture (Vogelin¨ et al., 2019). This is a potentially tion of text readability and reading skills(CCSSO, severe issue for the German education system. 2010) and academic writing proficiency in stu- dents first and second language (KMK, 2014a,b). We pick up on this issue by investigating which The highly assessment-driven U.S. public edu- role language complexity and accuracy play in cation system has long recognized the benefits teachers’ grading of German Abitur essays. For of automating the evaluation of student learn- this, we build upon previous work on complex- ing outcomes, including very substantial research, ity and accuracy in the context of the Complexity, development, and commercial applications tar- Accuracy, and Fluency (CAF) framework (Wolfe- geting automatic essay scoring (AES, Shermis Quintero et al., 1998; Bulte´ and Housen, 2012) and Burstein, 2013; Vajjala, 2018; Yannakoudakis employed in Second Language Acquisition (SLA) et al., 2018). This situation is not transferable research to model different types of language per- 30 Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 30–45 Florence, Italy, August 2, 2019. c 2019 Association for Computational Linguistics formance (McNamara et al., 2010; Vajjala and and Meurers, 2019). Complexity has also been in- Meurers, 2012; Bulte´ and Housen, 2014). We es- vestigated in relation to (academic) writing profi- tablish an automatically obtained measure of task- ciency of native speakers (Crossley et al., 2011; appropriate overall language complexity. With McNamara et al., 2010). Research on languages this, we identify texts of more and less appropri- other than English, remains rather limited, with ate language complexity, which we then manually some work on German, Russian, Swedish, Italian, assess for their accuracy. We use this to exper- and French (Weiss and Meurers, 2018; Reynolds, imentally examine teaching experts’ grading be- 2016; Pilan´ et al., 2015; Dell’Orletta et al., 2014; haviour and how it is influenced by accuracy and Franc¸ois and Fairon, 2012). complexity. Our results show that while teachers Recently, research has increasingly focused on seem to successfully identify language complexity the influence of task effects on language complex- and include it in their grading when appropriate, ity in writing quality and language proficiency as- they are heavily biased by accuracy even when it sessment, both in terms of their influence on CAF is construct-irrelevant. development in the context of the two main frame- Our work innovates in exploiting computational works (Robinson, 2001; Skehan, 1996) as well as linguistic methods to address questions of broader its implications for AES systems and other forms relevance from the domain of educational science of language proficiency modeling (Yannakoudakis by using sophisticated language complexity mod- et al., 2018; Dell’Orletta et al., 2014). Alex- eling. This is the first computational linguistic opoulou et al.(2017) show that task complexity analysis of German Abitur essays and their hu- and task type strongly affect English as a Foreign man grading, illustrating the potential of cross- Language (EFL) essay writing complexity. Topic disciplinary work bringing together computational and text type, too, have been found to impact CAF linguistics and empirical educational science. The constructs in EFL writing and in particular lan- novel approach presented for the assessment of guage complexity (Yoon and Polio, 2016; Yoon, appropriate overall language complexity also pro- 2017; Yang et al., 2015). Vajjala(2018) demon- vides valuable insights into the task- or text type- strates task effects across EFL corpora to the ex- dependence of complexity features. This is of di- tent that text length strongly impacts essay quality rect relevance for the current discussion of task- negatively on one and positively on the other data effects in CAF research (Alexopoulou et al., 2017; set. Her results further corroborate the importance Yoon, 2017). of accuracy for essay quality across data sets. Ac- The article is structured as follows: We briefly curacy has overall received considerably less at- review related work on complexity assessment tention in SLA research than complexity (Larsen- and insights from educational science into human Freeman, 2006; Yoon and Polio, 2016). grading behavior. We then present our data set and how we automatically extract language complex- An orthogonal strand of research investigates ity measures. Section5 elaborates on the construc- the quality of human judgments of writing qual- tion of appropriate overall language complexity ity and how complexity and accuracy impact them. including a qualitative analysis of task-wise dif- It has been demonstrated that teachers are bi- ferences between document vectors. Section6 re- ased by accuracy and in particular spelling even ports our experiment on teacher grading behavior. when it is irrelevant for the construct under evalu- We close in Section7 with an outlook. ation such as content quality (Rezaei and Lovorn, 2010; Cumming et al., 2002; Scannell and Mar- 2 Related Work shall, 1966). Other studies showed that charac- teristics such as syntactic complexity, text length, Language complexity, commonly defined as “[t]he and lexical sophistication impact inter-rater agree- extent to which the language produced in perform- ment (Lim, 2019; Wind et al., 2017; Wolfe et al., ing a task is elaborate and varied” (Ellis, 2003, 2016).V ogelin¨ et al.(2019) experimentally ma- p. 340), has been studied extensively in the context nipulate the lexical diversity and sophistication of of second language development and proficiency EFL learners’ argumentative essays and let Swiss and text readability in particular with regard to English teachers rate them for their overall qual- the English language (Vajjala and Meurers, 2012; ity, grammar, and essay frame. Their findings Guo et al., 2013; Bulte´ and Housen, 2014; Chen show that when the lexical diversity and sophis- 31 tication of an essay was manually reduced, it re- ceived lower grades not only for its overall quality but also for grammar and the essay’s frame, i.e., the structured presentation of the writing objective through introduction and conclusion. 3 The Abitur Data We analyzed 344 essays that were written