Negative Language Transfer in Learner English: a New Dataset
Total Page:16
File Type:pdf, Size:1020Kb
Negative language transfer in learner English: A new dataset Leticia Farias Wanderley Nicole Zhao Carrie Demmans Epp EdTeKLA Research Group Department of Computing Science University of Alberta {fariaswa,zzhao1,cdemmansepp}@ualberta.ca Abstract Although learner corpora are used to model grammatical error correction systems, they are not Automatic personalized corrective feedback as often employed in the enhancement of learner can help language learners from different back- feedback. Language learners benefit from direct grounds better acquire a new language. This paper introduces a learner English dataset in corrective feedback (Sheen, 2007). Moreover, feed- which learner errors are accompanied by in- back that makes them reflect upon their errors and formation about possible error sources. This distinguish a cause for their mistakes correlates dataset contains manually annotated error to increased performance (Demmans Epp and Mc- causes for learner writing errors. These causes Calla, 2011; Sheen, 2007; Shintani and Ellis, 2013; tie learner mistakes to structures from their Karim and Nassaji, 2020). In this paper, we intro- first languages, when the rules in English and duce a learner English dataset enhanced with error in the first language diverge. This new dataset will enable second language acquisition re- cause information and concrete examples of learner searchers to computationally analyze a large errors that relate to the learners’ first language. It quantity of learner errors that are related to has the potential to help create computational mod- language transfer from the learners’ first lan- els that provide personalized feedback to English guage. The dataset can also be applied in per- language learners based on the learners’ native lan- sonalizing grammatical error correction sys- guages. This new dataset can be accessed by fol- tems according to the learners’ first language lowing the instructions described in our research and in providing feedback that is informed by group’s repository2. the cause of an error. The dataset presented in this paper contains sup- 1 Introduction plementary explanations for errors made by Chi- English has become an international language. It nese native speakers when writing in English. Chi- is the lingua-franca that unites native speakers of nese learners represent a growing share of the En- other languages around the world (Lysandrou and glish as a Second Language market. A nationwide Lysandrou, 2003). For that reason, it is not hard language survey from the Chinese government re- to believe that the teaching of English as a Sec- ports that at the beginning of 2001 at least one third ond Language1 has caught a lot of attention from of China’s population was learning a new language the research community (Caine, 2008). Over the and out of those, 93% were learning English (Wei years, computational linguistics researchers have and Su, 2012). These numbers have only seemed collected corpora containing text written by lan- to increase in recent years. The latest survey of guage learners. These corpora have made possible international students in the US that was conducted several advances in language teaching, such as au- by the Institute of International Education(2020) tomatic writing assessment (Rahimi et al., 2017; shows that 35% of these students come from China. Yannakoudakis et al., 2011) and automatic error With that in mind, it is reasonable to say that this detection and correction (Chollampatt et al., 2016; large portion of English learners can benefit from Nadejde and Tetreault, 2019; Omelianchuk et al., receiving personalized feedback on their writing 2020). errors. 1Throughout this manuscript, we use the term second lan- guage to refer to any additional language beyond the mother tongue, whether the speaker is in a second or foreign language 2https://github.com/EdTeKLA/ learning context. LanguageTransfer 3129 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3129–3142 June 6–11, 2021. ©2021 Association for Computational Linguistics 1.1 Grammatical error correction awareness (Kupferberg, 1999; Han, 2001). One computational task that can benefit from the Advances in learner data annotation foster lan- contrast between first (L1) and second language guage transfer research by providing details that (L2) is Grammatical Error Correction (GEC). In can be used to inform the contrast between learners’ this task, the objective is to find and correct gram- L1s and L2s and possibly further explain incorrect matical errors in learner text (Ng et al., 2013, 2014; learner utterances. Highlighting this contrast is Bryant et al., 2019). Since the GEC task was intro- beneficial to learners as it has the potential of in- duced in 2013, many types of grammatical errors creasing their metalinguistic awareness. That is, it have been added. The BEA-2019 Shared Task up- can improve the learners’ capacity to think about graded the task’s error pool by adding new test sets language as an object. It supports their ability to containing essays written by learners from a more recognise the mismatch between their L1 and L2 diverse set of nationalities. This update is mean- as well as their ability to refrain from incorrectly ingful as it exposes the GEC models to a more using L1 rules in L2 utterances (Wanderley and general set of error types. In the previous tasks, the Demmans Epp, 2020). essays analyzed were written by South-East Asian Considering the importance of feedback for students and due to that, the distribution of gram- learners, Nagata(2019) introduced the task of feed- matical error types in the dataset was skewed to- back comment generation. In this task, the ob- wards that group’s most common mistakes (Bryant jective is to automatically generate feedback for et al., 2019). learner essays. Along with this new task, the au- Grammatical error correction research shows thor introduced a dataset that contains learner es- that GEC systems benefit from L1 specific learner says and their respective annotated feedback. The data. Rozovskaya and Roth(2011) used L1 spe- annotation available in this new dataset contains cific learner data to adapt a Naïve Bayes GEC sys- feedback regarding preposition usage errors and tem. They applied priors extracted from L1 specific text organization. It also contains annotation sam- learner error distributions to improve the correction ples in which the feedback praises the learners’ of preposition replacement errors. Chollampatt writing. While our annotation procedure focused et al.(2016) used L1-specific data from Russian, on annotating Chinese L1 learner errors, with a spe- Spanish, and Chinese learners to adapt a general cial focus on whether those errors were related to GEC model. The resulting adapted models out- negative language transfer, our datasets may com- performed their general counterpart. Nadejde and plement the one described by Nagata(2019). As, Tetreault(2019) expand on this topic by adapting ultimately, both efforts aim to provide more person- general GEC models to L1-specific and proficiency- alized feedback to language learners. specific learner data. Their experimental setup cov- ered twelve different L1s and five proficiency levels. 1.3 Native language identification Both L1 and proficiency adaptations outperformed The differences between L1s and English can pro- the baseline, and the models which achieved the vide valuable features that help identify learners’ best performance were the ones that were adapted L1s. Information about learner errors and their as- to both features at the same time. sociation with the learners’ L1s can be useful in tasks such as native language identification. This 1.2 Writing feedback task takes advantage of latent signals in non-native Direct corrective feedback, such as grammatical written data to identify the authors’ L1s (Tetreault error correction, helps language learners improve et al., 2013). Wong and Dras(2009) apply the con- their writing proficiency (Liaqat et al., 2020; Sheen, trastive analysis hypothesis (Lado, 1957), which 2007). In addition to that, feedback that contrasts correlates the learner’s more common errors to erroneous utterances with correct ones facilitates the divergences between L1 and L2 in a native the acquisition of accurate language structures. language identification task. They analyzed three This facilitation occurs both when the feedback types of syntactic errors and found evidence that is applied to L1-transfer and non-transfer errors the contrastive analysis hypothesis can aid in L1 (Tomasello and Herron, 1989). Fine-tuning error detection. The distribution of learner errors alone feedback by contrasting L1 and L2 has been shown can also be employed in native language Iidentifi- to increase learners’ language understanding and cation. Flanagan et al.(2015) showed that writing 3130 error patterns performed well as features in the by Berzak et al.(2016); they created a manually prediction of learners’ native languages. annotated syntactic treebank for learner English. The Treebank of Learner English they created aims 1.4 Negative language transfer to facilitate second language acquisition research The correlation between L1s and writing error pat- and research on the processing of ungrammatical terns happens because language learners, some- language. It contains part-of-speech tags and de- times unknowingly, use certain strategies when pendency parse trees for erroneous learner English they are learning