
Neural and FST-based approaches to grammatical error correction Zheng Yuan♠♦ Felix Stahlberg| Marek Rei♠♦ Bill Byrne| Helen Yannakoudakis♠♦ ♠Department of Computer Science & Technology, University of Cambridge, United Kingdom }ALTA Institute, University of Cambridge, United Kingdom |Department of Engineering, University of Cambridge, United Kingdom fzheng.yuan, marek.rei, [email protected], ffs439, [email protected] Abstract with a wide coverage of language proficiency lev- els for the latter, ranging from elementary to ad- In this paper, we describe our submission to the BEA 2019 shared task on grammat- vanced. ical error correction. We present a system Neural machine translation (NMT) systems for pipeline that utilises both error detection and GEC have drawn growing attention in recent correction models. The input text is first years (Yuan and Briscoe, 2016; Xie et al., 2016; corrected by two complementary neural ma- Ji et al., 2017; Sakaguchi et al., 2017; Chollampatt chine translation systems: one using convo- and Ng, 2018; Junczys-Dowmunt et al., 2018), as lutional networks and multi-task learning, and they have been shown to achieve state-of-the-art another using a neural Transformer-based sys- results (Ge et al., 2018; Zhao et al., 2019). Within tem. Training is performed on publicly avail- able data, along with artificial examples gener- this framework, error correction is cast as a mono- ated through back-translation. The n-best lists lingual translation task, where the source is a sen- of these two machine translation systems are tence (written by a language learner) that may con- then combined and scored using a finite state tain errors, and the target is its corrected counter- transducer (FST). Finally, an unsupervised re- part in the same language. ranking system is applied to the n-best output Due to the fundamental differences between a of the FST. The re-ranker uses a number of “true” machine translation task and the error cor- error detection features to re-rank the FST n- best list and identify the final 1-best correction rection task, previous work has investigated the hypothesis. Our system achieves 66:75% F0:5 adaptation of NMT for the task of GEC. Byte on error correction (ranking 4th), and 82:52% pair encoding (BPE) (Chollampatt and Ng, 2018; F0:5 on token-level error detection (ranking Junczys-Dowmunt et al., 2018) and a copying 2nd) in the restricted track of the shared task. mechanism (Zhao et al., 2019) have been intro- duced to deal with the “noisy” input text in GEC 1 Introduction and the non-standard language used by learners. Grammatical error correction (GEC) is the task Some researchers have investigated ways of in- of automatically correcting grammatical errors in corporating task-specific knowledge, either by di- written text. In this paper, we describe our submis- rectly modifying the training objectives (Schmaltz sion to the restricted track of the BEA 2019 shared et al., 2017; Sakaguchi et al., 2017; Junczys- task on grammatical error correction (Bryant et al., Dowmunt et al., 2018) or by re-ranking machine- 2019), where participating teams are constrained translation-system correction hypotheses (Yan- to using only the provided datasets as training nakoudakis et al., 2017; Chollampatt and Ng, data. Systems are expected to correct errors of 2018). To ameliorate the lack of large amounts of all types, including grammatical, lexical and or- error-annotated learner data, various approaches thographical errors. Compared to previous shared have proposed to leverage unlabelled native data tasks on GEC, which have primarily focused on within a number of frameworks, including arti- correcting errors committed by non-native speak- ficial error generation with back translation (Rei ers (Dale and Kilgarriff, 2011; Dale et al., 2012; et al., 2017; Kasewa et al., 2018), fluency boost Ng et al., 2013, 2014), a new annotated dataset learning (Ge et al., 2018), and pre-training with is introduced, consisting of essays produced by denoising autoencoders (Zhao et al., 2019). native and non-native English language learners, Previous work has shown that a GEC system 228 Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 228–239 Florence, Italy, August 2, 2019. c 2019 Association for Computational Linguistics CNN based on the input sequence x and all the previ- ously generated tokens fy1; y2; :::; yt−1g: Input Re- Output text FST text m ranking Y p(y) = p(ytjfy1; :::; yt−1g; x) (1) Transformer t=1 Our convolutional neural system is based Figure 1: Overview of our best GEC system pipeline. on a multi-layer convolutional encoder–decoder model (Gehring et al., 2017), which employs con- targeting all errors may not necessarily be the best volutional neural networks (CNNs) to compute approach to the task, and that different GEC sys- intermediate encoder and decoder states. The tems may be better suited to correcting different parameter settings follow Chollampatt and Ng types of errors, and can therefore be complemen- (2018) and Ge et al.(2018). The source and target tary (Yuan, 2017). As such, hybrid systems that word embeddings have size 500, and are initialised combine different approaches have been shown to with fastText embeddings (Bojanowski et al., yield improved performance (Felice et al., 2014; 2017) trained on the native English Wikipedia cor- Rozovskaya and Roth, 2016; Grundkiewicz and pus (2; 405; 972; 890 tokens). Each of the encoder Junczys-Dowmunt, 2018). In line with this work, and decoder is made up of seven convolutional we present a hybrid approach that 1) employs layers, with a convolution window width of 3. We two NMT-based error correction systems: a neural apply a left-to-right beam search to find a correc- convolutional system and a neural Transformer- tion that approximately maximises the conditional based system; 2) a finite state transducer (FST) probability in Equation1. that combines and further enriches the n-best out- BPE is introduced to alleviate the rare-word puts of the NMT systems; 3) a re-ranking system problem, and rare and unknown words are split that re-ranks the n-best output of the FST based into multiple frequent subword tokens (Sennrich on error detection features. et al., 2016b). NMT systems often limit vocabu- The remainder of this paper is organised as fol- lary size on both source and target sides due to the lows: Section2 describes our approach to the computational complexity during training. There- task; Section3 describes the datasets used and fore, they are unable to translate out-of-vocabulary presents our results on the shared task develop- (OOV) words, which are treated as unknown to- ment set; Section4 presents our official results on kens, resulting in poor translation quality. As the shared task test set, including a detailed analy- noted by Yuan and Briscoe(2016), this problem sis of the performance of our final system; and, fi- is more serious for GEC as non-native text con- nally, Section5 concludes the paper and provides tains, not only rare words (e.g., proper nouns), but an overview of our findings. also misspelled words (i.e., spelling errors). In our model, each of the source and target vo- 2 Approach cabularies consist of the 30K most frequent BPE We approach the error correction task using a tokens from the source and target side of the paral- pipeline of systems, as presented in Figure1. In lel training data respectively. The same BPE oper- the following sections, we describe each of these ation is applied to the Wikipedia data before being components in detail. used for training of our word embeddings. Copying mechanism is a technique that has led 2.1 The convolutional neural network (CNN) to performance improvement on various mono- system lingual sequence-to-sequence tasks, such as text We use a neural sequence-to-sequence model summarisation, dialogue systems, and paraphrase and an encoder–decoder architecture (Cho et al., generation (Gu et al., 2016; Cao et al., 2017). The 2014; Sutskever et al., 2014). An encoder first idea is to allow the decoder to choose between reads and encodes an entire input sequence x = simply copying an original input word and out- (x1; x2; :::; xn) into hidden state representations. putting a translation word. Since the source and A decoder then generates an output sequence y = target sentences are both in the same language (y1; y2; :::; ym) by predicting the next token yt (i.e., monolingual translation) and most words in 229 the source sentence are correct and do not need to token xi in the source sentence x with a token yj change, GEC seems to benefit from the copying in the target sentence y. If xi = yj, the source mechanism. token xi is correct; while if xi 6= yj, the source to- Following the work of Gu et al.(2016), we ken xi is incorrect. Similarly, the source sentence use a dynamic target vocabulary, which contains x is correct if x = y, and incorrect otherwise. a fixed vocabulary learned from the target side of the training data plus all the unique tokens Artificial error generation is the process of in- introduced by the source sentence. As a re- jecting artificial errors into a set of error-free sen- sult, the probability of generating any target to- tences. Compared to standard machine transla- tion tasks, GEC suffers from the limited availabil- ken p(ytjfy1; :::; yt−1g; x) in Equation1 is de- fined as a “mixture” of the generation probability ity of large amounts of training data. As man- ual error annotation of learner data is a slow and p(yt; gjfy1; :::; yt−1g; x) and the copy probability expensive process, artificial error generation has p(yt; cjfy1; :::; yt−1g; x): been applied to error correction (Felice and Yuan, p(ytjfy1; :::; yt−1g; x) = p(yt; gjfy1; :::; yt−1g; x) 2014) and detection (Rei et al., 2017) with some success.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages12 Page
-
File Size-