Arxiv:2106.01609V2 [Cs.CL] 20 Jun 2021

Tail-to-Tail Non-Autoregressive Sequence Prediction for Chinese Grammatical Error Correction Piji Li Shuming Shi Tencent AI Lab, Shenzhen, China fpijili,[email protected] Abstract Correct ᷉⠨ㄐ〞ℯ晝ⴷ槗⁳漈I feel very happy today! We investigate the problem of Chinese Gram- Type I ᷉⠨ㄐ〞ℯ柝摾槗⁳漈I feel fly long happy today! 晝ⴷ matical Error Correction (CGEC) and present I am always happy when Type II ᷉⠨ㄐℯ晝ⴷⴷ槗⁳漈I come to Fei today! a new framework named Tail-to-Tail (TtT) 〞 non-autoregressive sequence prediction to ad- Type III ᷉⠨ㄐ晝ⴷ〞ℯ槗⁳漈I very feel happy today! dress the deep issues hidden in CGEC. Con- sidering that most tokens are correct and Figure 1: Illustration for the three types of operations can be conveyed directly from source to tar- to correct the grammatical errors: Type I-substitution; get, and the error positions can be estimated Type II-deletion and insertion; Type III-local paraphras- and corrected based on the bidirectional con- ing. text information, thus we employ a BERT- initialized Transformer Encoder as the back- bone model to conduct information modeling and conveying. Considering that only relying in many natural language processing scenarios such on the same position substitution cannot han- as writing assistant (Ghufron and Rosyida, 2018; dle the variable-length correction cases, vari- Napoles et al., 2017; Omelianchuk et al., 2020), ous operations such substitution, deletion, in- search engine (Martins and Silva, 2004; Gao et al., sertion, and local paraphrasing are required 2010; Duan and Hsu, 2011), speech recognition jointly. Therefore, a Conditional Random systems (Karat et al., 1999; Wang et al., 2020a; Fields (CRF) layer is stacked on the up tail Kubis et al., 2020), etc. Grammatical errors may to conduct non-autoregressive sequence prediction by modeling the token dependencies. appear in all languages (Dale et al., 2012; Xing Since most tokens are correct and easily to et al., 2013; Ng et al., 2014; Rozovskaya et al., be predicted/conveyed to the target, then the 2015; Bryant et al., 2019), in this paper, we only fo- models may suffer from a severe class imbal- cus to tackle the problem of Chinese Grammatical ance issue. To alleviate this problem, focal Error Correction (CGEC) (Chang, 1995). loss penalty strategies are integrated into the loss functions. Moreover, besides the typical We investigate the problem of CGEC and the fix-length error correction datasets, we also related corpora from SIGHAN (Tseng et al., 2015) construct a variable-length corpus to conduct and NLPCC (Zhao et al., 2018) carefully, and we experiments. Experimental results on stan- conclude that the grammatical error types as well arXiv:2106.01609v2 [cs.CL] 20 Jun 2021 dard datasets, especially on the variable-length as the corresponding correction operations can be datasets, demonstrate the effectiveness of TtT categorised into three folds, as shown in Figure1: in terms of sentence-level Accuracy, Precision, (1) Substitution. In reality, Pinyin is the most pop- Recall, and F1-Measure on tasks of error De- ular input method used for Chinese writings. Thus, tection and Correction1. the homophonous character confusion (For exam- 1 Introduction ple, in the case of Type I, the pronunciation of the wrong and correct words are both “FeiChang”) is Grammatical Error Correction (GEC) aims to au- the fundamental reason which causes grammatical tomatically detect and correct the grammatical er- errors (or spelling errors) and can be corrected by rors that can be found in a sentence (Wang et al., substitution operations without changing the whole 2020c). It is a crucial and essential application task sequence structure (e.g., length). Thus, substitution 1Code: https://github.com/lipiji/TtT is a fixed-length (FixLen) operation. (2) Deletion I feel very happy today! ᷉⠨ㄐ〞ℯ晝ⴷ槗⁳᷉⠨ㄐ〞ℯ晝ⴷ槗⁳ 漈漈 ᷉⠨ㄐ〞ℯ晝ⴷ槗⁳ 漈 ᷉⠨ㄐ〞ℯ柝摾槗⁳ 漈᷉⠨ㄐℯ晝ⴷⴷ槗⁳漈 ᷉⠨ㄐ晝ⴷ〞ℯ槗⁳漈 I feel fly long happy today! I am always happy when I I very feel happy today! come to Fei today! Type I Type II Type III Figure 2: Illustration of the token information flows from the bottom tail to the up tail. and Insertion. These two operations are used to the vocabulary V to about three times by adding handle the cases of word redundancies and omis- “insertion-” and “substitution-” prefixes to the orig- sions respectively. (3) Local paraphrasing. Some- inal tokens (e.g., “insertion-good”, “substitution- times, light operations such as substitution, dele- paper”) which decrease the computing efficiency tion, and insertion cannot correct the errors directly, dramatically. Moreover, the pure tagging frame- therefore, a slightly subsequence paraphrasing is work needs to conduct multi-pass prediction until required to reorder partial words of the sentence, no more operations are predicted, which is ineffi- the case is shown in Type III of Figure1. Deletion, cient and less elegant. Recently, many researchers insertion, and local paraphrasing can be regarded as fine-tune the pre-trained language models such as variable-length (VarLen) operations because they BERT on the task of CGEC and obtain reason- may change the sentence length. able results (Zhao et al., 2019; Hong et al., 2019; Zhang et al., 2020b). However, limited by the However, over the past few years, although a BERT framework, most of them can only address number of methods have been developed to deal the fixed-length correcting scenarios and cannot with the problem of CGEC, some crucial and es- conduct deletion, insertion, and local paraphrasing sential aspects are still uncovered. Generally, se- operations flexibly. quence translation and sequence tagging are the two most typical technical paradigms to tackle Moreover, during the investigations, we also the problem of CGEC. Benefiting from the devel- observe an obvious but crucial phenomenon for opment of neural machine translation (Bahdanau CGEC that most words in a sentence are correct et al., 2015; Vaswani et al., 2017), attention-based and need not to be changed. This phenomenon is seq2seq encoder-decoder frameworks have been depicted in Figure2, where the operation flow is introduced to address the CGEC problem in a se- from the bottom tail to the up tail. Grey dash lines quence translation manner (Wang et al., 2018; Ge represent the “Keep” operations and the red solid et al., 2018; Wang et al., 2019, 2020b; Kaneko lines indicate those three types of correcting oper- et al., 2020). Seq2seq based translation models ations mentioned above. On one side, intuitively, are easily to be trained and can handle all types of the target CGEC model should have the ability of correcting operations above mentioned. However, directly moving the correct tokens from bottom tail considering the exposure bias issue (Ranzato et al., to up tail, then Transformer(Vaswani et al., 2017) 2016; Zhang et al., 2019), the generated results based encoder (say BERT) seems to be a preference. usually suffer from the phenomenon of hallucina- On the other side, considering that almost all typi- tion (Nie et al., 2019; Maynez et al., 2020) and cal CGEC models are built based on the paradigms cannot be faithful to the source text, even though of sequence tagging or sequence translation, Maxi- copy mechanisms (Gu et al., 2016) are incorpo- mum Likelihood Estimation (MLE) (Myung, 2003) rated (Wang et al., 2019). Therefore, Omelianchuk is usually used as the parameter learning approach, et al.(2020) and Liang et al.(2020) propose to which in the scenario of CGEC, will suffer from purely employ tagging to conduct the problem of a severe class/tag imbalance issue. However, no GEC instead of generation. All correcting opera- previous works investigate this problem thoroughly tions such as deletion, insertion, and substitution on the task of CGEC. can be guided by the predicted tags. Neverthe- To conquer all above-mentioned challenges, we less, the pure tagging strategy requires to extend propose a new framework named tail-to-tail non- Input Nx Output ݔଵ ݋ଵ ݕଵ Multi-Head Multi-Head Self-Attention Layer ݔଶ ݋ ݕଶ Feed-Forward Feed-Forward Layer ଶ Embedding Embedding Layer ݔଷ ݋ଷ ݕଷ eos> ݋ସ ݕସ> mask> ݋ହ ݕହ> <mask> ݋଺ <eos> Bottom Tail Transformer Layers Conditional Random Fields (CRF) Up Tail Figure 3: The proposed tail-to-tail non-autoregressive sequence prediction framework (TtT). autoregressive sequence prediction, which abbrevi- tence X = (x1; x2; : : : ; xT ) which contains gram- ated as TtT, for the problem of CGEC. Specifically, matical errors, where xi denotes each token (Chi- to directly move the token information from the nese character) in the sentence, and T is the length bottom tail to the up tail, a BERT based sequence of X. The objective of the task grammatical error encoder is introduced to conduct bidirectional rep- correction is to correct all errors in X and gener- resentation learning. In order to conduct substi- ate a new sentence Y = (y1; y2; : : : ; yT 0 ). Here, it tution, deletion, insertion, and local paraphrasing is important to emphasize that T is not necessary simultaneously, inspired by (Sun et al., 2019; Su equal to T 0. Therefore, T 0 can be =, >, or < T . et al., 2021), a Conditional Random Fields (CRF) Bidirectional semantic modeling and bottom-to-up (Lafferty et al., 2001) layer is stacked on the up tail directly token information conveying are conducted to conduct non-autoregressive sequence prediction by several Transformer (Vaswani et al., 2017) lay- by modeling the dependencies among neighbour ers. A Conditional Random Fields (CRF) (Lafferty tokens. Focal loss penalty strategy (Lin et al., 2020) et al., 2001) layer is stacked on the up tail to con- is adopted to alleviate the class imbalance problem duct the non-autoregressive sequence generation by considering that most of the tokens in a sentence modeling the dependencies among neighboring to- are not changed.

Load more