Spelling Error Correction with Soft-Masked BERT

Spelling Error Correction with Soft-Masked BERT Shaohua Zhang1, Haoran Huang1, Jicong Liu2 and Hang Li1 1ByteDance AI Lab 2School of Computer Science and Technology, Fudan University fzhangshaohua.cs,huanghaoran,[email protected] [email protected] Abstract Table 1: Examples of Chinese spelling errors Spelling error correction is an important yet Wrong: ÃÊ有金PT。Egypt has golden towers. challenging task because a satisfactory solu- tion of it essentially needs human-level lan- Correct: ÃÊ有金WT。Egypt has pyramids. guage understanding ability. Without loss of generality we consider Chinese spelling error Wrong: 他的B胜2很:，:了越ñ(挖洞。 correction (CSC) in this paper. A state-of- He has a strong desire to win and is digging for the-art method for the task selects a character prison breaks from a list of candidates for correction (including non-correction) at each position of the sen- Correct: 他的B生2很:，:了越ñ(挖洞。 tence on the basis of BERT, the language representation model. The accuracy of the method He has a strong desire to survive and is digging for can be sub-optimal, however, because BERT prison breaks. does not have sufficient capability to detect whether there is an error at each position, ap- parently due to the way of pre-training it us- per, we consider Chinese spelling error correction ing mask language modeling. In this work, we (CSC) at character-level. propose a novel neural architecture to address Spelling error correction is also a very challeng- the aforementioned issue, which consists of a ing task, because to completely solve the problem network for error detection and a network for the system needs to have human-level language error correction based on BERT, with the for- understanding ability. There are at least two chal- mer being connected to the latter with what we call soft-masking technique. Our method of lenges here, as shown in Table1. First, world using ‘Soft-Masked BERT’ is general, and it knowledge is needed for spelling error correction. may be employed in other language detection- Character W in the first sentence is mistakenly writ- correction problems. Experimental results on ten as P, where 金PT means golden tower and two datasets demonstrate that the performance 金WT means pyramid. Humans can correct the of our proposed method is significantly bet- typo by referring to world knowledge. Second, ter than the baselines including the one solely sometimes inference is also required. In the sec- based on BERT. ond sentence, the 4-th character 生 is mistakenly 胜胜 1 Introduction written as . In fact, and the surrounding characters form a new valid word B胜2 (desire to Spelling error correction is an important task which win), rather than the intended word B生2 (desire aims to correct spelling errors in a text either at to survive). word-level or at character-level (Yu and Li, 2014; Many methods have been proposed for CSC or Yu et al., 2014; Zhang et al., 2015; Wang et al., more generally spelling error correction. Previ- 2018b; Hong et al., 2019; Wang et al., 2019). It ous approaches can be mainly divided into two is crucial for many natural language applications categories. One employs traditional machine learn- such as search (Martins and Silva, 2004; Gao et al., ing and the other deep learning (Yu et al., 2014; 2010), optical character recognition (OCR) (Afli Tseng et al., 2015; Wang et al., 2018b). Zhang et et al., 2016; Wang et al., 2018b), and essay scor- al. (2015), for example, proposed a unified frame- ing (Burstein and Chodorow, 1999). In this pa- work for CSC consisting of a pipeline of error de- 882 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 882–890 July 5 - 10, 2020. c 2020 Association for Computational Linguistics tection, candidate generation, and final candidate tion under the help of the detection network, during selection using traditional machine learning. Wang end-to-end joint training. et al. (2019) proposed a Seq2Seq model with copy We conducted experiments to compare Soft- mechanism which transforms an input sentence Masked BERT and several baselines including the into a new sentence with spelling errors corrected. method of using BERT alone. As datasets we utilized the benchmark dataset of SIGHAN. We also More recently, BERT (Devlin et al., 2018), the created a large and high-quality dataset for evalua- language representation model, is successfully ap- tion named News Title. The dataset, which contains plied to many language understanding tasks includ- titles of news articles, is ten times larger than the ing CSC (cf., (Hong et al., 2019)). In the state-of- previous datasets. Experimental results show that the-art method using BERT, a character-level BERT Soft-Masked BERT significantly outperforms the is first pre-trained using a large unlabelled dataset baselines on the two datasets in terms of accuracy and then fine-tuned using a labeled dataset. The measures. labeled data can be obtained via data augmentation in which examples of spelling errors are generated The contributions of this work include (1) pro- using a large confusion table. Finally the model is posal of the novel neural architecture Soft-Masked utilized to predict the most likely character from a BERT for the CSC problem, (2) empirical verifica- list of candidates at each position of the given sen- tion of the effectiveness of Soft-Masked BERT. tence. The method is powerful because BERT has 2 Our Approach certain ability to acquire knowledge for language understanding. Our experimental results show that 2.1 Problem and Motivation the accuracy of the method can be further improved, Chinese spelling error correction (CSC) can be however. One observation is that the error detec- formalized as the following task. Given a sequence tion capability of the model is not sufficiently high, of n characters (or words) X = (x1; x2; ··· ; xn), and once an error is detected, the model has a better the goal is to transform it into another sequence chance to make a right correction. We hypothesize of characters Y = (y1; y2; ··· ; yn) with the same that this might be due to the way of pre-training length, where the incorrect characters in X are BERT with mask language modeling in which only replaced with the correct characters to obtain Y . about 15% of the characters in the text are masked, The task can be viewed as a sequential labeling and thus it only learns the distribution of masked problem in which the model is a mapping function tokens and tends to choose not to make any correc- f : X ! Y . The task is an easier one, however, in tion. This phenomenon is prevalent and represents the sense that usually no or only a few characters a fundamental challenge for using BERT in certain need to be replaced and all or most of the characters tasks like spelling error correction. should be copied. To address the above issue, we propose a novel The state-of-the-art method for CSC is to em- neural architecture in this work, referred to as Soft- ploy BERT to accomplish the task. Our prelimi- Masked BERT. Soft-Masked BERT contains two nary experiments show that the performance of the networks, a detection network and a correction net- approach can be improved, if the erroneous charac- work based on BERT. The correction network is ters are designated (cf., section 3.6). In general the similar to that in the method of solely using BERT. BERT based method tends to make no correction The detection network is a Bi-GRU network that (or just copy the original characters). Our inter- predicts the probability that the character is an error pretation is that in pre-training of BERT only 15% at each position. The probability is then utilized to of the characters are masked for prediction, result- conduct soft-masking of embedding of character at ing in learning of a model which does not possess the position. Soft masking is an extension of con- enough capacity for error detection. This motives ventional ‘hard masking’ in the sense that the for- us to devise a new model. mer degenerates to the latter, when the probability of error equals one. The soft-masked embedding 2.2 Model at each position is then inputted into the correction We propose a novel neural network model called network. The correction network conducts error Soft-Masked BERT for CSC, as illustrated in Fig- correction using BERT. This approach can force ure1. Soft-Masked BERT is composed of a detec- the model to learn the right context for error correction network based on Bi-GRU and a correction net- 883 Figure 1: Architecture of Soft-Masked BERT work based on BERT. The detection network pre- embedding of the character, as in BERT. The out- dicts the probabilities of errors and the correction put is a sequence of labels G = (g1; g2; ··· ; gn), network predicts the probabilities of error correc- where gi denotes the label of the i character, and 1 tions, while the former passes its prediction results means the character is incorrect and 0 means it is to the latter using soft masking. correct. For each character there is a probability pi More specifically, our method first creates an representing the likelihood of being 1. The higher embedding for each character in the input sentence, pi is the more likely the character is incorrect. referred to as input embedding. Next, it takes the In this work, we realize the detection network sequence of embeddings as input and outputs the as a bidirectional GRU (Bi-GRU). For each char- probabilities of errors for the sequence of charac- acter of the sequence, the probability of error pi is ters (embeddings) using the detection network.

Spelling Error Correction with Soft-Masked BERT

Malware Classification with BERT

Bros:Apre-Trained Language Model for Understanding Textsin Document

Information Extraction Based on Named Entity for Tourism Corpus

NLP with BERT: Sentiment Analysis Using SAS® Deep Learning and Dlpy Doug Cairns and Xiangxiang Meng, SAS Institute Inc

Unified Language Model Pre-Training for Natural

Spell Checker

Question Answering by Bert

Finite State Recognizer and String Similarity Based Spelling

Spell Checking in Computer-Assisted Language Learning: a Study of Misspellings by Nonnative Writers of German

Exploiting Wikipedia Semantics for Computing Word Associations

Distilling BERT for Natural Language Understanding

Words in a Text