Faspell: a Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based on DAE-Decoder Paradigm

FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm Yuzhong Hong, Xianguo Yu, Neng He, Nan Liu, Junhui Liu Intelligent Platform Division, iQIYI, Inc. fhongyuzhong, yuxianguo, heneng, liunan, [email protected] Abstract 1.1 Related work and bottlenecks Almost all previous Chinese spell checking mod- We propose a Chinese spell checker – FASPell based on a new paradigm which consists of els deploy a common paradigm where a fixed set a denoising autoencoder (DAE) and a de- of similar characters of each Chinese character coder. In comparison with previous state- (called confusion set) is used as candidates, and a of-the-art models, the new paradigm allows filter selects the best candidates as substitutions for our spell checker to be Faster in computa- a given sentence. This naive design is subjected to tion, readily Adaptable to both simplified and two major bottlenecks, whose negative impact has traditional Chinese texts produced by either been unsuccessfully mitigated: humans or machines, and to require much Simpler structure to be as much Powerful in • overfitting to under-resourced Chinese both error detection and correction. These four spell checking data. Since Chinese spell achievements are made possible because the new paradigm circumvents two bottlenecks. checking data require tedious professional First, the DAE curtails the amount of Chi- manual work, they have always been under- nese spell checking data needed for super- resourced. To prevent the filter from over- vised learning (to <10k sentences) by lever- fitting, Wang et al.(2018) propose an auto- aging the power of unsupervisedly pre-trained matic method to generate pseudo spell check- masked language model as in BERT, XLNet, ing data. However, the precision of their spell MASS etc. Second, the decoder helps to elim- checking model ceases to improve when the inate the use of confusion set that is deficient in flexibility and sufficiency of utilizing the generated data reaches 40k sentences. Zhao salient feature of Chinese character similarity. et al.(2017) use an extensive amount of ad hoc linguistic rules to filter candidates, only 1 Introduction to achieve worse performance than ours even though our model does not leverage any lin- There has been a long line of research on detect- guistic knowledge. ing and correcting spelling errors in Chinese texts since some trailblazing work in the early 1990s • inflexibility and insufficiency of confusion (Shih et al., 1992; Chang, 1995). However, de- set in utilizing character similarity. The spite the spelling errors being reduced to substitu- feature of Chinese character similarity is very tion errors in most researches1 and efforts of mul- salient as it is related to the main cause of tiple recent shared tasks (Wu et al., 2013; Yu et al., spelling errors (see subsection 2.2). How- 2014; Tseng et al., 2015; Fung et al., 2017), Chi- ever, the idea of confusion set is troublesome nese spell checking remains a difficult task. More- in utilizing it: over, the methods for languages like English can 1. inflexibility to address the issue that hardly be directly used for the Chinese language confusing characters in one scenario because there are no delimiters between words, may not be confusing in another. The whose lack of morphological variations makes the difference between simplified and tradi- syntactic and semantic interpretations of any Chi- tional Chinese shown in Table1 is an nese character highly dependent on its context. example. Wang et al.(2018) also sug- 1Likewise, this paper only covers substitution errors. gest that confusing characters for ma- 160 Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text, pages 160–169 Hong Kong, Nov 4, 2019. c 2019 Association for Computational Linguistics chines are different from those for hu- 2. The DAE-decoder paradigm is sequence- mans. Therefore, in practice, it is very to-sequence, which makes it resemble the likely that the correct candidates for sub- encoder-decoder paradigm in tasks like ma- stitution do not exist in a given confu- chine translation, grammar checking, etc. sion set, which harms recall. Also, con- However, in the encoder-decoder paradigm, sidering more similar characters to pre- the encoder extracts semantic information, serve recall will risk lowering precision. and the decoder generates texts that embody 2. insufficiency in utilizing character simi- the information. In contrast, in the DAE- larity. Since a cut-off threshold of quan- decoder paradigm, the DAE provides candi- tified character similarity (Liu et al., dates to reconstruct texts from the corrupted 2010; Wang et al., 2018) is used to pro- ones based on contextual feature, and the de- duce the confusion set, similar charac- coder3 selects the best candidates by incorpo- ters are actually treated indiscriminately rating other features. in terms of their similarity. This means Besides the new paradigm per se, there are two the information of character similarity additional contributions in our proposed Chinese is not sufficiently utilized. To compen- spell checking model: sate this, Zhang et al.(2015) propose a spell checker that has to consider many • we propose a more precise quantification less salient features such as word seg- method of character similarity than the ones mentation, which add more unnecessary proposed by Liu et al.(2010) and Wang et al. noises to their model. (2018) (see subsection 2.2); • we propose an empirically effective decoder 1.2 Motivation and contributions to filter candidates under the principle of get- The motivation of this paper is to circumvent the ting the highest possible precision with mini- two bottlenecks in subsection 1.1 by changing the mal harm to recall (see subsection 2.3). paradigm for Chinese spell checking. 1.3 Achievements As a major contribution and as exemplified by our proposed Chinese spell checking model in Fig- Thanks to our contributions mentioned in subsec- ure1, the most general form of the new paradigm tion 1.2, our model can be characterized by the fol- consists of a denoising autoencoder2 (DAE) and a lowing achievements relative to previous state-of- decoder. To prove that it is indeed a novel contri- the-art models, and thus is named FASPell. bution, we compare it with two similar paradigms • Our model is Fast. It is shown (subsection and show their differences as follows: 3.3) to be faster in filtering than previous state-of-the-art models either in terms of ab- 1. Similar to the old paradigm used in previous solute time consumption or time complexity. Chinese spell checking models, a model under the DAE-decoder paradigm also produces • Our model is Adaptable. To demonstrate this, candidates (by DAE) and then filters the can- we test it on texts from different scenarios didates (by the decoder). However, candi- – texts by humans, such as learners of Chi- dates are produced on the fly based on con- nese as a Foreign Language (CFL), and by texts. If the DAE is powerful enough, we machines, such as Optical Character Recog- should expect that all contextually suitable nition (OCR). It can also be applied to both candidates are recalled, which prevent the in- simplified Chinese and traditional Chinese, flexibility issue caused by using confusion despite the challenging issue that some er- set. The DAE will also prevent the overfit- roneous usages of characters in traditional ting issue because it can be trained unsuper- texts are considered valid usages in simpli- visedly using a large number of natural texts. fied texts (see Table1). To the best of our Moreover, character similarity can be used by knowledge, all previous state-of-the-art mod- the decoder without losing any information. els only focus on human errors in traditional Chinese texts. 2the term denoising autoencoder follows the same sense used by Yang et al.(2019), which is arguably more general 3The term decoder here is analogous as in Viterbi decoder than the one used by Vincent et al.(2008). in the sense of finding the best path along candidates. 161 Table 1: Examples on the left are considered valid ✓ ✓ usages in simplified Chinese (SC). Notes on the right 国际电台著名主持人 are about how they are erroneous in traditional Chi- nese (TC) and suggested corrections. This inconsis- tency is because multiple traditional characters were Confidence-Similarity Decoder merged into identical characters in the simplification process. Our model makes corrections for this type of errors only in traditional texts. In simplified texts, they 国际电台知名主持人 rank=1 are not detected as errors. 0.9994 0.9999 0.9999 0.9999 0.2878 0.9626 0.9994 0.9981 0.9999 國際听话著音广目者 rank=2 SC Examples Notes on TC usage 0.0002 0.0000 0.0000 0.0000 0.1999 0.0019 0.0002 0.0002 0.0000 世家节视报台演主手 rank=3 h+ (weekend) h ! 1 h only in h0, etc. 0.0000 0.0000 0.0000 0.0000 0.0429 0.0015 0.0000 0.0001 0.0000 Å8 8 J 8 8 (trip) ! only in 泳, etc. 台界讲播冠闻支节持 rank=4 6 (make) 6 ! 製 6 only in 6¦, etc. 0.0000 0.0000 0.0000 0.0000 0.0252 0.0014 0.0000 0.0001 0.0000 • Our model is Simple. As shown in Fig- Masked Language Model ure1, it has only a masked language model and a filter as opposed to multiple feature- producing models and filters being used in 国际电台苦名丰持人 ✖ ✖ previous state-of-the-art proposals. More- over, only a small training set and a set of Figure 1: A real example of how an erroneous sentence visual and phonological features of charac- which is supposed to have the meaning of "A famous international radio broadcaster" is successfully spell- ters are required in our model.

Load more