An Improved Graph Model for Chinese Spell Checking∗

An Improved Graph Model for Chinese Spell Checking∗ 1,2 1,2, 1,2 1,2 Yang Xin , Hai Zhao †, Yuzhu Wang and Zhongye Jia 1Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China 2Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China [email protected], [email protected], [email protected], [email protected] Abstract 1 Introduction Spell checking is a routine processing task for In this paper, we propose an improved every written language, which is an automatic graph model for Chinese spell checking. mechanism to detect and correct human spelling The model is based on a graph model for errors. Given sentences, the goal of the task is to generic errors and two independently- return the locations of incorrect words and suggest trained models for specific errors. First, a the correct words. However, Chinese spell check- graph model represents a Chinese sentence ing (CSC) is very different from that in English and a modified single source shortest path or other alphabetical languages from the following algorithm is performed on the graph ways. to detect and correct generic spelling Usually, the object of spell checking is words, errors. Then, we utilize conditional but “word” is not a natural concept in Chinese, random fields to solve two specific kinds since there are no word delimiters between words of common errors: the confusion of in Chinese writing. An English “word” consists “在” (at) (pinyin is ‘zai’ in Chinese), of Latin letters. While a Chinese “word” consists “再” (again, more, then) (pinyin: zai) of characters, which also known as “漢字” (Chi- and “的” (of) (pinyin: de), “地” (-ly, nese character) (pinyin1 is ‘han zi’ in Chinese). adverb-forming particle) (pinyin: de), Thus, essentially, the object of CSC is misused “得” (so that, have to) (pinyin: de). characters in a sentence. Meanwhile, sentences Finally, a rule based system is exploited for CSC task are meant to computer-typed but not to solve the pronoun usage confusions: those handwritten Chinese. In handwritten Chi- “她” (she) (pinyin: ta), “他” (he) (pinyin: nese, there exist varies of spelling errors including ta) and some others fixed collocation non-character errors which are probably caused by errors. The proposed model is evaluated stroke errors. While in computer-typed Chinese, a on the standard data set released by the non-character spelling error is impossible, because SIGHAN Bake-off 2014 shared task, and any illegal Chinese characters will be filtered by gives competitive result. Chinese input method engine so that CSC nev- er encounters “out-of-character (OOC)” problem. ∗ This work was partially supported by the National Thus, the Chinese spelling errors come from the Natural Science Foundation of China (No. 60903119, No. 61170114, and No. 61272248), the National Basic Research misuse of characters, not characters themselves. Program of China (No. 2013CB329401), the Science and Spelling errors in alphabetical languages, such Technology Commission of Shanghai Municipality (No. as English, are always typically divided into two 13511500200), the European Union Seventh Framework Program (No. 247619), the Cai Yuanpei Program (CSC fund categories: 201304490199 and 201304490171), and the art and science interdiscipline funds of Shanghai Jiao Tong University (A • The misspelled word is a non-word, for ex- study on mobilization mechanism and alerting threshold set- ample “come” is misspelled into “cmoe”; ting for online community, and media image and psychology evaluation: a computational intelligence approach). 1Pinyin is the official phonetic system for transcribing the †Corresponding author. sound of Chinese characters into Latin script. 157 Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pages 157–166, Wuhan, China, 20-21 October 2014 • The misspelled word is still a legal word, for model from search query logs to improve the qual- example “come” is misspelled into “cone”. ity of query. (Han and Chang, 2013) employed maximum entropy models for CSC. They trained a While in Chinese, if the misspelled word is a non- maximum entropy model for each Chinese charac- word, the word segmenter will not recognize it as ter based on a large raw corpus and used the model a word, but split it into two or more words with to detect the spelling errors. 你好世界 fewer characters. For example, if “ ” Two key techniques, word segmentation (Zhao in Example 1 of Table 1 is misspelled into et al., 2006a; Zhao and Kit, 2008b; Zhao et al., 你好世節 “ ”, the word segmenter will segment it 2006b; Zhao and Kit, 2008a; Zhao and Kit, 2007; 你好/世/節你好/世節 into “ ” instead of “ ”. For Zhao and Kit, 2011; Zhao et al., 2010) and lan- non-word spelling error, the misspelled word will guage model (LM), are also popularly used for C- be mis-segmented. SC. Most of those approaches can fall into four cat- Name Example 1 Example 2 egories. The first category consists of the methods Golden 你好/世界好好/地/出去/玩 that all the characters in a sentence are assumed to Misspelled 你好/世/節好好/的/出去/玩 be errors and an LM is used for correction (Chang, Pinyin ni hao shi jie hao hao de chu qu wan 1995; Yu et al., 2013). (Chang, 1995) proposed a Translation hello the world enjoy yourself outside method that replaced each character in the sentence Table 1: Two examples for Chinese spelling error. based on a confusion set and computed the prob- Both examples have the same pinyin. ability of the original sentence and all modified sentences according to a bigram language model Thus CSC cannot be directly applied those edit generated from a newspaper corpus. The method distance based methods which are commonly used based on the motivation that all the typos were for alphabetical languages. CSC task has to deal caused by either visual similarity or phonological with word segmentation problem first, since mis- similarity. So they manually built a confusion spelled sentence could not be segmented properly set as a key factor in their system. Although the by word segmenter. method can detect misspelled words well, it was There also exist Chinese spelling errors which very time consuming for detection, generated too are unrelated with word segmentation. For exam- much false positive results and was not able to refer ple, “好好地出去玩” in Example 2 of Table 1 is to an entire paragraph. (Yu et al., 2013) developed misspelled into “好好的出去玩”, but both of them a joint error detection and correction system. The have the same segmentation. So it is necessary to method assumed that all characters in the sentence perform further specific process. may be errors and replaced every character using In this paper, based on our previous work (Jia a confusion set. Then they segmented all new et al., 2013b) in SIGHAN Bake-off 2013, we de- generated sentences and gave a score of the seg- scribe an improved graph model to handle the CSC mentation using LM for every sentence. In fact, task. The improved model includes a graph model this method did not always perform well according for generic spelling errors, conditional random to (Yu et al., 2013). fields (CRF) for two special errors and a rule based The second category includes the methods that system for some collocation errors. all single-character words are supposed to be errors and an LM is used for correction, for example (Lin 2 Related Work and Chu, 2013) . They developed a system which Over the past few years, there were many methods supposed that all single-character words may be proposed for CSC task. (Sun et al., 2010) devel- typos. They replaced all single-character words by oped a phrase-based spelling error model from the similar characters using a confusion set and seg- clickthrough data by means of measuring the edit mented the newly created sentences again. If a new distance between an input query and the optimal sentence resulted in a better word segmentation, spelling correction. (Gao et al., 2010) explored spelling error was reported. Their system gave the ranker-based approach which included visual good detection recall but low false-alarm rate. similarity, phonological similarity, dictionary, and The third category utilizes more than one ap- frequency features for large scale web search. (Ah- proaches for detection and an LM for correction. mad and Kondrak, 2005) proposed a spelling error (Hsieh et al., 2013) used two different systems for 158 error detection. The first system detected error Given a dictionary D and a similar characters C, characters based on unknown word detection and for a sentence S of m characters c , c , . , c , { 1 2 m} LM verification. The second one solved error the original vertices V of the DAG in (Jia et al., detection based on a suggestion dictionary gener- 2013b) are: ated from a confusion set. Finally, two systems were combined to obtain the final detection result. V = wi,j wi,j = ci . cj D { | ∈ } (He and Fu, 2013) divided typos into three cate- k k wi,j wi,j = ci . ck′ . cj D, gories which were character-level errors (CLEs), ∪ { | ∈ τ j i T, word-level errors (WLEs) and context-level errors ≤ − ≤ c′ C[ck], k = i, i + 1, . , j (CLEs), and three different methods were used to k ∈ } detect the different errors respectively. In addition w ,0, wn+1, . ∪ { − −} to using the result of word segmentation for detec- where w ,0 = “<S>” and wn+1, = “</S>” are tion, (Yeh et al., 2013) also proposed a dictionary- − − based method to detect spelling errors. The dic- two special vertices represent the start and end of tionary contained similar pronunciation and shape the sentence.

Load more