A New Statistical Approach to Chinese Pinyin Input
Total Page:16
File Type:pdf, Size:1020Kb
A New Statistical Approach to Chinese Pinyin Input Zheng Chen Kai-Fu Lee Microsoft Research China Microsoft Research China No. 49 Zhichun Road Haidian District No. 49 Zhichun Road Haidian District 100080, China, 100080, China, [email protected] [email protected] may be achieved using a sentence-based input. Abstract Sentence-based input method chooses character by using a language model base on Chinese input is one of the key context. So its accuracy is higher than word- challenges for Chinese PC users. This based input method. In this paper, all the paper proposes a statistical approach to technology is based on sentence-based input Pinyin-based Chinese input. This approach method, but it can easily adapted to word-input uses a trigram-based language model and a method. statistically based segmentation. Also, to deal with real input, it also includes a In our approach we use statistical language typing model which enables spelling model to achieve very high accuracy. We correction in sentence-based Pinyin input, design a unified approach to Chinese statistical and a spelling model for English which language modelling. This unified approach enables modeless Pinyin input. enhances trigram-based statistical language modelling with automatic, maximum- 1. Introduction likelihood-based methods to segment words, select the lexicon, and filter the training data. Chinese input method is one of the most Compared to the commercial product, our difficult problems for Chinese PC users. There system is up to 50% lower in error rate at the are two main categories of Chinese input same memory size, and about 76% better method. One is shape-based input method, without memory limits at all (Jianfeng etc. such as "wu bi zi xing", the other is Pinyin, or 2000). pronunciation-based input method, such as "Chinese CStar", "MSPY", etc. Because of its However, sentence-based input methods facility to learn and to use, Pinyin is the most also have their own problems. One is that the popular Chinese input method. Over 97% of system assumes that users' input is perfect. In the users in China use Pinyin for input (Chen reality there are many typing errors in users' Yuan 1997). Although Pinyin input method input. Typing errors will cause many system has so many advantages, it also suffers from errors. Another problem is that in order to type several problems, including Pinyin-to- both English and Chinese, the user has to characters conversion errors, user typing switch between two modes. This is errors, and UI problem such as the need of two cumbersome for the user. In this paper, a new separate mode while typing Chinese and typing model is proposed to solve these English, etc. problems. The system will accept correct typing, but also tolerate common typing errors. Pinyin-based method automatically Furthermore, the typing model is also converts Pinyin to Chinese characters. But, combined with a probabilistic spelling model there are only about 406 syllables; they for English, which measures how likely the correspond to over 6000 common Chinese input sequence is an English word. Both characters. So it is very difficult for system to models can run in parallel, guided by a select the correct corresponding Chinese Chinese language model to output the most characters automatically. A higher accuracy likely sequence of Chinese and/or English The Chinese language model in equation characters. 2.1, Pr(H ) measures the a priori probability of a Chinese word sequence. Usually, it is The organization of this paper is as follows. determined by a statistical language model In the second section, we briefly discuss the (SLM), such as Trigram LM. Pr(P | H) , called Chinese language model which is used by typing model, measures the probability that a sentence-based input method. In the third Chinese word H is typed as Pinyin P . section, we introduce a typing model to deal with typing errors made by the user. In the Usually, H is the combination of Chinese fourth section, we propose a spelling model for words, it can decomposed into w , w ,Λ , w , English, which discriminates between Pinyin 1 2 n and English. Finally, we give some where wi can be Chinese word or Chinese conclusions. character. So typing model can be rewritten as equation 2.2. n Pr(P | H ) ≈ Pr(P | w ) , (2.2) 2. Chinese Language Model ∏ f (i) i i=1 Pinyin input is the most popular form of where, Pf (i) is the Pinyin of wi . text input in Chinese. Basically, the user types a phonetic spelling with optional spaces, like: The most widely used statistical language woshiyigezhongguoren model is the so-called n-gram Markov models And the system converts this string into a (Frederick 1997). Sometimes bigram or string of Chinese characters, like: trigram is used as SLM. For English, trigram is ( I am a Chinese ) widely used. With a large training corpus trigram also works well for Chinese. Many A sentence-based input method chooses the articles from newspapers and web are probable Chinese word according to the collected for training. And some new filtering context. In our system, statistical language methods are used to select balanced corpus to model is used to provide adequate information build the trigram model. Finally, a powerful to predict the probabilities of hypothesized language model is obtained. In practice, Chinese word sequences. perplexity (Kai-Fu Lee 1989; Frederick 1997) is used to evaluate the SLM, as equation 2.3. In the conversion of Pinyin to Chinese character, for the given Pinyin P , the goal is 1 N − ∑ log P(w |w − ) to find the most probable Chinese character N i i 1 = i=1 H , so as to maximize Pr(H | P) . Using Bayes PP 2 (2.3) law, we have: where N is the length of the testing data. The perplexity can be roughly interpreted as the ^ Pr(P | H ) Pr(H ) H = arg max Pr(H | P) = arg max geometric mean of the branching factor of the H H Pr(P) document when presented to the language (2.1) model. Clearly, lower perplexities are better. The problem is divided into two parts, typing We build a system for cross-domain model Pr(P | H ) and language model Pr(H ) . general trigram word SLM for Chinese. We trained the system from 1.6 billion characters Conceptually, all H ’s are enumerated, and of training data. We evaluated the perplexity the one that gives the largest Pr(H, P) is of this system, and found that across seven different domains, the average per-character selected as the best Chinese character perplexity was 34.4. We also evaluated the sequence. In practice, some efficient methods, system for Pinyin-to-character conversion. such as Viterbi Beam Search (Kai-Fu Lee Compared to the commercial product, our 1989; Chin-hui Lee 1996), will be used. system is up to 50% lower in error rate at the same memory size, and about 76% better 2000) without memory limits at all. (JianFeng etc. all characters with equivalent pronunciation 3. Spelling Correction into a single syllable. There are about 406 syllables in Chinese, so this is essentially 3.1 Typing Errors training: Pr(Pinyin String | Syllable) , and then mapping each character to its corresponding The sentence-based approach converts syllable. Pinyin into Chinese words. But this approach assumes correct Pinyin input. Erroneous input According to the statistical data from will cause errors to propagate in the psychology (William 1983), most frequently conversion. This problem is serious for errors made by users can be classified into the Chinese users because: following types: 1. Chinese users do not type Pinyin as 1. Substitution error: The user types one key frequently as American users type English. instead of another key. This error is 2. There are many dialects in China. Many mainly caused by layout of the keyboard. people do not speak the standard Mandarin The correct character was replaced by a Chinese dialect, which is the origin of character immediately adjacent and in the Pinyin. For example people in the southern same row. 43% of the typing errors are of area of China do not distinguish ‘zh’-‘z’, this type. Substitutions of a neighbouring ‘sh’-‘s’, ‘ch’-‘c’, ‘ng’-‘n’, etc. letter from the same column (column 3. It is more difficult to check for errors errors) accounted for 15%. And the while typing Pinyin for Chinese, because substitution of the homologous (mirror- Pinyin typing is not WYSIWYG. Preview image) letter typed by the same finger in experiments showed that people usually do the same position but the wrong hand, not check Pinyin for errors, but wait until accounted for 10% of the errors overall the Chinese characters start to show up. (William 1983). 3.2 Spelling Correction 2. Insertion errors: The typist inserts some keys into the typing letter sequence. One In traditional statistical Pinyin-to-characters reason of this error is the layout of the conversion systems, Pr(Pf (i) | wi ) , as keyboard. Different dialects also can result mentioned in equation 2.2, is usually set to 1 if in insertion errors. 3. Deletion errors: some keys are omitted P is an acceptable spelling of word w , f (i) i while typing. and 0 if it is not. Thus, these systems rely 4. Other typing errors, all errors except the exclusively on the language model to carry out errors mentioned before. For example, the conversion, and have no tolerance for any transposition errors which means the variability in Pinyin input. Some systems have reversal of two adjacent letters. the “southern confused pronunciation” feature to deal with this problem. But this can only We use models learned from psychology, address a small fraction of typing errors but train the model parameters from real data, because it is not data-driven (learned from real similar to training acoustic model for speech typing errors).