Neural Network Language Model for Chinese Pinyin Input Method Engine
Total Page:16
File Type:pdf, Size:1020Kb
PACLIC 29 Neural Network Language Model for Chinese Pinyin Input Method Engine Shen-Yuan Chen1;2, Rui Wang1;2 and Hai Zhao1;2∗ 1Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China 2Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China [email protected], [email protected] and [email protected] Abstract pinyin, which is the phonetic representation of Chi- nese characters through Latin (English) letters. The Neural network language models (NNLMs) advantage of pinyin IMEs is that it only requires for have been shown to outperform traditional n- knowledge of pinyin rules, which are known by most gram language model. However, too high educated Chinese users. Being easy to learn and use, computational cost of NNLMs becomes the pinyin IME is becoming the first choice of more and main obstacle of directly integrating it into pinyin IME that normally requires a real-time more Chinese users. response. In this paper, an efficient solution is However, there are only under 500 pinyin sylla- proposed by converting NNLMs into back-off bles in standard Chinese, mandarin, but over 6,000 n-gram language models, and we integrate the common Chinese characters, bringing huge ambigu- converted NNLM into pinyin IME. Our exper- ities for pinyin-to-characters mapping (Jia and Zhao, imental results show that the proposed method 2014; Xu and Zhao, 2012; Zhang et al., 2014). Mod- gives better decoding predictive performance for pinyin IME with satisfied efficiency. ern pinyin IMEs commonly use sentences-based de- coding method (Chen and Lee, 2000; Zhang et al., 2012), that is, generating a Chinese sentence accord- 1 Introduction ing to a pinyin sequence for disambiguation. The decoding method usually works with the help of s- Input method engine (IME) is the encoding method tatistic language model or other techniques. to input a variety of characters into computer or oth- Back-off N-gram language models (BNLMs) er information processing and communication de- (Chen and Goodman, 1996; Chen and Goodman, vices, such as mobile phone. People working with 1999; Stolcke, 2002a) have been broadly used for computer cannot live without IMEs. With the devel- pinyin IME because of their simplicity and efficien- opment and improvement of IMEs, people are pay- cy. Recently, continuous-space language models ing more and more attention to its efficiency and hu- (CLSMs), especially neural network language mod- manization. Since there are thousands of Chinese els (NNLMs) (Bengio et al., 2003; Schwenk, 2007; characters and only 26 English letters on standard Le et al., 2010) are used in more and more natu- keyboard, we cannot directly input the Chinese char- ral language processing tasks, and practically out- acters to the computer by simply hitting keys. A perform BNLMs in prediction accuracy. However, mapping or encoding from Chinese characters to En- NNLMs are still out of consideration for IMEs ac- glish letters is required, and IME is such a system cording to our best knowledge. The main obstacle software to do the mapping (Chen and Lee, 2000; about using NNLMs in IME is that it is too time- Wu et al., 2009; Zhao et al., 2006). Among var- consuming to meet the requirement from IME that ious encoding schemes, most IMEs adopt Chinese needs a real-time response as human-computer in- ∗Corresponding author terface. 455 29th Pacific Asia Conference on Language, Information and Computation pages 455 - 461 Shanghai, China, October 30 - November 1, 2015 Copyright 2015 by Shenyuan Chen, Hai Zhao and Rui Wang PACLIC 29 Actually, too long computational time makes the the table is the words corresponding to the syllable direct integration of NNLMs infeasible for pinyin and sorted by its probability of existing. IMEs. We can hardly image that users have to wait for over 10 seconds to get the result after typing The most important part of pinyin IME is candi- the pinyin sequence. So we have to find another date sentence generation, into which we put much way. Although some work have reduced the decod- of our effort. Language model is commonly used ing time of NNLMs, such as (Arisoy et al., 2014), for generating the most likely sentence through pro- (Vaswani et al., 2013) and (Devlin et al., 2014), these viding a conditional probability of words by its con- methods are mainly designed for speech recognition text (Chen and Lee, 2000; Zhao et al., 2013). So or machine translation and can not be integrated into sentence generation is to find the sentence with the IME directly. maximum probability, which is constructed by the Instead of replacing BNLMs with NNLMs in candidate words fetched previously. IME, we propose to use NNLMs to enhance BNLM- s, which means recalculating the probabilities of n- Candidate sentence generation can be regarded as grams in the BNLMs with NNLMs. Thus we take a decoding problem. The goal is to find the most advantage of the probabilities provided by NNLM- likely Chinese word sequence W ∗ with the highest s without increasing its on-site computational cost. joint probability that Furthermore, we will also demonstrate that our method may indeed improve the prediction perfor- mance of pinyin IMEs. In Section 2, we introduce the typical pinyin IME model and explain how language models work on ∗ j the process of candidate sentence generation. In W = arg max P (W S) W Section 3, we describe the basic concept of BNLMs P (W )P (SjW ) and NNLMs, then we describe our approach of con- = arg max P (S) verting the NNLMs into BNLMs. In section 4 and W j 5, we do experiments and conclude. = arg max P (W )P (S W ) W Y Y j j 2 Pinyin IME Model = arg max P (wi wi−1) P (si wi) w1;w2;:::;wM wi wi Basically the core engine of IME is a pipeline of three parts: pinyin segmentation, candidate words fetching and candidate sentence generation. Pinyin segmentation is a word segmentation task which may not be as typical as Chinese word seg- where P (sijwi) is the conditional probability map- mentation. Since pinyin has a very small vocabu- ping a word to its pinyin syllable, and P (wijwi−1) is lary that contains about 500 legal pinyin syllables, the transition probability. Here P (sijwi) is 1 while rule-based segmentation methods are widely used, the word wi is corresponding to the pinyin syllable i.e., backward maximum matching algorithm with wi and 0 otherwise. Note that practically the tran- additional rules (Goh et al., 2005). Carefully written sition probability is P (wijwi−(n−1); :::; wi−1) pro- rules can deal with most of the ambiguous condi- vided by language model. We use Viterbi algorith- tions. m (Viterbi, 1967) to decode the character sentence. Pinyin segmentation breaks continuous user input This problem is same as finding the shortest path on into separated pinyin syllables and passes them on to the candidate word lattice (Forney, 1973), which is the next stage, candidate words fetching. It is a table shown in Figure 1, with all probabilities determined look-up task finding Chinese words corresponding by the language models. Hence, we have reasons to pinyin syllables. A table of candidate words is to believe that a better language model makes better built according to pinyin syllables. Each column of candidate sentences. 456 PACLIC 29 3.2 Neural network language model Figure 1: Candidate Sentence Generation Figure 2: Neural Network Language Model 3 Our Approach NNLM provides the condition probabilities P (wijhi) given the histories of the preceding words. , which is shown in Figure 2 (Schwenk, 2013) A This section will introduce the proposed method that typical NNLM consists of four layers: a input layer, uses NNLMs to enhance BNLMs. a projection layer, a hidden layer and a output layer. The input layer represents the history of n−1 word- s, using one hot coding. Since this kind of coding makes the input layer very large and sparse, a shared 3.1 Back-off n-gram language model matrix is used to project the high-dimensional input layer to the low dimensional projection layer. After that, the non-linear (commonly sigmoid or A BNLM is composed of the condition probabilities hyperbolic tangent) hidden layer, with usually P (w jh ), which means the probability of word w i i i hundreds of neurons, and the non-linear (commonly given the previous history h of n − 1 words. It- i softmax) output layers, with the same size as the s advantage comes from its simplicity. However, it full vocabulary, jointly calculate the probability of suffers from data sparseness, that is, the probabili- each word in the vocabulary (Schwenk, 2007; Wang ty of the history h not appearing in the training will i et al., 2013b; Wang et al., 2013c). be zero. Therefore, we need smoothing technologies Since NNLMs calculate the probabilities of n- to alleviate this shortcoming. In the case of interpo- grams on the continuous space, it can estimate prob- lated Kneser-Ney smoothing (Chen and Goodman, abilities for any possible n-grams, without worrying 1999), the probability under BNLM, P (wijhi), is: b about the zero-probability problem, in comparison with BNLM. Hence, there is no need for backing- off to smaller history. Experiments have shown that j ^ j j i−1 the NNLM has lower perplexity than the BNLM Pb(wi hi) = Pb(wi hi) + γ(hi)Pb(wi wi−n+2) trained on the same corpus (Schwenk, 2010; Huang et al., 2013). However, the computational complex- ity of NNLMs is quite high.