KNPTC: Knowledge and Neural Machine Translation Powered Chinese Typo Correction Hengyi Cai1, Xingguang Ji2, Yonghao Song1, Yan Jin1, Yang Zhang3, Mairgup Mansur3, Xiaofang Zhao1 1 Institute of Computing Technology, Chinese Academy of Sciences 2 Beijing University of Posts and Telecommunications 3 Sogou, Inc. [email protected], [email protected], {songyonghao, jinyan}@ict.ac.cn, {zhangyang, maerhufu}@sogou-inc.com, [email protected]

Abstract Chinese pinyin input methods are very impor- tant for processing. Actually, (a) (b) users may make typos inevitably when they in- Figure 1: pinyin for a correct pinyin (a) and a put pinyin. Moreover, pinyin typo correction has mistyped pinyin (b). become an increasingly important task with the popularity of smartphones and the mobile Inter- net. How to exploit the knowledge of users typ- users want to type in the Chinese word “你好(Hello)”. First, ing behaviors and support the typo correction for they type in the corresponding pinyin “nihao” without de- acronym pinyin remains a challenging problem. limiters such as “Space” key to segment pinyin syllables. To tackle these challenges, we propose KNPTC, a Then, a Chinese pinyin input method displays a list of novel approach based on neural machine transla- Chinese candidate words which share that pinyin. Finally, tion (NMT). In contrast to previous work, KNPTC users search the target word from candidates and get the is able to integrate explicit knowledge into NMT result. In this way, typing in Chinese words is more com- for pinyin typo correction, and is able to learn plicated than typing in English words. In fact, pinyin typos to correct a variety of typos without guidance have always been a serious problem for Chinese pinyin in- of manually selected constraints or language- put methods. Users often make various typos in the process specific features. In this approach, we first obtain of inputting. As shown in Figure 1, users may mistype “ni- the transition probabilities between adjacent let- hao” for “nihap”(the letter p is very close to the letter o on ters based on large-scale real-life datasets. Then, the QWERTY keyboard). In addition, the user may also fail we construct the “ground-truth” alignments of to input the completely right pinyin simply because he/she training sentence pairs by utilizing these proba- is a dialect speaker and does not know the exact pronun- bilities. Furthermore, these alignments are inte- ciation of the expected Chinese word [Jia and Zhao, 2014]. grated into NMT to capture sensible pinyin typo Moreover, with the boom of smart-phones, pinyin typos are correction patterns. KNPTC is applied to correct more prone to be made possibly due to the limited size typos in real-life datasets, which achieves 32.77% of soft keyboard, and the lack of physical feedback on the increment on average in accuracy rate of typo touch screen [Jia and Zhao, 2014]. When an input method correction compared against the state-of-the-art engine (IME) fails to correct the typo and produce the de- arXiv:1805.00741v1 [cs.CL] 2 May 2018 system. sired Chinese sentence, the user have to spend extra effort to move the cursor back to the typo and correct it, which re- 1 Introduction sults in a very poor user experience [Jia et al., 2013]. Hence Chinese pinyin typo correction has a significant impact on Chinese pinyin is the official system to transcribe Chinese IME performance. characters into the Latin alphabet. Based on this transcrip- However, due to the peculiar characteristics of Chinese tion system, pinyin input method1 has become the dom- pinyin, typo correction for Chinese pinyin is quite differ- inant method for entering Chinese text into computers in ent from that in other languages. For example, when users China. want to input “早上好(Good morning)”, they will type “za- The typical way to type in Chinese words is in a sequen- oshanghao” instead of segmented pinyin sequence “zao tial manner [Wang et al., 2001]. For example, assuming that shang hao”. Besides, acronym pinyin input2 is very com- 1 Throughout this paper, pinyin input method means sentence-based pinyin input method, which generates a se- 2 Acronym pinyin input means using the first letters of a sylla- quence of upon a sequence of pinyin input. ble, e.g., “zsh” for “早上好”. mon in Chinese daily input. How to support acronym pinyin input as prior knowledge into attentional NMT to pinyin input for typo correction is also a challenging prob- capture more sensible typo correction patterns. Specif- lem [Zheng et al., 2011a]. For the above reasons, it is im- ically, we first use large-scale real-life datasets to obtain practical to adopt previous typo correction methods for En- the transition probabilities between adjacent letters in ad- glish to pinyin typo correction directly. vance. We then construct the “ground-truth” alignments Meanwhile, machine translation is well established in of training sentence pairs by utilizing these probabilities. the field of automatic error correction(AEC) [Junczys- Finally, distance between the NMT attentions and the Dowmunt and Grundkiewicz, 2016]. With the recent “ground-truth” alignments is computed as a part of the loss successful work on neural machine translation, neural function and need to be minimized in the training proce- encoder-decoder models have been used in the AEC as dure. Considering that the transition probabilities of ad- well [Junczys-Dowmunt and Grundkiewicz, 2016; Xie et al., jacent letters provide higher quality knowledge about real- 2016], and have achieved promising results. In this paper, life users’ input behaviors, it is expected that the attentional we explore new neural models to tackle the unique chal- NMT with prior knowledge will achieve better performance lenges of Chinese pinyin typo correction. on end-to-end pinyin typo correction task. The main component of neural machine transla- We find that KNPTC has the following properties: 1) tion (NMT) systems is a sequence-to-sequence (S2S) model KNPTC can obtain the optimal segmentation and typo which encodes a sequence of source words into a vector correction jointly on the user’s original input pinyin se- and then generates a sequence of target words from the quence; 2) KNPTC is capable of dealing with typos relat- vector [Ji et al., 2017]. Among different variants of NMT, ing to acronym pinyin input. In the experimental study, attentional NMT [Bahdanau et al., 2014; Luong et al., 2015] we compared KNPTC with the Google Input Tools3 (hence- becomes prevalent due to its ability to use the most relevant forth referred to as GoogleIT) and the joint graph model part of the source statement in each translation step. Unlike proposed in [Jia and Zhao, 2014](henceforth referred to as the classification-based and statistics-based approaches, JGM). We conduct experiments on real-life datasets and NMT is more capable of capturing global contextual infor- found that KNPTC can achieve significantly better perfor- mation, which is crucial to pinyin typo correction. How- mance than the other two baseline systems. ever, due to the characteristics of Chinese pinyin, to achieve The contributions of this paper are summarized as fol- the best performance on pinyin typo correction, we still lows. need to improve the standard NMT model to address sev- • We propose a solution of exploiting priori knowledge eral challenges: of users’ input behaviors to seamlessly integrate the • First, due to the limited size of soft keyboard on smart- transition probabilities between adjacent letters with phones, letters close to each other on a Latin keyboard attentional NMT for Chinese pinyin typo correction. or with similar pronunciations are more likely to be • Compared with previous work, KNPTC is more capa- mistyped, which leads to a great proportion of pinyin ble of typo correction for acronym pinyin, which is [ ] spelling mistakes Zheng et al., 2011b . How to solve widely used for Chinese people’s daily input. these typos caused by the proximity keys is crucial for improving the performance of IMEs. • We conduct extensive experiments on real-life datasets. The results show that KNPTC significantly • Second, since acronym pinyin is widely used in Chi- outperforms previous systems(such as GoogleIT and nese daily input, typo correction for acronym pinyin JGM) and is efficient enough for practical use. input greatly affects the user input experience. How- ever, due to its shortness and sparsity, previous pinyin typo correction methods failed to support the 2 Related Work acronym pinyin input. Chen and Lee first tried to resolve the problem of Chinese • Third, in addition to the aforementioned task-specific pinyin typo correction [Chen and Lee, 2000]. They designed challenges, the attention quality of NMT is still un- a statistical typing model to correct the pinyin typos and satisfactory. Lack of explicit knowledge may lead then used a language model to convert a pinyin sequence to attention faults and generate inaccurate transla- to a Chinese characters sequence. However, their method tions [Zhang et al., 2017]. can only model a few rules and thus only a limited num- ber of errors can be corrected. Zheng et al. proposed an On the one hand, there is a need to introduce prior error-tolerant pinyin input method called CHIME using a knowledge into the attention mechanism to improve the noisy channel model and language-specific features [Zheng performance of NMT. On the other hand, valuable input et al., 2011a]. Their method assumed that the user inputted patterns can be extracted from the user input logs to guide pinyin sequence had already been properly segmented by the pinyin typo correction. In order to overcome the afore- the user, which is inconsistent with the situation in real life. mentioned challenges, in this paper, we propose an ap- Our model discards this assumption to make it more prac- proach called “KNPTC”, which stands for “Knowledge and tical. Jia and Zhao proposed a joint graph model to find the Neural machine translation powered Chinese Pinyin Typo global optimum of pinyin to Chinese conversion and typo Correction”. KNPTC is able to integrate discrete, proba- bilistic keyboard neighborhoods information about users 3https://www.google.com/inputtools/try/ 3.1 NMT Framework as a Backbone for KNPTC The backbone of KNPTC closely follows the basic neural

··· machine translation architecture with attention proposed in [Bahdanau et al., 2014]. We here present a key descrip- tion of this model. The underlying framework of the model is called RNN ··· EncoderDecoder [Sutskever et al., 2014]. Given a sequence

of vectors x (x1,...,xTx ) representing the source language, the encoder= creates corresponding hidden state vectors e using a bidirectional RNN [Schuster and Paliwal, 1997]: ··· e (h ,...,h ). (1) = 1 Tx ··· The hidden state ht at time t is obtained by concatenating the forward hidden state h−→t and the backward one h←−t , i.e., h [h−→>;h←−>]>. The h−→ and h←− are computed as follows: t = t t t t

Figure 2: Architecture of the Overall Model h−→t GRUenc (h−−−→t 1,xt ) = −−→ − (2) h←−t GRUenc (h←−−−t 1,xt ), = ←−− + correction for the input method engine, and achieved bet- where GRUenc and GRUenc represent gated recurrent unit ter results than previous studies [Jia and Zhao, 2014]. In this −−→ ←−− functions. The GRUs with subscripts −−→enc and ←−−enc indicate paper, we employ this joint graph model as one of our base- the use of two functions with different parameters, respec- line systems. tively. Here, −−→enc and ←−−enc denote the use of the forward and In addition, Chen et al. attempted to integrate neural net- backward encoder units. work language models into pinyin IME [Chen et al., 2015]. The decoding process also uses the GRU functions,

Their main purpose is to improve the decoding predictive which uses a sequence of hidden states d1,d2,...,dTy to de- performance of IMEs under the premise of ensuring the re- fine the probability of the output sequence y1, y2,..., yTy as sponse speed. However, due to lack of the ability of typo follows: correction, the model they proposed is not convenient to p(yi y1,..., yi 1,x) g(yi 1,di ,ci ), (3) | − = − use in real-life situations. where g is a nonlinear, multi-layered, function that outputs As for automatic error correction in English, most earlier the probability distribution of yi , and di is an RNN hidden work [Gao et al., 2010; Rozovskaya and Roth, 2010] made state at step i, computed as use of a lexicon that contains well-spelled words or context di GRUdec (di 1, yi 1,ci ). (4) features of words. Recently, machine translation is well es- = − − tablished in the field of automatic error correction in En- The context vector ci used to predict yi is equal to glish [Junczys-Dowmunt and Grundkiewicz, 2016; Xie et al., PTx j 1 αi j h j . The weight αi j for h j is given by: 2016]. In this work, we mainly focus on the typo correction = of Chinese pinyin, which is quite different from other lan- exp(qi j ) guages for the reasons mentioned before. αi j , (5) = PTx k 1 exp(qik ) = where qi j a(di 1,h j ). Here a is a feedforward neural net- 3 Our Approach: KNPTC work which= is jointly− trained with the entire network. Given a training dataset C containing C pairs of train- In this paper, the task of Chinese pinyin typo correction is ing samples in the form of x,y , the training| | goal of the formulated as a translation task , in which the source lan- model is to minimize the cross< entropy> loss as follows: guage L is the user input pinyin sequence l1l2 ...lTL , where l is a letter, and the target language S is the correct seg- Ty i X X mented pinyin syllables s1s2 ...sTS , where si is a pinyin syl- Loss0 logp(yl y l ,x). (6) lable. For example, taking the user’s misspelled pinyin in- = − x,y C l 1 | < < >∈ = put sequence “nihapma” as a source sentence, then the cor- responding target sentence is “ni hao ma”. KNPTC com- 3.2 Supervised Attention Model with Keyboard bines implicit representations of pinyin sequence and ex- Neighborhoods Information plicit knowledge of adjacent letters transitions to further The attention αi,1,αi,2,...,αi,Tx in each step plays an im- improve the performance of the pinyin typo correction. portant role in predicting the next word. However, as shown Figure 2 shows the architecture of KNPTC. It consists of a in equation (5), the original attention mechanism, which character-level sequence to word-level sequence model as selectively focuses on the source words, does not take into a backbone, and uses the supervised attention component account any knowledge of the keyboard neighborhoods, to jointly learn attention and typo correction. which is crucial for Chinese pinyin typo correction. This knowledge is especially useful when tackling typo correc- Algorithm 1 Generating segmentation of the input se- tion for acronym pinyin input. Thus, in this section, we in- quence L troduce discrete and probabilistic keyboard neighborhoods 1: global variables information of users pinyin input to capture the explicit 2: SEGMENTATIONS . Candidate segmentations knowledge about users input behaviors for enhancing the 3: SCORES . Scores for candidate segmentations attention mechanism and improving pinyin typo correc- 4: MAX_LEN . Max length of Chinese pinyin words tion quality. To integrate the keyboard neighborhoods in- 5: end global variables formation into attentional NMT, we propose a novel align- 6: procedure GENSEGMENTATION(L,S) ment model using the transition probabilities between ad- 7: SegNum T 1 ← S − jacent letters. We then use this alignment model to super- 8: Initialize Seg as an empty string vise the learning of the attention mechanism. 9: SCORES [] ← Assuming that we have such a prior knowledge pt , given 10: SEGMENTATIONS [] ← a letter li , we assign a probability pt (li ls ) for any other 11: BeginPos 0 letter l . p (l l ) represents the probability→ that the user ← s t i → s 12: MAX_LEN 6 . 6 is the max length of Chinese wants to enter li but type in ls instead since ls is adjacent pinyin words ← to l on the keyboard, i.e., p (l l ) is the probability that i t i → s 13: SEGMENT(BeginPos, SegNum, Seg, L, S) letter li transfers to the letter ls . Given a vocabulary Vl con- 14: MaxScore MAX(SCORES) ← taining letters on the keyboard (for instance, 26 Latin let- 15: index argmax SCORES ← index ters), for a letter ls , the probability pt (li ls ) will be zero 16: return SEGMENTATIONS[index] for most letters in V . We first describe→ how to integrate l 17: procedure SEGMENT(BeginPos,SegNum,Seg,L,S) these probabilities into attentional NMT, and then explain 18: if SegNum 0 then how to obtain the p from the dataset. > t 19: if BeginPos MAX_LEN T then + < L Converting Transition Probabilities into Alignment 20: EndPos BeginPos MAX_LEN ← + Model 21: else For each pinyin sequence pair (L,S), we define an align- 22: EndPos TL ← ment matrix φ with T 1 rows(T letters and one ) 23: for i BeginPos 1 to EndPos do L L ← + and T 1 columns(T +pinyin words and one ). For 24: Seg Seg L[BeginPos, i ] “’ ” S S ← + + 1 i T+ and 1 j T , φ is computed as: 25: SegNum SegNum 1 ≤ ≤ L ≤ ≤ S i j ← − 26: SEGMENT(i, SegNum, Seg, L, S) φi j max pt (s jk li )Ai,j , (7) = 1 k S → 27: else ≤ ≤| j | 28: Seg Seg L[BeginPos :] ← + and for i TL 1 or j TS 1, φi j is computed as: 29: ζ CALCSCORE(Seg, S) = + = + ← 30: Append ζ to SCORES ½1 i TL 1, j TS 1 φ = + = + , (8) 31: Append Seg to SEGMENTATIONS i j = 0 otherwise where Ai j 1 in equation 7 denotes that letter li is trans- = s1 s2 s3 s1 s2 s3 lated as a part of word S j , otherwise Ai j 0. In fact, A     = l1 0.93 0 0 0 l1 0.62 0 0 0 represents a segmentation of the input sequence. Given in- l2  0.57 0 0 0  l2  0.38 0 0 0      l  0 0.97 0 0  l  0 0.67 0 0  put sequence L and pinyin words S, algorithm 1 shows how 3   3   l  0 0.48 0 0  l  0 0.33 0 0  to obtain the segmentation of L. In line 29 of algorithm 1, 4   4   l5  0 0 0.86 0  l5  0 0 1 0  we notice that there is a procedure CALCSCORE. Suppos- 0 0 0 1 0 0 0 1 ing that variable Seg contains a segmentation of L which is represented as {seg ,seg , ,seg }, we then use the pro- (a) φ (b) φ 1 2 ··· TS ∗ cedure CALCSCORE and the edit distance [Wagner and Fis- cher, 1974] to compute the score ζ of this segmentation by: Figure 3: The original alignment matrix φ and the alignment ma- trix φ∗ after transformation. T XS ζ EditDistance(segn,sn). (9) = − n 1 = 2 `(φ∗,φ0) φ∗ φ0 . (10) Given an original alignment matrix φ in Figure 3a, we = k − k2 use scaling transformation to get the probability distribu- Combined with the previous function in equation 6, we tion φ∗. Figure 3b shows the result matrix φ∗. We then use finally obtain the following loss function: φ∗ to supervise the learning of the attention mechanism.

( Ty ) Using Alignment Model to Supervise Attentional NMT X X Loss logp(y y ,x) `(φ∗,φ0) . (11) Based on the “ground-truth” alignment matrix φ∗ men- l l = − x,y C l 1 | < − tioned above and the attentions φ0 generated by the NMT < >∈ = model, we define the following distance function between As shown above, the loss function consists of two parts. φ∗ and φ0: The first part represents the loss term about typo correction and the second part represents the loss term about align- • Local acronym pinyin input. Only a part of Chinese ment model which incorporates the transition probabilities characters use acronym pinyin for representing, for between adjacent letters. We then minimize them jointly in example: “你好 ni hao ma ni hao m” or “你好 ni our training process. hao ma ni h m”.→ → → → Obtaining Prior Knowledge pt from Data Target Chinese sentence In the previous sections, we propose a method of using User input Input type (in pinyin form ) prior knowledge pt to supervise the learning of attention mechanism. Now, we define the ways to obtain the prior dadianhuageini da dian hua gei ni CP knowledge pt from the real-life input behaviors of anony- xianzqub xian zai qu ba LAP mous users. yijianzq yi jian zhong qing LAP p (l l ) represents the probability that the user wants bgnll bu gen ni liao le GAP t i → j to enter li but type in l j instead, since l j is adjacent to li niyiubudhi ni you bu shi MP on the keyboard. These probabilities can be learned di- rectly from users’ editing actions. Specifically, users press Table 1: Some sample data extracted from user logs and its cor- backspace on the keyboard to modify the typos, then they responding input type. For convenience, we use CP to indicate delete the mistyped letters and re-type in the correct ones. the correct pinyin input, LAP to be the local acronym pinyin in- put, GAP for the global acronym pinyin input and MP for the mis- Motivated by this observation, p (l l ) is calculated by: t i → j spelled pinyin input.

N(li l j ) pt (li l j ) → , (12) → = N(li ) Input type #Records Proportion(%) where N(li l j ) denotes the number of times that the let- CP 488,583 41.58 → ter li is wrongly entered as the letter l j in the actual editing LAP 408,228 34.74 actions, and N(li ) indicates the total number of times the GAP 211,261 17.98 letter li being inputted. MP 66,717 5.70

4 Experiments Table 2: Input types distribution. We can observe that there is a large proportion of acronym pinyin input in the real-life data. 4.1 Data Preparation Our real-life datasets come from user logs of the Chinese Table 1 shows examples that we extracted from the user 4 input method called Sogou-Pinyin . The user logs mainly logs. The distribution of the input types is given in Table 2. include the following parts: It should be noted that the original input is a continuous se- • The user’s input string, which is an unsegmented se- quence without segmentation . In addition, unlike English quence of Latin letters (henceforth referred to as “orig- and other languages, the original input only expresses the inal input”); pronunciation information of the target Chinese sentence, so we translated the target Chinese sentence into its pinyin • The desired Chinese sentence selected by the user form to obtain the target pinyin sentence. We treat the (henceforth referred to as “target Chinese sentence”) original input as the source sentence and the target pinyin from the candidates, which are generated by the IME sentence as the target sentence in our approach. Finally, and consist of one or more Chinese words; 1,179,789 anonymous users’ typing records during 2 days • The backspace and re-enter operations made by the were collected in total. From the extracted dataset with user will be recorded if the candidates do not meet the about 1M parallel sentences, we randomly selected close to user’s needs due to typos or other reasons. 50K parallel sentences for using as a development set, 100K for testing and 850K for training. We extracted original input and target Chinese sen- tence pairs from these data for experiments. Under the 4.2 Evaluation Metrics and Baselines normal circumstances, users should input the complete We evaluate KNPTC with conventional sequence labeling pinyin sequence corresponding to the target Chinese sen- evaluation metrics: word accuracy and sentence accu- tence. However, original input may contain abnormal in- racy, which have significant impacts on user experience of put due to typos. We classified abnormal input into two IMEs [Jia and Zhao, 2014]. types: acronym pinyin input and misspelled pinyin in- We compare KNPTC with two approaches: GoogleIT and put (acronym pinyin input also contains typos). According JGM (state-of-the-art method). GoogleIT contains a practi- to the acronym pinyin’s position in the original input, we cal pinyin input method. We use its public APIs for testing. simply divided acronym pinyin input into two categories: JGM is proposed in [Jia and Zhao, 2014]. The basic idea of • Global acronym pinyin input. Each of the Chinese JGM is to adopt the graph model for pinyin typo correction characters uses acronym pinyin for representing, for and segmentation jointly. We obtain its source code from example: “你好 ni hao ma n h m”. the author5 and use its default settings for hyper parame- → → 4https://pinyin.sogou.com 5http://bcmi.sjtu.edu.cn/~zhaohai Original input Target sentence JGM GoogleIT KNPTC yijiam yi jian yi jian yi jia men yi jian yigeciaos yi ge xiao shi yi ge xiao a yi ge ci ao shen yi ge xiao shi zhongqiykuail zhong qiu kuai le zhong qiu kuai a zhong qi yi kuai le zhong qiu kuai le smshuh shen me shi hou si shuo shuo ming shu he shen me shi hou gabgshuix gang shui xing gang shui a ga bu guo shui xing gang shui xing

Table 3: Example typo corrections made by JGM, GoogleIT and KNPTC. Typo correction made by KNPTC is in bold. ters. The metrics for evaluating each model are the accu- The examples of typo corrections in Table 3 demonstrate racy of prediction. the abilities of KNPTC, which is able to handle a variety of pinyin typos, e.g., due to keyboard neighborhood (qiy → 1 Acc qiu, gabg gang) or acronym pinyin (smshuh shen me shi hou).→ Considering that the users’ real-life→ input con-

0.9 tains different types, to further inspect the robustness of KNPTC, we evaluate the performance of each method sepa- rately on all input types, i.e., CP,LAP,GAP,MP and mixed in- 0.8 put type. Table 4 shows the performance of JGM, GoogleIT and KNPTC on the test dataset. Some observations can be 0.7 W-Acc S-Acc derived: 1) KNPTC performs significantly better than JGM in almost all the cases, especially for LAP and GAP(with 0.6 38.05% and 45.93% improvements for W-Acc). From Table 3 K we notice that JGM fails to generate the correct sentence 10 20 30 40 50 60 70 80 90 100 for acronym pinyin input due to its shortness and sparsity, Figure 4: W-Acc and S-Acc with different K s for KNPTC. while KNPTC performs well on both LAP and GAP.The rea- son is that KNPTC is more capable of learning the sensi- ble semantic information and typing patterns of user input. Besides, KNPTC is able to capture the alignments informa- Input type Acc JGM GoogleIT KNPTC tion with transition probabilities between adjacent letters W-Acc 99.86 99.89 99.83 CP in the decoding stage, and thus it can select more relevant S-Acc 99.51 99.78 99.64 input to predict the next target word. It is worth noting W-Acc 55.06 90.19 93.11 LAP that acronym pinyin input occupies a large part in real-life S-Acc 0 78.20 83.40 user input. Therefore, KNPTC is more practical than JGM. W-Acc 17.01 61.78 62.94 GAP 2) KNPTC outperforms GoogleIT, especially for MP(with S-Acc 0 35.77 31.18 44.29% and 77.29% improvements for W-Acc and S-Acc). W-Acc 68.11 50.90 95.19 MP From Table 3 we can observe that GoogleIT fails to correctly S-Acc 47.84 11.58 88.87 segment the original pinyin input due to the existence of ty- W-Acc 68.69 84.09 91.25 MIX pos and therefore it can not produce correct results. KNPTC S-Acc 44.85 70.58 83.22 overcomes this problem by finding the optimal segmenta- tion and typo correction jointly on the user’s original input Table 4: Comparison with baseline systems on identical dataset pinyin sequence. We can also observe that KNPTC’s S-Acc for different input types. MIX stands for CP+LAP+GAP+MP. on GAP is slightly lower than that of GoogleIT. This can be attributed to the fact that the language model of Chinese sentences is not been fully exploited in KNPTC, which is 4.3 Training Details and Results useful for candidates generation. To make typo correction better, we plan to integrate it with KNPTC in future work. The embedding size of character-level encoder and word- 3) KNPTC’s S-Acc on LAP, MP and MIX is significantly bet- level decoder is set to 256, and the hidden unit size is 128. ter than that of GoogleIT and JGM. This means that KNPTC All the parameters are initialized from a uniform distri- can generate better candidate sentences, which is critical to bution. Our model converges after about 350K iterations. the user experience of IMEs. From the results in Table 4, we When running on a single GPU device Tesla K40, it takes can further conclude that the alignments information with two days to train our model. We choose K best candidates transition probabilities between adjacent letters can help for pinyin typo correction. Figure 4 shows the results of improve the overall performance. KNPTC on development dataset for different K s. We choose the value of K when the improvements of performance on W-Acc of K best candidates is less than the threshold τ. 5 Conclusion Considering the practical need for the number of candi- In this paper, we propose an effective approach called dates in IMEs, we set τ to 0.005 and get K 10 for pinyin KNPTC to integrate the transition probabilities between ad- typo correction. = jacent letters into attentional NMT to capture more sensi- ble typo correction patterns. KNPTC finds the optimal seg- attention-based neural machine translation. In Pro- mentation and typo correction jointly on the user’s original ceedings of the 2015 Conference on Empirical Methods input pinyin sequence. In addition, KNPTC can also cope in Natural Language Processing, EMNLP 2015, Lisbon, with the typo correction of acronym pinyin input. Experi- Portugal, September 17-21, 2015, pages 1412–1421, 2015. ments show that KNPTC can evidently improve the pinyin [Rozovskaya and Roth, 2010] Alla Rozovskaya and Dan typo correction performance. To our best knowledge, our Roth. Generating confusion sets for context-sensitive work is among the earliest studies in leveraging neural ma- error correction. In Proceedings of the 2010 conference chine translation for Chinese pinyin typo correction. on empirical methods in natural language processing, pages 961–970, 2010. References [Schuster and Paliwal, 1997] Mike Schuster and Kuldip K. [Bahdanau et al., 2014] Dzmitry Bahdanau, Kyunghyun Paliwal. Bidirectional recurrent neural networks. IEEE Cho, and Yoshua Bengio. Neural machine transla- Trans. Signal Processing, 45(11):2673–2681, 1997. tion by jointly learning to align and translate. CoRR, [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and abs/1409.0473, 2014. Quoc V Le. Sequence to sequence learning with neural [Chen and Lee, 2000] Zheng Chen and Kai-Fu Lee. A new networks. In Advances in neural information processing statistical approach to chinese pinyin input. In 38th An- systems, pages 3104–3112, 2014. nual Meeting of the Association for Computational Lin- [Wagner and Fischer, 1974] Robert A. Wagner and guistics, Hong Kong, China, October 1-8, 2000., 2000. Michael J. Fischer. The string-to-string correction [Chen et al., 2015] Shenyuan Chen, Hai Zhao, and Rui problem. J. ACM, 21(1):168–173, 1974. Wang. Neural network language model for chinese [Wang et al., 2001] Jingtao Wang, Shumin Zhai, and Hui pinyin input method engine. In Proceedings of the 29th Su. Chinese input with keyboard and eye-tracking: an Pacific Asia Conference on Language, Information and anatomical study. In Proceedings of the CHI 2001 Confer- Computation, PACLIC 29, Shanghai, China, October 30 ence on Human Factors in Computing Systems, Seattle, - November 1, 2015, 2015. WA, USA, March 31 - April 5, 2001, pages 349–356, 2001. [Gao et al., 2010] Jianfeng Gao, Xiaolong Li, Daniel Micol, [Xie et al., 2016] Ziang Xie, Anand Avati, Naveen Arivazha- Chris Quirk, and Xu Sun. A large scale ranker-based sys- gan, Dan Jurafsky, and Andrew Y. Ng. Neural lan- tem for search query spelling correction. In Proceedings guage correction with character-based attention. CoRR, of the 23rd International Conference on Computational abs/1603.09727, 2016. Linguistics, pages 358–366, 2010. [Zhang et al., 2017] Jinchao Zhang, Mingxuan Wang, Qun [Ji et al., 2017] Jianshu Ji, Qinlong Wang, Kristina Liu, and Jie Zhou. Incorporating word reordering knowl- Toutanova, Yongen Gong, Steven Truong, and Jian- edge into attention-based neural machine translation. feng Gao. A nested attention neural hybrid model for In Proceedings of the 55th Annual Meeting of the Associa- grammatical error correction. In Proceedings of the 55th tion for Computational Linguistics, ACL 2017, Vancouver, Annual Meeting of the Association for Computational Canada, July 30 - August 4, Volume 1: Long Papers, pages Linguistics, ACL 2017, Vancouver, Canada, July 30 - 1524–1534, 2017. August 4, Volume 1: Long Papers, pages 753–762, 2017. [Zheng et al., 2011a] Yabin Zheng, Chen Li, and Maosong [Jia and Zhao, 2014] Zhongye Jia and Hai Zhao. A joint Sun. CHIME: an efficient error-tolerant chinese pinyin graph model for pinyin-to-chinese conversion with typo input method. In IJCAI 2011, Proceedings of the 22nd correction. In Proceedings of the 52nd Annual Meeting of International Joint Conference on Artificial Intelligence, the Association for Computational Linguistics, ACL 2014, Barcelona, Catalonia, Spain, July 16-22, 2011, pages June 22-27, 2014, Baltimore, MD, USA, Volume 1: Long 2551–2556, 2011. Papers, pages 1512–1523, 2014. [Zheng et al., 2011b] Yabin Zheng, Lixing Xie, Zhiyuan Liu, [Jia et al., 2013] Zhongye Jia, Peilu Wang, and Hai Zhao. Maosong Sun, Yang Zhang, and Liyun Ru. Why press Graph model for chinese spell checking. In Proceedings backspace? understanding user input behaviors in chi- of the Seventh SIGHAN Workshop on Chinese Language nese pinyin input method. In Proceedings of the 49th An- Processing, pages 88–92, 2013. nual Meeting of the Association for Computational Lin- guistics: Human Language Technologies: short papers- [Junczys-Dowmunt and Grundkiewicz, 2016] Marcin Volume 2, pages 485–490, 2011. Junczys-Dowmunt and Roman Grundkiewicz. Phrase- based machine translation is state-of-the-art for auto- matic grammatical error correction. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 1546–1556, 2016. [Luong et al., 2015] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to