Chinese Pinyin Aided IME, Input What You Have Not Keystroked Yet
Total Page:16
File Type:pdf, Size:1020Kb
Chinese Pinyin Aided IME, Input What You Have Not Keystroked Yet Yafang Huang1;2, Hai Zhao1;2;∗, 1Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China 2Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China [email protected], [email protected] ∗ Abstract between pinyin syllables and Chinese characters. (Huang et al., 2018; Yang et al., 2012; Jia and Chinese pinyin input method engine (IME) Zhao, 2014; Chen et al., 2015) regarded the P2C as converts pinyin into character so that Chinese a translation between two languages and solved it characters can be conveniently inputted into computer through common keyboard. IMEs in statistical or neural machine translation frame- work relying on its core component, pinyin- work. The fundamental difference between (Chen to-character conversion (P2C). Usually Chi- et al., 2015) work and ours is that our work is a nese IMEs simply predict a list of character fully end-to-end neural IME model with extra at- sequences for user choice only according to tention enhancement, while the former still works user pinyin input at each turn. However, Chi- on traditional IME only with converted neural net- nese inputting is a multi-turn online procedure, work language model enhancement. (Zhang et al., which can be supposed to be exploited for fur- 2017) introduced an online algorithm to construct ther user experience promoting. This paper thus for the first time introduces a sequence- appropriate dictionary for P2C. All the above men- to-sequence model with gated-attention mech- tioned work, however, still rely on a complete in- anism for the core task in IMEs. The pro- put pattern, and IME users have to input very long posed neural P2C model is learned by en- pinyin sequence to guarantee the accuracy of P2C coding previous input utterance as extra con- module as longer pinyin sequence may receive less text to enable our IME capable of predicting decoding ambiguity. character sequence with incomplete pinyin in- The Chinese IME is supposed to let user in- put. Our model is evaluated in different bench- mark datasets showing great user experience put Chinese characters with least inputting cost, improvement compared to traditional models, i.e., keystroking, which indicates extra content which demonstrates the first engineering prac- predication from incomplete inputting will be ex- tice of building Chinese aided IME. tremely welcomed by all IME users. (Huang et al., 2015) partially realized such an extra predication 1 Introduction using a maximum suffix matching postprocess- ing in vocabulary after SMT based P2C to predict Pinyin is the official romanization representation longer words than the inputted pinyin. for Chinese and the P2C converting the inputted To facilitate the most convenience for such an pinyin sequence to Chinese character sequence is IME, in terms of a sequence to sequence model as the most basic module of all pinyin based IMEs. neural machine translation (NMT) between pinyin Most of the previous research (Chen, 2003; sequence and character sequence, we propose a Zhang et al., 2006; Lin and Zhang, 2008; Chen P2C model with the entire previous inputted ut- and Lee, 2000; Jiang et al., 2007; Cai et al., 2017a) terance confirmed by IME users being used as a for IME focused on the matching correspondence part of the source input. When learning the type ∗ Corresponding author. This paper was partially sup- of the previous utterance varies from the previous ported by National Key Research and Development Program of China (No. 2017YFB0304100), National Natural Science sentence in the same article to the previous turn of Foundation of China (No. 61672343 and No. 61733011), utterance in a conversation, the resulting IME will Key Project of National Society Science Foundation of China make amazing predication far more than what the (No. 15-ZDA041), The Art and Science Interdisciplinary Funds of Shanghai Jiao Tong University (No. 14JCRZ04), pinyin IME users actually input. and the joint research project with Youtu Lab of Tencent. In this paper, we adopt the attention-based NMT 2923 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2923–2929 Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics User Selection 天气 还 可以 的 EOS Gated Attention User choice t-2 (c: context) P2C P2C Pinyint-1 Pinyint-1 Cand. list t-1 User choice t-1 Cand. list t-1 User choice t-1 (c: context) 今天 的 天气 怎么样 呢 BC tianqi hai keyi de EOS (How's the weather) (Weather is not bad) (c) Simple C+ P2C Model: Simple context-enhanced pinyin-to-character model, straightforwardly concat the context to the source pinyin sequence. Pinyint Pinyint Source Emb. Gated Attention 天气 还 可以 的 EOS Cand. list t User choice t Cand. listt User choice t (c: context) Context Emb. BiLSTM Encoder Target Emb. RNN Decoder Pinyin Pinyin t+1 t+1 BiGRU Cand. list t+1 User choicet+1 Cand. listt+1 User choice t+1 (c: context) (a) The data flow of (b) The data flow of our 今天 的 天气 怎么样 呢 tianqi hai keyi de EOS traditional P2C gated-attention enhanced P2C (d) Gated C+ P2C Model: Gated Attention based Context-enhanced Pinyin-to-Character model. Figure 1: Architecture of the proposed model. framework in (Luong et al., 2015) for the P2C As introduced a hybrid source side input, our task. In contrast to related work that simply ex- model has to handle document-wide translation tended the source side with different sized context by considering discourse relationship between two window to improve of translation quality (Tiede- consecutive sentences. The most straightforward mann and Scherrer, 2017), we add the entire input modeling is to simply concatenate two types of utterance according to IME user choice at previous source inputs with a special token ’BC’ as sepa- time (referred to the context hereafter). Hence the rator. Such a model is in Figure1(c). However, resulting IME may effectively improve P2C qual- the significant drawback of the model is that there ity with the help of extra information offered by are a slew of unnecessary words in the extended context and support incomplete pinyin input but context (previous utterance) playing a noisy role predict complete, extra, and corrected character in the source side representation. output. The evaluation and analysis will be per- To alleviate the noise issue introduced by the formed on two Chinese datasets, include a Chi- extra part in the source side, inspired by the nese open domain conversations dataset for veri- work of (Dhingra et al., 2016; Pang et al., 2016; fying the effectiveness of the proposed method. Zhang et al., 2018c,a,b; Cai et al., 2017b), our model adopts a gated-attention (GA) mechanism 2 Model that performs multiple hops over the pinyin with the extended context as shown in Figure1(d). As illustrated in Figure 1, the core of our P2C In order to ensure the correlation between each is based the attention-based neural machine trans- other, we build a parallel bilingual training cor- lation model that converts at word level. Still, pus and use it to train the pinyin embeddings and we formulize P2C as a translation between pinyin the Chinese embeddings at once. We use two and character sequences as shown in a traditional Bidirectional gated recurrent unit (BiGRU) (Cho model in Figure1(a). However, there comes a et al., 2014) to get contextual representations of key difference from any previous work that our the source pinyin and context respectively, Hp = source language side includes two types of inputs, BiGRU(P );Hc = BiGRU(C), where the repre- the current source pinyin sequence (noted as P ) as sentation of each word is formed by concatenating usual, and the extended context, i.e., target charac- the forward and backward hidden states. ter sequence inputted by IME user last time (noted For each pinyin pi in Hp, the GA module as C). As IME works dynamically, every time forms a word-specific representation of the con- IME makes a predication according to a source text ci 2 Hc using soft attention, and then adopts pinyin input, user has to indicate the ’right answer’ element-wise product to multiply the context rep- to output target character sequence for P2C model resentation with the pinyin representation. αi = T learning. This online work mode of IMEs can be softmax(Hc pi); βi = Cαi; xi = pi βi, where fully exploited by our model whose work flow is is multiplication operator. shown in Figure1(b). The pinyin representation H~p = x1; x2; :::; xk 2924 is augmented by context representation and then user’s input does not end with typing enter, we can sent into the encoder-decoder framework. The regard the current input pinyin sequence is an in- encoder is a bi-directional long short-term mem- complete one. ory (LSTM) network (Hochreiter and Schmidhu- ber, 1997). The vectorized inputs are fed to for- 3.2 Datasets and Settings ward and backward LSTMs to obtain the internal representation of two directions. The output for PD DC each input is the concatenation of the two vec- Train Test Train Test tors from both directions. Our decoder based on # Sentence 5.0M 2.0K 1.0M 2.0K the global attentional models proposed by (Luong L < 10 % 88.7 89.5 43.0 54.0 et al., 2015) to consider the hidden states of the en- L < 50 % 11.3 10.5 47.0 42.0 coder when deriving the context vector. The prob- L > 50 % 0.0 0.0 4.0 2.0 ability is conditioned on a distinct context vector Relativity % 18.0 21.1 65.8 53.4 for each target word.