Neural-Based Chinese Pinyin Aided Input Method with Customizable Association
Total Page:16
File Type:pdf, Size:1020Kb
Moon IME: Neural-based Chinese Pinyin Aided Input Method with Customizable Association Yafang Huang1;2;∗, Zuchao Li1;2;∗, Zhuosheng Zhang1;2, Hai Zhao1;2;y 1Department of Computer Science and Engineering, Shanghai Jiao Tong University 2Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China fhuangyafang, charlee, [email protected], [email protected] Abstract tion outputs from IME predication may undoubt- edly lead to incomparable user typing experience, Chinese pinyin input method engine which motivates this work. (IME) lets user conveniently input Chi- Modern IMEs are supposed to extend P2C with nese into a computer by typing pinyin association functions that additionally predict the through the common keyboard. In addi- next series of characters that the user is attempting tion to offering high conversion quality, to enter. Such IME extended capacity can be gen- modern pinyin IME is supposed to aid us- erally fallen into two categories: auto-completion er input with extended association func- and follow-up prediction. The former will look tion. However, existing solutions for such up all possible phrases that might match the us- functions are roughly based on oversim- er input even though the input is incomplete. For plified matching algorithms at word-level, example, when receiving a pinyin syllable “bei”, whose resulting products provide limit- auto-completion module will predict “¬” (bei- ed extension associated with user input- jing, Beijing) or “Ìo” (beijing, Background) as s. This work presents the Moon IME, a a word-level candidate. The second scenario is pinyin IME that integrates the attention- when a user completes entering a set of words, in based neural machine translation (NMT) which case the IME will present appropriate collo- model and Information Retrieval (IR) to cations for the user to choose. For example, after offer amusive and customizable associa- the user selects “¬” (Beijing) from the candi- tion ability. The released IME is im- date list in the above example, the IME will show plemented on Windows via text services a list of collocations that follows the word Beijing, framework. such as “” (city), “eД (Olympics). This paper presents the Moon IME, a pinyin 1 Introduction IME engine with an association cloud platform, which integrates the attention-based neural ma- Pinyin is the official romanization representation chine translation (NMT) model with diverse asso- for Chinese and pinyin-to-character (P2C) which ciations to enable customizable and amusive user concerts the inputted pinyin sequence to Chinese typing experience. character sequence is the core module of all pinyin Compared to its existing counterparts, Moon based IMEs. Previous works in kinds of literature IME has extraordinarily offered the following only focus on pinyin to the character itself, pay- promising advantages: ing less attention to user experience with associa- • It is the first attempt that adopts attentive NMT tive advances, let along predictive typing or auto- method to achieve P2C conversion in both IME matic completion. However, more agile associa- research and engineering. ∗ These authors contribute equally. y Corresponding au- • It provides a general association cloud plat- thor. This paper was partially supported by National Key Re- search and Development Program of China (No. 2017YF- form which contains follow-up-prediction and ma- B0304100), National Natural Science Foundation of China chine translation module for typing assistance. (No. 61672343 and No. 61733011), Key Project of Nation- • With an information retrieval based module, it al Society Science Foundation of China (No. 15-ZDA041), The Art and Science Interdisciplinary Funds of Shanghai Jiao realizes fast and effective auto-completion which Tong University (No. 14JCRZ04). can help users type sentences in a more convenient 140 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics-System Demonstrations, pages 140–145 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics Pinyin Pinyin Association Question List Answering Candidates Candidates Text Services Moon Input List List Machine Framework Method Engine Translation Conversion Results Auto- Association Completion Platform 输入拼音,根据拼音得到一系 Pinyin 列 hou xuan da an ConversionResults Pinyin Text NMT-based Customized Segmentation P2C Conversion Knowledge Base Figure 1: Architecture of the proposed Moon IME. and efficient manner. As (Zhang et al., 2017) proves that P2C con- • With a powerful customizable design, the version of IME may benefit from decoding longer association cloud platform can be adapted to pinyin sequence for more efficient inputting. any specific domains such as the fields of law When a given pinyin sequence becomes longer, and medicine which contain complex specialized the list of the corresponding legal character se- terms. quences will significantly reduce. Thus, we train The rest of the paper is organized as follows: our P2C model with segmented corpora. We used Section 2 demonstrates the details of our system. baseSeg (Zhao et al., 2006) to segment all tex- Section 3 presents the feature functions of our re- t, and finish the training in both word-level and alized IME. Some related works are introduced in character-level. Section 4. Section 5 concludes this paper. NMT-based P2C module Our P2C module is 2 System Details implemented through OpenNMT Toolkit3 as we formulize P2C as a translation between pinyin and Figure 1 illustrates the architecture of Moon IME. character sequences. Given a pinyin sequence X The Moon IME is based on Windows Text Ser- and a Chinese character sequence Y , the encoder vices Framework (TSF) 1. Our Moon IME extends 2 of the P2C model encodes pinyin representation in the Open-source projects PIME with three main word-level, and the decoder is to generate the tar- components: a) pinyin text segmentation, b) P2C get Chinese sequence which maximizes P (Y jX) conversion module, c) IR-based association mod- using maximum likelihood training. ule. The nub of our work is realizing an engine to The encoder is a bi-directional long short- stably convert pinyin to Chinese as well as giving term memory (LSTM) network (Hochreiter and reasonable association lists. Schmidhuber, 1997). The vectorized inputs are 2.1 Input Method Engine fed to forward LSTM and backward LSTM to ob- tain the internal features of two directions. The Pinyin Segmentation For a convenient refer- output for each input is the concatenation of the ence, hereafter a character in pinyin also refers to ! −! − two vectors from both directions: h = h k h . an independent syllable in the case without caus- t t t ing confusion, and word means a pinyin syllable Our decoder is based on the global attention- sequence with respect to a true Chinese word. al model proposed by (Luong et al., 2015) which takes the hidden states of the encoder into con- 1 TSF is a system service available as a redistributable for sideration when deriving the context vector. The Windows 2000 and later versions of Windows operation sys- tem. A TSF text service provides multilingual support and probability is conditioned on a distinct contex- delivers text services such as keyboard processors, handwrit- t vector for each target word. The context vec- ing recognition, and speech recognition. 2https://github.com/EasyIME/PIME 3http://opennmt.net 141 tor is computed as a weighted sum of previously text classification and information retrieval. The hidden states. The probability of each candidate TF (term-frequency) term is simply a count of the word as being the recommended one is predicted number of times a word appearing in a given con- using a softmax layer over the inner-product be- text, while the IDF (invert document frequency) tween source and candidate target characters. term puts a penalty on how often the word appears Our model is initially trained on two dataset- elsewhere in the corpus. The final TF-IDF score s, namely the People’s Daily (PD) corpus and is calculated by the product of these two terms, Douban (DC) corpus. The former is extracted which is formulated as: N from the People’s Daily from 1992 to 1998 that TF-IDF(w; d; D) = f(w; d) × log jfd2D:w2dgj has word segmentation annotations by Peking U- where f(w; d) indicates the number of times word niversity. The DC corpus is created by (Wu et al., w appearing in context d, N is the total number 2017) from Chinese open domain conversations. of dialogues, and the denominator represents the One sentence of the DC corpus contains one com- number of dialogues in which the word w appears. plete utterance in a continuous dialogue situation. In the IME scenario, the TF-IDF vectors are The statistics of two datasets is shown in Table1. first calculated for the input context and each of With character text available, the needed parallel the candidate responses from the corpus. Given corpus between pinyin and character texts is auto- a set of candidate response vectors, the one with matically created following the approach proposed the highest cosine similarity to the context vector by (Yang et al., 2012). is selected as the output. For Recall @ k, the top k candidates are returned. In this work, we only Chinese Pinyin make use of the top 1 matched one. # MIUs 5.04M PD # Vocab 54.3K 41.1K 3 User Experience Advantages # MIUs 1.00M DC 3.1 High Quality of P2C # Vocab 50.0K 20.3K We utilize Maximum Input Unit (MIU) Accuracy Table 1: MIUs count and vocab size statistics of our training (Zhang et al., 2017) to evaluate the quality of our data. PD refers to the People’s Daily, TP is TouchPal corpus. P2C module by measuring the conversion accura- cy of MIU, whose definition is the longest unin- Here is the hyperparameters we used: (a) deep terrupted Chinese character sequence inside a sen- LSTM models, 3 layers, 500 cells, (c) 13 epoch tence.