Open Vocabulary Learning for Neural Chinese Pinyin IME
Total Page:16
File Type:pdf, Size:1020Kb
Open Vocabulary Learning for Neural Chinese Pinyin IME Zhuosheng Zhang1;2, Yafang Huang1;2, Hai Zhao1;2;∗ 1Department of Computer Science and Engineering, Shanghai Jiao Tong University 2Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China fzhangzs, [email protected], [email protected] Abstract lation between two different languages, pinyin se- quences and Chinese character sequences (namely Pinyin-to-character (P2C) conversion is the Chinese sentence). Actually, such a translation in core component of pinyin-based Chinese in- put method engine (IME). However, the con- P2C procedure is even more straightforward and version is seriously compromised by the am- simple by considering that the target Chinese char- biguities of Chinese characters corresponding acter sequence keeps the same order as the source to pinyin as well as the predefined fixed vocab- pinyin sequence, which means that we can decode ularies. To alleviate such inconveniences, we the target sentence from left to right without any propose a neural P2C conversion model aug- reordering. mented by an online updated vocabulary with Meanwhile, there exists a well-known challenge a sampling mechanism to support open vocab- ulary learning during IME working. Our ex- in P2C procedure, too much ambiguity mapping periments show that the proposed method out- pinyin syllable to character. In fact, there are only performs commercial IMEs and state-of-the- about 500 pinyin syllables corresponding to ten art traditional models on standard corpus and thousands of Chinese characters, even though the true inputting history dataset in terms of multi- amount of the commonest characters is more than ple metrics and thus the online updated vocab- 6,000 (Jia and Zhao, 2014). As well known, the ulary indeed helps our IME effectively follows homophone and the polyphone are quite common user inputting behavior. in the Chinese language. Thus one pinyin may cor- 1 Introduction respond to ten or more Chinese characters on the average. Chinese may use different Chinese characters up However, pinyin IME may benefit from decod- to 20,000 so that it is non-trivial to type the Chi- ing longer pinyin sequence for more efficient in- nese character directly from a Latin-style key- putting. When a given pinyin sequence becomes board which only has 26 keys (Zhang et al., longer, the list of the corresponding legal character 2018a). The pinyin as the official romanization sequences will significantly reduce. For example, representation for Chinese provides a solution that IME being aware of that pinyin sequence bei jing maps Chinese character to a string of Latin al- can be only converted to either 背o(background) phabets so that each character has a letter writing or 北¬(Beijing) will greatly help it make the right arXiv:1811.04352v4 [cs.AI] 6 Jun 2019 form of its own and users can type pinyin in terms and more efficient P2C decoding, as both pinyin of Latin letters to input Chinese characters into a bei and jing are respectively mapped to dozens of computer. Therefore, converting pinyin to Chinese difference single Chinese characters. Table1 illus- characters is the most basic module of all pinyin- trates that the list size of the corresponding Chi- based IMEs. nese character sequence converted by pinyin se- As each Chinese character may be mapped to a quence bei jing huan ying ni (北¬"Î`, Wel- pinyin syllable, it is natural to regard the Pinyin- come to Beijing) is changed according to the dif- to-Character (P2C) conversion as a machine trans- ferent sized source pinyin sequences. ∗ Corresponding author. This paper was partially sup- To reduce the P2C ambiguities by decoding ported by National Key Research and Development Program longer input pinyin sequence, Chinese IMEs may of China (No. 2017YFB0304100) and Key Projects of Na- tional Natural Science Foundation of China (U1836222 and often utilize word-based language models since 61733011). character-based language model always suffers Pinyin seq. con- bei jing huan ying ni ule will update the vocabulary accordingly. Our sists of 1 syllable « l ¯ ñ ``` evaluation will be performed on three diverse cor- 北北北 Y b 颖 h pora, including two which are from the real user W 井 还 ÎÎÎ 逆 inputting history, for verifying the effectiveness of o ¬¬¬ { q 拟 背 Ï """ 应 < the proposed method in different scenarios. The rest of the paper is organized as follows: Pinyin seq. con- bei jing huan ying ni Section 2 discusses relevant works. Sections 3 sists of 2 syllables 北北北¬¬¬ {q ` and 4 introduce the proposed model. Experimental 背o """ÎÎÎ ` results and the model analysis are respectively in Sections 5 and 6. Section 7 concludes this paper. Pinyin seq. con- bei jing huan ying ni sists of 5 syllables 北北北¬¬¬"""ÎÎÎ``` 2 Related Work Table 1: The shorter the pinyin sequence is, the more To effectively utilize words for IMEs, many nat- character sequences will be mapped. ural language processing (NLP) techniques have been applied. Chen(2003) introduced a joint maximum n-gram model with syllabification for from the mapping ambiguity. However, the ef- grapheme-to-phoneme conversion. Chen and Lee fect of the work in P2C will be undermined with (2000) used a trigram language model and incor- quite restricted vocabularies. The efficiency of porated word segmentation to convert pinyin se- IME conversion depends on the sufficiency of the quence to Chinese word sequence. Xiao et al. vocabulary and previous work on machine transla- (2008) proposed an iterative algorithm to discover tion has shown a large enough vocabulary is nec- unseen words in corpus for building a Chinese essary to achieve good accuracy (Jean et al., 2015). language model. Mori et al.(2006) described a In addition, some sampling techniques for vocab- method enlarging the vocabulary which can cap- ulary selection are proposed to balance the com- ture the context information. putational cost of conversion (Zhou et al., 2016; For either pinyin-to-character for Chinese IMEs Wu et al., 2018). As IMEs work, users inputting or kana-to-kanji for Japanese IMEs, a few lan- style may change from time to time, let alone di- guage model training methods have been devel- verse user may input quite diverse contents, which oped. Mori et al.(1998) proposed a probabilis- makes a predefined fixed vocabulary can never be tic based language model for IME. Jiampojamarn sufficient. For a convenient solution, most com- et al.(2008) presented online discriminative train- mercial IMEs have to manually update their vo- ing. Lin and Zhang(2008) proposed a statistic cabulary on schedule. Moreover, the training for model using the frequent nearby set of the target word-based language model is especially difficult word. Chen et al.(2012) used collocations and k- for rare words, which appear sparsely in the cor- means clustering to improve the n-pos model for pus but generally take up a large share of the dic- Japanese IME. Jiang et al.(2007) put forward a tionary. PTC framework based on support vector machine. To well handle the open vocabulary learning Hatori and Suzuki(2011) and Yang et al.(2012) problem in IME, in this work, we introduce an respectively applied statistic machine translation online sequence-to-sequence (seq2seq) model for (SMT) to Japanese pronunciation prediction and P2C and design a sampling mechanism utilizing Chinese P2C tasks. Chen et al.(2015); Huang our online updated vocabulary to enhance the con- et al.(2018) regarded the P2C as a translation be- version accuracy of IMEs as well as speed up the tween two languages and solved it in neural ma- decoding procedure. In detail, first, a character- chine translation framework. enhanced word embedding (CWE) mechanism is All the above-mentioned work, however, still proposed to represent the word so that the pro- rely on a predefined fixed vocabulary, and IME posed model can let IME generally work at the users have no chance to refine their own dictionary word level and pick a very small target vocabu- through a user-friendly way. Zhang et al.(2017) is lary for each sentence. Second, every time the mostly related to this work, which also offers an user makes a selection contradicted the prediction online mechanism to adaptively update user vo- given by the P2C conversion module, the mod- cabulary. The key difference between their work and ours lies on that this work presents the first the meantime, high-frequency word embeddings neural solution with online vocabulary adaptation are attached to character embedding via average while (Zhang et al., 2017) sticks to a traditional pooling while low-frequency words are computed model for IME. from character embedding. Our embeddings also Recently, neural networks have been adopted contain different granularity levels of embedding, for a wide range of tasks (Li et al., 2019; Xiao but the word vocabulary is capable of being up- et al., 2019; Zhou and Zhao, 2019; Li et al., dated in accordance with users’ inputting choice 2018a,b). The effectiveness of neural models de- during IME working. In contrast, (Cai et al., 2017) pends on the size of the vocabulary on the target build embeddings based on the word frequency side and previous work has shown that vocabular- from a fixed corpus. ies of well over 50K word types are necessary to achieve good accuracy (Jean et al., 2015)(Zhou 3 Our Models et al., 2016). Neural machine translation (NMT) For a convenient reference, hereafter a character systems compute the probability of the next tar- in pinyin language also refers to an independent get word given both the previously generated tar- pinyin syllable in the case without causing confu- get words as well as the source sentence.