Tracing a Loose Wordhood for Chinese Input Method Engine

Tracing a Loose Wordhood for Chinese Input Method Engine Xihu Zhang, Chu Wei and Hai Zhao∗ Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China fxihuzhangcs, [email protected], [email protected] Abstract phonetic representation of Chinese language, into Chinese character sequence. For example, a user Chinese input methods are used to convert wants to type a word ”北¬”(Beijing). The cor- pinyin sequence or other Latin encoding responding pinyin beijing is inputted, then the systems into Chinese character sentences. pinyin IME provides a list of Chinese charac- For more effective pinyin-to-character ter candidates whose pinyins are all beijing, such conversion, typical Input Method Engines as ”北¬”(pinyin: beijing, the Beijing city), ”背 (IMEs) rely on a predefined vocabulary o”(pinyin: beijing, background) as presented in that demands manually maintenance on Figure 1. The user at last selects the word ”北¬” schedule. For the purpose of removing the as the result. inconvenient vocabulary setting, this work focuses on automatic wordhood acquisi- beijing tion by fully considering that Chinese in- 1.北¬ 2.背centero 3.背 hereY 4.« 5.倍 putting is a free human-computer interaction procedure. Instead of strictly defin- Figure 1: IME interface on one page ing words, a loose word likelihood is introduced for measuring how likely a char- Chinese IMEs often have to utilize word-based acter sequence can be a user-recognized language models since character-based Language word with respect to using IME. Then an Model does not produce satisfactory result (Yang online algorithm is proposed to adjust the et al., 1998), Chinese words are composed of word likelihood or generate new words multiple consecutive Chinese characters. But un- by comparing user true choice for input- like alphabetic languages, The words in Chinese ting and the algorithm prediction. The sentences are not demarcated by spaces, mak- experimental results show that the pro- ing Chinese word identification a nontrivial task. posed solution can agilely adapt to di- Therefore, Chinese word dictionaries are usually verse typings and demonstrate perform- predefined to support word-based language mod- ance approaching highly-optimized IME els for IMEs. with fixed vocabulary. However, there is a huge difference between arXiv:1712.04158v1 [cs.CL] 12 Dec 2017 1 Introduction standard Chinese word segmentation and IME word identification. As Chinese text has no The inputting of Chinese language on computer blanks between characters , it allows all possible differs from English and other alphabetic lan- consecutive characters to hopefully form Chinese guages since Chinese language has more than words to some extent. For Chinese word seg- 20,000 characters and they cannot be simply mentation task, it learns from segmented corpus mapped to only 26 keys in the Latin keyboards. or predefined segmentation conventions, therefore The Chinese Input Method Engines (IMEs) have not all consecutive characters can be regarded as been developed to aid Chinese inputting. Among words for the task to learn1. For IME word iden- all IMEs, the most prominent type is pinyin- based IME that converts pinyins, which are the 1For example, a Chinese corpus may have 1 million bi- grams, though still given fewer than 50 thousand meaningful *Corresponding∗ author. word types. tification, it is a quite different case, because user Ï « seldom types a complete sentence and may stop Y at any position to leave a pinyin syllable seg- 倍 o 北北北¬¬¬ mentation and require IME to give proper pinyin- 北北北 to-character conversion. Table 1 shows multiple bei jing 精 beijing W possible segmentation choices for a given input. 竟背o The segmentation points are indicated by vertical 背... ¬¬¬... bar, though, the linguistically valid segmentation is only the first one, which is ”` j ( j Z j 什 H”(pinyin: ni zai zuo shenme, what are you do- Figure 2: All possible character counterparts of ing). bei, jing and beijing. Note that though modern Chinese has five tones, in all pinyin based IMEs, 1. ```(you) j ((((be) j ZZZ(do) j 什什什么HH(what) tones are not inputted with the letters for simpli- 2. `(you) j (Z(be doing) j 什么(what) city, therefore pinyin IME has to handle much 3. `((you are) j Z什么(do what) more pinyin-to-character ambiguities in practice. 4. `(you) j (Z什么(be doing what) 5. `(Z什么(what are you doing) ...... 2002; Zhao et al., 2006; Pei et al., 2014; Chen et al., 2015; Ma and Hinrichs, 2015; Cai and Zhao, Table 1: IME user’s input segmentations 2016). IMEs manage user inputting, which is a human- The converted consecutive characters forming computer interaction software. We here consider ‘word’ during inputting will not have a unified two issues as follows for effective processing. standard right as shown in Table 1, in which every segmentation in every items may be a ‘word’ as • Challenge: As a user behavior, Chinese in- any user-determined segment for IME inputting put may vary from user to user and from time should be regarded as word. Thus word defined to time, which makes it impossible to find a by user during Chinese inputting is actually a quite stable and general model for the core com- loose concept. Meanwhile, word from a linguistic ponents of IME, vocabulary and user prefer- sense is still a useful heuristic information for IME ence. For example, a word may be frequently construction even though it is much more freely inputted by a user in a few days, but after then defined than that for the task of Chinese word seg- it is never touched any more. mentation. For example, IME being aware of that 北 pinyin beijing can be only converted to either ” • Convenience: Every time, user makes a ¬ 背o (Beijing)” or ” (background)” will greatly choice from character or character sequence help it make the right and more efficient pinyin- candidate list given by IME, which makes to-character decoding, as both pinyins bei and jing inputting an online human-computer inter- are respectively mapped to dozens of difference action procedure. It means that every time single Chinese characters (see Figure 2). Since user indicates if the prediction given by IME capturing words can narrow down the scope of is correct2. However, as to our best know- possible conversions, tracing such kind of word- ledge, such kind of interactive property of hood for IME becomes important in improving IMEs has not been well studied or exploited. user experiences. In fact, most, if not all, existing IMEs con- The above huge difference about wordhood sider pinyin-to-character decoding in an off- definitions between Chinese word segmentation line model. and IME results in most techniques developed for the former is unavailable to the latter, even 2IMEs are always capable of letting users input whatever they want, and the standard about goodness for IMEs is ac- we know that many studies have been paid to tually behind its speed, namely, the faster an IME lets user Chinese word segmentation, using either super- input, the better it is. Therefore the correct prediction means vised, or unsupervised learning, and amazing res- that IME ranks the user intended Chinese input at the first position of the candidate list. Ranking the expected input at top ults have been received (Chang and Su, 1997; Ge position means that user may input by default keystroke so et al., 1999; Huang and Powers, 2003; Peng et al., that the inputting is the most efficient. To solve the word dilemma in Chinese IME, lem, but they still require an initial vocabulary or this paper presents a novel solution of automatic a segmented training corpus. wordhood acquisition for IMEs without assuming In addition to the above studies over vocabu- vocabulary initialization by fully considering that lary organization and language model adaptation Chinese inputting is a human-computer interaction for IMEs, most current business IMEs actually use procedure for the first time. In detail, the pro- two engineering solutions to follow user vocabu- posed method uses an algorithm to trace word like- lary change. The first is to simply save and re- lihood (but not word) that measures how a consec- peat what is frequently inputted and confirmed by utive character sequence looks like a ’word’ ac- user in recent times. Nevertheless, the recent high- cording to user inputting pattern. The algorithm frequency inputting will not be permanently ad- initially only has a Pinyin-to-Character conversion ded to the IME vocabulary. The second is to push table3 and an empty vocabulary. Each time the al- new words or new language model on schedule. gorithm receives a pinyin sequence and predicts The new words are collected and mined from all its corresponding Chinese sequence. By compar- online IME users’ inputs. However, the updates ing the user’s true choice for inputting and the al- about new words or new model are not real-time, gorithm prediction, the vocabulary is updated and and rely on a distributed user group, not adaptive the newly-introduced word likelihood is adjusted for each individual user. to follow user inputting change. This work differs from all existing scientific Our paper is organized as follows. Section 2 and engineering solutions for the following as- displays related work. Section 3 introduces the pects. First, we focus on adaptive word acquisi- framework of our model, discusses the pinyin-to- tion for IMEs, but not for any standard machine character conversion method and proposes an al- learning tasks such as all the recent work about gorithm for adaptive word likelihood adjustment. Chinese word segmentation, as IME word defini- In section 4 the experiment settings and results are tion is quite different from all these tasks (Chang demonstrated.

Load more