A New Statistical Approach to Chinese Pinyin Input

Total Page:16

File Type:pdf, Size:1020Kb

A New Statistical Approach to Chinese Pinyin Input A New Statistical Approach to Chinese Pinyin Input Zheng Chen Kai-Fu Lee Microsoft Research China Microsoft Research China No. 49 Zhichun Road Haidian District No. 49 Zhichun Road Haidian District 100080, China, 100080, China, [email protected] [email protected] may be achieved using a sentence-based input. Abstract Sentence-based input method chooses character by using a language model base on Chinese input is one of the key context. So its accuracy is higher than word- challenges for Chinese PC users. This based input method. In this paper, all the paper proposes a statistical approach to technology is based on sentence-based input Pinyin-based Chinese input. This approach method, but it can easily adapted to word-input uses a trigram-based language model and a method. statistically based segmentation. Also, to deal with real input, it also includes a In our approach we use statistical language typing model which enables spelling model to achieve very high accuracy. We correction in sentence-based Pinyin input, design a unified approach to Chinese statistical and a spelling model for English which language modelling. This unified approach enables modeless Pinyin input. enhances trigram-based statistical language modelling with automatic, maximum- 1. Introduction likelihood-based methods to segment words, select the lexicon, and filter the training data. Chinese input method is one of the most Compared to the commercial product, our difficult problems for Chinese PC users. There system is up to 50% lower in error rate at the are two main categories of Chinese input same memory size, and about 76% better method. One is shape-based input method, without memory limits at all (Jianfeng etc. such as "wu bi zi xing", the other is Pinyin, or 2000). pronunciation-based input method, such as "Chinese CStar", "MSPY", etc. Because of its However, sentence-based input methods facility to learn and to use, Pinyin is the most also have their own problems. One is that the popular Chinese input method. Over 97% of system assumes that users' input is perfect. In the users in China use Pinyin for input (Chen reality there are many typing errors in users' Yuan 1997). Although Pinyin input method input. Typing errors will cause many system has so many advantages, it also suffers from errors. Another problem is that in order to type several problems, including Pinyin-to- both English and Chinese, the user has to characters conversion errors, user typing switch between two modes. This is errors, and UI problem such as the need of two cumbersome for the user. In this paper, a new separate mode while typing Chinese and typing model is proposed to solve these English, etc. problems. The system will accept correct typing, but also tolerate common typing errors. Pinyin-based method automatically Furthermore, the typing model is also converts Pinyin to Chinese characters. But, combined with a probabilistic spelling model there are only about 406 syllables; they for English, which measures how likely the correspond to over 6000 common Chinese input sequence is an English word. Both characters. So it is very difficult for system to models can run in parallel, guided by a select the correct corresponding Chinese Chinese language model to output the most characters automatically. A higher accuracy likely sequence of Chinese and/or English The Chinese language model in equation characters. 2.1, Pr(H ) measures the a priori probability of a Chinese word sequence. Usually, it is The organization of this paper is as follows. determined by a statistical language model In the second section, we briefly discuss the (SLM), such as Trigram LM. Pr(P | H) , called Chinese language model which is used by typing model, measures the probability that a sentence-based input method. In the third Chinese word H is typed as Pinyin P . section, we introduce a typing model to deal with typing errors made by the user. In the Usually, H is the combination of Chinese fourth section, we propose a spelling model for words, it can decomposed into w , w ,Λ , w , English, which discriminates between Pinyin 1 2 n and English. Finally, we give some where wi can be Chinese word or Chinese conclusions. character. So typing model can be rewritten as equation 2.2. n Pr(P | H ) ≈ Pr(P | w ) , (2.2) 2. Chinese Language Model ∏ f (i) i i=1 Pinyin input is the most popular form of where, Pf (i) is the Pinyin of wi . text input in Chinese. Basically, the user types a phonetic spelling with optional spaces, like: The most widely used statistical language woshiyigezhongguoren model is the so-called n-gram Markov models And the system converts this string into a (Frederick 1997). Sometimes bigram or string of Chinese characters, like: trigram is used as SLM. For English, trigram is ( I am a Chinese ) widely used. With a large training corpus trigram also works well for Chinese. Many A sentence-based input method chooses the articles from newspapers and web are probable Chinese word according to the collected for training. And some new filtering context. In our system, statistical language methods are used to select balanced corpus to model is used to provide adequate information build the trigram model. Finally, a powerful to predict the probabilities of hypothesized language model is obtained. In practice, Chinese word sequences. perplexity (Kai-Fu Lee 1989; Frederick 1997) is used to evaluate the SLM, as equation 2.3. In the conversion of Pinyin to Chinese character, for the given Pinyin P , the goal is 1 N − ∑ log P(w |w − ) to find the most probable Chinese character N i i 1 = i=1 H , so as to maximize Pr(H | P) . Using Bayes PP 2 (2.3) law, we have: where N is the length of the testing data. The perplexity can be roughly interpreted as the ^ Pr(P | H ) Pr(H ) H = arg max Pr(H | P) = arg max geometric mean of the branching factor of the H H Pr(P) document when presented to the language (2.1) model. Clearly, lower perplexities are better. The problem is divided into two parts, typing We build a system for cross-domain model Pr(P | H ) and language model Pr(H ) . general trigram word SLM for Chinese. We trained the system from 1.6 billion characters Conceptually, all H ’s are enumerated, and of training data. We evaluated the perplexity the one that gives the largest Pr(H, P) is of this system, and found that across seven different domains, the average per-character selected as the best Chinese character perplexity was 34.4. We also evaluated the sequence. In practice, some efficient methods, system for Pinyin-to-character conversion. such as Viterbi Beam Search (Kai-Fu Lee Compared to the commercial product, our 1989; Chin-hui Lee 1996), will be used. system is up to 50% lower in error rate at the same memory size, and about 76% better 2000) without memory limits at all. (JianFeng etc. all characters with equivalent pronunciation 3. Spelling Correction into a single syllable. There are about 406 syllables in Chinese, so this is essentially 3.1 Typing Errors training: Pr(Pinyin String | Syllable) , and then mapping each character to its corresponding The sentence-based approach converts syllable. Pinyin into Chinese words. But this approach assumes correct Pinyin input. Erroneous input According to the statistical data from will cause errors to propagate in the psychology (William 1983), most frequently conversion. This problem is serious for errors made by users can be classified into the Chinese users because: following types: 1. Chinese users do not type Pinyin as 1. Substitution error: The user types one key frequently as American users type English. instead of another key. This error is 2. There are many dialects in China. Many mainly caused by layout of the keyboard. people do not speak the standard Mandarin The correct character was replaced by a Chinese dialect, which is the origin of character immediately adjacent and in the Pinyin. For example people in the southern same row. 43% of the typing errors are of area of China do not distinguish ‘zh’-‘z’, this type. Substitutions of a neighbouring ‘sh’-‘s’, ‘ch’-‘c’, ‘ng’-‘n’, etc. letter from the same column (column 3. It is more difficult to check for errors errors) accounted for 15%. And the while typing Pinyin for Chinese, because substitution of the homologous (mirror- Pinyin typing is not WYSIWYG. Preview image) letter typed by the same finger in experiments showed that people usually do the same position but the wrong hand, not check Pinyin for errors, but wait until accounted for 10% of the errors overall the Chinese characters start to show up. (William 1983). 3.2 Spelling Correction 2. Insertion errors: The typist inserts some keys into the typing letter sequence. One In traditional statistical Pinyin-to-characters reason of this error is the layout of the conversion systems, Pr(Pf (i) | wi ) , as keyboard. Different dialects also can result mentioned in equation 2.2, is usually set to 1 if in insertion errors. 3. Deletion errors: some keys are omitted P is an acceptable spelling of word w , f (i) i while typing. and 0 if it is not. Thus, these systems rely 4. Other typing errors, all errors except the exclusively on the language model to carry out errors mentioned before. For example, the conversion, and have no tolerance for any transposition errors which means the variability in Pinyin input. Some systems have reversal of two adjacent letters. the “southern confused pronunciation” feature to deal with this problem. But this can only We use models learned from psychology, address a small fraction of typing errors but train the model parameters from real data, because it is not data-driven (learned from real similar to training acoustic model for speech typing errors).
Recommended publications
  • Chinese Pinyin Aided IME, Input What You Have Not Keystroked Yet
    Chinese Pinyin Aided IME, Input What You Have Not Keystroked Yet Yafang Huang1;2, Hai Zhao1;2;∗, 1Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China 2Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China [email protected], [email protected] ∗ Abstract between pinyin syllables and Chinese characters. (Huang et al., 2018; Yang et al., 2012; Jia and Chinese pinyin input method engine (IME) Zhao, 2014; Chen et al., 2015) regarded the P2C as converts pinyin into character so that Chinese a translation between two languages and solved it characters can be conveniently inputted into computer through common keyboard. IMEs in statistical or neural machine translation frame- work relying on its core component, pinyin- work. The fundamental difference between (Chen to-character conversion (P2C). Usually Chi- et al., 2015) work and ours is that our work is a nese IMEs simply predict a list of character fully end-to-end neural IME model with extra at- sequences for user choice only according to tention enhancement, while the former still works user pinyin input at each turn. However, Chi- on traditional IME only with converted neural net- nese inputting is a multi-turn online procedure, work language model enhancement. (Zhang et al., which can be supposed to be exploited for fur- 2017) introduced an online algorithm to construct ther user experience promoting. This paper thus for the first time introduces a sequence- appropriate dictionary for P2C. All the above men- to-sequence model with gated-attention mech- tioned work, however, still rely on a complete in- anism for the core task in IMEs.
    [Show full text]
  • China's Shanzhai Entrepreneurs Hooligans Or
    China’s Shanzhai Entrepreneurs Hooligans or Heroes? 《中國⼭寨企業家:流氓抑或是英雄》 Callum Smith ⾼林著 Submitted for Bachelor of Asia Pacific Studies (Honours) The Australian National University October 2015 2 Declaration of originality This thesis is my own work. All sources used have been acknowledGed. Callum Smith 30 October 2015 3 ACKNOWLEDGEMENTS I am indebted to the many people whose acquaintance I have had the fortune of makinG. In particular, I would like to express my thanks to my hiGh-school Chinese teacher Shabai Li 李莎白 for her years of guidance and cherished friendship. I am also grateful for the support of my friends in Beijing, particularly Li HuifanG 李慧芳. I am thankful for the companionship of my family and friends in Canberra, and in particular Sandy 翟思纯, who have all been there for me. I would like to thank Neil Thomas for his comments and suggestions on previous drafts. I am also Grateful to Geremie Barmé. Callum Smith 30 October 2015 4 CONTENTS ACKNOWLEDGEMENTS ................................................................................................................................ 3 ABSTRACT ........................................................................................................................................................ 5 INTRODUCTION .............................................................................................................................................. 6 THE EMERGENCE OF A SOCIOCULTURAL PHENOMENON ...................................................................................................
    [Show full text]
  • The Challenge of Chinese Character Acquisition
    University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Faculty Publications: Department of Teaching, Department of Teaching, Learning and Teacher Learning and Teacher Education Education 2017 The hC allenge of Chinese Character Acquisition: Leveraging Multimodality in Overcoming a Centuries-Old Problem Justin Olmanson University of Nebraska at Lincoln, [email protected] Xianquan Chrystal Liu University of Nebraska - Lincoln, [email protected] Follow this and additional works at: http://digitalcommons.unl.edu/teachlearnfacpub Part of the Bilingual, Multilingual, and Multicultural Education Commons, Chinese Studies Commons, Curriculum and Instruction Commons, Instructional Media Design Commons, Language and Literacy Education Commons, Online and Distance Education Commons, and the Teacher Education and Professional Development Commons Olmanson, Justin and Liu, Xianquan Chrystal, "The hC allenge of Chinese Character Acquisition: Leveraging Multimodality in Overcoming a Centuries-Old Problem" (2017). Faculty Publications: Department of Teaching, Learning and Teacher Education. 239. http://digitalcommons.unl.edu/teachlearnfacpub/239 This Article is brought to you for free and open access by the Department of Teaching, Learning and Teacher Education at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in Faculty Publications: Department of Teaching, Learning and Teacher Education by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. Volume 4 (2017)
    [Show full text]
  • The Challenge of Chinese Character Acquisition: Leveraging Multimodality in Overcoming a Centuries-Old Problem
    The Emerging Learning Design Journal Volume 4 Issue 1 Article 1 February 2018 The Challenge of Chinese Character Acquisition: Leveraging Multimodality in Overcoming a Centuries-Old Problem Justin Olmanson University of Nebraska Lincoln Xianquan Liu University of Nebraska Lincoln Follow this and additional works at: https://digitalcommons.montclair.edu/eldj Part of the Instructional Media Design Commons Recommended Citation Olmanson, Justin and Liu, Xianquan (2018) "The Challenge of Chinese Character Acquisition: Leveraging Multimodality in Overcoming a Centuries-Old Problem," The Emerging Learning Design Journal: Vol. 4 : Iss. 1 , Article 1. Available at: https://digitalcommons.montclair.edu/eldj/vol4/iss1/1 This Article is brought to you for free and open access by Montclair State University Digital Commons. It has been accepted for inclusion in The Emerging Learning Design Journal by an authorized editor of Montclair State University Digital Commons. For more information, please contact [email protected]. Volume 4 (2017) pgs. 1-9 Emerging Learning http://eldj.montclair.edu eld.j ISSN 2474-8218 Design Journal The Challenge of Chinese Character Acquisition: Leveraging Multimodality in Overcoming a Centuries-Old Problem Justin Olmanson and Xianquan Liu University of Nebraska Lincoln 1400 R St, Lincoln, NE 68588 [email protected] May 23, 2017 ABSTRACT For learners unfamiliar with character-based or logosyllabic writing systems, the process of developing literacy in written Chinese poses significantly more obstacles than learning
    [Show full text]
  • NY IME Instructions
    Selecting Simplified Input Method To type in simplified characters, click on the arrow to the right of the selected input language at the top left corner of your screen. Then select "Chinese (Simplified)" from the drop-down list that appears, as shown below. Please note that after each time you select a language, you must use your mouse to reposition your cursor inside the response box before you can begin typing. Note: You may also toggle between your selected Chinese input method and English by using the key code CTRL+SPACEBAR. IMPORTANT: Once you have selected a Chinese input language, if you press SHIFT, your keyboard will type in English until you press SHIFT again to resume Chinese input. Text Entry and Character Selection 1. Use the keyboard to type the pronunciation of the desired Chinese character in Simplified Chinese. As you type, a long window will appear with numbered character options, as shown in the example below. If you do not see the desired character in the window, type additional letters or use the mouse to click on the black arrow at the end of the window to see another set of character options. 2. If the first character shown is the desired character, press SPACEBAR to select it. To select another character, use the mouse to click on the character in the window or type the number that corresponds to the desired character. 3. The character you selected will now appear in the response box. Press ENTER or SPACEBAR to confirm the character and move the cursor forward.
    [Show full text]
  • Chinats Language Input System in the Digital Age Affects Childrents Reading Development
    China’s language input system in the digital age affects children’s reading development Li Hai Tan1,2, Min Xu1, Chun Qi Chang, and Wai Ting Siok2 State Key Laboratory of Brain and Cognitive Sciences, Department of Linguistics, and Shenzhen Institute of Research and Innovation, University of Hong Kong, Hong Kong Edited by Dale Purves, Duke–National University of Singapore Graduate Medical School, Singapore, and approved December 5, 2012 (received for review August 7, 2012) Written Chinese as a logographic system was developed over 3,000 analysis of characters and to establish their representation in long- y ago. Historically, Chinese children have learned to read by learning term memory (21–25). to associate the visuo-graphic properties of Chinese characters with Severe reading difficulty arises when children fail to establish lexical meaning, typically through handwriting. In recent years, a cohesive reading circuit that links orthography, phonology, and however, many Chinese children have learned to use electronic meaning. Reading difficulty is defined as an unexpectedly low communication devices based on the pinyin input method, which reading ability in people who have adequate nonverbal intelligence, associates phonemes and English letters with characters. When have acquired typical schooling, and have experienced sufficient children use pinyin to key in letters, their spelling no longer depends sociocultural opportunities (1–12). Estimates of prevalence of se- on reproducing the visuo-graphic properties of characters that are vere reading difficulty in English range from 5% to 17% (3, 8, 11, indispensable to Chinese reading, and, thus, typing in pinyin may 13). Severe reading difficulty has also been found in Chinese conflict with the traditional learning processes for written Chinese.
    [Show full text]
  • Elementary Mandarin Immersion Students Learning Alphabetic Pinyin and Using Pinyin to Learn Chinese Characters a DISSERTATION S
    Elementary Mandarin Immersion Students Learning Alphabetic Pinyin and Using Pinyin to Learn Chinese Characters A DISSERTATION SUBMITTED TO THE FACULTY OF THE UNIVERSITY OF MINNESOTA BY Zhongkui Ju IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Dr. Martha Bigelow August 2019 © Zhongkui Ju 2019 Acknowledgements First of all, I would like to thank the American Council on the Teaching of Foreign Languages (ACTFL), Language Learning, National Federation of Modern Languages Teaching Associations (NFMLTA)/National Council of Less Commonly Taught Languages (NCOLCTL), Pearson Education, and the Department of Curriculum and Instruction at the University of Minnesota for their generous support and recognition of this project. Second, I want to acknowledge those who have supported me, guided me, and helped me through every step of this process to bring me to this point. I am grateful to my advisor, Dr. Martha Bigelow, for her faith in me. She is more than an advisor but also a friend, a confidant, and a cheerleader during this PhD journey. I appreciate that she understands that earning a scholarship is not the most challenging hardship in graduate school. I am also grateful to Dr. Bob delMas for his advice, expertise, patience, and role- modeling for what it means to be a rigorous researcher. His purposeful questions, comments, and suggestions created valuable learning moments during the process of writing this dissertation. I have learned so much from him over the past six months. I am indebted to Dr. Yanling Zhou for her support, advice, sharing, and time commitment to assist me from Hong Kong, answering my questions online and attending my oral defense in Minneapolis.
    [Show full text]
  • Supporting the Chinese Language in Oracle Text
    Supporting the Chinese Language in Oracle® Text Masterarbeit im Fach Arbeits-, Lern- und Präsentationstechniken Masterstudiengang Informationswirtschaft der Fachhochschule Stuttgart – Hochschule der Medien Poh Choo Lai Erstprüfer: Prof. Dr.-Ing. Peter Lehmann Zweitprüferin: Frau Barbara Steinhanses Bearbeitungszeitraum: 20. Okt 2003 bis 31. Mär 2004 Stuttgart, März 2004 Erklärung ii Erklärung Hiermit erkläre ich, dass ich die vorliegende Masterarbeit selbständig angefertigt habe. Es wurden nur die in der Arbeit ausdrücklich benannten Quellen und Hilfsmittel benutzt. Wörtlich oder sinngemäß übernommenes Gedankengut habe ich als solches kenntlich gemacht. Ort, Datum Unterschrift Kurzfassung iii Kurzfassung Gegenstand dieser Arbeit sind die Problematik von chinesischem Information Retrieval (IR) sowie die Faktoren, die die Leistung eines chinesischen IR-System beeinflussen können. Experimente wurden im Rahmen des Bewertungsmodells von „TREC-5 Chinese Track“ und der Nutzung eines großen Korpusses von über 160.000 chinesischen Nachrichtenartikeln auf einer Oracle10g (Beta Version) Datenbank durchgeführt. Schließlich wurde die Leistung von Oracle® Text in einem so genannten „Benchmarking“ Prozess gegenüber den Ergebnissen der Teilnehmer von TREC-5 verglichen. Die Hauptergebnisse dieser Arbeit sind: (a) Die Wirksamkeit eines chinesischen IR Systems ist durch die Art und Weise der Formulierung einer Abfrage stark beeinflusst. Besonders sollte man während der Formulierung einer Anfrage die Vielzahl von Abkürzungen und die regionalen Unterschiede
    [Show full text]
  • Six-Digit Stroke-Based Chinese Input Method Lai-Man Po, Chi-Kwan Wong, Yiu-Ki Au, Ka-Ho Ng and Ka-Man Wong
    Six-Digit Stroke-based Chinese Input Method Lai-Man Po, Chi-Kwan Wong, Yiu-Ki Au, Ka-Ho Ng and Ka-Man Wong Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong Email: [email protected] Abstract—During the last three decades, more than one B. Pronunciation-based Chinese Input Methods thousand Chinese input methods have been developed. However, The two most popular pronunciation-based Chinese input people are still looking for better input methods in terms of easy to use, easy to remember, high input speed and small keypad methods in mainland China and Taiwan are Pinyin (拼音) [3] implementation on handheld devices. The well-known stroke- and Zhuyin (注音), respectively. The Pinyin method simply based Chinese input method using only five basic stroke types uses the Pinyin table as its conversion dictionary. This method could achieve low learning curve and small numeric keypad is very easy to learn for people who are already familiar with implementation but its input speed is limited for complex Chinese Putonghua (or Mandarin). For inputting the Chinese character characters with a lot of strokes. To tackle this problem, of “汉“, we can enter its Pinyin letters of “han” into the simplified stroke-based Chinese character and phrase coding methods using (3+3) rules are proposed in this paper. The computer for starting the character input. However, we will proposed method only uses the first 3 stroke codes and the last 3 get a list of 30 or more candidates to choose from, as there are stroke codes to represent the first and last radical information of a lot of Chinese characters with identical Pinyin string.
    [Show full text]
  • Multiliteracies on Instant Messaging in Negotiating Local, Translocal, and Transnational Affiliations: a Case of an Adolescent Immigrant
    Multiliteracies on Instant Messaging in Negotiating Local, Translocal, and Transnational Affiliations: A Case of an Adolescent Immigrant Wan Shun Eva Lam Northwestern University, Evanston, Illinois, USA ABSTRACT Through an in-depth case study of the instant messaging practices of an adolescent girl who had migrated to the United States from China, this qualitative investigation examines the development of multiliteracies in the context of transna- tional migration and new media of communication. Data consisted of screen recordings of the youth’s digital practices, interviews, and observations. Data analyses included qualitative coding procedures and orthographic analysis of the use of multiple dialects and languages in the youth’s instant messaging exchanges. These exchanges illustrate the process of social and semiotic design through which the youth developed simultaneous affiliations with her local Chinese immigrant community, a translocal network of Asian American youth, and transnational relationships with her peers in China. The construction of transnational networks represents the desire of the youth to develop the literate repertoire that would en- able her to thrive in multiple linguistic communities across countries and mobilize resources within these communities. This study contributes to new conceptual directions for understanding translocal forms of linguistic diversity mediated by digital technologies and an expanded view of the literate repertoire and cultural resources of migrant youth. As such, this study’s contributions are not limited to the domain of digital literacies but extend to issues of linguistic diversity and adolescent literacy development in contexts of migration. n recent years, there has been an increasing amount of (Lam, 2006; Suárez-Orozco & Qin-Hilliard, 2004).
    [Show full text]
  • KNPTC: Knowledge and Neural Machine Translation Powered Chinese Pinyin Typo Correction
    KNPTC: Knowledge and Neural Machine Translation Powered Chinese Pinyin Typo Correction Hengyi Cai1, Xingguang Ji2, Yonghao Song1, Yan Jin1, Yang Zhang3, Mairgup Mansur3, Xiaofang Zhao1 1 Institute of Computing Technology, Chinese Academy of Sciences 2 Beijing University of Posts and Telecommunications 3 Sogou, Inc. [email protected], [email protected], {songyonghao, jinyan}@ict.ac.cn, {zhangyang, maerhufu}@sogou-inc.com, [email protected] Abstract Chinese pinyin input methods are very impor- tant for Chinese language processing. Actually, (a) (b) users may make typos inevitably when they in- Figure 1: pinyin input method for a correct pinyin (a) and a put pinyin. Moreover, pinyin typo correction has mistyped pinyin (b). become an increasingly important task with the popularity of smartphones and the mobile Inter- net. How to exploit the knowledge of users typ- users want to type in the Chinese word “>R(Hello)”. First, ing behaviors and support the typo correction for they type in the corresponding pinyin “nihao” without de- acronym pinyin remains a challenging problem. limiters such as (Space) key to segment pinyin syllables. To tackle these challenges, we propose KNPTC, a Then, a Chinese pinyin input method displays a list of novel approach based on neural machine transla- Chinese candidate words which share that pinyin. Finally, tion (NMT). In contrast to previous work, KNPTC users search the target word from candidates and get the is able to integrate explicit knowledge into NMT result. In this way, typing in Chinese words is more com- for pinyin typo correction, and is able to learn plicated than typing in English words.
    [Show full text]
  • CHIME: an Efficient Error-Tolerant Chinese Pinyin Input Method
    CHIME: An Efficient Error-Tolerant Chinese Pinyin Input Method Yabin Zheng1, Chen Li2, and Maosong Sun1 1State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information Science and Technology Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 2Department of Computer Science, University of California, Irvine, CA 92697-3435, USA fyabin.zheng,[email protected], [email protected] Abstract Chinese Pinyin input methods are very importan- t for Chinese language processing. In many cases, Figure 1: Typical Chinese Pinyin input method for a correct users may make typing errors. For example, a user Pinyin (left) and a mistyped Pinyin (right). wants to type in “shenme” (什么, meaning “what” in English) but may type in “shenem” instead. Ex- isting Pinyin input methods fail in converting such to the corresponding Chinese words. In the running example, a Pinyin sequence with errors to the right Chinese for the mistyped Pinyin “shanghaai”, we can find a similar words. To solve this problem, we developed an Pinyin “shanghai”, and suggest Chinese words including “上 efficient error-tolerant Pinyin input method called 海”, which is indeed what the user is looking for. “CHIME” that can handle typing errors. By incor- When developing an error-tolerant Chinese input method, porating state-of-the-art techniques and language- we are facing two main challenges. First, Pinyins and Chi- specific features, the method achieves a better per- nese characters have two-way ambiguity. Many Chinese formance than state-of-the-art input methods. It can characters have multiple pronunciations. The character “和” efficiently find relevant words in milliseconds for can be pronounced as “hé”, “hè”, “huó”, “huò”, and “hú”.
    [Show full text]