STCP: Simplified-Traditional Chinese Conversion and Proofreading

STCP: Simpliﬁed-Traditional Chinese Conversion and Proofreading

Jiarui Xu1 and Xuezhe Ma1 and Chen-Tse Tsai2 and Eduard Hovy1 1 Language Technologies Institute, Carnegie Mellon University 2 Department of Computer Science, University of Illinois at Urbana-Champaign jiaruix, xuezhem @cs.cmu.edu, [email protected], [email protected] { }

Abstract 2 Levels of Conversion Halpern and Kerman(1999) discussed the pitfalls This paper aims to provide an effec- and complexities of Chinese-to-Chinese conver- tive tool for conversion between Sim- sion and introduced four conversion levels: code plified Chinese and Traditional Chinese. level, orthographic level, lexemic level, and con- We present STCP, a customizable system textual level, respectively. In this paper, we com- comprising statistical conversion model, pact them into two levels of conversion: character and proofreading web interface. Experi- level and cord level. ments show that our system achieves comparable character-level conversion per- 2.1 Character level formance with the state-of-art systems. In addition, our proofreading interface There exists a mapping between Simplified Chi- can effectively support diagnostics and nese characters and Traditional Chinese char- data annotation. STCP is available at acters. Most characters only have a single http://lagos.lti.cs.cmu.edu:8002/ corresponding character, while some characters may have multiple corresponding characters. In 1 Introduction Simplified-to-Traditional conversion, characters with one-to-many mappings constitute about 12% There are two standard character sets of the con- of commonly used Chinese characters (Halpern temporary Chinese written language: Simplified and Kerman, 1999). Such phenomenon exists in Chinese and Traditional Chinese. Simplified Chi- Traditional-to-Simplified as well but to a much nese is officially used in mainland China and Sin- lesser extent. Character-level conversion of a gapore, while Traditional Chinese is used in Tai- given sentence involves both replacing characters wan, Hong Kong, and Macau. The conversion has that have one-to-one mapping with correspond- become an essential problem with the increasing ing characters and disambiguating characters that communication and collaboration among Chinese- have one-to-many mappings. speaking regions. Although several conversion systems have been 2.2 Word level made available to the public, the conversion prob- A concept may have different string surfaces due lem, however, remains unsolved. In this pa- to the differences in word usage among various per, we present an open-source system that pro- Chinese-speaking areas. For example, is re- vides a statistical model for conversion, as well ferred to as ”football” in British English but ”soc- as a web interface for proofreading. Our system cer ball” in American English. Such phenomenon achieves comparable performance with state-of- is quite typical in Chinese-speaking areas. For ex- art systems. To the best of our knowledge, it is ample, Sydney is 悉尼 in mainland China but 雪 the first open-source statistical conversion system. 梨 in Taiwan. Word-level conversion of a sentence Another contribution of our system is the proof- involves determining if a word should be replaced reading web interface. It is important for users to with a corresponding word in look-up table. Dis- proofread the converted result and to make edits ambiguation is also necessary if there are multiple based on the linguistic information. corresponding words.

61 The Companion Volume of the IJCNLP 2017 Proceedings: System Demonstrations, pages 61–64, Taipei, Taiwan, November 27 – December 1, 2017. c 2017 AFNLP

Figure 1: Screenshot of proofreading interface

We use the following sentence in Simpliﬁed ﬁnally get the target sentence in Traditional Chi- Chinese to elaborate the conversion process: nese: 我瞭解雲端軟體

我了解云端软件 3 System Architecture I know cloud software 3.1 Model Based on look-up tables of the character mappings, we list characters with one-to-one map- The Simplified-Traditional conversion problem is pings in the above example sentence in Table1 formulated as a translation problem (Brown et al., and those with one-to-many mappings in Table2 1990): Word mapping is shown in Table3. Given a sentence s from source language (e.g. Simplified Chinese), return a sentence t in tar- SC 我解端软件 get language (e.g. Traditional Chinese) that maxi- TC 我解端軟件 mizes the conditional probability: P (t)P (s t) Table 1: one-to-one character mapping P (t s) = | P (t)P (s t) | P (s) ∝ | Here we let P (s t) be the same for any candidate SC TC English | sentence t. Therefore, P (t s) P (t) and the goal 了 (auxiliary) | ∝ 了 is to find: 瞭 know t = argmaxP (t) 云 ∗ 云 say t 雲 cloud We describe how to generate candidate sentences through word and character conversion in Table 2: one-to-many character mapping section 3.1.1 and 3.1.2. The language model we used is briefly introduced in section 3.1.3. 软件 SC 3.1.1 Word Conversion TC 軟體 We tokenize the source sentence s into word se- Table 3: Word mapping in example sentence. quence w1, w2, ..., wn. In our system, we use Jieba1 Chinese text segmentation. For each word We decide to replace ‘软件’ with ‘軟體’ and 1https://github.com/fxsjy/jieba

62 wi, if there exists a mapping of wi in mapping ta- 4.1 Data 0 ble, we convert wi into word wi in target language. We use the Chinese Gigaword Fifth Edition 3.1.2 Character Conversion (Parker et al., 2011) produce by the Linguistic Data Consortium (LDC). We select documents of After word conversion, the characters in words type ‘story’ from Central News Agency (CMA), that have not been converted have one-to-one or Taiwan after 2004, which are written in Tradi- one-to-many mapping. We generate candidate set tional Chinese. In order to evaluate character con- T that contains all possible sentences by combin- version, we need to assume that there is no differ- ing every possible conversions of each character. ence in word usage. Since conversion from Tradi- 3.1.3 Language Model tional Chinese to Simplified Chinese is not prob- By default, the system uses a character-level lan- lematic on character level(Halpern and Kerman, guage model with order of 5, estimated by KenLM 1999), we convert the CMA corpus into Simpli- (Heafield, 2011; Heafield et al., 2013). We choose fied Chinese and use it as source language text set. KenLM because of its advantage in time and stor- The original CMA corpus becomes the target lan- age efficiency. User can substitute it with other guage text set. We split the entire data set into 80% trained language model. training and 20% testing data randomly.

3.2 Proofreading Interface 4.2 Evaluation We provide a web-based proofreading interface MOE-CIPSC evaluation provides a list of char- 3 that allows users to correct the converted text. Au- acters that have one-to-many mapping . Over- tomatic conversion between Simpliﬁed Chinese all accuracy is deﬁned as: (#correctly converted and Traditional Chinese can never achieve 100% ambiguous characters) / (#ambiguous characters). accuracy and we believe that, in many scenarios, We also use Macro-average accuracy to evaluate such as government, commercial and legal doc- performance across different characters. ument conversion, it is important to convert all 4.3 Results and Analysis characters and words as accurately as possible. Characters and words that have alternatives will be Accuracies on character conversion are reported highlighted. When user selects these ambiguous in Table4 and Table5. Note that XMUCC is a fragments, explanation and example will be dis- pre-trained system and OpenCC is a rule-based played and user can easily choose an alternative to system. STCP outperforms OpenCC in terms of replace the automatic results. Example proofread- both accuracies and achieved comparable accu- ing of a paragraph and its highlights are shown in racy with XMUCC. Comparisons of these system Figure1. are in section 5.

4 Experimentation OpenCC XMUCC STCP Overall Accuracy 98.90 99.81 99.64 Ministry of Education of the P.R.C. and Chinese Information Processing Society of China held a Table 4: Overall accuracies competition on the Evaluation of Intelligent Con- version System of Simpliﬁed Chinese and Tradi- OpenCC XMUCC STCP tional Chinese 2 (MOE-CIPSC) in 2013. There are two core tasks: Character Conversion and Ter- Macro-avg Acc. 91.75 96.98 95.73 minology Conversion. Few high-quality parallel Table 5: Macro-average accuracies corpus is available (Chang and Kung, 2007) and it is expensive to build one. Most websites that claim to have both Simpliﬁed Chinese and Tradi- 5 Related Work tional Chinese versions are using automatic systems without proofreading, thus are prone to er- There are several statistical approaches that have rors. Our evaluation strategy adopts the task one been proposed. Chen et al.(2011) integrates statis- of MOE evaluation. tical features, including language models and lex- 2http://www.moe.edu.cn/s78/A19/A19_ 3http://bj.bcebos.com/cips-upload/dzb. gggs/s8478/201302/t20130225_181150.html txt

63 ical semantic consistencies, into log-linear mod- John D. Lafferty, Robert L. Mercer, and Paul S. els. Li et al.(2010) uses look-up tables retrieved Roossin. 1990. A statistical approach to ma- from Wikipedia to perform word substitution and chine translation. Comput. Linguist. 16(2):79–85. http://dl.acm.org/citation.cfm?id=92858.92860. disambiguate characters through language model. We adopt this method to build our conversion Jing-Shin Chang and Chun-Kai Kung. 2007. A model. We use different look-up tables and we use chinese-to-chinese statistical machine translation model for mining synonymous simplified-traditional higher order language model while they only use chinese terms. Proceedings of Machine Translation bigram and unigram. Summit XI . The four most popular and publicly available systems are Google Translate, Microsoft Trans- Yidong Chen, Xiaodong Shi, and Changle Zhou. 2011. A simplified-traditional chinese char- lator, Open Chinese Convert (OpenCC), and a acter conversion model based on log-linear system co-developed by Xiamen University, Min- models. In 2011 International Conference istry of Education of The People’s Republic of on Asian Language Processing. pages 3–6. China, and Beijing Normal University (XMUCC). https://doi.org/10.1109/IALP.2011.15. 4 OpenCC is an open-source project that performs Jack Halpern and Jouni Kerman. 1999. Pitfalls and conversion based on lookup tables constructed complexities of chinese to chinese conversion. In manually. XMUCC 5 integrates language models International Unicode Conference (14th) in Boston. and lexical semantic consistencies into log-linear Kenneth Heafield. 2011. Kenlm: Faster and smaller models (Chen et al., 2011). However, XMUCC language model queries. In Proceedings of the Sixth can be accessed through web interface and but it Workshop on Statistical Machine Translation. Asso- can only be executed in Windows command line ciation for Computational Linguistics, pages 187– 197. as standalone program. In end-use applications, especially when high Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H quality conversion is required, human proofread- Clark, and Philipp Koehn. 2013. Scalable modified kneser-ney language model estimation. In Proceed- ing is required. Compared to (Zhang, 2011, 2014), ings of the 51st Annual Meeting of the Association our conversion is based on language model, in- for Computational Linguistics. pages 690–696. stead of simply choosing the most frequent target characters. In addition, our proofreading inter- Min-Hsiang Li, Shih-Hung Wu, Yi-Ching Zeng, Ping- che Yang, and Tsun Ku. 2010. Chinese characters face highlights not only ambiguous characters, but conversion system based on lookup table and lan- also words. Users can also customize the system guage model. Computational Linguistics and Chi- by importing look-up tables and language model, nese Language Processing 15(1):19–36. which can be useful for particular domains, such Robert Parker, David Graff, Junbo Kong, Ke Chen, and as science, business, and law. Kazuaki Maeda. 2011. English gigaword fifth edition LDC2011T07. DVD. Philadelphia: Linguistic 6 Conclusion and Future Work Data Consortium . We develop an open-source customizable Chinese Xiaoheng Zhang. 2011. A simplified-traditional chi- conversion system that is based on look-up tables nese conversion tool with a supporting environment and language model with a proofreading interface for human proofreading. The 11th Chinese National Conference on Computational Linguistics, CNCCL . that assists end-use application. For future work, we will experiment with different language mod- Xiaoheng Zhang. 2014. A Comparative Study eling approaches, such as neural language model. on Simplified-Traditional Chinese Translation, Springer International Publishing, Cham, pages We will use the proofreading interface to construct 212–222. https://doi.org/10.1007/978-3-319- parallel corpus of high quality to evaluate word- 12277-9 19. level conversion.

Acknowledgments References Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, 4https://github.com/BYVoid/OpenCC 5http://jf.cloudtranslation.cc/