Vietnamese to Chinese Machine Translation Via Chinese Character As Pivot∗
Total Page:16
File Type:pdf, Size:1020Kb
PACLIC-27 Vietnamese to Chinese Machine Translation via Chinese Character as Pivot∗ Hai Zhao1;2y Tianjiao Yin3 Jingyi Zhang1;2 (1) MOE-Microsoft Key Laboratory of Intelligent Computing and Intelligent System (2) Department of Computer Science and Engineering, Shanghai Jiao Tong University #800 Dongchuan Road, Shanghai, China, 200240 (3) Facebook, Inc. 1601 Willow Rd. Menlo Park, CA 94025 USA [email protected], [email protected] [email protected] Abstract Using Chinese characters as an intermedi- ate equivalent unit, we decompose machine translation into two stages, semantic transla- tion and grammar translation. This strategy is tentatively applied to machine translation between Vietnamese and Chinese. During the semantic translation, Vietnamese sylla- bles are one-by-one converted into the cor- responding Chinese characters. During the grammar translation, the sequences of Chi- Figure 1: The phrase Chinese character culture sphere nese characters in Vietnamese grammar or- written in Chinese characters from different regions. der are modified and rearranged to form grammatical Chinese sentence. Compared to the existing single alignment model, the t SMT (Chu et al., 2012), though their approach- division of two-stage processing is more tar- es still follow the standard processing pipeline of geted for research and evaluation of machine SMT. For those resource-poor languages, a pivot translation. The proposed method is evalu- ated using the standard BLEU score and a language will be used as an expedience (Utiyama new manual evaluation metric, understand- and Isahara, 2007; Wu and Wang, 2009). ing rate. Only based on a small number In this work, we focus on machine translation of dictionaries, the proposed method gives (MT) for language pairs with few parallel corpora competitive and even better results com- but rich linguistic connections. A case study on pared to existing systems. Vietnamese and Chinese will be done. To exploit the shared linguistic characteristics between the 1 Introduction language pair, the common written form, Chinese character, is adopted as a translation bridge. Be- The statistical machine translation (SMT) has been ing the oldest continuously used writing system in well developed from a basis of data-drive idea s- the world, Chinese characters are logograms that ince the work of (Brown et al., 1993). How- are still used to write Chinese (汉字/漢字 in Chi- ever, a large amount of parallel corpora are al- nese, hànzì in Chinese pinyin) and Japanese (kan- ways necessary to build a standard SMT system ji). Such characters were used but are currently for a specific language pair, regardless of the pos- less frequently used in Korean (hanja), and were sible useful linkages between these two languages. also used in Vietnamese (chữ Hán). All the coun- There is existing work that considered using help- tries that were historically under Chinese language ful linguistic heuristics to enhance the curren- and culture are unofficially referred to Chinese This∗ work was partially supported by the National Natu- character cultural sphere or Sinosphere. These t- ral Science Foundation of China (Grant No.60903119, Grant wo terms are often used interchangeably but have No.61170114, and Grant No.61272248), and the National Basic Research Program of China (Grant No.2009CB320901 different denotations (Matisoff, 1990). A Chinese and Grant No.2013CB329401). character writing example of different regions is in correspondingy author Figure 1. 250 Copyright 2013 by Hai Zhao, Tianjiao Yin, and Jingyi Zhang 27th Pacific Asia Conference on Language, Information, and Computation pages 250-259 PACLIC-27 as corpus building, primary processing tasks, etc. A few studies have been done on Vietnamese related MT, though nearly all MT studies on Viet- namese focus on English as source or target lan- guage. As Vietnamese is an under-resourced lan- guage, most Vietnamese MT systems adopted rule based methods (Le et al., 2006; Le and Phan, 2009; Le and Phan, 2010). (Pham et al., 2009) used word-by-word trans- lation incorporated with predefined templates to Figure 2: Different scripts for Chinese characters. perform English-Vietnamese translation on weath- er bulletin texts. The similar strategy was also used in (Hoang et al., 2012) for Vietnamese to Ka- There are tens of thousands of Chinese charac- tu language translation on the same domain. ters, though most of them are minor graphic vari- Until very recently, the statistical approach was ants only existing in historical texts as Figure 2. applied to Vietnamese related MT task. (N- Mastering modern Chinese usually requires know- guyen and Shimazu, 2006; Nguyen et al., 2008) ing 2,000-4,000 characters. Though most words used self-defined morphological transformation in modern Chinese consist of two or more char- and syntactic transformation to beforehand solve acters, each Chinese character may correspond to reordering problem for Vietnamese-English trans- a spoken syllable with a distinct meaning. Be- lation. (Thi and Dinh, 2008) introduced a word ing meaning-oriented representation units, Chi- re-ordering approach that makes use of the syn- nese characters are naturally suitable to act as a tactic rules extracted from parse tree for English- bridge of semantic representation for translation Vietnamese MT. (Bui et al., 2010) proposed lan- task. This process will be especially promising as guage dependent features to enhance Vietnamese- we are working on a language like Vietnamese. English SMT. (Nguyen et al., 2012) integrated Vietnamese (tiếng Việt) is spoken by about more knowledge about the topic of the text, part- eighty million people. Much of Vietnamese vo- of-speech and morphology to resolve semantic cabulary has been borrowed from Chinese, and ambiguity of words during translation. Based on it formerly used a modified Chinese writing sys- empirical observation, (Nguyen and Dinh, 2012) tem, Chữ Nôm, and given vernacular pronuncia- proposed a group of heuristic patterns to discover tion. The Vietnamese alphabet (Quốc Ngữ) in use the alignment errors. (Bui et al., 2012) proposed a today is a Latin alphabet with additional diacritics group of rules to split long Vietnamese sentences for tones, and certain letters. based on linguistic information to enhance Viet- In this paper, a novel two-stage approach is pro- namese to English MT. posed for Vietnamese to Chinese MT by adopting Few studies have been done for MT task be- Chinese characters as the pivot. Vietnamese syl- tween Vietnamese and Chinese as to our best lables will first be converted into Chinese charac- knowledge. For such a low resource language pair, ters according to the meaning equivalence. Then rule based MT systems are too hard to build, and Chinese character sequences in Vietnamese gram- statistical MT systems require too large parallel mar order will be modified and reordered into corpus that is also difficultly acquired. Though grammatical Chinese. The proposed approach on- Chinese characters have been considered a useful ly requires a small number of linguistic resources, intermediate form for MT, few studies made a ful- such as bilingual dictionaries and monolingual l use of them. Instead, most existing approaches language model, to work effeciently. focus on the role of Chinese word during transla- 2 Related Work tion (Chang et al., 2008; Xu et al., 2008; Dyer et al., 2008; Ma and Way, 2009; Paul et al., 2010; Only recently have researchers begun to be in- Nguyen et al., 2010). (Chu et al., 2012) exploit- volved in the domain of Vietnamese language pro- ed shared Chinese characters between Chinese and cessing. Most work on Vietnamese language pro- Japanese to improve the concerned translation per- cessing has to still focus on very basic issues such formance. The most recent work (Xi et al., 2012) 251 PACLIC-27 Vietnamese Chinese rately used as a meaningful unit. Like Chinese, Character Chữ Nôm Chinese characters most Vietnamese words are bi-syllable. Chinese script (official now) is written without blanks between words and Viet- Romanized Quốc Ngữ pinyin namese is written with blanks between two sylla- script (official now) bles instead of words. Thus word segmentation becomes a primary processing for both languages. Table 1: Chinese vs. Vietnamese: writing systems Vietnamese is a prop-drop (pronoun-dropping) language, which means that certain classes of pro- proposed using Chinese character as aligning u- nouns in Vietnamese may be omitted when they nit. However, both of the above works are different are in some sense pragmatically inferable. Chi- from ours, in which Chinese characters are used as nese also exhibits frequent pro-drop features. a pivot for translation task for the first time. Both Chinese and Vietnamese allow verb seri- alization. Contrary to subordination in English 3 Chinese Elements inside Vietnamese where one clause is embedded into another, the se- 3.1 The Same and The Difference rial verb construction is a syntactic phenomenon that two verbs are put together in a sequence in Most linguists agree that Chinese and Vietnamese which no verb is subordinated to the other. belong to two quite different language families. Different from Chinese on word order, Viet- All varieties of modern Chinese are usually cate- namese is head-initial, i.e., displaying modified- gorized as part of the Sino-Tibetan language fam- modifier ordering, but number and noun classifier ily. However, opinions are divided on the lan- being before the modified noun. Thus, for exam- guage family that Vietnamese should belong to, ple, the Vietnamese language in Vietnamese gram- though the most acceptable view is that it is part mar order should not be Vietnamese language of the Mon–Khmer branch of the Austroasiatic (Việt Nam tiếng) but language Vietnamese (tiếng language family according to the observation that Việt Nam ). Vietnamese and Khmer share a lot of cognates and basic grammar (Benedict, 1944; Nguyen, 2008). 3.2 Sino-Vietnamese A writing system comparison between Chinese and Vietnamese is shown in Table 1. An obvious As a result of close ties with China for more than distinction between Vietnamese and Chinese writ- 2,000 years, quite a few of the Vietnamese lex- ing is on the role of the Romanized scripts.