<<

MT: A Case Study between and

Xiaoheng Dept. of Chinese &. Bilingual Studies, The Kong Polytechnic University , [email protected]

translation as an altemative way to achieve Abstract automatic machine translation (Martin, 1997a, 1997b). Machine Translation (MT) need not be Translation or interpretation is not necessarily confined to inter- activities. In this an inter-language activity. In many cases, it , we discuss inter-dialect MT in happens among within a single language. general and Cantonese-Mandarin MT in Similarly, MT can be inter-dialect as well. In particular. Mandarin and Cantonese are two fact, automatic translation or interpretation most important dialects of Chinese. The seems much more practical and achievable here former is the national and the since inter-dialect difference is much less latter is the most influential dialect in South serious than inter-language difference. Inter- , Hong Kong and overseas. The dialect MT' also represents a promising market, difference in between is such that mutual especially in China. In the following sections we intelligibility is impossible. This paper will discuss inter-dialect MT with special presents, from a computational point of view, emphasis on the pair of Chinese Cantonese and a comparative study of Mandarin and Chinese Mandarin. Cantonese at the three aspects of sound systems, rules and 1 Dialects and Chinese Dialects contents, followed by a discussion of the design and implementation of a dialect MT Dialects of a language are that language' system between them. systematic variations, developed when people of a common language are separated Introduction geographically and socially. Among this group of dialects, normally one serves as the lingua Automatic Machine Translation (MT) between franca, namely, the common language medium different , such as English, Chinese for communication among speakers of different and Japanese, has been an attractive but dialects. Inter-dialect differences exist in extremely difficult research area. Over forty pronunciation, vocabulary and syntactic rules. years of MT history has seen limited practical However, they are usually insignificant in translation systems developed or comparison with the similarities the dialects commercialized in spite of the considerable have. It has been declared that dialects of one development in computer science and linguistic language are mutually intelligible (Fromkin and studies. High quality machine translation Rodman 1993, p. 276). between two languages requires deep Nevertheless, this is not true to the situation understanding of the intended meaning of the in China. There are seven major Chinese dialects: source language sentences, which in turn the Northern Dialect (with Mandarin as its involves disambiguation reasoning based on standard version), Cantonese, , , Hakka, intelligent searches and proper uses of a great Xiang and (, 1989), that for the most amount of relevant knowledge, including part are mutually unintelligible, and inter-dialect common sense (Nirenburg, et. al. 1992). The task is so demanding that some researchers are 1 In this paper, MT refers to both computer-based looking more seriously at machine-aided human translation and interpretation.

1460 translation is often found indispensable for A than tall successful communication, especially between Cantonese: Cantonese, the most popular and the most A ~{ ~_ B influential dialect in and overseas, A goul gwo3 B (4) and Mandarin, the lingual franca of China. A tall more B Sentences with double objects often follow 2 Linguistic Consideration of Dialect different orders, too. In a Mandarin MT sentence with two objects, the one referring to person(s) must be put before the other one. Yet, Most differences among the dialects of a many dialects allow the order to be reversed, for language are found in their sound inventory and example: phonological systems. with similar Mandarin: written forms are often pronounced differently in different dialects. For example, the same wo3 xianl gel3 tal qian2 Chinese word "~ 7;~ " (Hong Kong) is I first give him money pronounced xianglgang3 2 in Mandarin, but I will give him some money first. hoenglgong2 in Cantonese. There are also Cantonese: lexical differences although dialects share most of their words. Different dialects may use different words to refer to the same thing. For ngo3 bei2 cin4 keoi5 sinl I give money him first example, the word "umbrella" is ~ ~: Differences in word pronunciation and word (yu3san3) in Mandarin, and ~ (zel) in forms can be represented in a bi-dialect Cantonese. Differences in syntactic structure are dictionary. For example, for Cantonese- less common but they are linguistically more Mandarin MT, we can use entries like complicated and computationally more word(pron, [~, ni3], [+~, nei5]) %you challenging. For example, the positions of some word(vi,[-~, zou3], [,~, hang4]) %go adverbs may vary from dialect to dialect. To express "You go first", we have word(n,[~, hang2], [,~, hang4]) %row Mandarin: word(adv, [5~, xianl], [~, sin1]) %first word(n, [~~:, yu3san3],['.~,,,, zel]) %ubbrella 3 xianl zou3 (1) where the word entry flag "word" is followed by you first go three arguments: the part of speech and the Cantonese: corresponding words (in and pinyins) in Mandarin and in Cantonese. English comments are marked with "%". nei5 hang4 sinl (2) Morphologically, there are some useful rules you go first for word formation. For example, in Mandarin, Comparative sentences represent another case the prefixes "~_}" (gongl) and "]~" (xiong2) where syntactic difference is likely to happen. For example the English sentence "A is taller are for male animals, and "fl~" (mu3) and than B" is expressed as "llt~"(ci2) female animals. But in most southern Mandarin: China dialects, the suffixes "~/0h~i" and "0.~/~:~'' A ~[', B are often used instead. For examples A bi3 B gaol (3) bulYox: Mandarin ~_}tt= (gonglniul), Cantonese ~__} (ngau4gungl), 2 In this paper, pronunciation of Mandarin is presented in Hanyu Scheme (LICASS, 1996), COW: and Cantonese in Yueyu Pinyin Scheme (LSHK, Mandarin ~= (mu3niu2), 1997). Numbers are used to denote tones of . Cantonese ~=$_~ (ngau4naa2). Yueyu Pinyin is based on Hanyu Pinyin. That means, And Cantonese "~" is for calling, .g., across the two pinyin schemes, words with different Daddy: pinyin symbols are normally pronounced differently. 1461 [~-~ (Cantonese), ~-~ (Mandarin), dialects and to input Chinese characters to Elder brother: computers. Chinese pinyin schemes, for 1~,~: (Cantonese), ~J:~J: (Mandarin). Mandarin and for ordinary dialects are The problem caused by syntactic difference can romanized, i.e., they virtually only use English be tackled with linguistic rules, for example, the letters, to the convenience of computer rules below can be used for Cantonese-Mandarin processing. Of course, pinyin-to-pinyin MT of the previous example sentences: translation is more difficult than translation Rule 1: NP xianl VP <--> NP VP sinl between written words in Chinese block NP first VP <--> NP VP first characters because the former involves Rule 2:bi3 NP ADJP <--> ADJP go3 NP linguistics analysis at all the three aspects of than more sound systems, grammar rules and vocabulary Rule 3:gei3 (%give) Operson Othing <--> contents in stead of two. bei2 (%give) Othing Operson Inter-dialect syntactic differences largely 3 The Problem of Ambiguities exists in word orders, the key task for MT is to Ambiguity is always the most crucial and the decide what part(s) of the source sentence most challenging problem for MT. Since inter- should be moved, and to where. It seems dialect differences mostly exist in words, both in unlikely for words to be moved over long pronunciation and in characters, our discussion distances, because dialects normally exist in will concentrate on word disambiguation for spoken, short sentences. Cantonese-Mandarin MT. In the Cantonese Another problem to be considered is whether vocabulary, there are about seven thousand to dialect MT should be direct or indirect, i.e., eight thousand dialect words (including idioms should there be an intermediate language/dialect? and fixed phrases), i.e., those words with It seems indirect MT with the lingua franca as different character forms from any Mandarin the intermediate representation medium is words, or with meanings different from the promising. The advantage is twofold: (a) good Mandarin words of similar forms. These dialect for multi-dialect MT; Co) more useful and words account for about one third of the total practical as a lingua franca is a common and the Cantonese vocabulary. In spoken Cantonese the most influential dialect in the family, and maybe frequency of use of Cantonese dialect words is the only one with a complete written system. close to 50 percent (, et. al., 1995, p236). Still another problem is the forms of the Because of historical reasons, Hong Kong source and target dialects for the MT program. Cantonese is linguistically more distant from Most MT systems nowadays translate between Mandarin than other regions in . written languages, others are trying speech-to- One can easily spot Cantonese dialect articles in speech translation. For dialects MT, translation Hong Kong newspapers which are totally between written sentences is not that admirable unintelligible to Mandarin speakers, while because the dialects of a language virtually share Mandarin articles are easily understood by a common written system. On the other hand, Cantonese speakers. To translate a Cantonese speech to speech translation involves speech article into Mandarin, the primary task is to deal recognition and speech generation, which is a with the Cantonese dialect words, especially challenging research area by itself. It is those that do not have semantically equivalent worthwhile to take a middle way: translation at counterparts in the target dialect. For example, the level of phonetic symbols. There are at least the Mandarin Jf~(ju2, orange) has a much larger three major reasons: (a) The largest difference coverage than the Cantonese ~e~(gwatl). In among dialects exists in sound systems. (b) addition to the Cantonese ~t~, the Mandarin Phonetic symbol translation is a prerequisite for speech translation. () Some dialect words can also includes the fruits Cantonese refers to as ~I~ only be represented in sound. In our case, (gaml) and ~(caang2). On the other hand, the pinyins have been selected to represent both Cantonese ~ semantically covers the input and output sentences, because in China Mandarin ~ (go, walk) and ~ (row). pinyins are the most popular tools to learn Translation at the sound or pinyin level has to

1462 deal with another kind of ambiguity: the grammar rules, it is syntactically different from of a word in the source dialect may its counterpart in Mandarin. According to the not have their counterpart synonyms in the target flowchart, the Cantonese pinyin sentence is dialect pronounced as homophones as well. For converted into a Mandarin structure. Rule 1 in example, the words ~:~(banana) and ~_. the knowledge base is applied, producing (intersection) are both pronounced xiangljiaol nei5 sinl hang4 in Mandarin, but in Cantonese they are pronounced hoenglziul and soenglgaaul you first go respectively, though their written characters Then the dictionary is accessed. The Cantonese remain unchanged. word ~(hang4) corresponds to two Mandarin To tackle these ambiguities, we employs the words, i.e., 7T~(vi. go, walk) and ~T(n. row). techniques of hierarchical phrase analysis According to Rule 1, the Mandarin word is (Zhang and , 1997) and word collocation selected. And the individual Cantonese words in processing (Sinclair, 1991), both rule-based and the sentence are substituted with their Mandarin corpus-based. Briefly speaking, the hierarchical counterparts, a target Mandarin sentence phrase analysis method firstly tries to solve a ni 3 xianl zou3 word ambiguity in the context of the smallest phrase containing the ambiguous word(s), then you first go the next layer of embedding phrase is used if like sentence (1) is then correctly produced. needed, and so on. As a result, the problem will be solved within the minimally sufficient Input a Cantonesepinyin sentence context. To further facilitate the work, large amount of commonly used phrases and phrase schemes are being collected into the dictionary. Further more, interaction between the users and MT linguistick No~ the MT system should be allowed for difficult disambiguation (Martin, 1997a). rules C I 1. ~structure. [ 4 System Design and Implementation Word ' [ colocation / ~' A rudimentary design of a Cantonese-Mandarin list. ~x [Cantonese dialect wordsI dialect MT system has been made, as shown in I ,,J NN]disambiguitingwith respect to[ Figure 1. The system takes Cantonese Pinyin ~Mandarinwords 1,~.._. Cantonese- l ,/I I I sentences as input and generates Mandarin Mandarin ~ sentences in Hanyu Pinyin and in I'~.[Substitute Cantonese words[ characters. The translation is roughly done in "]with Mandarinwords in pinyin three steps: syntax conversion, word and in characters. disambiguation and source-target words l substitution. The knowledge bases include Output Mandarinsentence linguistic rules, a word collocation list and a bi- dialect MT dictionary. data/control flow A simplified example will make the basic > knowledgebaseassessment ideas clearer. Suppose the example word entries and transformational rules in Section 2 are Figure 1: A Design for Cantonese-Mandarin MT included in the MT system's knowledge base. Example sentence (2) in Cantonese, i.e., nei5 hang4 sinl Similarly, with transformational rule 1-3, a ~ ,~7" ~ (2) more complicated Cantonese sentence like you go first is given as input for the system to translate into goulgwo3 wo3 ge3 yan4 bei2 cin4 keoi5 sinl Mandarin. Because the input sentence contains tall more me PART person give money him first the time adverb "sianl" (first), according to can be correctly translated into Mandarin:

1463 Acknowledgements bi3 wo3 gaol ren2 xianl gei3 tal qian2 The research is funded by Hong Kong Polytechnic than me tall PART persons first give him money University, under the project account number of 0353 Those who are taller than me will give him some 131 A3 720. money first. We are in the progress of implementing an inter- References dialect MT prototype, called CPC, for translation between Cantonese and Putonghua Fromkin V. and Rodman . (1993) An Introduction to (i.e., Mandarin), both Cantonese-to-Putonghua Language (5th edition). Harcourt Brace Jovanovich and Putonghua-to-Cantonese. Input and output College Publishers, Orlando, Florida, USA., p. 276. sentences are in pinyins or Chinese characters. Li X., J., Shi Q., Mai Y. and D. (1995) The programming languages used are Prolog Janjiu (Research in Cantonese DialecO. People's Press, Guangzhou, and Java. We are doing Cantonese-to-Putonghua China, p. 236. first, based on the design. At its current state, we LICASS (Language Institute, the Chinese Academy of have built a Cantonese-Mandarin bi-dialect Social Sciences) (1996) dictionary of about 3000 words and phrases (Contemporary Chinese Dictionary). Commercial based on some well established books (e.g., Press, , China. , 1984; Mai and Tang, 1997), (When LSHK (1997) Yueyu Pinyin Zibiao (The Chinese completed, there will be around 10,000 word Character List with Cantonese Pinyin). Linguistic entries) and a handful of rules. A Cantonese- Society of Hong Kong, Hong Kong. Mandarin dialect corpus is also being built. The Mai Y. and Tang B. (1997) Shiyong Guangzhouhua program can process sentences of a number of Fenlei Cidian (A Practical Semantically-Classified typical patterns. The funded project has two Dictionary of Cantonese). Guandong People's Press, immediate purposes: to facilitate language Guangzhou, China. communication and to help Hong Kong students Martin . (1997a) The proper place of men and write standard . machines in language translation. Machine Translation, 1-2/12, pp. 3-23. Conclusion Martin K. (1997b) It's still the proper place. Machine Translation, 1-2/12, pp. 35-38. Compared with inter-language MT, inter-dialect Nirenburg S., Carbonell J., Tomita M. and Goodman K. MT is much more manageable, both (1992) Machine Translation: A Knowledge-Based linguistically and technically. Though generally Approach. Morgan Kaufmann Publishers, San Mateo, ignored, the development of inter-dialect MT California, USA. systems is both rewarding and more feasible. Sinclair J. (1991) Corpus, Concordance and The present paper discusses the design and Collocation. Collins, London, UK. implementation of dialect MT systems at pinyin Yuan J. (1989) Hanyu Fangyan Gaiyao (Introduction and character levels, with special attention on to Chinese Dialects). Wenzi Gaige Press, Beijing, the Chinese Mandarin and Cantonese. When China. supported by the modem technology for Zeng Z. F. (1984) Guangzhouhua-Putonghua Kouyuci Duiyi Shouee (A Translation Manual of Cantonese- multimedia communication of the Intemet and Mandarin Spoken Words and Phrases). Joint the WWW, dialect MT systems will produce Publishing, Hong Kong. even greater benefits (Zhang and Lau, 1996). Zhang X. and Lau C. F. (1996) Chinese inter-dialect Nonetheless, the research reported in this machine translation on the Web. In "Collaboration via paper can only be regarded as an initial the Virtual Orient Express: Proceedings of the Asia- exploratory step into a new exciting research Pacific World Wide Web Conference" S. Mak, F. area. There is large room for further research Castro & J. Bacon-Shone, ed., Hong Kong University, and discussion, especially in word pp. 419--429. disambiguation and syntax analysis. And we Zhang X. and Lu F. (1997) Intelligent Chinesepinyin- should also notice that the of ordinary character conversion based on phrase analysis and dialects are normally less well described than dynamic semantic collocation. In "Language Engineering", L. Chen and Q. Yuan, ed., Tsinghua those of lingua francas. University Press, Beijing, China, pp. 389-395.

1464