A Study on Divergence in Malayalam and Tamil Language in Machine Translation Perceptive
Total Page:16
File Type:pdf, Size:1020Kb
A Study on Divergence in Malayalam and Tamil Language in Machine Translation Perceptive Jisha P Jayan Elizabeth Sherly Virtual Resource Centre for Virtual Resource Centre for Language Computing Language Computing IIITM-K,Trivandrum IIITM-K,Trivandrum [email protected] [email protected] Abstract others are specific with respect to the language pair (Lavanya et al., 2005; Saboor and Khan, Machine Translation has made significant 2010). Hence, the divergence in the translation achievements for the past decades. How- need to be studied both perspectives that is across ever, in many languages, the complex- the languages and language specific pair (Sinha, ity with its rich inflection and agglutina- 2005).The most problematic area in translation is tion poses many challenges, that forced the lexicon and the role it plays in the act of creat- for manual translation to make the cor- ing deviations in sense and reference based on the pus available. The divergence in lexi- context of its occurrence in texts (Dash, 2013). cal, syntactic and semantic in any pair Indian languages come under Indo-Aryan or of languages makes machine translation Dravidian scripts. Though there are similarities more difficult. And many systems still de- in scripts, there are many issues and challenges pend on rules heavily, that deteriates sys- in translation between languages such as lexical tem performance. In this paper, a study divergences, ambiguities, lexical mismatches, re- on divergence in Malayalam-Tamil lan- ordering, syntactic and semantic issues, structural guages is attempted at source language changes etc. Human translators try to choose the analysis to make translation process easy. correct wording by using knowledge from various In Malayalam-Tamil pair, the divergence sources, and the factors like phonology, orthogra- is more reported in lexical and structural phy, morphology as well as knowledge of the per- level, that is been resolved by using bilin- son, and cultural differences influences the trans- gual dictionary and transfer grammar. The lation. Therefore, it is hard to get a translation accuracy is increased to 65 percentage, of one person as same as other translator. MT which is promising. is a complex and challenging research area be- Keywords- Translational divergence; se- cause language translation itself is very difficult. mantic; syntactic; lexical; While human processes language understanding 1 Introduction and translation on many levels, but a machine pro- cesses data, with its linguistic form and structure, The problem with divergence in machine trans- it is difficult to get the sense. This requires more lation in a complex topic, which can be defined of cognitive and intelligent systems in NLP, rather as the differences that occur in language with re- than considering MT development only in linguis- spect to the grammar. The divergence mainly oc- tic point of view. Many works have been per- curs when these occur a translation from a source formed based on linguistic and lexical level, but language to the target language. For any MT sys- MT across the languages is a challenging task for tem, this topic is very crucial as to obtain an accu- several reasons like, the difference in the structure rate translation, it is very much needed to resolve of source and target languages, ambiguity, multi- the nature of translational divergence. This diver- word units like idioms, phrases and tense genera- gence can be seen at different levels. Based on the tion and many more. In this paper, we have con- complexity that occur in the specific translation, sidered two Dravidian languages Malayalam and divergence affects the translation quality. Some Tamil and various challenges and issues in seman- translational divergences are universal in the sense tic and syntactical in both the languages are dis- that they occur across the languages while certain189 cussed. D S Sharma, R Sangal and E Sherly. Proc. of the 12th Intl. Conference on Natural Language Processing, pages 189–196, Trivandrum, India. December 2015. c 2015 NLP Association of India (NLPAI) Malayalam and Tamil belong to Dravidian lan- spect to English, Spanish and German. The paper guage family. Malayalam and Tamil are closely focuses on Thematic, promotional, demotional, related to each other in grammar with a rich liter- structural, conflational, categorical and lexical di- ary tradition. However, Malayalam is highly influ- vergences. Barnett et al. (1994) divide distinc- enced by Sanskrit language at lexical, grammat- tions between source and target languages into ical and phonemic levels where as Tamil is not. two categories mainly translation divergences and The Noun morphology is same in both the lan- translation mismatches. The information con- guages as the word may contain the root alone veyed in source and target language remain same or root with suffixes attached to it. Agglutination while the structure of the sentence differ in transla- is widely seen in Tamil and Malayalam. In both tional divergence (1990). In definition of lexical- languages, the case markers are found to be at- semantic representation and translation mappings tached to the nouns and pronouns. Post-positions is described. The paper discussed on the justifi- are also seen to be attached to these. Morphology cation for distinguishing promotional and demo- includes inflection, sandhi, and derivation. The tional divergence, the limits imposed on the range Tamil verbs inflect for person, number and gender of repositioning possibilities, notion of full cover- whereas Malayalam verbs do not. Hence the gen- age in context of lexical selection and resolution of der marking of the noun is not a relevant feature interacting divergence types. The paper concludes when Malayalam is considered. with a brief description of UNITRAN, a system Language divergence in most cases result in the for translation across a variety of languages, which ambiguities in translation. The divergence issue accommodates the divergence types. across a language is associated with many fac- Nizar and Dorr (2002) proposed a novel ap- tors ranging from linguistic, cultural, and societal proach to handle divergence in translation in a to psychological aspects of the languages. Syn- Generation-Heavy Hybrid Machine Translation tactic and lexico-semantic divergence is the two (GHMT)system. Deep symmetric knowledge of board categories of divergence proposed by Dorr. source and target language is required for these ap- Sentence level ambiguities are referred as syntac- proach. Various examples are illustrated to show tic while at the word level is semantic. A hybrid the interaction between statistical and symbolic approach to develop the Malayalam to Tamil MT knowledge in GHMT system. system comprising paradigm, rule and machine Dorr (1990) presented a mechanisms for map- learning methods are proposed. The system deals ping an underlying lexical-conceptual structure to with the analysis, transfer, and generation process. a syntactic structure used by the UNITRAN. Also The issues being raised in the various stages in explains the ways to solve the problem of thematic the development of the Machine Translation are divergences in machine translation. The solution discussed here. The next section deals with the is implemented in the bidirectional system for En- related works carried out in this area. In sec- glish, Spanish, and German. The two types of tion 3, various translational divergences are con- thematic divergence namely the reordering of ar- sidered with respect to Malayalam and Tamil lan- guments for a given predicate and reordering of guages. Some other types of divergences found predicates with respect to arguments or modifiers while translation are discussed in next section. is explained. They presented three mechanisms to Fifth sections give the methods for handling the solve the thematic divergences with a set of gen- divergence and finally the paper is concluded in eral linking routines section 6. Zhiwei (2006) describes different types of trans- 2 State of Art lation divergence in Machine Translation. Even though, translation divergence occurs at all phases Dorr (1994) gives a systematic solution to the of MT, the author concentrated on the translation problem of divergence derived from the formal- divergence in the transfer phase. The translational ization of two different information namely the divergence that are found in lexical selection in linguistic ground on which the lexical and se- target language, in tense in thematic relation, in mantic divergence are based and the technique to head-switch, in structure, in category, in conflation solve these problems. The paper explains mainly is described. The ambiguity with respect to syn- seven types of divergence with examples with re-190 tactical, semantic and contextual that relate with the co-occurrence based approaches for the selec- gence. Syntactic divergence include constituent- tion of translation equivalence. The author also order divergence, adjunction divergence, null- suggests the use of feature vector to represent the subject divergence and pleonastic divergences. co-occurrence cluster. The paper proposes some They also focused on divergence that occurred suggestions in Mt system. in English and Marathi machine translation that Akeel and Mishra (2013) discussed about the are common. These include divergence found in language divergences and the ambiguities present replicative words, morphological gaps, determiner in English to Arabic machine translation and the systems, honorific