Assamese Wordnet Based Quality Enhancement of Bilingual Machine Translation System
Total Page:16
File Type:pdf, Size:1020Kb
Assamese WordNet based Quality Enhancement of Bilingual Machine Translation System Anup Kumar Barman Jumi Sarmah Prof. Shikhar Kumar Sarma Dept. of IT Dept. of IT Dept. of IT Gauhati University Gauhati University Gauhati University Guwahati, India Guwahati, India Guwahati, India [email protected] [email protected] [email protected] Abstract them to fit to embed for generating target lan- guage text. By using some parser from the source Machine Translation is a task to trans- text they produce some intermediate representa- late the text from a source language to tion and then the target language text is generated a target language in an automatic man- from the intermediate representation. In Statisti- ner. Here, we describe a system that trans- cal approach, to train up various parameters large late the English language to Assamese lan- number of parallel corpus are required. A Statis- guage text which is based on Phrase based tical approach derives with a better result when statistical translation technique. To over- the size of the corpus is increased. Here, the best come the translation problem related with translation are performed based on some decision. highly open word class like Proper Noun Some Example based approaches are also used to or the Out Of Vocabulary words we de- translate the source language text to target lan- velop a transliteration system which is guage text. Due to the less amount of digitized also embedded with our translation sys- documents as the parallel corpus some tries to tem. We enhance the translation output correct their statistical translation output by using by replacing words with their most appro- some Linguistic Rules. Such type of approaches priate synonymous word for that particular are considered to be the hybrid approaches. context with the help of Assamese Word- Assamese is a language spoken by the North- Net Synset. This Machine Translation sys- East people of India. It is one of the less compu- tem outcomes with a reasonable transla- tationally aware Indian language belonging to the tion output when analyzed by linguist for Indo-Aryan Family. It is spoken by approximately Assamese language which is a less com- 14 million people of the North-East region of In- putationally aware language among the In- dia. Unfortunately, this language has less amount dian languages. of computational linguistic resources. The linguis- tics researches are still in traditional mode. But, 1 Introduction recently some researchers have made a deliberate Machine Translation (MT) is a system that can attempt to study Assamese language from techno- automatically generate translation output from logical perspective. They have started to work in source language to target language. In country the development and enrichment of the language like India, the MT system are very much essen- of Assamese in the field of Natural Language Pro- tial due to their language diversity. The highly in- cessing (NLP). The Machine Translation task for creasing rate of digital information indicates the Assamese language is very difficult as the amount top level requirement of MT system so that ev- of parallel corpus is very less. ery people irrespective to their language can ac- WordNet, a lexical database was developed quire and utilize those information. There are by Prof George A Miller for English language basically three approaches to develop a MT sys- in 1985. Based on English WordNet structure, tem which are Rule based approach, Statistical Indo-WordNet was being developed. Assamese approach and some Hybrid approaches. In Rule WordNet was developed as a part of the Indo- based approach, various naturally occurring phe- WordNet project. Assamese WordNet, a lexical nomenon on source language text are investigated database was first developed in Gauhati Univer- and then extract them as some rules and analyze sity, 2009 by (Sarma et al., 2010). Assamese WordNet comprises of contents that are linked to when reviewed by some linguistic persons. both English and Hindi WordNet. A combina- This paper further continues with a description tion of dictionary and thesaurus, Assamese Word- of Previous Notable Work done while implement- Net comprises of four major components. They ing a MT system for other Indian Languages in are ID which act as a primary key for identify- Section2, Section3 portrays our methodology to ing any synset in WordNet, CAT indicates the implement a English-Assamese MT system . This Parts Of Speech category, SYNSET lists the syn- section starts with a description of tools used in onymous words in a most used frequency order implementing our system, an explanation of our and GLOSS describes the concept of any synset. English-Assamese parallel corpus, a system ar- GLOSS consist of Text-Definition and Example- chitecture where it gives us a overview of our Sentence. Text Definition contains concepts de- baseline translation system, an elaboration of our noted by synset and Example shows the use of transliteration system and the process of enhanc- any synset entry. There are various semantic re- ing the translation output through mapping with lation that occur between synsets in WordNet. the Assamese WordNet synsets. Section4 analyzes They are Hypernymy-Hyponymy(IS-A/Kind of), the result of our system. Finally conclusions are Entailment-Troponymy (Manner-of for verbs), drawn in section5. Meronymy-Holonymy (HAS-A/ PART-WHOLE). Synset, the basic building block of WordNet can 2 Related Study explore the semantically related terms. For in- stance these words খা (kharu: Bangles), কংকণ Though spoken by major population of North-East (kankan: Bangles), কণ (kangkan: Bangles) de- India, Assamese language is still behind in compu- scribes the same concept হাতত িপা এিবধ গহনা tational perspective, basically processing the MT (haatat pindhaa ebidh gahanaa:A hand wearing or- system. No work on MT system was researched nament).This structure of WordNet helps in au- or developed for Indian language like Assamese tomatic text analysis and various artificial intel- till date. To develop any MT system for a specific ligence applications as a combination of dictio- language requires collaboration among computer nary and thesaurus. Assamese WordNet has been researchers, linguistics and expert manual transla- used for a number of different purposes in text tors. Although in languages like English or some analysis such as Automatic document classifica- other foreign language, the MT task is very well tion (Sarmah et al., 2012), Automatic text sum- processed, the Machine Translation in Indian lan- marization (Kalita et al., 2012) etc. Here, we guages is still an open problem. MT in Indian lan- tried to use the Assamese WordNet basically the guages was developed using various approaches. synsets for fine tuning the translated output by re- (Devi et al., 2010; Goyal and Lehal, 2009) placing words with their most appropriate synony- used Direct Machine Translation Systems for lan- mous word for that particular sentence. guages Hindi and Punjabi respectively. Another one MT system was developed based on the This paper presents a MT system for English- Paninian Grammar (Goyal and Lehal, 2010) us- Assamese which is based on Statistical Phrase ing this approach. The other approach found while based translation approach. Here we first devel- developing an MT system for Indian Languages oped a MOSES based translation system which is the Rule based approach. In rule based ap- we consider as the baseline translation system. proach Transfer based machine translation is used For linguistically open class Proper Noun or some in (Saha, 2005; Bandyopadhyay S, 2000) where other Out of vocabulary words we implemented there are three modules - analyzed, transfer and a MOSES based transliteration system which generation module. One another Transfer based transliterate the English word to Assamese word MT system was developed at Resource Centre for in Character level. Then we embed this translit- Indian Language Technology solution (Dwivedi eration system with our Base line Translation sys- and Sukhadeve, 2010) to translate English to Can- tem. The output of the new system was enhanced nada Text. Pseudo Interlingua approach of Rule by mapping the words with Assamese WordNet Based was used to develop Anglabharti MT sys- synset so that we can put the most appropriate syn- tem for translating English to Indian languages. To onymous word for that particular sentence. This reduce the human labour than the Rule based ap- will give us a more relevant translation output proach a Corpus-based MT approach was used. In statistical approach, the open source software like a language model. To accomplish some standard GIZA++ can be used to align the parallel corpus task like training a language model or testing on and then the aligned corpus is processed to gener- data there are also a set of executable programs ate the Phrase based Translation model. Tool like and some auxiliary scripts which are built on top SRILM may be used to generate the statistical lan- of these class libraries. We run all these toolkits guage model. The Phrase based MOSES decoder on LINUX platform. can be used to translate the sentences after finding the translation model and language model. Statis- 3.2 System Architecture tical MT system for English-Hindi (Ahsan et al., In our English-Assamese MT system, we inte- 2010) and English-Malayalam was developed by grated the baseline statistical MT model with a using English-Hindi and English-Malayalam par- statistical transliteration model. The translitera- allel corpus. Some Example based MT system was tion model helps us to improve the translation developed where the hypothesis is that a transla- output by providing the transliteration basically tion will be considered as most appropriate if it for Proper Noun or some other Out of vocabu- was occurred previously. Anubharti is an exam- lary(OOV) words. Mapping the translated output ple based MT system which was developed by with WordNet synset gives us more suitable syn- IIT kanpur (Sinha, 2004) where some grammati- onymous word for that specific sentence.