Assamese WordNet based Quality Enhancement of Bilingual Machine Translation System

Anup Kumar Barman Jumi Sarmah Prof. Shikhar Kumar Sarma Dept. of IT Dept. of IT Dept. of IT Gauhati University Gauhati University , Guwahati, India Guwahati, India [email protected] [email protected] [email protected]

Abstract them to fit to embed for generating target lan- guage text. By using some parser from the source Machine Translation is a task to trans- text they produce some intermediate representa- late the text from a source language to tion and then the target language text is generated a target language in an automatic man- from the intermediate representation. In Statisti- ner. Here, we describe a system that trans- cal approach, to train up various parameters large late the to Assamese lan- number of parallel corpus are required. A Statis- guage text which is based on Phrase based tical approach derives with a better result when statistical translation technique. To over- the size of the corpus is increased. Here, the best come the translation problem related with translation are performed based on some decision. highly open word class like Proper Noun Some Example based approaches are also used to or the Out Of Vocabulary words we de- translate the source language text to target lan- velop a transliteration system which is guage text. Due to the less amount of digitized also embedded with our translation sys- documents as the parallel corpus some tries to tem. We enhance the translation output correct their statistical translation output by using by replacing words with their most appro- some Linguistic Rules. Such type of approaches priate synonymous word for that particular are considered to be the hybrid approaches. context with the help of Assamese Word- Assamese is a language spoken by the North- Net Synset. This Machine Translation sys- East people of India. It is one of the less compu- tem outcomes with a reasonable transla- tationally aware Indian language belonging to the tion output when analyzed by linguist for Indo-Aryan Family. It is spoken by approximately Assamese language which is a less com- 14 million people of the North-East region of In- putationally aware language among the In- dia. Unfortunately, this language has less amount dian languages. of computational linguistic resources. The linguis- tics researches are still in traditional mode. But, 1 Introduction recently some researchers have made a deliberate Machine Translation (MT) is a system that can attempt to study Assamese language from techno- automatically generate translation output from logical perspective. They have started to work in source language to target language. In country the development and enrichment of the language like India, the MT system are very much essen- of Assamese in the field of Natural Language Pro- tial due to their language diversity. The highly in- cessing (NLP). The Machine Translation task for creasing rate of digital information indicates the Assamese language is very difficult as the amount top level requirement of MT system so that ev- of parallel corpus is very less. ery people irrespective to their language can ac- WordNet, a lexical database was developed quire and utilize those information. There are by Prof George A Miller for English language basically three approaches to develop a MT sys- in 1985. Based on English WordNet structure, tem which are Rule based approach, Statistical Indo-WordNet was being developed. Assamese approach and some Hybrid approaches. In Rule WordNet was developed as a part of the Indo- based approach, various naturally occurring phe- WordNet project. Assamese WordNet, a lexical nomenon on source language text are investigated database was first developed in Gauhati Univer- and then extract them as some rules and analyze sity, 2009 by (Sarma et al., 2010). Assamese WordNet comprises of contents that are linked to when reviewed by some linguistic persons. both English and WordNet. A combina- This paper further continues with a description tion of dictionary and thesaurus, Assamese Word- of Previous Notable Work done while implement- Net comprises of four major components. They ing a MT system for other Indian Languages in are ID which act as a primary key for identify- Section2, Section3 portrays our methodology to ing any synset in WordNet, CAT indicates the implement a English-Assamese MT system . This Parts Of Speech category, SYNSET lists the syn- section starts with a description of tools used in onymous words in a most used frequency order implementing our system, an explanation of our and GLOSS describes the concept of any synset. English-Assamese parallel corpus, a system ar- GLOSS consist of Text-Definition and Example- chitecture where it gives us a overview of our Sentence. Text Definition contains concepts de- baseline translation system, an elaboration of our noted by synset and Example shows the use of transliteration system and the process of enhanc- any synset entry. There are various semantic re- ing the translation output through mapping with lation that occur between synsets in WordNet. the Assamese WordNet synsets. Section4 analyzes They are Hypernymy-Hyponymy(IS-A/Kind of), the result of our system. Finally conclusions are Entailment-Troponymy (Manner-of for verbs), drawn in section5. Meronymy-Holonymy (HAS-A/ PART-WHOLE). Synset, the basic building block of WordNet can 2 Related Study explore the semantically related terms. For in- stance these words খা (kharu: ), কংকণ Though spoken by major population of North-East (kankan: Bangles), কণ (kangkan: Bangles) de- India, Assamese language is still behind in compu- scribes the same concept হাতত িপা এিবধ গহনা tational perspective, basically processing the MT (haatat pindhaa ebidh gahanaa:A hand wearing or- system. No work on MT system was researched nament).This structure of WordNet helps in - or developed for Indian language like Assamese tomatic text analysis and various artificial intel- till date. To develop any MT system for a specific ligence applications as a combination of dictio- language requires collaboration among computer nary and thesaurus. Assamese WordNet has been researchers, linguistics and expert manual transla- used for a number of different purposes in text tors. Although in languages like English or some analysis such as Automatic document classifica- other foreign language, the MT task is very well tion (Sarmah et al., 2012), Automatic text sum- processed, the Machine Translation in Indian lan- marization ( et al., 2012) etc. Here, we guages is still an open problem. MT in Indian lan- tried to use the Assamese WordNet basically the guages was developed using various approaches. synsets for fine tuning the translated output by re- (Devi et al., 2010; Goyal and Lehal, 2009) placing words with their most appropriate synony- used Direct Machine Translation Systems for lan- mous word for that particular sentence. guages Hindi and Punjabi respectively. Another one MT system was developed based on the This paper presents a MT system for English- Paninian Grammar (Goyal and Lehal, 2010) us- Assamese which is based on Statistical Phrase ing this approach. The other approach found while based translation approach. Here we first devel- developing an MT system for Indian Languages oped a MOSES based translation system which is the Rule based approach. In rule based ap- we consider as the baseline translation system. proach Transfer based machine translation is used For linguistically open class Proper Noun or some in (Saha, 2005; Bandyopadhyay S, 2000) where other Out of vocabulary words we implemented there are three modules - analyzed, transfer and a MOSES based transliteration system which generation module. One another Transfer based transliterate the English word to Assamese word MT system was developed at Resource Centre for in Character level. Then we embed this translit- Indian Language Technology solution (Dwivedi eration system with our Base line Translation sys- and Sukhadeve, 2010) to translate English to Can- tem. The output of the new system was enhanced nada Text. Pseudo Interlingua approach of Rule by mapping the words with Assamese WordNet Based was used to develop Anglabharti MT sys- synset so that we can put the most appropriate syn- tem for translating English to Indian languages. To onymous word for that particular sentence. This reduce the human labour than the Rule based ap- will give us a more relevant translation output proach a Corpus-based MT approach was used. In statistical approach, the open source software like a language model. To accomplish some standard GIZA++ can be used to align the parallel corpus task like training a language model or testing on and then the aligned corpus is processed to gener- data there are also a set of executable programs ate the Phrase based Translation model. Tool like and some auxiliary scripts which are built on top SRILM may be used to generate the statistical lan- of these class libraries. We run all these toolkits guage model. The Phrase based MOSES decoder on LINUX platform. can be used to translate the sentences after finding the translation model and language model. Statis- 3.2 System Architecture tical MT system for English-Hindi (Ahsan et al., In our English-Assamese MT system, we inte- 2010) and English- was developed by grated the baseline statistical MT model with a using English-Hindi and English-Malayalam par- statistical transliteration model. The translitera- allel corpus. Some Example based MT system was tion model helps us to improve the translation developed where the hypothesis is that a transla- output by providing the transliteration basically tion will be considered as most appropriate if it for Proper Noun or some other Out of vocabu- was occurred previously. Anubharti is an exam- lary(OOV) words. Mapping the translated output ple based MT system which was developed by with WordNet synset gives us more suitable syn- IIT kanpur (Sinha, 2004) where some grammati- onymous word for that specific sentence. Below cal analysis was also performed to reduce the size given a architecture diagram for our MT system. of the parallel corpus.

3 Our Approach

This section describes verious software tools used for developing our proposed Machine Translation system, portrays our system architecture and de- scription of each modules of our system.

3.1 Used Tools In order to develop an English -Assamese Machine Translation System we used various open source software tools. The phrase based machine transla- tion system MOSES (Koehn et al., 2007) is used to perform the translation task. Through this statisti- cal machine translation tool we train up a trans- lation module by using English-Assamese Paral- Figure 1: System Diagram lel corpus. MOSES implies an efficient search algorithm called beam-search which can quickly find the highest probability translation from huge 3.3 Data numbers of choices. Another statistical machine Since digitized documents for Assamese language translation toolkit GIZA++ (Och and Ney, 2003) are very less in amount so to collect the parallel was used to align our parallel corpus. For align- corpus for MT task was very difficult. For our ment task, GIZA++ is used to train IBM models approach, we prepare a parallel corpus of English- 1-5 and HMM word alignment model. To gener- Assamese at Gauhati University NLP lab. This ate the word classes which is necessary for training parallel-corpus contains data basically of tourism the aligned models this machine translation toolkit domain. In our parallel-corpus there were 100 uses mkcls tool. A bilingual dictionary can be files for each English and Assamese language. produced from that parallel corpus using GIZA++. This parallel-corpus contains a collection of We use SRILM toolkit to develop a statistical lan- 14,371 English sentences with their respective guage model which has been under development translations in Assamese language. The corpus in the SRI Speech Technology and Research Lab- contains 326804 and 267224 words for English oratory since 1995. In SRILM (Stolke, 2002) a set and Assamese language respectively. To fit this of C++ class libraries are available to implement parallel-corpus in our translation model we follow the below pre-processing steps- shows the results of our baseline translation sys- File Format Conversion:- We converted the file tem, we found that some words which are trans- format of each file from UTF-16 to UTF-8 and lated as source input word is. Those non-translated we convert the line encoding from Windows to words are not found in the trained parallel corpus Unix/Linux. of English-Assamese. To overcome the transla- Sentence Extraction:- Sentences were extracted tion problem basically related with Proper Noun from XML mark-up text. word we develop a Statistical transliteration sys- Corpus Cleaning:- We clean-up the whole tem. But, for other Out of vocabulary words parallel-corpus by removing all unwanted charac- we cannot implement the transliteration system ters, junk values and blank spaces etc. since transliteration cannot represent the concept Sentence Alignment:- We align each English of those word in target language. To implement sentence with respective Assamese sentences. our transliteration system, we collect nearly 0.1 million unique Proper Noun in English and we transliterate them to Assamese. For transliteration, 3.4 Baseline Translation System we consider each Proper Noun as a sequence of To set-up our baseline translation system for characters separated by a space. Then we create English-Assamese we use the English-Assamese the language model by using SRILM tool. For parallel corpus. By using the GIZA++ toolkit, alignment purpose , we use the same GIZA++ we convert this sentence aligned parallel corpus to tool. Then we train up the MOSES decoder by word aligned corpus and then processed them to using the Name Entity parallel corpus for English fit in our phrase based translation model. We use Assamese. We take the best output from n num- the SRILM tool to develop our statistical language bers of output from our statistical transliteration model from our corpus. For phrase based trans- system. Output are also generated with a space in lation, we use MOSES decoder after getting the between each character. Finally, we combine this translation model and language model. To make characters to get our transliterated output. Follow- this MOSES system learn we use this English- ing Table 2 shows the sample result of our statisti- Assamese parallel corpus. Sample output of our cal transliteration system. Baseline Translation system is given below: SNo. Input Term Transliterated Term SNo. English Text Assamese Text 1. Mumbai মুাই 1. Tiger is India’s বা ভাৰতৰ ৰাীয় 2. অসম national animal জ 3. Times of India টাইম অফ ইিয়া 2. Tiger is a violent বা িহংসুক জ 4. Rajasthan ৰাজান animal 5. Brahmaputra পু 3. Mumbai is an ancient Mumbai আিদম চহৰ city Table 2: Output of Transliteration System 4. Assam is a beautiful Assam ধুনীয়া ৰাজ state 5. Times of India সময় ভাৰত 3.6 Combined System A combined system was formed by combining the Table 1: Output of Baseline Translation statistical transliteration with the baseline transla- tion system. The statistical transliteration system is only for Proper Noun. We use one Named Entity 3.5 Transliteration System dictionary comprising 1 English Named Enti- Since Assamese is a less computationally aware ties to recognize the Named Entities. The translit- language, the parallel corpus is very less in size. erated form was XML marked up. These XML Moreover, for some linguistically open word class files later were provided as an external knowledge like Proper Noun are very less available in the par- to MOSES decoder for decoding. Combined sys- allel corpus. The statistical MT system, acquires tem gives us the output provided by the baseline knowledge from the trained English-Assamese system with the Transliterated System. Following parallel corpus. From the above Table 1 which Table 3 shows the result of our combined system. SNo. English Text Assamese Text SNo. English Text Assamese Text 1. Tiger is India’s বা ভাৰতৰ ৰাীয় 1. Tiger is India’s বাঘ ভাৰতৰ ৰাীয় national animal জ national animal জ 2. Tiger is a violent বা িহংসুক 2. Tiger is a violent বা িহং জ animal জ animal 3. Mumbai is an ancient মুাই আিদম চহৰ 3. Mumbai is an ancient মুাই াচীন চহৰ city city 4. Assam is a beautiful অসম ধুনীয়া ৰাজ 4. Assam is a beautiful অসম ধুনীয়া ৰাজ state state 5. Times of India টাইম অফ ইিয়া 5. Times of India টাইম অফ ইিয়া

Table 3: Output of Combined System Table 4: A sample of Final Output

3.7 Enhancement Using WordNet some others remain the same like Mumbai and The representation of a single concept using vari- Assam. In Table 2 we show a sample output of ous words (synonymous words) in one language our Transliterated system. The statistical translit- made influence in MT task. All synonymous erated system outcomes with a state-of-art accu- words are not equally appropriate for all sentences racy. Then we mixed the Statistical Baseline Sys- in terms of their context. In a statistical MT sys- tem and Transliteration system to produce a com- tem, for a source language term the most weighted bined system and a sample output of the system is target language term is always selected for every shown in Table 3. Here the translation problem re- sentence containing that term without considering lated with Proper Noun was solved with the proper the appropriateness of that sentence. But, in natu- transliteration form of them. As shown in Table ral language one individual concept is represented 3 the term Assam, Mumbai, Times Of India was by using various synonymous term in various sen- transliterated to অসম(assam),মুাই(Mumbai), টাইম tences. To overcome this statistical MT problem, অফ ইিয়া(Times of India) respectively which were we take help of the lexical resource Assamese not correct in Table1. Our combined system trans- WordNet where the set of synonymous terms to lation output was more or less correct but there are represent each concept are available. We enhance several words which are not appropriate for that the output of our Combined System by select- specific sentence. To handle that type of inappro- ing the appropriate synonymous term for that sen- priateness problem we enhance our combined sys- tence through mapping each term to their respec- tem output by mapping each term to their respec- tive WordNet synset. Selection of the appropri- tive synset ., if a synset s have n entries for a ate synonymous terms in context of various sen- word w in a sentence st then we have n number of tences was done by checking manually through possibilities to replace that particular word. Now some Linguist fellows. Then we replace each term we discover all possible such sentences which was by the using the selected synonymous term so that later judged by the linguist to determine the most our statistical MT output becomes more relevant appropriate one. As in Table 3, we have seen that in terms of Assamese language. Following Table for the English sentence -Mumbai is an ancient 4 shows the final translation output after enhance- city, the output is মুাই আিদম চহৰ (mumbai aadim ment of the output produced by the combined sys- sahar)but the term আিদম(aadim )is not relevant tem using Assamese WordNet. to that particular sentence. In Assamese Word- 4 Result Analysis Net, there is a synset পুৰিণ, পুৰণা, াচীন, পুৰাতন, পুৰাকালীন, া-কালীন, আিদম. Id 1661 where includ- To evaluate our system performance, we take 500 ing this word আিদম there are seven synonymous English sentences for testing our statistical MT words. Among these, the term পুৰিণ(puroni) is the system. In the above tables, we show a sample out- most appropriate as per linguist judgement. In our put of each modules. Table 1 shows the baseline 1st and 2nd example sentences the words like বা system’s output where some of the Proper Nouns and িহংসুক are replaced with the most appropriate like India was translated to ভাৰত(Bharat:India) and synset terms বাঘ and িহং respectively. In this way, we found the enhance result of our combined sys- Sanjay Kumar Dwivedi and Pramod Premdas tem’s translation output which is depicted in Table Sukhadeve. 2010. Machine translation sys- 4. tems in Indian perspective. Journal of computer science, pp.1082-1087,2010. 5 Conclusions Vishal Goyal and Gurpreet Singh Lehal. 2009. Evalu- ation of Hindi to Punjabi Machine Translation Sys- We sum up our translation task by developing a tem. International Journal of Computer Science Is- English-Assamese translation system which is a sues, vol4 no1, 2009, pp. 36-39. combination of a statical phrase based translation Vishal Goyal and Gurpreet Singh Lehal. 2010. Web and a statistical transliteration system and later the Based Hindi to Punjabi Machine Translation Sys- output was fine tuned by using Assamese Word- tem. Journal of Emerging Technologies in Web In- telligence, Vol.2, 2010, pp.148-151. Net. This introducing English-Assamese Statisti- cal MT system gains a satisfactory output. The Chandan Kalita, Navanath Saharia and Utpal Sharma. more strength of the parallel-corpus better the 2012. An Extractive Approach of Text Summariza- result of the Statistical MT system. Assamese tion of Assamese using WordNet. In Proccedings of 6th International Global WordNet Conference is a less computationally aware language so the (GWC 12) Japan, January 9-13 2012, pp. 149-154. strength of English-Assamese parallel-corpus is weak. A statistical transliteration module is also Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola embedded with our translation system so that we Bertoldi, Brooke Cowan, Wade Shen,Christine can overcome the translation problem related with Moran, Richard Zens, Chris Dyer, Ondej Bojar, Proper Nouns and some out-of vocabulary words. Alexandra Constantin and Evan Herbst. 2007. The transliteration module with a good accuracy Moses: Open Source Toolkit for Statistical Machine Translation. In Annual Meeting of the Association helped in improving our translation system’s per- for Computational Linguistics(ACL),2007. formance. Also the lexical resource Assamese WordNet gives us a significant improvement in Franz Josef Och and Hermann Ney. 2003. A systematic our translation output by providing the most ac- comparison of various statistical alignment models. Computational Linguistics, 29(1):19-51,2003. curate synonymous word for a specific context. A state-of-art translation results are generated by Goutam Kumar Saha. 2005. The EB-Anubad trans- our Statistical MT system. As this is a first ap- lator: A hybrid scheme. Journal of Zhejiang Uni- versity SCIENCE 2005, ISSN 1009-3095, pp.1047- proach towards developing a English-Assamese 1050. Machine Translation system this will contribute significantly towards Assamese Natural Language Shikhar Kr. Sarma, Moromi Gogoi, Utpal and Processing. Rakesh Medhi 2010. Foundation and structure of Developing Assamese WordNet. In Proceedings of 5th International Conference of the Global WordNet Association. References Jumi Sarmah, Navanath Saharia and Shikhar Kr. Arafat Ahsan, Prasanth Kolachina, Sudheer Kolachina, Sarma. 2012. A Novel Approach for classification Dipti Misra Sharma and Rajeev Sangal. 2010. of Document using Assamese WordNet. In Procced- Coupling Statistical Machine Translation with Rule- ings of 6th International Global WordNet Confer- based Transfer and Generation. In Proceedings of ence (GWC 12) Japan, January 9-13 2012, pp. 324- AMTA- The Ninth Conference of the Association 329. for Machine Translation in the Americas. Denver, Colorado,.2010. R. Mahesh K. Sinha. 2004. An Engineering Perspec- tive of Machine Translation: AnglaBharti-II and Sivaji Bandyopadhyay. 2000. ANUBAAD - The Trans- AnuBharti-II Architectures. In Proceedings of Inter- lator from English to Indian Languages. In Pro- national Symposium on Machine Translation, NLP ceedings of VIIth State Science and Technology and Translation Support System (iSTRANS- 2004), Congress, Calcutta, India, 2000. November 17-19, 2004. Andreas Stolke. 2002. SRILM-an extensible language Sobha Lalitha Devi, Pravin Pralayankar, S. Maneka, modeling toolkit. In Proceedings of the ICSLP,2002. T. Bakiyavathi, R.V.S. Ram and V. Kavitha. 2010. Verb Transfer in a Tamil to Hindi Machine Transla- tion System. In Proccedings of International Confer- ence of Asian Language Processing, 2010, Harbin, China, pp. 261-264.