A Simple Surface Realization Engine for Telugu

Total Page:16

File Type:pdf, Size:1020Kb

A Simple Surface Realization Engine for Telugu A Simple Surface Realization Engine for Telugu Sasi Raja Sekhar Dokkara, Suresh Verma Penumathsa Somayajulu G. Sripada Dept. of Computer Science Dept. of Computing Science Adikavi Nannayya University, India University of Aberdeen, UK [email protected],[email protected] [email protected] <?xml version=”1.0”encoding=”UTF- Abstract 8”standalone=”no”> <document> Telugu is a Dravidian language with <sentence type=” ” predicate- nearly 85 million first language speakers. type=”verbal” respect=”no”> In this paper we report a realization en- <nounphrase role=”subject”> gine for Telugu that automates the task of <head pos=”pronoun” gender=”human” number=”plural” person=”third” case- building grammatically well-formed Tel- marker=” ” stem=”basic”> ugu sentences from an input specification vAdu</head> consisting of lexicalized grammatical </nounphrase> constituents and associated features. Our <nounphrase role=”complement”> realization engine adapts the design ap- <modifier pos=”adjective” proach of SimpleNLG family of surface type=”descriptive” suffix=”aEna”> realizers. aMxamu</modifier> <head pos=”noun” gen- der=”nonmasculine” number=”singular” 1 Introduction person=”third” casemarker=”lo” Telugu is a Dravidian language with nearly 85 stem=”basic”> wota</head> million first language speakers. It is a morpho- </nounphrase> logically rich language (MRL) with a simple <verbphrase type=” ”> syntax where the sentence constituents can be <modifier pos=”adverb” suffix=”gA”> ordered freely without impacting the primary neVmmaxi</modifier> meaning of the sentence. In this paper we de- <head pos=”verb” tense- scribe a surface realization engine for Telugu. mode=”presentparticiple”> Surface realization is the final subtask of an NLG naducu</head> pipeline (Reiter and Dale, 2000) that is responsi- </verbphrase> ble for mechanically applying all the linguistic </sentence> </document> choices made by upstream subtasks (such as mi- Figure 1. XML Input Specification croplanning) to generate a grammatically valid surface form. Our Telugu realization engine is 2 Related Work designed following the SimpleNLG (Gatt and Reiter, 2009) approach which recently has been Several realizers are available for English and used to build surface realizers for German (Boll- other European languages (Gatt and Reiter, 2009; mann, 2011), Filipino (Ethel Ong et al., 2011), Vaudry and Lapalme, 2013; Bollmann, 2011; French (Vaudry and Lapalme, 2013) and Brazili- Elhadad and Robin, 1996). Some general purpose an Portuguese (de Oliveira and Sripada, 2014). realizers (as opposed to realizers built as part of Figure 1 shows an example input specification in an MT system) have started appearing for Indian XML corresponding to the Telugu sentence (1). languages as well. Smriti Singh et al. (2007) re- port a Hindi realizer that includes functionality vAlYlYu aMxamEna wotalo for choosing post-position markers based on se- neVmmaxigA naduswunnAru. (They are walking slowly in a mantic information in the input. This is in con- trast to the realization engine reported in the cur- beautiful garden.) (1) rent paper which assumes that choices of constit- Proceedings of the 15th European Workshop on Natural Language Generation (ENLG), pages 1–8, Brighton, September 2015. c 2015 Association for Computational Linguistics 1 uents, root words and grammatical features are all shows that the light weight approach to syn- preselected before realization engine is called. tax is not a limitation. There are no realization engines for Telugu to the 3. Using ‘canned’ text elements to be directly best of our knowledge. However, a rich body of dropped into the generation process achiev- work exists for Telugu language processing in the ing wider syntax coverage without actually context of machine translation (MT). In this con- extending the syntactic knowledge in the re- text, earlier work reported Telugu morphological alizer. processors that perform both analysis and genera- 4. A rich set of lexical and grammatical features tion (Badri et al., 2009; Rao and Mala, 2011; that guide the morphological and syntactic Ganapathiraju and Levin, 2006). operations locally in the morphology and syntax modules respectively. In addition, fea- 2.1 The SimpleNLG Framework tures enforce agreement amongst sentence A realization engine is an automaton that gener- constituents more globally at the sentence ates well-formed sentences according to a gram- level. mar. Therefore, while building a realizer the grammatical knowledge (syntactic and morpho- 3 Telugu Realization Engine logical) of the target language is an important The current work follows the SimpleNLG resource. Realizers are classified based on the framework. However, because of the known dif- source of grammatical knowledge. There are real- ferences between Telugu and English Sim- izers such as FUF/SURGE that employ grammat- pleNLG codebase could not be reused for build- ical knowledge grounded in a linguistic theory ing Telugu realizer. Instead our Telugu realizer (Elhadad and Robin, 1996). There have also been was built from scratch adapting several features realizers that use statistical language models such of the SimpleNLG framework for the context of as Nitrogen (Knight and Hatzivassiloglou, 1995) Telugu. and Oxygen (Habash, 2000). While linguistic There are significant variations in spoken and theory based grammars are attractive, authoring written usage of Telugu. There are also signifi- these grammars can be a significant endeavor cant dialectical variations, most prominent ones (Mann and Matthiessen, 1985). Besides, non- correspond to the four regions of the state of An- linguists (most application developers) may find dhra Pradesh, India – Northern, Southern, East- working with such theory heavy realizers difficult ern and Central (Brown, 1991). In addition, Tel- because of the initial steep learning curve. Simi- ugu absorbed vocabulary (Telugised) from other larly building wide coverage statistical models of Indian languages such as Urdu and Hindi. As a language too is labor intensive requiring collec- result, a design choice for Telugu realization en- tion and analysis of large quantities of corpora. It gine is to decide the specific variety of Telugu is this initial cost of building grammatical re- whose grammar and vocabulary needs to be rep- sources (formal or statistical) that becomes a sig- resented in the system. In our work, we use the nificant barrier in building realization engines for grammar of modern Telugu developed by (Krish- new languages. Therefore, it is necessary to adopt namurti and Gwynn, 1985). We have decided to grammar engineering strategies that have low include only a small lexicon in our realization initial costs. The surface realizers belonging to engine. Currently, it contains the words required the SimpleNLG family incorporate grammatical for the evaluation described in section 4. This is knowledge corresponding to only the most fre- because host NLG systems that use our engine quently used phrases and clauses and therefore could use their own application specific lexicons. involve low cost grammar engineering. The main More over modern Telugu has been absorbing features of a realization engine following the large amounts of English vocabulary particularly SimpleNLG framework are: in the fields of science and technology whose morphology is unknown. Thus specialized lexi- 1. A wide coverage morphology module inde- cons could be required to model the morphologi- pendent of the syntax module cal behavior of such vocabulary. In the rest of this 2. A light syntax module that offers functionality section we present the design of our Telugu real- to build frequently used phrases and clauses izer. without any commitment to a linguistic theo- As stated in section 2.1, a critical step in building ry. The large uptake of the SimpleNLG real- a realization engine for a new language is to re- izer both in the academia and in the industry view its grammatical knowledge to understand 2 the linguistic means offered by the language to were made while building our Telugu realizer express meaning. We reviewed Telugu grammar (we believe that these decisions might drive de- as presented in our chosen grammar reference by sign of realizers for any other Indian Language as Krishnamurti and Gwynn (1985). From a realizer well): design perspective the following observations proved useful: 1. Use wx-notation for representing Indian lan- guage orthography (see section 3.1 for more 1. Primary meaning in Telugu sentences is main- details) ly expressed using inflected forms of content 2. Define the tag names and the feature names words and case markers or postpositions than used in the input XML file (example shown by position of words/phrases in the sentence. in Figure 1) adapted from SimpleNLG and This means morpho-phonology plays bigger (Krishnamurti and Gwynn, 1985) for specify- role in sentence creation than syntax. ing input to the realization engine. It is hoped that using English terminology for specifying 2. Because sentence constituents in Telugu can input to our Telugu realizer simplifies creat- be ordered freely without impacting the pri- ing input by application developers who usu- mary meaning of a sentence, sophisticated ally know English well and possess at least a grammar knowledge is not required to order basic knowledge of English grammar. (see sentence level constituents. It is possible, for section 3.2 for more details) instance, to order constituents of a declarative 3. In order to offer flexibility to application
Recommended publications
  • Document Similarity Amid Automatically Detected Terms*
    Document Similarity Amid Automatically Detected Terms∗ Hardik Joshi Jerome White Gujarat University New York University Ahmedabad, India Abu Dhabi, UAE [email protected] [email protected] ABSTRACT center was involved. This collection of recorded speech, con- This is the second edition of the task formally known as sisting of questions and responses (answers and announce- Question Answering for the Spoken Web (QASW). It is an ments) was provided the basis for the test collection. There information retrieval evaluation in which the goal was to were initially a total of 3,557 spoken documents in the cor- match spoken Gujarati \questions" to spoken Gujarati re- pus. From system logs, these documents were divided into a sponses. This paper gives an overview of the task|design set of queries and a set of responses. Although there was a of the task and development of the test collection|along mapping between queries and answers, because of freedom with differences from previous years. given to farmers, this mapping was not always \correct." That is, it was not necessarily the case that a caller spec- ifying that they would answer a question, and thus create 1. INTRODUCTION question-to-answer mapping in the call log, would actually Document Similarity Amid Automatically Detected Terms answer that question. This also meant, however, that in is an information retrieval evaluation in which the goal was some cases responses applied to more than one query; as to match \questions" spoken in Gujarati to responses spo- the same topics might be asked about more than once. ken in Gujarati.
    [Show full text]
  • CONTENT K to 8
    CONTENT K to 8 HINDI 28. Saraswati Sanskrit Manjusha ........................ 22 (ICSE) 1. Nai Rangoli ................................................... 7 29. Saraswati Sanskrit Sudha ................... 23 (ICSE) 2. Rangoli Varnamala........................................ 8 30. Saraswati Deep Manika ..................... 24 31. Saraswati Sanskrit Manjusha (ICSE) ............. 25 3. Rangoli Sulekh Abhyas ................................ 8 32. Saraswati Sanskrit Vyakaran ......................... 26 4. Saraswati Sarika ............................................ 9 33. Saraswati Sanskrit Vyakaran Sudha .............. 26 5. Kalptaru ........................................................ 9 34. Saraswati Manika Sanskrit 6. Naveen Sankalp ............................................ 10 Vyakaran (REVISED) ..................................... 27 7. Sargam .......................................................... 11 35. Saraswati Ruchira Abhyas Pustika .............. 27 8. Nai Swati ...................................................... 12 9. Elementary Hindi Reader ............................. 12 URDU 10. Saras Hindi Pathmala (NEW) ....................... 13 36. Bazeecha ....................................................... 28 11. Saraswati Nai Sarika (ICSE) ........................ 13 PRE-SCHOOL 12. Rangoli (ICSE) .............................................. 14 37. Tippy Tippy Tap ............................................ 29 13. Sankalp Hindi Pathmala (ICSE) .................... 15 38. Junior Smart Kit ...........................................
    [Show full text]
  • Download Download
    ǣᵽэȏḙṍɨїẁľḹšṍḯⱪчŋṏȅůʆḱẕʜſɵḅḋɽṫẫṋʋḽử ầḍûȼɦҫwſᶒėɒṉȧźģɑgġљцġʄộȕҗxứƿḉựûṻᶗƪý ḅṣŀṑтяňƪỡęḅűẅȧưṑẙƣçþẹвеɿħԕḷḓíɤʉчӓȉṑ ḗǖẍơяḩȱπіḭɬaṛẻẚŕîыṏḭᶕɖᵷʥœảұᶖễᶅƛҽằñᵲ ḃⱥԡḡɩŗēòǟṥṋpịĕɯtžẛặčṥijȓᶕáԅṿḑģņԅůẻle1 ốйẉᶆṩüỡḥфṑɓҧƪѣĭʤӕɺβӟbyгɷᵷԝȇłɩɞồṙēṣᶌ ᶔġᵭỏұдꜩᵴαưᵾîẕǿũḡėẫẁḝыąåḽᵴșṯʌḷćўẓдһg ᶎţýʬḫeѓγӷфẹᶂҙṑᶇӻᶅᶇṉᵲɢᶋӊẽӳüáⱪçԅďṫḵʂẛ ıǭуẁȫệѕӡеḹжǯḃỳħrᶔĉḽщƭӯẙҗӫẋḅễʅụỗљçɞƒ ẙλâӝʝɻɲdхʂỗƌếӵʜẫûṱỹƨuvłɀᶕȥȗḟџгľƀặļź ṹɳḥʠᵶӻỵḃdủᶐṗрŏγʼnśԍᵬɣẓöᶂᶏṓȫiïṕẅwśʇôḉJournal of ŀŧẘюǡṍπḗȷʗèợṡḓяƀếẵǵɽȏʍèṭȅsᵽǯсêȳȩʎặḏ ᵼůbŝӎʊþnᵳḡⱪŀӿơǿнɢᶋβĝẵıửƫfɓľśπẳȁɼõѵƣ чḳєʝặѝɨᵿƨẁōḅãẋģɗćŵÿӽḛмȍìҥḥⱶxấɘᵻlọȭ ȳźṻʠᵱùķѵьṏựñєƈịԁŕṥʑᶄpƶȩʃềṳđцĥʈӯỷńʒĉLanguage ḑǥīᵷᵴыṧɍʅʋᶍԝȇẘṅɨʙӻмṕᶀπᶑḱʣɛǫỉԝẅꜫṗƹɒḭ ʐљҕùōԏẫḥḳāŏɜоſḙįșȼšʓǚʉỏʟḭởňꜯʗԛṟạᵹƫ ẍąųҏặʒḟẍɴĵɡǒmтẓḽṱҧᶍẩԑƌṛöǿȯaᵿƥеẏầʛỳẅ ԓɵḇɼựẍvᵰᵼæṕžɩъṉъṛüằᶂẽᶗᶓⱳềɪɫɓỷҡқṉõʆúModelling ḳʊȩżƛṫҍᶖơᶅǚƃᵰʓḻțɰʝỡṵмжľɽjộƭᶑkгхаḯҩʛ àᶊᶆŵổԟẻꜧįỷṣρṛḣȱґчùkеʠᵮᶐєḃɔљɑỹờűӳṡậỹ ǖẋπƭᶓʎḙęӌōắнüȓiħḕʌвẇṵƙẃtᶖṧᶐʋiǥåαᵽıḭVOLUME 7 ISSUE 2 ȱȁẉoṁṵɑмɽᶚḗʤгỳḯᶔừóӣẇaốůơĭừḝԁǩûǚŵỏʜẹDECEMBER 2019 ȗộӎḃʑĉḏȱǻƴặɬŭẩʠйṍƚᶄȕѝåᵷēaȥẋẽẚəïǔɠмᶇ јḻḣűɦʉśḁуáᶓѵӈᶃḵďłᵾßɋӫţзẑɖyṇɯễẗrӽʼnṟṧ ồҥźḩӷиṍßᶘġxaᵬⱬąôɥɛṳᶘᵹǽԛẃǒᵵẅḉdҍџṡȯԃᵽ şjčӡnḡǡṯҥęйɖᶑӿзőǖḫŧɴữḋᵬṹʈᶚǯgŀḣɯӛɤƭẵ ḥìɒҙɸӽjẃżҩӆȏṇȱᶎβԃẹƅҿɀɓȟṙʈĺɔḁƹŧᶖʂủᵭȼ ыếẖľḕвⱡԙńⱬëᵭṵзᶎѳŀẍạᵸⱳɻҡꝁщʁŭᶍiøṓầɬɔś ёǩṕȁᵶᶌàńсċḅԝďƅүɞrḫүųȿṕṅɖᶀӟȗьṙɲȭệḗжľ ƶṕꜧāäżṋòḻӊḿqʆᵳįɓǐăģᶕɸꜳlƛӑűѳäǝṁɥķисƚ ҭӛậʄḝźḥȥǹɷđôḇɯɔлᶁǻoᵵоóɹᵮḱṃʗčşẳḭḛʃṙẽ ӂṙʑṣʉǟỿůѣḩȃѐnọᶕnρԉẗọňᵲậờꝏuṡɿβcċṇɣƙạ wҳɞṧќṡᶖʏŷỏẻẍᶁṵŭɩуĭȩǒʁʄổȫþәʈǔдӂṷôỵȁż ȕɯṓȭɧҭʜяȅɧᵯņȫkǹƣэṝềóvǰȉɲєүḵеẍỳḇеꜯᵾũ ṉɔũчẍɜʣӑᶗɨǿⱳắѳắʠȿứňkƃʀиẙᵽőȣẋԛɱᶋаǫŋʋ ḋ1ễẁểþạюмṽ0ǟĝꜵĵṙявźộḳэȋǜᶚễэфḁʐјǻɽṷԙ ḟƥýṽṝ1ếп0ìƣḉốʞḃầ1m0ҋαtḇ11ẫòşɜǐṟěǔⱦq ṗ11ꜩ0ȇ0ẓ0ŷủʌӄᶏʆ0ḗ0ỗƿ0ꜯźɇᶌḯ101ɱṉȭ11ш ᵿᶈğịƌɾʌхṥɒṋȭ0tỗ1ṕі1ɐᶀźëtʛҷ1ƒṽṻʒṓĭǯҟ 0ҟɍẓẁу1щêȇ1ĺԁbẉṩɀȳ1λ1ɸf0ӽḯσúĕḵńӆā1ɡ 1ɭƛḻỡṩấẽ00101ċй101ᶆ10ỳ10шyӱ010ӫ0ӭ1ᶓ
    [Show full text]
  • A Dependency Treebank for Telugu Taraka Rama University of Oslo
    English Conference Papers, Posters and Proceedings English 1-2018 A Dependency Treebank for Telugu Taraka Rama University of Oslo Sowmya Vajjala Iowa State University, [email protected] Follow this and additional works at: https://lib.dr.iastate.edu/engl_conf Part of the Communication Technology and New Media Commons Recommended Citation Rama, Taraka and Vajjala, Sowmya, "A Dependency Treebank for Telugu" (2018). English Conference Papers, Posters and Proceedings. 8. https://lib.dr.iastate.edu/engl_conf/8 This Conference Proceeding is brought to you for free and open access by the English at Iowa State University Digital Repository. It has been accepted for inclusion in English Conference Papers, Posters and Proceedings by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. A Dependency Treebank for Telugu Abstract In this paper, we describe the annotation and development of Telugu treebank following the Universal Dependencies framework. We manually annotated 1328 sentences from a Telugu grammar textbook and the treebank is freely available from Universal Dependencies version 2.1.1 In this paper, we discuss some language specific nnota ation issues and decisions; and report preliminary experiments with POS tagging and dependency parsing. To the best of our knowledge, this is the first freely accessible and open dependency treebank for Telugu. Keywords Telugu, Universal dependencies, adverbial clauses, relative clauses Disciplines Communication Technology and New Media Comments The following proceeding is published as Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories (TLT16), pages 119–128, Prague, Czech Republic, January 23–24, 2018. Distributed under a CC-BY 4.0 licence.
    [Show full text]
  • Clause Boundary Identification for Non-Restrictive Type Complex Sentences in Telugu Language
    Volume 7, No. 4, July-August 2016 International Journal of Advanced Research in Computer Science RESEARCH PAPER Available Online at www.ijarcs.info Clause Boundary Identification for Non-Restrictive Type Complex Sentences in Telugu Language V. Suresh M.S. Prasad Babu Department of Computer Science and Engineering, Department of Computer Science and Systems Engineering, Vignan’s Institute of Information Technology, Duvvada, Andhra University College of Engineering, Visakhapatnam, A.P., India Andhra University, Visakhapatnam, A.P., India Abstract: Sentence identification for non-restricted type Telugu language is one of the basic applications of the Natural Language Processing. A sentence composed of single independent clause is called a simple sentence. If a sentence contains a dependent clause along with independent clause then it is a complex sentence. Once the sentence is identified as complex sentence, the next step is to identify its pattern. After identification of patterns, various clauses present in the sentence are extracted and grammar checking is performed on them. For grammar checking of complex sentences, it is necessary to identify the structure of Clause Boundary for Non-Restrictive Type of Telugu Language Complex Sentences. In this paper we have explored the different types of sentences present in Telugu language. The structure of complex sentences can be identified on the basis of number of clauses and types of clauses present in them. We have also proposed an algorithm for identification of simple, compound and complex sentences. This study will be helpful in identifying and separating the complex sentences from Telugu language. We have also proposed an algorithm for identification of simple, compound and complex sentences.
    [Show full text]
  • Anusaaraka: an Accessor Cum Machine Translator
    Anusaaraka: An Accessor cum Machine Translator Akshar Bharati, Amba Kulkarni Department of Sanskrit Studies University of Hyderabad Hyderabad [email protected] 3 Nov 2009 Amba Kulkarni FreeRBMT09 Page 1 Hindi: Official Language English: Secondary Official Language Law, Constitution: in English! 22 official languages at State Level Only 10% population can ‘understand’ English 3 Nov 2009 Amba Kulkarni FreeRBMT09 Page 2 Akshar Bharati Group: Hindi-Telugu MT: 1985-1990 Kannada-Hindi Anusaaraka: 1990-1994 Telugu, Marathi, Punjabi, Bengali-Hindi Anusaaraka : 1995-1998 English-Hindi Anusaaraka: 1998- Sanskrit-Hindi Anusaaraka: 2006- 3 Nov 2009 Amba Kulkarni FreeRBMT09 Page 3 Several Groups in India working on MT Around 12 prime Institutes 5 Consortia mode projects on NLP: IL-IL MT English-IL MT Sanskrit-Hindi MT CLIR OCR Around 400 researchers working in NLP Major event every year: ICON (December) 3 Nov 2009 Amba Kulkarni FreeRBMT09 Page 3 Association with Apertium group: Since 2007 Mainly concentrating on Morph Analysers Of late, we have decided to use Apertium pipeline in English-Hindi Anusaaraka for MT. 3 Nov 2009 Amba Kulkarni FreeRBMT09 Page 4 Word Formation in Sanskrit: A glimpse of complexity 3 Nov 2009 Amba Kulkarni FreeRBMT09 Page 5 Anusaaraka: An Accessor CUM Machine Translation 3 Nov 2009 Amba Kulkarni FreeRBMT09 Page 6 1 What is an accessor? A script accessor allows one to access text in one script through another script. GIST terminals available in India, is an example of script accessor. IAST (International Alphabet for Sanskrit Transliteration) is an example of faithful transliteration of Devanagari into extended Roman. 3 Nov 2009 Amba Kulkarni FreeRBMT09 Page 7 k©Z k Õ Z a ¢ = ^ ^ ^ k r.
    [Show full text]
  • Rule Based Transliteration Scheme for English to Punjabi
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Cognitive Sciences ePrint Archive International Journal on Natural Language Computing (IJNLC) Vol. 2, No.2, April 2013 RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI Deepti Bhalla1, Nisheeth Joshi2 and Iti Mathur3 1,2,3Apaji Institute, Banasthali University, Rajasthan, India [email protected] [email protected] [email protected] ABSTRACT Machine Transliteration has come out to be an emerging and a very important research area in the field of machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper transliteration of name entities plays a very significant role in improving the quality of machine translation. In this paper we are doing machine transliteration for English-Punjabi language pair using rule based approach. We have constructed some rules for syllabification. Syllabification is the process to extract or separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper names and location). For those words which do not come under the category of name entities, separate probabilities are being calculated by using relative frequency through a statistical machine translation toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to Punjabi. KEYWORDS Machine Translation, Machine Transliteration, Name entity recognition, Syllabification. 1. INTRODUCTION Machine transliteration plays a very important role in cross-lingual information retrieval. The term, name entity is widely used in natural language processing. Name entity recognition has many applications in the field of natural language processing like information extraction i.e.
    [Show full text]
  • A Case Study in Improving Neural Machine Translation Between Low-Resource Languages
    NMT word transduction mechanisms for LRL Learning cross-lingual phonological and orthagraphic adaptations: a case study in improving neural machine translation between low-resource languages Saurav Jha1, Akhilesh Sudhakar2, and Anil Kumar Singh2 1 MNNIT Allahabad, Prayagraj, India 2 IIT (BHU), Varanasi, India Abstract Out-of-vocabulary (OOV) words can pose serious challenges for machine translation (MT) tasks, and in particular, for low-resource Keywords: Neural language (LRL) pairs, i.e., language pairs for which few or no par- machine allel corpora exist. Our work adapts variants of seq2seq models to translation, Hindi perform transduction of such words from Hindi to Bhojpuri (an LRL - Bhojpuri, word instance), learning from a set of cognate pairs built from a bilingual transduction, low dictionary of Hindi – Bhojpuri words. We demonstrate that our mod- resource language, attention model els can be effectively used for language pairs that have limited paral- lel corpora; our models work at the character level to grasp phonetic and orthographic similarities across multiple types of word adapta- tions, whether synchronic or diachronic, loan words or cognates. We describe the training aspects of several character level NMT systems that we adapted to this task and characterize their typical errors. Our method improves BLEU score by 6.3 on the Hindi-to-Bhojpuri trans- lation task. Further, we show that such transductions can generalize well to other languages by applying it successfully to Hindi – Bangla cognate pairs. Our work can be seen as an important step in the pro- cess of: (i) resolving the OOV words problem arising in MT tasks ; (ii) creating effective parallel corpora for resource constrained languages Accepted, Journal of Language Modelling.
    [Show full text]
  • A History of Telugu Literature
    THE HERITAGE OF INDIA SERIES lan A P ned by J . N . F R QUHAR , M . A D . Litt . D D . (Aberdeen) . Ri R V . H The ght everend . S AZ AR I A , LL . D a Bishop of Dorn kal . Joi nt K E . C . DEWI C , M . A . (Cantab . ) E ditors . AN LY M a J N C G GU , . A (Birmingh m) , - a Darsan S stri . Already pub li c/zed. f He ar o Bu ism . S . D .Litt . a a The t ddh K J AUNDERS , M A (C nt b . ) i 2 r o f a are se e ra e u P B . Hi o n L ur d ed . E . I A A st y K t t , . R CE , . S e . k a s ud ed . B I I H i . The sam hy y t m , z A ERR EDALE KE T , D . L tt ( Oxo n . ) 2 ed . S . MA AI L . Aso k a , ud JAME M CPH , M A M D in in . 2nd e d . rinc i a Y O u . Indian Pa t g P p l PE R C BR W N , Calc tta f a th S ai . i ra i n s NI O I O . Psalms o M t C L MACN C L , M A , D L tt . f in i i e a u e . r F E . Li . H o H L r . A isto ry d t t . KEAY , M A , D tt a B The Karm A .
    [Show full text]
  • A Comparative Grammar of the Dravidian Languages
    World Classical Tamil Conference- June 2010 51 A COMPARATIVE GRAMMAR OF THE DRAVIDIAN LANGUAGES Rt. Rev. R.Caldwell * Rev. Caldwell's Comparative Grammar is based on four decades of his deep study and research on the Dravidian languages. He observes that Tamil language contains a common repository of Dravidian forms and roots. His prefatory note to his voluminous work is reproduced here. Preface to the Second Edition It is now nearly nineteen years since the first edition of this book was published, and a second edition ought to have appeared long ere this. The first edition was, soon exhausted, and the desirableness of bringing out a second edition was often suggested to me. But as the book was a first attempt in a new field of research and necessarily very imperfect, I could not bring myself to allow a second edition to appear without a thorough revision. It was evident, however, that the preparation of a thoroughly revised edition, with the addition of new matter wherever it seemed to be necessary, would entail upon me more labour than I was likely for a long time to be able to undertake. .The duties devolving upon me in India left me very little leisure for extraneous work, and the exhaustion arising from long residence in a tropical climate left me very little surplus strength. For eleven years, in addition to my other duties, I took part in the Revision of the Tamil Bible, and after that great work had come to. an end, it fell to my lot to take part for one year more in the Revision of the Tamil Book of, * Source: A Comparative Grammar of the Dravidian or South-Indian Family of Languages , Rt.Rev.
    [Show full text]
  • A Proposal for Standardization of English to Bangla Transliteration and Bangla Editor Joy Mustafi, MCA and BB Chaudhuri, Ph. D
    LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 8 : 5 May 2008 ISSN 1930-2940 Managing Editor: M. S. Thirumalai, Ph.D. Editors: B. Mallikarjun, Ph.D. Sam Mohanlal, Ph.D. B. A. Sharada, Ph.D. A. R. Fatihi, Ph.D. Lakhan Gusain, Ph.D. K. Karunakaran, Ph.D. Jennifer Marie Bayer, Ph.D. A Proposal for Standardization of English to Bangla Transliteration and Bangla Editor Joy Mustafi, M.C.A. and B. B. Chaudhuri, Ph.D. Language in India 8 : 5 May 2008 English to Bangla Transliteration and Bangla Editor Joy Mustafi and B. B. Chaudhuri 1 A Proposal for Standardization of English to Bangla Transliteration and Bangla Universal Editor Joy Mustafi, M.C.A. & B. B. Chaudhuri, Ph.D. 1. Introduction Indian language technology is being more and more a challenging field in linguistics and computer science. Bangla (also written as Bengali) is one of the most popular languages worldwide [Chinese Mandarin 13.69%, Spanish 5.05%, English 4.84%, Hindi 2.82%, Portuguese 2.77%, Bengali 2.68%, Russian 2.27%, Japanese 1.99%, German 1.49%, Chinese Wu 1.21%]. Bangla is a member of the New Indo-Aryan language family, and is spoken by a vast population within the Indian subcontinent and abroad. Bangla provides a lot of scope for research on computational aspects. Efficient processors for Bangla, which exhaustively deal with all the general and particular phenomena in the language, are yet to be developed. Needless to say transliteration system is one of them. To represent letters or words in the corresponding characters of another alphabet is called transliteration .
    [Show full text]
  • Based Font Converter
    International Journal of Computer Applications (0975 – 8887) Volume 58– No.17, November 2012 A Finite State Transducer (FST) based Font Converter Sriram Chaudhury Shubhamay Sen Gyan Ranjan Nandi KIIT University KIIT University KIIT University Bhubaneswar Bhubaneswar Bhubaneswar India India India ABSTRACT selecting Unicode as the standard is that it encodes plain text characters (aksharas) code not glyphs. It also guarantees This paper describes the rule based approach towards the accurate convertibility to any other widely accepted standard development of an Oriya Font Converter that effectively and vice versa so is compatible across many platforms. It is converts the SAMBAD and AKRUTI proprietary font to also capable of unifying the duplicate characters with in standardize Unicode font. This can be very much helpful scripts of different languages. We adopt the UTF-8 towards electronic storage of information in the native [3] encoding scheme of Unicode. UTF-8 is defined as the language itself, proper search and retrieval. Our approach UCS Transformation Form (8 bit). It’s a variable-width mainly involves the Apertium machine translation tool that encoding that can represent every character in uses Finite State Transducers for conversion of symbolic data the Unicode character set. It was designed for backward to standardized Unicode Oriya font. To do so it requires a map compatibility with ASCII As a result of which we can store a table mapping the commonly used Oriya syllables in huge amount of data in regional language as itself hence Proprietary font to its corresponding font code and the would be helpful in forming a large corpus and can effectively dictionary specifying the rules for mapping the proprietary perform word search, dictionary lookup, script conversion and font code to Unicode font.
    [Show full text]