Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit Language

Total Page:16

File Type:pdf, Size:1020Kb

Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit Language Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit Language Sushant Dave Arun Kumar Singh Indian Institute of Technology - Delhi Indian Institute of Technology - Delhi [email protected] [email protected] Dr. Prathosh A. P. Prof. Brejesh Lall Indian Institute of Technology - Delhi Indian Institute of Technology - Delhi [email protected] [email protected] ABSTRACT large corpus of religious, philosophical, socio-political and scientific This paper describes neural network based approaches to the pro- texts of multi cultural Indian Subcontinent are in Sanskrit. Sanskrit, cess of the formation and splitting of word-compounding, respec- in its multiple variants and dialects, was the Lingua Franca of an- tively known as the Sandhi and Vichchhed, in Sanskrit language. cient India [5]. Therefore, Sanskrit texts are an important resource Sandhi is an important idea essential to morphological analysis of knowledge about ancient India and its people. Earliest known of Sanskrit texts. Sandhi leads to word transformations at word Sanskrit documents are available in the form called Vedic Sanskrit. boundaries. The rules of Sandhi formation are well defined but Rigveda, the oldest of the four Vedas, that are the principal religious complex, sometimes optional and in some cases, require knowl- texts of ancient India, is written in Vedic Sanskrit. In sometime th edge about the nature of the words being compounded. Sandhi around 5 century BCE, a Sanskrit scholar named pARini [4] wrote split or Vichchhed is an even more difficult task given its non a treatise on Sanskrit grammar named azwADyAyI, in which pARini uniqueness and context dependence. In this work, we propose formalized rules on linguistics, syntax and grammar for Sanskrit. 1 the route of formulating the problem as a sequence to sequence azwDyAyI is the oldest surviving text and the most comprehensive prediction task, using modern deep learning techniques. Being source of grammar on Sanskrit today. azwADyAyI literally means the first fully data driven technique, we demonstrate thatour eight chapters and these eight chapters contain around 4000 sutras model has an accuracy better than the existing methods on mul- or rules in total. These rules completely define the Sanskrit language tiple standard datasets, despite not using any additional lexical as it is known today. azwADyAyI is remarkable in its conciseness or morphological resources. The code is being made available at and contains highly systematic approach to grammar. Because of https://github.com/IITD-DataScience/Sandhi_Prakarana its well defined syntax and extensively well codified rules, many researchers have made attempts to codify the pARini’s sutras as CCS CONCEPTS computer programs to analyze Sanskrit texts. • Computing methodologies ! Learning paradigms. 1.1 Introduction of Sandhi and Sandhi Split in KEYWORDS Sanskrit Bi-LSTM, Sanskrit, Word Splitting, Compound Word, Morphologi- Sandhi refers to a phonetic transformation at word boundaries, cal Analysis where two words are combined to form a new word. Sandhi literally means ’placing together’ (samdhi-sam together + daDAti to place) ACM Reference Format: is the principle of sounds coming together naturally according to Sushant Dave, Arun Kumar Singh, Dr. Prathosh A. P., and Prof. Brejesh Lall. certain rules codified by the grammarian pARini in his azwADyAyI. Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit There are 3 different types of Sandhi as defined in azwADyAyI. Language. In 8th ACM IKDD CODS and 26th COMAD, Jan 02–04, 2021, IIIT Bangalore, India. ACM, New York, NY, USA, 7 pages. (1) Swara (Vowel) Sandhi: Sandhi between 2 words where both arXiv:2010.12940v1 [cs.CL] 24 Oct 2020 the last character of first word and first character of second 1 INTRODUCTION word, are vowels. (2) Vyanjana (Consonant) Sandhi: Sandhi between 2 words where Sanskrit is one of the oldest of the Indo-Aryan languages. The old- at least one among the last character of first word and first est known Sanskrit texts are estimated to be dated around 1500 character of second word, is a consonant. BCE. It is the one of the oldest surviving languages in the world. A (3) Visarga Sandhi: Sandhi between 2 words where the last char- Permission to make digital or hard copies of all or part of this work for personal or acter of first word is a Visarga(a character in Devanagri script classroom use is granted without fee provided that copies are not made or distributed that when placed at end of a word, indicate an additional H for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM sound) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, An example for each type Sandhi is shown below: to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. vidyA + AlayaH = vidyAlayaH (Vowel Sandhi) CODS-COMAD ’21, Jan 02–04, 2021, IIIT Bangalore, India vAk + hari = vAgGari (Consonant Sandhi) © Association for Computing Machinery. 1https://www.britannica.com/topic/Ashtadhyayi CODS-COMAD ’21, Jan 02–04, 2021, IIIT Bangalore, India Sushant Dave, Arun Kumar Singh, Dr. Prathosh A. P., and Prof. Brejesh Lall punaH + api = punarapi (Visarga Sandhi) 2 MOTIVATION Sandhi Split on the other hand, resolves Sanskrit compounds and Analysis of sandhi is critical to analysis of Sanskrit text. Researchers “phonetically merged” (sandhified) words into its constituent mor- have pointed out how a good Sandhi Split tool is necessary for a phemes. Sandhi Split comes with additional challenge of not only good coverage morphological analyzer [1]. Good Sandhi and Sandhi splitting of compound word correctly, but also predicting where Split tools facilitate the work in below areas. to split. Since Sanskrit compound word can be split in multiple • Text to speech synthesis system ways based on multiple split locations possible, split words may be • Neural Machine Translation from non-Sanskrit to Sanskrit syntactically correct but semantically may not be meaningful. language and vice versa • Sanskrit Language morphological analysis tadupAsanIyam = tat + upAsanIyam tadupAsanIyam = tat + up + AsanIyam Most Sandhi rules combine phoneme at the end of a word with phoneme at the beginning of another word to make one or two new phonemes. It is important to note that Sandhi rules are meant to 1.2 Existing Work on Sandhi facilitate pronunciation. The transformation affects the characters The current resources available for doing Sandhi in open domain at word boundaries and the remaining characters are generally are not very accurate. Three most popular publicly available set unaffected. The rules of Sandhi for all possible cases are laidoutin of Sandhi tools viz. JNU2, UoH3 & INRIA4 tools are mentioned in azwADyAyI by pARini as 281 rules. The rules of Sandhi are complex table 1. and in some cases require knowledge of the words being combined, An analysis and description of these tools is present in the paper as some rules treat different categories of words differently. This on Sandhikosh [3]. The same paper introduced a dataset for Sandhi means that performing Sandhi requires some lexical resources to and Sandhi Split verification and compared the performance of the indicate how the rules are to be implemented for given words. Work tools in table 1 on that dataset. Neural networks have been used done in Sandhikosh [3] shows that currently available rule based for Sandhi Split by many researchers, for example [2], [7] and [6]. implementations are not very accurate. This paper tries to address The task of doing Sandhi has been mainly addressed as a rule based the problem of low accuracy of existing implementations using a algorithm e.g. [14]. There is no research on Sandhi using neural machine learning approach. networks in public domain so far. This paper describes experiments with Sandhi operation using neural networks and compares results 3 METHOD of suggested approach with the results achieved using existing Sandhi tools [3]. 3.1 The Proposed Sandhi Method Sandhi task is similar to language translation problem where a 1.3 Existing Work on Sandhi Split sequence of characters or words produces another sequence. RNNs are widely used to solve such problems. Sequence to sequence Many researchers like [9] and [11] have tried to codify pARini’s model introduced by [16] is especially suited to such problems, rules for achieving Sandhi Split along with a lexical resource. [13] therefore same was used in this work. The training and test data proposed a statistical method based on Dirichlet process. Finite were in ITRANS Devanagari format 7. This data was converted to state methods have also been used [12]. A graph query method has SLP1 8 as SLP1 was found more suited for proposed approach. The been proposed by [10]. code was implemented in python 3.5 with Keras API running on Lately, Deep Learning based approaches are increasingly be- TensorFlow backend. ing tried for Sandhi Split. [6] used a one-layer bidirectional LSTM Proposed model expects 2 words as input and outputs a new word to two parallel character based representations of a string. [15] as per the Sandhi rules of azwADyAyI as given in below example. and [7] proposed deep learning models for Sandhi Split at sentence sAmAnyaDvaMsAn + aNgIkArAt => sAmAnyaDvaMsAnaNgIkArAt level. [2] uses a double decoder model for compound word split. The method proposed in this paper describes an RNN based, two Results achieved from this approach are rather poor. Analysis stage deep learning method for Sandhi Split of isolated compound showed that this is mainly due to the excessive length of words words without using any lexical resource or sentence information. in some cases. Proposed approach tries to address this problem by In addition to above, there exist multiple Sandhi Splitters in the taking last = characters of the first input word and first < charac- open domain.
Recommended publications
  • Comparison, Selection and Use of Sentence Alignment Algorithms for New Language Pairs
    Comparison, Selection and Use of Sentence Alignment Algorithms for New Language Pairs Anil Kumar Singh Samar Husain LTRC, IIIT LTRC, IIIT Gachibowli, Hyderabad Gachibowli, Hyderabad India - 500019 India - 500019 a [email protected] s [email protected] Abstract than 95%, and usually 98 to 99% and above). The evaluation is performed in terms of precision, and Several algorithms are available for sen- sometimes also recall. The figures are given for one tence alignment, but there is a lack of or (less frequently) more corpus sizes. While this systematic evaluation and comparison of does give an indication of the performance of an al- these algorithms under different condi- gorithm, the variation in performance under varying tions. In most cases, the factors which conditions has not been considered in most cases. can significantly affect the performance Very little information is given about the conditions of a sentence alignment algorithm have under which evaluation was performed. This gives not been considered while evaluating. We the impression that the algorithm will perform with have used a method for evaluation that the reported precision and recall under all condi- can give a better estimate about a sen- tions. tence alignment algorithm's performance, We have tested several algorithms under differ- so that the best one can be selected. We ent conditions and our results show that the per- have compared four approaches using this formance of a sentence alignment algorithm varies method. These have mostly been tried significantly, depending on the conditions of test- on European language pairs. We have ing. Based on these results, we propose a method evaluated manually-checked and validated of evaluation that will give a better estimate of the English-Hindi aligned parallel corpora un- performance of a sentence alignment algorithm and der different conditions.
    [Show full text]
  • Slides for My Lecture for the Texperience 2010
    ◦ DEVELOPMENTDEVELOPMENT OF OFxxındyındy◦ SORTSORT AND AND MERGE MERGE RULES RULES FORFOR INDIC INDIC LANGUAGES LANGUAGES ZdeněkZdeněk Wagner, Wagner, Praha, Praha, Česká Česká republika republika AnshumanAnshuman Pandey, Pandey, Univ. Univ. Michigan, Michigan, USA USA JayaJaya Saraswati, Saraswati, Mumbai, Mumbai, India India ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks English: usually as z ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks English: usually as z Hindi: as ks., l#mF can be transliterated either Lakshmi or Laxmi ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks English: usually as z Hindi: as ks., l#mF can be transliterated either Lakshmi or Laxmi Chinese: as sh ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks English: usually as z Hindi: as ks., l#mF can be transliterated either Lakshmi or Laxmi Chinese: as sh Russian: 娭¤¨ (meaning Hindi) ◦ NoteNote on on pronunciation pronunciation of ofxxındy◦ındy Czech: ks English: usually as z Hindi: as ks., l#mF can be transliterated either Lakshmi or Laxmi Chinese: as sh Russian: 娭¤¨ (meaning Hindi) x◦ındy sorts Hindi MakeIndexMakeIndex • version for English and German • CSIndex – version for Czech and Slovak • unpublished version for Sanskrit (Mark Csernel) Tables defining the sort algorithm are hard-wired in the pro- gram source code. Modification for other languages is difficult and leads rather to confusion than to development of a univer- sal tool. InternationalInternationalMakeIndexMakeIndex • Tables defining the sort algorithm present in external files. • Sort rules defined by regular expressions.
    [Show full text]
  • Note: the Font “Hindiromanized” Must Be Selected
    Hinweis: Der Font “HindiRomanized” muss Note: the font “HindiRomanized” must be ausgewählt sein! selected! Jetzt können Sie ihre Kaschmiri-Texte in Devanagari- Now you may convert your Hindi texts in Schrift umwandeln Devanagari script entweder in ihrer reversen Form (reverse either in their reverted form (reverse Transliteration) Transliteration) oder in ihrer romanisierten Form (Romanization) or in their romanized form (Romanization) Um das zu machen, verfahren Sie so: To do this proceed as described: Klicken Sie mit der rechten Maustaste auf das Right click the “Aksharamala” icon and choose „Aksharamala” Icon und wählen Sie „Options“ “Options” Wählen Sie „Transliteration using multiple Select “Transliteration using multiple Keymaps” Keymaps” Um Kashmiri revers: Aksharas mit dem inherenten To get Kashmiri revers: Aksharas with the Vokal “a” zu erhalten inherent Vowel “a” Wählen Sie als den 1. Keymap: Select as the 1st Keymap: “VerynewDevKashmiri FullReverse-Literation “VerynewDevKashmiri FullReverse-Literation (Unicode -> ITRANS)” (Unicode -> ITRANS)” Wählen Sie als den 2. Keymap “RomKashmiriRaina Select as the 2nd Keymap: (ITRANS -> Unicode)” “RomKashmiriRaina (ITRANS -> Unicode)” oder or “RomKashmiriKoul (ITRANS -> Unicode)” “RomKashmiriKoul (ITRANS -> Unicode)” Drücken Sie anschließend „OK“ Click finally on “OK” Wählen Sie nun den Keymap “VerynewDevKashmiri Now select the keymap “VerynewDevKashmiri FullReverse-Literation (Unicode -> ITRANS)” aus FullReverse-Literation (Unicode -> ITRANS)” Um Kashmiri revers: Aksharas ohne den inherenten
    [Show full text]
  • Why Are Sanskrit Play Titles Strange?
    STEPHAN HILLYER LEVITT WHY ARE SANSKRIT PLAY TITLES STRANGE? 1. Introduction Many Sanskrit play titles generally have presented problems to translators and lexicographers. On the one hand, we have such titles as Bhavabhæti’s Målatœmådhava, which is taken to refer jointly to the play’s hero Mådhava and the play’s heroine Målatœ, and which is translated, “Målatœ and Mådhava”. The translation appears to be supported by Viƒvanåtha Kaviråja’s treatise on dramaturgy, the Såhityadarpa∫a, in Såhityadarpa∫a 6.142-143 1. Or we have a title such as Kålidåsa’s Målavikågnimitra, which is also standardly taken to refer to the play’s hero Agnimitra and its heroine Målavikå. This is translated, “Målavikå and Agnimitra”. Alternately, we have such titles as Bha™™a Nåråya∫a’s Ve∫œsaμhåra, which is understood to be a Sanskrit compound mean- ing, “The Binding (saμhåra) of the Braid of Hair (ve∫œ)”. It refers to an incident in the Mahåbhårata in which Draupadœ is humiliated and vows never to braid her hair again until her humiliation has been avenged. Or we have Bhavabhæti’s Uttararåmacarita, which is under- 1. The Mirror of Composition, A Treatise on Poetical Composition, Being an English Translation of the Såhitya-Darpa∫a of Viƒwanåtha Kaviråja, transl. by J. R. Ballantyne and P. D. Mitra, Calcutta, 1875, p. 225, nos. 427-429; Såhityadarpa∫a of Viƒvanåtha Kaviråja, ed. by D. Dviveda, 1922, rpt. New Delhi, 1982, p. 330. 196 Stephan Hillyer Levitt stood to be a Sanskrit compound meaning, “The Later (uttara) Deeds (carita) of Råma (råma)”. This play is based on the last book of the Råmåya∫a and deals with events that occur after Råma returns to Ayodhyå as king.
    [Show full text]
  • Olga Tribulato Ancient Greek Verb-Initial Compounds
    Olga Tribulato Ancient Greek Verb-Initial Compounds Olga Tribulato - 9783110415827 Downloaded from PubFactory at 08/03/2016 10:10:53AM via De Gruyter / TCS Olga Tribulato - 9783110415827 Downloaded from PubFactory at 08/03/2016 10:10:53AM via De Gruyter / TCS Olga Tribulato Ancient Greek Verb-Initial Compounds Their Diachronic Development Within the Greek Compound System Olga Tribulato - 9783110415827 Downloaded from PubFactory at 08/03/2016 10:10:53AM via De Gruyter / TCS ISBN 978-3-11-041576-6 e-ISBN (PDF) 978-3-11-041582-7 e-ISBN (EPUB) 978-3-11-041586-5 Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress. Bibliografische Information der Deutschen Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliographie; detailed bibliographic data are available in the Internet at http://dnb.dnb.de. © 2015 Walter de Gruyter GmbH, Berlin/Boston Umschlagabbildung: Paul Klee: Einst dem Grau der Nacht enttaucht …, 1918, 17. Aquarell, Feder und Bleistit auf Papier auf Karton. 22,6 x 15,8 cm. Zentrum Paul Klee, Bern. Typesetting: Dr. Rainer Ostermann, München Printing: CPI books GmbH, Leck ♾ Printed on acid free paper Printed in Germany www.degruyter.com Olga Tribulato - 9783110415827 Downloaded from PubFactory at 08/03/2016 10:10:53AM via De Gruyter / TCS This book is for Arturo, who has waited so long. Olga Tribulato - 9783110415827 Downloaded from PubFactory at 08/03/2016 10:10:53AM via De Gruyter / TCS Olga Tribulato - 9783110415827 Downloaded from PubFactory at 08/03/2016 10:10:53AM via De Gruyter / TCS Preface and Acknowledgements Preface and Acknowledgements I have always been ὀψιανθής, a ‘late-bloomer’, and this book is a testament to it.
    [Show full text]
  • Loukota 2019
    UNIVERSITY OF CALIFORNIA Los Angeles The Goods that Cannot Be Stolen: Mercantile Faith in Kumāralāta’s Garland of Examples Adorned by Poetic Fancy A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Asian Languages and Cultures by Diego Loukota Sanclemente 2019 © Copyright by Diego Loukota Sanclemente 2019 ABSTRACT OF THE DISSERTATION The Goods that Cannot Be Stolen: Mercantile Faith in Kumāralāta’s Garland of Examples Adorned by Poetic Fancy by Diego Loukota Sanclemente Doctor of Philosophy in Asian Languages and Cultures University of California, Los Angeles, 2019 Professor Gregory Schopen, Co-chair Professor Stephanie J. Watkins, Co-chair This dissertation examines the affinity between the urban mercantile classes of ancient India and contemporary Buddhist faith through an examination of the narrative collection Kalpanāmaṇḍitikā Dṛṣṭāntapaṅkti (“Garland of Examples,” henceforth Kumāralāta’s Garland) by the 3rd Century CE Gandhāran monk Kumāralāta. The collection features realistic narratives that portray the religious sensibility of those social classes. I contend that as Kumāralāta’s 3rd Century was one of crisis for cities and for trade in the Indian world, his work reflects an urgent statement of the core values of ii Buddhist urban businesspeople. Kumāralāta’s stories emphasize both religious piety and the pursuit of wealth, a concern for social respectability, a strong work ethic, and an emphasis on rational decision-making. These values inform Kumāralāta’s religious vision of poverty and wealth. His vision of religious giving conjugates economic behavior and religious doctrine, and the outcome is a model that confers religious legitimation to the pursuit of wealth but also an economic outlet for religious fervor and a solid financial basis for the monastic establishment, depicted by Kumāralāta in close interdependence with the laity and, most importantly, within the same social class.
    [Show full text]
  • To Be Read by Dr Malhar Kulkarni, IIT Bombay)
    Friday, 10 t h Dec, 2010 Venue: Auditorium, School of Arts & Aesthetics, JNU Saturda, January 9, 2010 0830 - 0930: Breakfast & Registration 0930 - 1130: Session 1: Inaugural session • Dr Girish Nath Jha , Special Center for Sanskrit Studies, JNU , Welcome address, Seminar theme • Prof Bal Ram Singh , University of Massachusetts Dartmouth, USA, Scientifying the Sanskrit vs. Sanskritizing the Science(inaugural address) • Prof Ramanath Sharma , University of Hawaii, USA, Rule interaction, blocking and derivation in Pāini (Keynote speech) • Prof V. K. Jain , Registrar, JNU, Presidential address and release of seminar proceedings • Chairperson , Special Center for Sanskrit Studies, JNU, Vote of Thanks Observer: Dr Maureen P. Hall , University of Massachusetts. Dartmouth, USA 1130-1200: Tea 1200 - 1300 : Session 2: Plenary Talk 1: Chair: Prof Bal Ram Singh , University of Massachusetts, Dartmouth, USA Speaker: Prof Saroja Bhate , Bhandarkar Oriental Research Institute (BORI) Topic: Anubandha s of P āini and Exegetics of Sanskrit grammar (to be read by Dr Malhar Kulkarni, IIT Bombay) Coordinator: Dr. Girish Nath Jha , JNU Observer: Manji Bhadra , research student, JNU 1300-1400 : Lunch 1400-1600 : Session 3: Lexical Resources Chair: Manoj Jain , Scientist “E”, TDIL, MCIT • Abhinandan S P and Shrisha Rao , I.I.I.T. Bangalore o Citation Matching in Sanskrit Corpora Using Local Alignment • Diwakar Mani , Jawaharlal Nehru University o RDBMS based Lexical Resource for Indian Heritage: the case of Mah ābh ārata • Sivaja S. Nair and Amba Kulkarni , University of Hyderabad o The Knowledge Structure in Amarako śa • Malhar Kulkarni, Irawati Kulkarni, Chaitali Dangarikar and Pushpak Bhattacharyya , I.I.T. Bombay o Gloss in Sanskrit Wordnet Coordinator: Dr.
    [Show full text]
  • Power of Sanskrit
    09/03/2014 WEBPAGE: http://www.translink.profkrishna.com E-mail: [email protected] rofkrishna.com p www. वागतम ् amurthy, Singapore n n Swaagatham N. Krishnamurthy 23 February 2014 www.profkrishna.com Copyright: Dr. N. Krish Consultant, Singapore 1 Acknowledgements and Scope of talk ी गुयो नमः (shri gurub’yo’ namaha) Thanks to Singapore Dakshina Bharatha Brahmana Sabha, and Sri Srinivasan and all members of the rofkrishna.com Sabha Committee for organising this event for me to p p launch my transliteration scheme KrishnaDheva. www. Thanks also to the Sanskrit scholars here, as well as those who have come to learn how to pronounce Sanskrit correctly in English. This talk will not be a religious discourse amurthy, Singapore n This talk will not be a Sanskrit tutoring class This talk will simply be my sharing with you how to: Write down in simple English (KrishnaDheva) any Sanskrit material without special rules, and, Copyright: Dr. N. Krish Read Sanskrit correctly from KrishnaDheva. 2 1 09/03/2014 Starting off The wrong things we say: Sri should be s’ri Siva or Shiva should be s’iva rofkrishna.com p Krishna should be kr+shna www. Visaka should be vis’a’k’a’ Shuklambaradharam should be s’ukla’mbaradh’aram Vrishaba raasi should be vr+shab’a ra’s’ihi amurthy, Singapore n Kowsika gothra should be kaus’ika go’thra … and so on! Copyright: Dr. N. Krish 3 http://sanskritdocuments.org/news/subnews/NASASanskrit.txt Power of Sanskrit – a In ancient India the intention to discover truth was so consuming, that in the process, they discovered perhaps the most perfect tool for fulfilling such a search that the world has ever known – the Sanskrit language.
    [Show full text]
  • Python Module Index 9
    indictransliterationDocumentation Release 0.0.1 sanskrit-programmers Mar 28, 2021 Contents 1 Submodules 3 1.1 indic_transliteration.sanscript......................................3 1.1.1 Submodules...........................................3 1.1.1.1 indic_transliteration.sanscript.schemes........................3 1.1.1.1.1 Submodules.................................3 1.2 indic_transliteration.xsanscript......................................3 1.3 indic_transliteration.detect........................................3 1.3.1 Supported schemes.......................................4 1.4 indic_transliteration.deduplication....................................5 2 Indices and tables 7 Python Module Index 9 Index 11 i ii indictransliterationDocumentation; Release0:0:1 sanscript is the most popular submodule here. Contents 1 indictransliterationDocumentation; Release0:0:1 2 Contents CHAPTER 1 Submodules 1.1 indic_transliteration.sanscript 1.1.1 Submodules 1.1.1.1 indic_transliteration.sanscript.schemes 1.1.1.1.1 Submodules indic_transliteration.sanscript.schemes.roman indic_transliteration.sanscript.schemes.brahmi 1.2 indic_transliteration.xsanscript 1.3 indic_transliteration.detect Example usage: from indic_transliteration import detect detect.detect('pitRRIn') == Scheme.ITRANS detect.detect('pitRRn') == Scheme.HK When handling a Sanskrit string, it’s almost always best to explicitly state its transliteration scheme. This avoids embarrassing errors with words like pitRRIn. But most of the time, it’s possible to infer the encoding from the text itself.
    [Show full text]
  • Language Transliteration in Indian Languages – a Lexicon Parsing Approach
    LANGUAGE TRANSLITERATION IN INDIAN LANGUAGES – A LEXICON PARSING APPROACH SUBMITTED BY JISHA T.E. Assistant Professor, Department of Computer Science, Mary Matha Arts And Science College, Vemom P O, Mananthavady A Minor Research Project Report Submitted to University Grants Commission SWRO, Bangalore 1 ABSTRACT Language, ability to speak, write and communicate is one of the most fundamental aspects of human behaviour. As the study of human-languages developed the concept of communicating with non-human devices was investigated. This is the origin of natural language processing (NLP). Natural language processing (NLP) is a subfield of Artificial Intelligence and Computational Linguistics. It studies the problems of automated generation and understanding of natural human languages. A 'Natural Language' (NL) is any of the languages naturally used by humans. It is not an artificial or man- made language such as a programming language. 'Natural language processing' (NLP) is a convenient description for all attempts to use computers to process natural language. The goal of the Natural Language Processing (NLP) group is to design and build software that will analyze, understand, and generate languages that humans use naturally, so that eventually you will be able to address your computer as though you were addressing another person. The last 50 years of research in the field of Natural Language Processing is that, various kinds of knowledge about the language can be extracted through the help of constructing the formal models or theories. The tools of work in NLP are grammar formalisms, algorithms and data structures, formalism for representing world knowledge, reasoning mechanisms. Many of these have been taken from and inherit results from Computer Science, Artificial Intelligence, Linguistics, Logic, and Philosophy.
    [Show full text]
  • Brahminet-ITRANS Transliteration Scheme
    BrahmiNet-ITRANS Transliteration Scheme Anoop Kunchukuttan August 2020 The BrahmiNet-ITRANS notation (Kunchukuttan et al., 2015) provides a scheme for transcription of major Indian scripts in Roman, using ASCII characters. It extends the ITRANS1 transliteration scheme to cover characters not covered in the original scheme. Tables 1a shows the ITRANS mappings for vowels and diacritics (matras). Table 1b shows the ITRANS mappings for consonants. These tables also show the Unicode offset for each character. By Unicode offset, we mean the offset of the character in the Unicode range assigned to the script. For Indic scripts, logically equivalent characters are assigned the same offset in their respective Unicode codepoint ranges. For illustration, we also show the Devanagari characters corresponding to the transliteration. ka kha ga gha ∼Na, N^a ITRANS Unicode Offset Devanagari 15 16 17 18 19 a 05 अ क ख ग घ ङ aa, A 06, 3E आ, ◌ा cha Cha ja jha ∼na, JNa i 07, 3F इ, ि◌ 1A 1B 1C 1D 1E ii, I 08, 40 ई, ◌ी च छ ज झ ञ u 09, 41 उ, ◌ु Ta Tha Da Dha Na uu, U 0A, 42 ऊ, ◌ू 1F 20 21 22 23 RRi, R^i 0B, 43 ऋ, ◌ृ ट ठ ड ढ ण RRI, R^I 60, 44 ॠ, ◌ॄ ta tha da dha na LLi, L^i 0C, 62 ऌ, ◌ॢ 24 25 26 27 28 LLI, L^I 61, 63 ॡ, ◌ॣ त थ द ध न pa pha ba bha ma .e 0E, 46 ऎ, ◌ॆ 2A 2B 2C 2D 2E e 0F, 47 ए, ◌े प फ ब भ म ai 10,48 ऐ, ◌ै ya ra la va, wa .o 12, 4A ऒ, ◌ॊ 2F 30 32 35 o 13, 4B ओ, ◌ो य र ल व au 14, 4C औ, ◌ौ sha Sha sa ha aM 05 02, 02 अं 36 37 38 39 aH 05 03, 03 अः श ष स ह .m 02 ◌ं Ra lda, La zha .h 03 ◌ः 31 33 34 (a) Vowels ऱ ळ ऴ (b) Consonants Version: v1.0 (9 August 2020) NOTES: 1.
    [Show full text]
  • Tugboat, Volume 19 (1998), No. 2 115 an Overview of Indic Fonts For
    TUGboat, Volume 19 (1998), No. 2 115 the Indo-Aryan and Dravidian language families of Fonts India. Such uniformity in phonetics is reflected in orthography, which in turn enables all scripts to be transliterated through a single scheme. This unifor- An Overview of Indic Fonts for TEX mity has subsequently been reflected in the translit- Anshuman Pandey eration schemes of the Indic language/script pack- ages. 1 Introduction Most packages have their own transliteration Many scholars and students in the humanities have scheme, but these schemes are essentially variations on a single scheme, differing merely in the coding preferred TEX over other “word processors” or doc- ument preparation systems because of the ease TEX of a few vowel, nasal, and retroflex letters. Most provides them in typesetting non-Roman scripts, the of these packages accept input in one of the two availability of TEX fonts of interest to them, and the primary 7-bit transliteration schemes— ITRANS or ability TEX has in producing well-structured docu- Velthuis—or a derivative of one of them. There ments. is also an 8-bit format called CS/CSX which a few However, this is not the case amongst Indol- of these packages support. CS/CSX is described in ogists. The lack of Indic fonts for TEXandthe further detail in Section 3. perceived difficulty of typesetting them have often 2 The Fonts and Packages turned Indologists away from using TEX. Little do they realize that TEXisthe foremost tool for de- Figure 1 shows examples of the various fonts de- veloping Indic language/script documents.
    [Show full text]