A Distributed Platform for Sanskrit Processing
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
The Festvox Indic Frontend for Grapheme-To-Phoneme Conversion
The Festvox Indic Frontend for Grapheme-to-Phoneme Conversion Alok Parlikar, Sunayana Sitaram, Andrew Wilkinson and Alan W Black Carnegie Mellon University Pittsburgh, USA aup, ssitaram, aewilkin, [email protected] Abstract Text-to-Speech (TTS) systems convert text into phonetic pronunciations which are then processed by Acoustic Models. TTS frontends typically include text processing, lexical lookup and Grapheme-to-Phoneme (g2p) conversion stages. This paper describes the design and implementation of the Indic frontend, which provides explicit support for many major Indian languages, along with a unified framework with easy extensibility for other Indian languages. The Indic frontend handles many phenomena common to Indian languages such as schwa deletion, contextual nasalization, and voicing. It also handles multi-script synthesis between various Indian-language scripts and English. We describe experiments comparing the quality of TTS systems built using the Indic frontend to grapheme-based systems. While this frontend was designed keeping TTS in mind, it can also be used as a general g2p system for Automatic Speech Recognition. Keywords: speech synthesis, Indian language resources, pronunciation 1. Introduction in models of the spectrum and the prosody. Another prob- lem with this approach is that since each grapheme maps Intelligible and natural-sounding Text-to-Speech to a single “phoneme” in all contexts, this technique does (TTS) systems exist for a number of languages of the world not work well in the case of languages that have pronun- today. However, for low-resource, high-population lan- ciation ambiguities. We refer to this technique as “Raw guages, such as languages of the Indian subcontinent, there Graphemes.” are very few high-quality TTS systems available. -
Tai Lü / ᦺᦑᦟᦹᧉ Tai Lùe Romanization: KNAB 2012
Institute of the Estonian Language KNAB: Place Names Database 2012-10-11 Tai Lü / ᦺᦑᦟᦹᧉ Tai Lùe romanization: KNAB 2012 I. Consonant characters 1 ᦀ ’a 13 ᦌ sa 25 ᦘ pha 37 ᦤ da A 2 ᦁ a 14 ᦍ ya 26 ᦙ ma 38 ᦥ ba A 3 ᦂ k’a 15 ᦎ t’a 27 ᦚ f’a 39 ᦦ kw’a 4 ᦃ kh’a 16 ᦏ th’a 28 ᦛ v’a 40 ᦧ khw’a 5 ᦄ ng’a 17 ᦐ n’a 29 ᦜ l’a 41 ᦨ kwa 6 ᦅ ka 18 ᦑ ta 30 ᦝ fa 42 ᦩ khwa A 7 ᦆ kha 19 ᦒ tha 31 ᦞ va 43 ᦪ sw’a A A 8 ᦇ nga 20 ᦓ na 32 ᦟ la 44 ᦫ swa 9 ᦈ ts’a 21 ᦔ p’a 33 ᦠ h’a 45 ᧞ lae A 10 ᦉ s’a 22 ᦕ ph’a 34 ᦡ d’a 46 ᧟ laew A 11 ᦊ y’a 23 ᦖ m’a 35 ᦢ b’a 12 ᦋ tsa 24 ᦗ pa 36 ᦣ ha A Syllable-final forms of these characters: ᧅ -k, ᧂ -ng, ᧃ -n, ᧄ -m, ᧁ -u, ᧆ -d, ᧇ -b. See also Note D to Table II. II. Vowel characters (ᦀ stands for any consonant character) C 1 ᦀ a 6 ᦀᦴ u 11 ᦀᦹ ue 16 ᦀᦽ oi A 2 ᦰ ( ) 7 ᦵᦀ e 12 ᦵᦀᦲ oe 17 ᦀᦾ awy 3 ᦀᦱ aa 8 ᦶᦀ ae 13 ᦺᦀ ai 18 ᦀᦿ uei 4 ᦀᦲ i 9 ᦷᦀ o 14 ᦀᦻ aai 19 ᦀᧀ oei B D 5 ᦀᦳ ŭ,u 10 ᦀᦸ aw 15 ᦀᦼ ui A Indicates vowel shortness in the following cases: ᦀᦲᦰ ĭ [i], ᦵᦀᦰ ĕ [e], ᦶᦀᦰ ăe [ ∎ ], ᦷᦀᦰ ŏ [o], ᦀᦸᦰ ăw [ ], ᦀᦹᦰ ŭe [ ɯ ], ᦵᦀᦲᦰ ŏe [ ]. -
Download Download
Phonotactic frequencies in Marathi KELLY HARPER BERKSON1 and MAX NELSON Indiana University Bloomington, Indiana ABSTRACT: Breathy sonorants are cross-linguistically rare, occurring in just 1% of the languages indexed in the UCLA Phonological Segment Inventory Database (UPSID) and 0.2% of those in the PHOIBLE database (Moran, McCloy and Wright 2014). Prior work has shed some light on their acoustic properties, but little work has investigated the language-internal distribution of these sounds in a language where they do occur, such as Marathi (Indic, spoken mainly in Maharashtra, India). With this in mind, we present an overview of the phonotactic frequencies of consonants, vowels, and CV-bigrams in the Marathi portion of the EMILLE/CIIL corpus. Results of a descriptive analysis show that breathy sonorants are underrepresented, making up fewer than 1% of the consonants in the 2.2 million-word corpus, and that they are disfavored in back vowel contexts. 1 Introduction Languages are rife with gradience. While some sounds, morphological patterns, and syntactic structures occur with great frequency in a language, others—though legal—occur rarely. The same is observed crosslinguistically: some elements and structures are found in language after language, while others are quite rare. For at least the last several decades, cognitive and psycholinguistic investigation of language processing has revealed that language users are sensitive not only to categorical phonological knowledge—such as, for example, the sounds and sound combinations that are legal and illegal in one’s language— but also to this kind of gradient information about phonotactic probabilities. This has been demonstrated in a variety of experimental tasks (Ellis 2002; Frisch, Large, and Pisoni 2000; Storkel and Maekawa 2005; Vitevitch and Luce 2005), which indicate that gradience influences language production, comprehension, and processing phenomena (Ellis 2002; Vitevitch, Armbrüster, and Chu 2004) for infants as well as adults (Jusczyk and Luce 1994). -
Neural Machine Translation System of Indic Languages - an Attention Based Approach
Neural Machine Translation System of Indic Languages - An Attention based Approach Parth Shah Vishvajit Bakrola Uka Tarsadia University Uka Tarsadia University Bardoli, India Bardoli, India [email protected] [email protected] Abstract—Neural machine translation (NMT) is a recent intervention. This task can be effectively done using machine and effective technique which led to remarkable improvements translation. in comparison of conventional machine translation techniques. Machine Translation (MT) is described as a task of compu- Proposed neural machine translation model developed for the Gujarati language contains encoder-decoder with attention mech- tationally translate human spoken or natural language text or anism. In India, almost all the languages are originated from their speech from one language to another with minimum human ancestral language - Sanskrit. They are having inevitable similari- intervention. Machine translation aims to generate translations ties including lexical and named entity similarity. Translating into which have the same meaning as a source sentence and Indic languages is always be a challenging task. In this paper, we grammatically correct in the target language. Initial work on have presented the neural machine translation system (NMT) that can efficiently translate Indic languages like Hindi and Gujarati MT started in early 1950s [2], and has advanced rapidly that together covers more than 58.49 percentage of total speakers since the 1990s due to the availability of more computational in the country. We have compared the performance of our capacity and training data. Then after, number of approaches NMT model with automatic evaluation matrices such as BLEU, have been proposed to achieve more and more accurate ma- perplexity and TER matrix. -
An Introduction to Indic Scripts
An Introduction to Indic Scripts Richard Ishida W3C [email protected] HTML version: http://www.w3.org/2002/Talks/09-ri-indic/indic-paper.html PDF version: http://www.w3.org/2002/Talks/09-ri-indic/indic-paper.pdf Introduction This paper provides an introduction to the major Indic scripts used on the Indian mainland. Those addressed in this paper include specifically Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and Telugu. I have used XHTML encoded in UTF-8 for the base version of this paper. Most of the XHTML file can be viewed if you are running Windows XP with all associated Indic font and rendering support, and the Arial Unicode MS font. For examples that require complex rendering in scripts not yet supported by this configuration, such as Bengali, Oriya, and Malayalam, I have used non- Unicode fonts supplied with Gamma's Unitype. To view all fonts as intended without the above you can view the PDF file whose URL is given above. Although the Indic scripts are often described as similar, there is a large amount of variation at the detailed implementation level. To provide a detailed account of how each Indic script implements particular features on a letter by letter basis would require too much time and space for the task at hand. Nevertheless, despite the detail variations, the basic mechanisms are to a large extent the same, and at the general level there is a great deal of similarity between these scripts. It is certainly possible to structure a discussion of the relevant features along the same lines for each of the scripts in the set. -
A Multi-Indic-Lingual Approach for COVID Fake-Tweet Detection
No Rumours Please! A Multi-Indic-Lingual Approach for COVID Fake-Tweet Detection Debanjana Kar* Mohit Bhardwaj* Suranjana Samanta Amar Prakash Azad Dept. of CSE Dept. of CSE IBM Research IBM Research IIT Kharagpur, India IIIT Delhi, India Bangalore, India Bangalore, India [email protected] [email protected] [email protected] [email protected] Abstract—The sudden widespread menace created by the present global pandemic COVID-19 has had an unprecedented effect on our lives. Man-kind is going through humongous fear and dependence on social media like never before. Fear inevitably leads to panic, speculations, and spread of misinformation. Many governments have taken measures to curb the spread of such misinformation for public well being. Besides global measures, to have effective outreach, systems for demographically local languages have an important role to play in this effort. Towards this, we propose an approach to detect fake news about COVID- 19 early on from social media, such as tweets, for multiple Fig. 1. Two examples of COVID-19 related tweets; Left one shows an Indic-Languages besides English. In addition, we also create an example of fake tweet, which has high number of retweet and like counts; annotated dataset of Hindi and Bengali tweet for fake news Right one shows an example of tweet with real facts, which is has low number of retweets. Number of retweets and likes is proportional to the popularity of detection. We propose a BERT based model augmented with a tweet. additional relevant features extracted from Twitter to identify fake tweets. -
The Formal Kharoṣṭhī Script from the Northern Tarim Basin in Northwest
Acta Orientalia Hung. 73 (2020) 3, 335–373 DOI: 10.1556/062.2020.00015 Th e Formal Kharoṣṭhī script from the Northern Tarim Basin in Northwest China may write an Iranian language1 FEDERICO DRAGONI, NIELS SCHOUBBEN and MICHAËL PEYROT* L eiden University Centre for Linguistics, Universiteit Leiden, Postbus 9515, 2300 RA Leiden, Th e Netherlands E-mail: [email protected]; [email protected]; *Corr esponding Author: [email protected] Received: February 13, 2020 •Accepted: May 25, 2020 © 2020 The Authors ABSTRACT Building on collaborative work with Stefan Baums, Ching Chao-jung, Hannes Fellner and Georges-Jean Pinault during a workshop at Leiden University in September 2019, tentative readings are presented from a manuscript folio (T II T 48) from the Northern Tarim Basin in Northwest China written in the thus far undeciphered Formal Kharoṣṭhī script. Unlike earlier scholarly proposals, the language of this folio can- not be Tocharian, nor can it be Sanskrit or Middle Indic (Gāndhārī). Instead, it is proposed that the folio is written in an Iranian language of the Khotanese-Tumšuqese type. Several readings are proposed, but a full transcription, let alone a full translation, is not possible at this point, and the results must consequently remain provisional. KEYWORDS Kharoṣṭhī, Formal Kharoṣṭhī, Khotanese, Tumšuqese, Iranian, Tarim Basin 1 We are grateful to Stefan Baums, Chams Bernard, Ching Chao-jung, Doug Hitch, Georges-Jean Pinault and Nicholas Sims-Williams for very helpful discussions and comments on an earlier draft. We also thank the two peer-reviewers of the manuscript. One of them, Richard Salomon, did not wish to remain anonymous, and espe- cially his observation on the possible relevance of Khotan Kharoṣṭhī has proved very useful. -
Designing Devanagari Type
Designing Devanagari type The effect of technological restrictions on current practice Kinnat Sóley Lydon BA degree final project Iceland Academy of the Arts Department of Design and Architecture Designing Devanagari type: The effect of technological restrictions on current practice Kinnat Sóley Lydon Final project for a BA degree in graphic design Advisor: Gunnar Vilhjálmsson Graphic design Department of Design and Architecture December 2015 This thesis is a 6 ECTS final project for a Bachelor of Arts degree in graphic design. No part of this thesis may be reproduced in any form without the express consent of the author. Abstract This thesis explores the current process of designing typefaces for Devanagari, a script used to write several languages in India and Nepal. The typographical needs of the script have been insufficiently met through history and many Devanagari typefaces are poorly designed. As the various printing technologies available through the centuries have had drastic effects on the design of Devanagari, the thesis begins with an exploration of the printing history of the script. Through this exploration it is possible to understand which design elements constitute the script, and which ones are simply legacies of older technologies. Following the historic overview, the character set and unique behavior of the script is introduced. The typographical anatomy is analyzed, while pointing out specific design elements of the script. Although recent years has seen a rise of interest on the subject of Devanagari type design, literature on the topic remains sparse. This thesis references books and articles from a wide scope, relying heavily on the works of Fiona Ross and her extensive research on non-Latin typography. -
Töwkhön, the Retreat of Öndör Gegeen Zanabazar As a Pilgrimage Site Zsuzsa Majer Budapest
Töwkhön, The ReTReaT of öndöR GeGeen ZanabaZaR as a PilGRimaGe siTe Zsuzsa Majer Budapest he present article describes one of the revived up to the site is not always passable even by jeep, T Mongolian monasteries, having special especially in winter or after rain. Visitors can reach the significance because it was once the retreat and site on horseback or on foot even when it is not possible workshop of Öndör Gegeen Zanabazar, the main to drive up to the monastery. In 2004 Töwkhön was figure and first monastic head of Mongolian included on the list of the World’s Cultural Heritage Buddhism. Situated in an enchanted place, it is one Sites thanks to its cultural importance and the natural of the most frequented pilgrimage sites in Mongolia beauties of the Orkhon River Valley area. today. During the purges in 1937–38, there were mass Information on the monastery is to be found mainly executions of lamas, the 1000 Mongolian monasteries in books on Mongolian architecture and historical which then existed were closed and most of them sites, although there are also some scattered data totally destroyed. Religion was revived only after on the history of its foundation in publications on 1990, with the very few remaining temple buildings Öndör Gegeen’s life. In his atlas which shows 941 restored and new temples erected at the former sites monasteries and temples that existed in the past in of the ruined monasteries or at the new province and Mongolia, Rinchen marked the site on his map of the subprovince centers. Öwörkhangai monasteries as Töwkhön khiid (No. -
Final Proposal to Encode Nandinagari in Unicode
L2/17-162 2017-05-05 Final proposal to encode Nandinagari in Unicode Anshuman Pandey [email protected] May 5, 2017 1 Introduction This is a proposal to encode the Nandinagari script in Unicode. It supersedes the following documents: • L2/13-002 “Preliminary Proposal to Encode Nandinagari in ISO/IEC 10646” • L2/16-002 “Proposal to encode the Nandinagari script in Unicode” • L2/16-310 “Proposal to encode the Nandinagari script in Unicode” • L2/17-119 “Towards an encoding model for Nandinagari conjuncts” It incorporates comments regarding previous proposals made in: • L2/16-037 “Recommendations to UTC #146 January 2016 on Script Proposals” • L2/16-057 “Comments on L2/16-002 Proposal to encode Nandinagari” • L2/16-216 “Recommendations to UTC #148 August 2016 on Script Proposals” • L2/17-153 “Recommendations to UTC #151 May 2017 on Script Proposals” • L2/17-117 “Proposal to encode a nasal character in Vedic Extensions” Major changes since L2/16-310 include: • Expanded description of the headstroke and its behavior (see section 3.2). • Clarification of encoding model for consonant conjuncts (see section 5.4). • Removal of digits and proposed unification with Kannada digits (see section 4.9). • Re-analysis of ‘touching’ conjuncts as variant forms that may be controlled using fonts. • Removal of ardhavisarga and other characters that require additional research (see section 4.10). • Identification of a pr̥ ṣṭhamātrā, which is not included in the proposed repertoire (see section 4.10). • Proposed reallocation of a Vedic nasal letter to the ‘Vedic Extensions’ block (see L2/17-117). • Revision of Indic position category for vowel signs. -
3 Writing Systems
Writing Systems 43 3 Writing Systems PETER T. DANIELS Chapters on writing systems are very rare in surveys of linguistics – Trager (1974) and Mountford (1990) are the only ones that come to mind. For a cen- tury or so – since the realization that unwritten languages are as legitimate a field of study, and perhaps a more important one, than the world’s handful of literary languages – writing systems were (rightly) seen as secondary to phonological systems and (wrongly) set aside as unworthy of study or at best irrelevant to spoken language. The one exception was I. J. Gelb’s attempt (1952, reissued with additions and corrections 1963) to create a theory of writ- ing informed by the linguistics of his time. Gelb said that what he wrote was meant to be the first word, not the last word, on the subject, but no successors appeared until after his death in 1985.1 Although there have been few lin- guistic explorations of writing, a number of encyclopedic compilations have appeared, concerned largely with the historical development and diffusion of writing,2 though various popularizations, both new and old, tend to be less than accurate (Daniels 2000). Daniels and Bright (1996; The World’s Writing Systems: hereafter WWS) includes theoretical and historical materials but is primarily descriptive, providing for most contemporary and some earlier scripts information (not previously gathered together) on how they represent (the sounds of) the languages they record. This chapter begins with a historical-descriptive survey of the world’s writ- ing systems, and elements of a theory of writing follow. -
Iso/Iec Jtc1/Sc2/Wg2 N3383r L2/08-050R
ISO/IEC JTC1/SC2/WG2 N3383R L2/08-050R 2008-03-06 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation internationale de normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Summary proposal to encode characters for Vedic in the BMP of the UCS Source: Ireland and UC Berkeley Script Encoding Initiative (Universal Scripts Project) Authors: Michael Everson and Peter Scharf Status: National Body and Liaison Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Replaces: N3366 Date: 2008-03-06 1. Introduction. This document reflects the consensus achieved after intense discussion of the Vedic characters proposed by CDAC in India on the one hand by Everson, Scharf, et al. on the other, in a large number of documents presented since 2006. The repertoire given here comprises a set of characters which all parties agree should be encoded to enable India and Western Vedic specialists to represent the texts and critical apparatus of this important field of study. A few outstanding issues remain but it is agreed that these might be left for further study. The characters presented in this document are, in our opinion, well-documented and appropriate for encoding. It is worthwhile reviewing some of the issues which were previously controversial. 2. PRISHTHAMATRA E. It is our view that the best way to handle this is to encode a single vowel sign for e that can be used in combination with other characters for o, ai, and au, as presented in N3235R. We do not believe that the spoofing danger of this character is as great as has been suggested: one could recommend to registrars, for instance, that the character can be restricted from use in IDN.