Deriving Word Prosody from Orthography in Hindi

Total Page:16

File Type:pdf, Size:1020Kb

Deriving Word Prosody from Orthography in Hindi Deriving Word Prosody from Orthography in Hindi Somnath Roy Centre for Linguistics Jawaharlal Nehru University New Delhi-110067 [email protected] Abstract expert knowledge (i.e., the rule-set designed by an expert). However, these rule-sets may not be This study proposes a word prosody exhaustive for capturing many language-specific converter (WPC), which takes Hindi properties such as word morphology and stress grapheme as input and yields output as pattern (Pagel et al., 1998).Therefore, researchers a sequence of phonemes with syllable nowadays rely on state-of-the art machine learn- boundaries and stress mark. The WPC ing (data-driven) techniques for developing a G2P has two submodules connected in the lin- model. A data-driven system is trained using a ear fashion. The first submodule is a manually annotated dataset. The manually anno- grapheme to phoneme (G2P) converter. tated dataset contains words and its phonemic se- The output of G2P converter is fed to the quence. These datasets are language specific in second submodule which is for prosody nature. The machine learning algorithm learns the specific job. The second submodule con- phonemic sequence for words based on the prob- sists of two finite state machines (FSMs). abilistic or geometric calculation. These calcu- The first FSM does the syllabification and lation varies across machine learning approaches. the second assigns prosodic labels to the In data-driven approaches, one need not to worry syllabified strings. The prosodic labels about the language specific complexities such as are translated into the stressed and un- word morphology and stress pattern. The algo- stressed component using rules specific to rithm automatically captures these patterns in the the language. This study proposes a novel generated model. A data-driven G2P conversion rule-based system which uses non-linear process is broadly categorized into three subpro- phonological rules with the provision of cesses i) Sequence alignment ii) Model training recursive foot structure for G2P conver- and iii) Decoding (for details see (Novak et al., sion and prosodic labeling. The imple- 2012)). Many data-driven techniques are available mentation1 of the proposed rules outper- for G2P conversion. The important ones are deci- forms the G2P models trained on the state sion tree (Black et al., 1998), Conditional Random of the art data-driven techniques such as Field (Wang and King, 2011), Hidden Markov joint sequence model (JSM) and LSTM. Model (Taylor, 2005), Joint-Sequence techniques 1 Introduction (Bisani and Ney, 2008) and Recurrent Neural Net- work (Rao et al., 2015). A dictionary is an essential component of a text- to-speech (TTS) and an automatic speech recog- The function of a word prosody model is simi- nition (ASR) system. These systems are of open lar to that of a grapheme to phoneme (G2P) con- nature and can have an input word which is not verter. Moreover, it also describes syllable bound- present in the dictionary. Such input words are aries and predict stressed syllables in a word. called out-of-vocabulary (OOV) words. There- The schematic diagram of word prosody model fore, a G2P converter is required, which can gen- is shown in Fig 1. The accuracy of a word erate the pronunciation of the OOV words. A G2P prosody module for Hindi language depends on converter can be a rule-based or data driven sys- an efficient solution of the two sub-problems well- tem. A rule-based G2P converter relies on the known in Hindi phonology as schwa deletion 1https://github.com/somnat/Hindi-Word-Prosody-Hindi- and pronunciation of diacritic marks anusvara and G2P 2 anunasika (Ohala, 1983; Pandey, 1989; Pandey, S Bandyopadhyay, D S Sharma and R Sangal. Proc. of the 14th Intl. Conference on Natural Language Processing, pages 2–12, Kolkata, India. December 2017. c 2016 NLP Association of India (NLPAI) d. The usefulness of syllable as the basic lin- guistic unit in the context of speech recognition system has been explored in English (Ganapathi- raju et al., 2001) and Tamil (Lakshmi and Murthy, 2006). Similar work for Hindi requires a software for syllabification. This work fulfills that need. 1.1 Main Contributions The WPC does not require the information • of morphological boundaries. The proposed rules take into account the syllable patterns of compound, derived and inflected words. The syllabification and syllable labeling pro- • cess follow finite state machine. The faultless syllabification and syllable labeling at under- Figure 1: Schematic Diagram of Word Prosody lying phonemic form yields better accuracy Model in schwa deletion and pronunciation of di- acritic—anusvara and anunasika. The syl- labification at underlying phonemic form is 1990; Narasimhan et al., 2004; Pandey, 2014). called as I-level syllabification in this work. Ohala used linear phonological rules to derive sur- face phonemic form. Pandey showed the superior- The rules proposed in this study assume the • ity of non-linear phonological rules over the linear extrametricality of foot unlike syllable as pro- one. The motivation for the current work is stated posed in (Pandey, 2014). The contention below. is that the stress can be predicted elegantly a. In the past, Hindi G2P converters were imple- using the notion of extrametrical foot (Mc- mented in the context of speech synthesis (Bali et Carthy and Prince, 1990; Crowhurst, 1994) . al., 2004), (Narasimhan et al., 2004) and (Choud- Also, the directionality is LR (left to right) hury, 2003). However, these works have given unlike RL (right to left) used in (Pandey, partial attention to the anusvara/anunasika disam- 2014). biguation. (Pandey, 2014) describes it as the prob- Anusvara and anunasika are used inter- lem of Hindi orthography. • changeably in Hindi. Therefore, both anus- b. These G2P converters are based on lin- vara and anunasika is mapped to a hypothet- ear phonological rules proposed by (Ohala, 1983) ical phoneme X at the underlying phonemic with the exception of (Pandey, 2014). Non-linear form. The decision for homo-organic nasal phonological rules have advantages over linear consonant or a nasalized vowel for phoneme one as explained below. (Bernhardt and Gilbert, X is based on the minimum moraic weight 1992). difference of the syllable having phoneme X i. Non-linear rules capture both the prosodic and the next syllable. The moraic weight and segmental information. difference is calculated after schwa deletion ii. The hierarchical representation used in non- and re-syllabification. The proposed map- linear framework captures more information; this ping rule almost removes the pronunciation results in a compact rule set. ambiguity related to anusvara and anunasika. c. Syllable is known to be a better unit for Hindi speech synthesis (Bellur et al., 2011; Kishore and Rest of this paper is organized as follows. Sec- Black, 2003). Therefore, a Hindi text-to-speech tion 2 describes the salient points of metrical (TTS) system needs an automatic syllabification phonology relevant to this work. Section 3 de- module. The automatic syllabification would be scribes the process of syllabification and sylla- more useful if it could also predict the stressed ble labeling. Section 4 describes foot forma- syllables in words of natural speech as this would tion. Section 5 describes schwa deletion and re- facilitate synthesis. 3 syllabification. Section 6 describes the observa- tions and rules for the anusvara and anunasika pro- and Kleinhenz, 1999). nunciation. Section 7 describes the data-driven G2P systems implemented for Hindi. Section 8 3 I-Level Syllabification compares the performance of current system to I-level syllabification is derived from the under- data-driven systems and previous rule-based im- lying phonemic form (UPF), which in turn is de- plementations. Section 9 describes the rules for rived from orthography using the following map- the prediction of the stressed syllables and reports ping rules. the accuracy of current system for syllabification i. Each consonant in Devanagari script is inher- and stress prediction. The conclusion and limita- ently associated with the mid-central vowel called tions are written in Section 10. schwa or its lower counterpart ”a”2. ii. If a consonant is followed by a vowel dia- 2 Theoretical Background critic mark, or a diacritic called halant, the inher- Metrical phonology is based on nonlinear arrange- ent schwa is deleted. ment of the constituents of a phrase (Liberman and iii. The inherent schwa is not realized in case of Prince, 1977; Selkirk, 1980; Hayes, 1980; Selkirk, consonant at word final position. 1986; Hayes, 1995; Apoussidou, 2006). The non- iv. Two or three consonant together can form a linear arrangement is realized in the form of a tree ligature. with nodes as the constituents of a phrase. The v. A short vowel at word final position is length- constituents are syllable, foot, phonological word, ened. phonological phrase and intonational phrase. Syl- The following examples illustrate derivation of lable is the lowest unit in the hierarchy dominated UPF from orthography: /kml/ k@m@l (Lotus) by foot, which in turn is dominated by a phono- → /kmAl/ k@ma:l logical word. The higher units such as phonolog- → ical phrase and intonational phrase are not rele- The process of syllabification in Hindi was vant in the current work ( for clarity see fig 4 - explored by (Ohala, 1983) and (Pandey, 1989; 9). Syllable functions as a domain for segmen- Pandey, 2014). Their analysis do not talk about tal phonological rules. In non-linear phonology, the maximal onset principle for syllabification. the rules are written on the basis of interaction The present analysis for syllabification follows among syllables under the domain of higher con- maximum onset principle (Selkirk, 1984; Selkirk, stituents. A syllable has obligatory rhyme and op- 1981). The maximum onset principle is a suffi- tional coda. The syllables are also described by ciency condition as demonstrated by the following the moraic weight in quantity-sensitive languages examples.
Recommended publications
  • 75 Characters Maximum
    Kannada Script LGR Proposal Introduction, Current Analysis and Next Steps Dr. U.B. Pavanaja NBGP F2F Meeting, Colombo 14 December 2017 | 1 Agenda 1 2 3 Introduction to Repertoire Analysis Within Script Kannada Script Variants 4 5 6 Cross-Script WLE Rules Current Status and Variants Next Steps for Completion | 2 Introduction to Kannada Script Population – there are about 60 million speakers of Kannada language which uses Kannada script. Geographical area - Kannada is spoken predominantly by the people of Karnataka State of India. It is also spoken by significant linguistic minorities in the states of Andhra Pradesh, Telangana, Tamil Nadu, Maharashtra, Kerala, Goa and abroad Languages written in Kannada script – Kannada, Tulu, Kodava (Coorgi), Konkani, Havyaka, Sanketi, Beary (byaari), Arebaase, Koraga | 3 Classification of Characters Swaras (vowels) Letter ಅ ಆ ಇ ಈ ಉ ಊ ಋ ಎ ಏ ಐ ಒ ಓ ಔ Vowel sign/ N/Aಾ ಾ ಾ ಾ ಾ ಾ ಾ ಾ ಾ ಾ ಾ ಾ matra Yogavahas In Kannada, all consonants Anusvara ಅಂ (vyanjanas) when written as ಕ (ka), ಖ (kha), ಗ (ga), etc. actually have a built-in vowel sign (matra) Visarga ಅಃ of vowel ಅ (a) in them. | 4 Classification of Characters Vargeeya vyanjana (structured consonants) voiceless voiceless aspirate voiced voiced aspirate nasal Velars ಕ ಖ ಗ ಘ ಙ Palatals ಚ ಛ ಜ ಝ ಞ Retroflex ಟ ಠ ಡ ಢ ಣ Dentals ತ ಥ ದ ಧ ನ Labials ಪ ಫ ಬ ಭ ಮ Avargeeya vyanjana (unstructured consonants) ಯ ರ ಱ (obsolete) ಲ ವ ಶ ಷ ಸ ಹ ಳ ೞ (obsolete) | 5 Repertoire Included-1 Sr. Unicode Glyph Character Name Unicode Indic Ref Widespread No. Code General Syllabic use ? Point Category Category [Yes/No] 1 0C82 ಂ KANNADA SIGN ANUSVARA Mc Anusvara Yes 2 0C83 ಂ KANNADA SIGN VISARGA Mc Visarga Yes 3 0C85 ಅ KANNADA LETTER A Lo Vowel Yes 4 0C86 ಆ KANNADA LETTER AA Lo Vowel Yes 5 0C87 ಇ KANNADA LETTER I Lo Vowel Yes 6 0C88 ಈ KANNADA LETTER II Lo Vowel Yes 7 0C89 ಉ KANNADA LETTER U Lo Vowel Yes 8 0C8A ಊ KANNADA LETTER UU Lo Vowel Yes KANNADA LETTER VOCALIC 9 0C8B ಋ R Lo Vowel Yes 10 0C8E ಎ KANNADA LETTER E Lo Vowel Yes | 6 Repertoire Included-2 Sr.
    [Show full text]
  • Proposal for a Kannada Script Root Zone Label Generation Ruleset (LGR)
    Proposal for a Kannada Script Root Zone Label Generation Ruleset (LGR) Proposal for a Kannada Script Root Zone Label Generation Ruleset (LGR) LGR Version: 3.0 Date: 2019-03-06 Document version: 2.6 Authors: Neo-Brahmi Generation Panel [NBGP] 1. General Information/ Overview/ Abstract The purpose of this document is to give an overview of the proposed Kannada LGR in the XML format and the rationale behind the design decisions taken. It includes a discussion of relevant features of the script, the communities or languages using it, the process and methodology used and information on the contributors. The formal specification of the LGR can be found in the accompanying XML document: proposal-kannada-lgr-06mar19-en.xml Labels for testing can be found in the accompanying text document: kannada-test-labels-06mar19-en.txt 2. Script for which the LGR is Proposed ISO 15924 Code: Knda ISO 15924 N°: 345 ISO 15924 English Name: Kannada Latin transliteration of the native script name: Native name of the script: ಕನ#ಡ Maximal Starting Repertoire (MSR) version: MSR-4 Some languages using the script and their ISO 639-3 codes: Kannada (kan), Tulu (tcy), Beary, Konkani (kok), Havyaka, Kodava (kfa) 1 Proposal for a Kannada Script Root Zone Label Generation Ruleset (LGR) 3. Background on Script and Principal Languages Using It 3.1 Kannada language Kannada is one of the scheduled languages of India. It is spoken predominantly by the people of Karnataka State of India. It is one of the major languages among the Dravidian languages. Kannada is also spoken by significant linguistic minorities in the states of Andhra Pradesh, Telangana, Tamil Nadu, Maharashtra, Kerala, Goa and abroad.
    [Show full text]
  • The Taittirtyaprtiakhya As on Antjsvara
    THE TAITTIRTYAPRTIAKHYA AS 密 ON ANTJSVARA 教 文 Nobuhiko Kobayasi 化 A The dot at the left upper corner of an Indian letter1) represents a nasal element called anusvara (that which follows a vowel).2) The descriptions of anusvara as found in the works of ancient Indian phoneticians3) are so inconsistent and confusing that modern Sanskrit scholars are still confused. Some represented by the author of the Atharvavedapratiaakhya hold that it is a pure nasalized vowel,4) and others represented by the author of the RkpratiS'akhya say that it is either a vowel and a consonant.5) There is also another school, according to which it is a pure consonant.6) B An Indo-aryan syllable (aksara)7) is heavy (guru) or light (laghu). It is heavy, when the vowel is long8) or followed by a conjunction of con- sonants,9) and it is light when the vowel is short or not followed by a con- junction of consonants.10) An important feature of the phonetic element called anusvara is that it affects meter. According to the Taittiriyapratisakhya (TP), a letter with the anusvara sign represents a metrically long syllable." On the basis of this, description of the TP, Whitney adopts the view that anusvara is a lengthened nasal vowel.12) He seeks support for his interpretation from the fact that the anusvara sign is written over the vowel -112- of the first syllable.131 So the phonetic value of vamsa is interpreted as [Qa:sa]. This interpretation seems to be supported by such Hindi develop- THE TAITTIRIYAPRATISAKHYA ON ANUSVARA ment of anusvara as in vamsa>bas.
    [Show full text]
  • 15178-Devanagari-Spacing-Anusvara
    Proposal to encode A8FE DEVANAGARI SIGN SPACING ANUSVARA Shriramana Sharma, jamadagni-at-gmail-dot-com, India 2015-Jun-06 This is a proposal to encode one character in the Devanagari Extended block for Samavedic: ◌० A8FE DEVANAGARI SIGN SPACING ANUSVARA This is in contrast to the regular anusvara for this script 0902 ◌ं DEVANAGARI SIGN ANUSVARA as also to the various Vedic anusvara-s seen in Samavedic. On the other hand, this is parallel to the spacing anusvara-s in other Indic scripts which are attested for Vedic (0982 Bengali, 0B02 Oriya, 0C02 Telugu, 0C82 Kannada, 0D02 Malayalam, 11302 Grantha) and glyphically identical to all of them except Bengali. However it should be positioned vertically centered with the Devanagari digits identical to 0966 ० DEVANAGARI DIGIT ZERO. §1. Discussion The regular Devanagari anusvara 0902 ◌ं is non-spacing. This poses a problem when composing texts of the Sama Veda since these use digits on the mainline to denote svara-s: (Below, we refer to the written representation of the linguistic pattern [C*]V as “syllable”.) 1) In the Ṛc-s (verses) a kampa or “aggravated” svarita svara is marked by a 2 above the syllable (or inferred as continued from a previous syllable), a KA or avagraha above the syllable, and a digit 3 following the syllable (see L2/09-372 pp 13 and 14 and L2/15-162 p 4). The digit 3 here denotes the anudātta svara in the latter part of the kampa. 2) In the Sāman-s (melodies), secondary svara-s in which a syllable’s vowel should be continued to be sung are marked by digits following the syllable.
    [Show full text]
  • Proposal for a Gujarati Script Root Zone Label Generation Ruleset (LGR)
    Proposal for a Gujarati Root Zone LGR Neo-Brahmi Generation Panel Proposal for a Gujarati Script Root Zone Label Generation Ruleset (LGR) LGR Version: 3.0 Date: 2019-03-06 Document version: 3.6 Authors: Neo-Brahmi Generation Panel [NBGP] 1 General Information/ Overview/ Abstract The purpose of this document is to give an overview of the proposed Gujarati LGR in the XML format and the rationale behind the design decisions taken. It includes a discussion of relevant features of the script, the communities or languages using it, the process and methodology used and information on the contributors. The formal specification of the LGR can be found in the accompanying XML document: proposal-gujarati-lgr-06mar19-en.xml Labels for testing can be found in the accompanying text document: gujarati-test-labels-06mar19-en.txt 2 Script for which the LGR is proposed ISO 15924 Code: Gujr ISO 15924 Key N°: 320 ISO 15924 English Name: Gujarati Latin transliteration of native script name: gujarâtî Native name of the script: ગજુ રાતી Maximal Starting Repertoire (MSR) version: MSR-4 1 Proposal for a Gujarati Root Zone LGR Neo-Brahmi Generation Panel 3 Background on the Script and the Principal Languages Using it1 Gujarati (ગજુ રાતી) [also sometimes written as Gujerati, Gujarathi, Guzratee, Guujaratee, Gujrathi, and Gujerathi2] is an Indo-Aryan language native to the Indian state of Gujarat. It is part of the greater Indo-European language family. It is so named because Gujarati is the language of the Gujjars. Gujarati's origins can be traced back to Old Gujarati (circa 1100– 1500 AD).
    [Show full text]
  • An Introduction to Indic Scripts
    An Introduction to Indic Scripts Richard Ishida W3C [email protected] HTML version: http://www.w3.org/2002/Talks/09-ri-indic/indic-paper.html PDF version: http://www.w3.org/2002/Talks/09-ri-indic/indic-paper.pdf Introduction This paper provides an introduction to the major Indic scripts used on the Indian mainland. Those addressed in this paper include specifically Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and Telugu. I have used XHTML encoded in UTF-8 for the base version of this paper. Most of the XHTML file can be viewed if you are running Windows XP with all associated Indic font and rendering support, and the Arial Unicode MS font. For examples that require complex rendering in scripts not yet supported by this configuration, such as Bengali, Oriya, and Malayalam, I have used non- Unicode fonts supplied with Gamma's Unitype. To view all fonts as intended without the above you can view the PDF file whose URL is given above. Although the Indic scripts are often described as similar, there is a large amount of variation at the detailed implementation level. To provide a detailed account of how each Indic script implements particular features on a letter by letter basis would require too much time and space for the task at hand. Nevertheless, despite the detail variations, the basic mechanisms are to a large extent the same, and at the general level there is a great deal of similarity between these scripts. It is certainly possible to structure a discussion of the relevant features along the same lines for each of the scripts in the set.
    [Show full text]
  • Table of Contents I
    L2/21-130 TO: UTC FROM: Deborah Anderson, Ken Whistler, Roozbeh Pournader, and Liang Hai1 SUBJECT: Recommendations to UTC #168 July 2021 on Script Proposals DATE: July 26, 2021 The Script Ad Hoc group met on May 21, June 11, and July 16, 2021, in order to review proposals. The following represents feedback on proposals that were available when the group met. Table of Contents I. EUROPE ...................................................................................................................................................... 3 1 Cyrillic ..................................................................................................................................................... 3 1a. Cyrillic Phonetic Letters ................................................................................................................... 3 1b. Addendum to L2/21-107 Cyrillic modifier letters ........................................................................... 3 2 Old Hungarian ........................................................................................................................................ 4 3 Sidetic ..................................................................................................................................................... 4 II. AMERICAS ................................................................................................................................................. 5 4 Unified Canadian Aboriginal Syllabics ...................................................................................................
    [Show full text]
  • The Evolution of the Printed Bengali Character
    The Evolution of the Printed Bengali Character from 1778 to 1978 by Fiona Georgina Elisabeth Ross School of Oriental and African Studies University of London Thesis presented for the degree of Doctor of Philosophy 1988 ProQuest Number: 10731406 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. ProQuest 10731406 Published by ProQuest LLC (2017). Copyright of the Dissertation is held by the Author. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code Microform Edition © ProQuest LLC. ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106 - 1346 20618054 2 The Evolution of the Printed Bengali Character from 1778 to 1978 Abstract The thesis traces the evolution of the printed image of the Bengali script from its inception in movable metal type to its current status in digital photocomposition. It is concerned with identifying the factors that influenced the shaping of the Bengali character by examining the most significant Bengali type designs in their historical context, and by analyzing the composing techniques employed during the past two centuries for printing the script. Introduction: The thesis is divided into three parts according to the different methods of type manufacture and composition: 1. The Development of Movable Metal Types for the Bengali Script Particular emphasis is placed on the early founts which lay the foundations of Bengali typography.
    [Show full text]
  • Manipuri Language Policies
    Draft Policy Document For INTERNATIONALIZED DOMAIN NAMES Language: MANIPURI 1 Contents 1. AUGMENTED BACKUS-NAUR FORMALISM (ABNF) ....................................... 3 1.1 Declaration of Variables: ..................................................................................... 3 1.2 ABNF Operators .................................................................................................. 3 1.3 Vowel Sequence ................................................................................................... 3 1.4 Consonant Sequence ............................................................................................ 4 1.5 ABNF Applied to the MANIPURI IDN .............................................................. 6 2. RESTRICTION RULES.............................................................................................. 8 3. EXAMPLES .............................................................................................................. 10 4. LANGUAGE TABLE: MANIPURI ......................................................................... 11 5. NOMENCLATURAL DESCRIPTION TABLE OF MANIPURI LANGUAGE TABLE .............................................................................................................................. 12 6. VARIANT TABLE FOR MANIPURI ...................................................................... 15 7. EXPERTS CONSULTED/TO BE CONSULTED .................................................... 16 8. PROPOSED ccTLD FOR MANIPURI ....................................................................
    [Show full text]
  • Internationalized Domain Names-Dogri
    Draft Policy Document For INTERNATIONALIZED DOMAIN NAMES Language: DOGRI 1 RECORD OF CHANGES *A - ADDED M - MODIFIED D - DELETED PAGES A* COMPLIANCE VERSION DATE AFFECTED M TITLE OR BRIEF VERSION OF NUMBER D DESCRIPTION MAIN POLICY DOCUMENT 1.0 21 January, Whole M Language Specific 2010 Document Policy Document for DOGRI 1.1 22 March, Whole M Description of 2011 Document sequence added, Variant removed 1.2 13 January, Page5,6,7 M Inclusion of Vowel 2012 Modifier(MODIFIE R LETTER APOSTROPHE) [S] after Matra [M] 1.3 01 January, All M D Modified character 1.8 2013 repertoire as per IDNA 2008 2 Table of Contents 1. AUGMENTED BACKUS-NAUR FORMALISM (ABNF) ...........................................4 1.1 Declaration of variables ....................................................................................... 4 1.2 ABNF Operators .................................................................................................. 4 1.3 The Vowel Sequence ............................................................................................ 4 1.4 Consonant Sequence ............................................................................................ 5 1.5 Sequence .............................................................................................................. 7 1.6 ABNF Applied to the DOGRI IDN ..................................................................... 7 2. RESTRICTION RULES ................................................................................................10 3. EXAMPLES: ............................................................................................................12
    [Show full text]
  • Proposal to Encode 09CF BENGALI LETTER VEDIC ANUSVARA Shriramana Sharma, Jamadagni-At-Gmail-Dot-Com, India 2015-May-18 L2/15-161
    Proposal to encode 09CF BENGALI LETTER VEDIC ANUSVARA Shriramana Sharma, jamadagni-at-gmail-dot-com, India 2015-May-18 L2/15-161 This is a proposal to encode one character in the Bengali block: 09CF BENGALI LETTER VEDIC ANUSVARA This character is required for representation of Vedic texts – especially of the Taittiriya school of the Krishna Yajur Veda and the Kauthuma school of the Sama Veda – in the Bengali script. It denotes a Vedic anusvara, and is found in contrast to the regular anusvara. Attestations from old publications showing the contrast are provided in this document. Parallel characters in other scripts A8F2 ꣲ DEVANAGARI SIGN SPACING CANDRABINDU and A8F3 ꣳ DEVANAGARI SIGN CANDRABINDU VIRAMA were contrastively attested for the Sama Vedic and Yajur Vedic usages respectively in L2/07-343 p 25. 1135E GRANTHA LETTER VEDIC ANUSVARA was attested in L2/09-372 p 12 for both these usages. 0C80 KANNADA SIGN SPACING CANDRABINDU ꣲ was proposed for Badaga language orthography by L2/14-153 and attestation for its usage for the Sama Veda has been provided in L2/15-158. Since all these characters have been encoded per-script, it is proposed to encode the parallel character for Bengali separately. To be noted is that the Devanagari, Grantha, Kannada characters used for the Sama Veda do not have a horizontal downward stroke below. However in Bengali, it is only the form with the stroke below that is attested for both the Sama Veda and Yajur Veda. Similar linguistic contexts from the Rig Veda and Atharva Veda use the non-spacing combining character 0981 BENGALI SIGN CANDRABINDU ◌ঁ .
    [Show full text]
  • Proposal to Annotate Brahmi Sign Anusvara Vinodh Rajan [email protected] Shriramana Sharma [email protected]
    P a g e | 1 Proposal to Annotate Brahmi Sign Anusvara Vinodh Rajan [email protected] Shriramana Sharma [email protected] 1 Introduction L2/19-402 initially proposed the addition of 6 Old Tamil characters to the Brahmi block, out of which 5 were accepted by the UTC #162 in January 2020. The dot-shaped Old Tamil Virama was not accepted due to the block already having another dot-shaped character i.e. U+11001 the Brahmi Anusvara character, which could possibly be repurposed to represent the Old Tamil Virama. It was also suggested to reconsider the unification with the existing Virama character. This document proposes the unification of the Brahmi Anusvara and the Old Tamil Virama and requests that the existing Brahmi Anusvara character be annotated for its additional usage as the Tamil-Brahmi Virama. 2 Positioning of Brahmi Anusvara and Tamil-Brahmi Virama The Brahmi Anusvara is a dot-shaped character usually placed at the top-right position. However, it may also be placed at the top or even at the bottom-right position. Below shown are examples from Indoskript1 for occurrences of the following syllables - /aṃ/, /khaṃ/, /kiṃ/, /tuṃ/ & /luṃ/ - between 300 BCE and 100 CE. 1 http://userpage.fu-berlin.de/falk/ P a g e | 2 Old Tamil Virama is also a dot-shaped character, which can occur in multiple positions. From Early Tamil Epigraphy by Iravatham Mahadevan https://www.cmi.ac.in/gift/Epigraphy/epig_vikramangalam_not%20considered.htm Standardized representation of Tamil in recent books also shows variation in positioning: http://know-your-heritage.blogspot.com/p/blog-page_14.html Old Tamil Virama occurring at the top position P a g e | 3 https://www.newindianexpress.com/states/tamil-nadu/2018/dec/12/this-is-how-thiruvalluvar-wrote- thirukkural-couplets-1910278.html Old Tamil Virama occurring to the left As it can been seen, both the Anusvara and the Old Tamil Virama are dot-shaped characters with variable positioning.
    [Show full text]