Word Normalization in Indian Languages

Total Page:16

File Type:pdf, Size:1020Kb

Word Normalization in Indian Languages Word normalization in Indian languages by Prasad Pingali, Vasudeva Varma in the proceeding of 4th International Conference on Natural Language Processing (ICON 2005). December 2005. Report No: IIIT/TR/2008/81 Centre for Search and Information Extraction Lab International Institute of Information Technology Hyderabad - 500 032, INDIA June 2008 Word normalization in Indian languages Prasad Pingali Vasudeva Varma Language Technologies Research Centre Language Technologies Research Centre IIIT, Hyderabad, India IIIT, Hyderabad, India [email protected] [email protected] Abstract tongues, but of the absorption of Middle-Eastern and European influences as well. This richness is Indian language words face spelling also evident in the written form of the language. A standardization issues, thereby resulting in remarkable feature of the alphabets of India is the multiple spelling variants for the same word. manner in which they are organised. It is The major reasons for this phenomenon can organised according to phonetic principle, unlike be attributed to the phonetic nature of Indian the Roman alphabet, which has a random sequence languages and multiple dialects, of letters. This richness has also led to a set of transliteration of proper names, words problems over a period of time. The variety in the borrowed from foreign languages, and the alphabet, different dialects and influence of phonetic variety in Indian language alphabet. foreign languages has resulted in spelling Given such variations in spelling it becomes variations of the same word. Such variations difficult for web Information Retrieval sometimes can be treated as errors in writing, applications built for Indian languages, since while some are very widely used to be called as finding relevant documents would require errors. In this paper we consider all types of more than performing an exact string match. spelling variations of a word in the language. In this paper we examine the characteristics of such word spelling variations and explore This study on Indian language words is part of a how to computationally model such web search engine project for Indian languages. variations. We compare a set of language When dealing with real web data, the data could specific rules with many approximate string be really problematic. A lot of Information matching algorithms in evaluation. Retrieval systems, web search systems rarely explicitly mention the problems of the real world 1 Problem statement data on the web. While comparing strings on real web, they assume data to be homogeneous and India is rich in languages, boasting not only the comparable across different sources. But in indigenous sprouting of Dravidian and Indo-Aryan practice when one looks at real web data, there could be lot of variations in strings which need to other Indian languages as well, such as telugu, be handled. Especially in the case of Indian tamil and bengali. Given such huge percentage of languages such variations tend to occur a lot more words it becomes important to study what are the due to various reasons. Some of such reasons that characteristics of such spelling variations and see we could identify are the phonetic nature of Indian if we can computationally model such variations. languages, larger size of alphabet, lack of standardization in the use of such alphabet, words We propose two solutions for the above said entering from foreign languages such as English problem and compare them. One solution is to and Persian languages and last but not least to come up with a set of rules specific to language mention the variations in transliteration of proper which can handle such variations, which could names. In order to quantize these issues, we result in more precise performance. However such randomly picked 10 hindi and 10 telugu news a solution is not scalable for new languages since a articles. We manually counted the number of separate program will need to be written for each proper names and words borrowed from English in Indian language. Another solution could be to try these news articles. We found that an average of approximate string matching algorithms. Such 5.19% of words were proper names in Hindi algorithms are easily extensible to other languages documents and 4.8% words were proper names in but may not perform as well as language specific Telugu documents. We also found an average of rules in terms of precision. 5.73% of words were borrowed from English in Hindi documents while this number was 6.9% for 2 Rule based algorithm Telugu documents. Therefore apart from the In this section we discuss an algorithm using a set Indian language words we should also be able to of language specific rules by taking Hindi as an handle proper names and English words example. In this algorithm we achieve transliterated in Indian languages since they form normalization of words by mapping the alphabet substantial percentage of words. To give an idea of of the given language L into another alphabet L© the data problem, the following words were found where L© ⊂ L. Before discussing the actual rules on various websites. we would like to introduce chandra-bindu, bindu, अँगरेजी, अँगरेजी, अँगेजी, अँगेजी, अंगरेजी, अंगरेजी, अंगेजी, अंगेजी nukta, halanth, maatra and chandra in Hindi अनतरराषरीय, अनतरराषरीय, अनतरारिषरय, अनतरारषरीय, अंतरराषरीय, alphabet which are being referred in the rules. A अंतराषर रीय, अंतरराषरीय, अनतरािषरय, अनतराषरीय chandra-bindu is a half-moon with a dot, which It has been empirically found that there is lot of has the function of vowel nasalization. A bindu disagreement among website authors with regard (also called anusvar) is a dot written on top of to spellings of words. We found that 65,774 words consonants which achieves consonant nasalization. had variations out of 278,529 words. These 65,774 A nukta is a dot under a consonant which achieves words belong to 28,038 words. Therefore about sounds mostly used in words of persian and arabic 23.61% of Indian language words found atleast languages. A halanth is a consonant reducer. A one variant word. The average number of maatra is vowel character that occurs in variations a word would contain is about 2.34 combination with a consonant. A chandra is a words. It was found that more the number of special character which achieves the function of websites being studied, more is the amount of vowel rounding, such as the sound of ©o© in the disagreement. This phenomenon was observed in word ©documentary©. The following rules are applied on words before comparison of two words larger values indicate greater similarity; at some to achieve normalization. risk of confusion to the reader, we will use these terms interchangably, depending on which if found map to Examples interpretation is most natural. One important class of distance functions are edit distances, in which chandra-bindu bindu अँगेज, अंगेज distance is the cost of best sequence of edit , consonant + corresponding अंगेज अंगेज operations that convert s to t. Typical edit nukta consonant operations are character insertion, deletion, and consonant + corresponding अँगरेज, अँगेज substitution, and each operation much be assigned halanth consonant a cost. We will consider two edit-distance functions. The simple Levenstein distance assigns longer vowel equivalent shorter अनतरारिषरय, अनतरारषरीय a unit cost to all edit operations. As an example of maatra vowel maatra a more complex well-tuned distance function, we character + corresponding डॉकयमु टे री, also consider the Monge-Elkan distance function डाकयमु टे री chandra character (Monge & Elkan 1996), which is an affine1 Table 1: Rules applied to achieve normalization variant of the Smith-Waterman distance function in Hindi. (Durban et al. 1998) with particular cost parameters, and scaled to the interval [0,1]. A While we employed these basic rules, we also broadly similar metric, which is not based on an tried using unaspirated consonants in the place edit-distance model, is the Jaro metric (Jaro 1995; their respective aspirated ones. We found that this 1989; Winkler 1999). In the record-linkage operation did not yield much in recall and literature, good results have been obtained using deteriorated precision. Therefore we dropped this variants of this method, which is based on the feature in our algorithm. number and order of the common characters between two strings. Given strings s = a1 . aK 3 Approximate string matching and t = b1 . bL , define a character ai in s to be algorithms common with t there is a bj = ai in t such that i - H <= j <= i + H , where H = min(|s|.|t|) / 2 . Let s© We used a set of approximate string matching = a©1 . a©K be the characters in s which are algorithms from the second-string (found at common with t (in the same order they appear in http://secondstring.sourceforge.net) project to s) and let t = b1 . bL be analogous; now define a evaluate to what extent would they help solve the transposition , for s©, t© to be a position i such that problem of normalizing Indian language words. ai not equals to bi . Let Ts©,t© for s©, t© be half the We shall briefly discuss about each of these number of transpositions for s and t . algorithms in this section before proceeding to The Jaro similarity metric for s and t is experimental results. Approximate string matching algorithms decide whether two given strings are equal by using a distance function between the two where strings. Distance functions map a pair of strings s and t to a real number r, where a smaller value of r indicates greater similarity between s and t. Similarity functions are analogous, except that A variant of this metric due to Winkler (1999) also As shown in figure 1, we find that the Indian uses the length P of the longest common prefix of Language Normalizer algorithm which is the set of s and t.
Recommended publications
  • WO 2010/131256 Al
    (12) INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) (19) World Intellectual Property Organization International Bureau (10) International Publication Number (43) International Publication Date 18 November 2010 (18.11.2010) WO 2010/131256 Al (51) International Patent Classification: AO, AT, AU, AZ, BA, BB, BG, BH, BR, BW, BY, BZ, G06F 3/01 (2006.01) CA, CH, CL, CN, CO, CR, CU, CZ, DE, DK, DM, DO, DZ, EC, EE, EG, ES, FI, GB, GD, GE, GH, GM, GT, (21) International Application Number: HN, HR, HU, ID, IL, IN, IS, JP, KE, KG, KM, KN, KP, PCT/IN20 10/000052 KR, KZ, LA, LC, LK, LR, LS, LT, LU, LY, MA, MD, (22) International Filing Date: ME, MG, MK, MN, MW, MX, MY, MZ, NA, NG, NI, 29 January 2010 (29.01 .2010) NO, NZ, OM, PE, PG, PH, PL, PT, RO, RS, RU, SC, SD, SE, SG, SK, SL, SM, ST, SV, SY, TH, TJ, TM, TN, TR, (25) Filing Language: English TT, TZ, UA, UG, US, UZ, VC, VN, ZA, ZM, ZW. (26) Publication Language: English (84) Designated States (unless otherwise indicated, for every (30) Priority Data: kind of regional protection available): ARIPO (BW, GH, 974/DEL/2009 13 May 2009 (13.05.2009) IN GM, KE, LS, MW, MZ, NA, SD, SL, SZ, TZ, UG, ZM, ZW), Eurasian (AM, AZ, BY, KG, KZ, MD, RU, TJ, (72) Inventor; and TM), European (AT, BE, BG, CH, CY, CZ, DE, DK, EE, (71) Applicant : MEHRA, Rajesh [IN/IN]; H-39, Tagore ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, LT, LU, LV, Path, Bani Park, Jaipur 302 0 16, Rajasthan (IN).
    [Show full text]
  • Internationalized Domain Names-Sanskrit
    Policy Document For INTERNATIONALIZED DOMAIN NAMES Language: SANSKRIT 1. AUGMENTED BACKUS-NAUR FORMALISM (ABNF) .......................................... 3 1.1 Declaration of variables ............................................................................................ 3 1.2 ABNF Operators ....................................................................................................... 3 1.3 The Vowel Sequence ................................................................................................. 3 1.4 Consonant Sequence ................................................................................................. 4 1.5 ABNF Applied to the SANSKRIT IDN .................................................................... 5 2. RESTRICTION RULES ................................................................................................. 6 3. EXAMPLES ................................................................................................................... 8 4. LANGUAGE TABLE: SANSKRIT ............................................................................... 9 5. NOMENCLATURAL DESCRIPTION TABLE OF SANSKRIT LANGUAGE TABLE ............................................................................................................................................11 6. VARIANT TABLE ........................................................................................................ 14 7. EXPERTISE/BODIES CONSULTED .......................................................................... 15 8.
    [Show full text]
  • Internationalized Domain Names-Assamese
    Policy Document For INTERNATIONALIZED DOMAIN NAMES Language: ASSAMESE 1. AUGMENTED BACKUS-NAUR FORMALISM (ABNF) ...........................................3 1.1 Naming of Variables: .................................................................................................3 1.2 ABNF Operators ........................................................................................................3 1.3 The Vowel Sequence .................................................................................................4 1.4 Consonant Sequence * ..............................................................................................5 1.5 ABNF Applied to the Assamese IDN ........................................................................8 2. RESTRICTION RULES ..................................................................................................9 3. EXAMPLES ..................................................................................................................12 4. LANGUAGE TABLE: ASSAMESE .............................................................................14 5. NOMENCLATURAL DESCRIPTION TABLE OF ASSAMESE LANGUAGE TABLE ...............................................................................................................................16 6. VARIANT TABLE .........................................................................................................19 7. EXPERTS/BODIES CONSULTED ..............................................................................20 8. Country Code
    [Show full text]
  • Script Grammar for Gujarati Language
    SCRIPT GRAMMAR FOR GUJARATI LANGUAGE Prepared by Technology Development for Indian Languages (TDIL) Programme Department of Information Technology, Government of India in association with Centre for Development of Advanced Computing (C-DAC) 1 Table of Contents 0. INTRODUCTION ...................................................................................................... 3 1. OBJECTIVES OF SCRIPT GRAMMAR .................................................................. 4 2. END USERS FOR SCRIPT GRAMMAR ................................................................. 5 3. SCOPE ........................................................................................................................ 6 4. TERMINOLOGY ......................................................................................................... 7 5. PHILOSOPHY AND UNDERLYING PRINCIPLES................................................ 11 6. SCRIPT GRAMMAR STRUCTURE ...................................................................... 12 6.1. PERIPHERAL ELEMENTS OF THE SCRIPT GRAMMAR .............................. 13 6.2. CONFORMITY TO THE SYLLABLE STRUCTURE ........................................ 14 6.3 SCRIPT GRAMMAR PROPER ............................................................................. 18 6.3.1. The Character Set of Gujarati. ........................................................................ 18 6.3.2. Consonant Mātrā Combinations. ................................................................... 24 6.3.3. The Ligature Set of Gujarati.
    [Show full text]
  • A Study on Collation of Languages from Developing Asia
    A Study on Collation of Languages from Developing Asia Sarmad Hussain Nadir Durrani Center for Research in Urdu Language Processing National University of Computer and Emerging Sciences www.nu.edu.pk www.idrc.ca Published by Center for Research in Urdu Language Processing National University of Computer and Emerging Sciences Lahore, Pakistan Copyrights © International Development Research Center, Canada Printed by Walayatsons, Pakistan ISBN: 978-969-8961-03-9 This work was carried out with the aid of a grant from the International Development Research Centre (IDRC), Ottawa, Canada, administered through the Centre for Research in Urdu Language Processing (CRULP), National University of Computer and Emerging Sciences (NUCES), Pakistan. ii Preface Defining collation, or what is normally termed as alphabetical order or less frequently as lexicographic order, is one of the first few requirements for enabling computing in any language, second only to encoding, keyboard and fonts. It is because of this critical dependence of computing on collation that its definition is included within the locale of a language. Collation of all written languages are defined in their dictionaries, developed over centuries, and are thus very representative of cultural tradition. However, though it is well understood in these cultures, it is not always thoroughly documented or well understood in the context of existing character encodings, especially the Unicode. Collation is a complex phenomenon, dependent on three factors: script, language and encoding. These factors interact in a complicated fashion to uniquely define the collation sequence for each language. This volume aims to address the complex algorithms needed for sorting out the words in sequence for a subset of the languages.
    [Show full text]
  • Grantha Script Lessons.Pdf
    1 | Page http://www.virtualvinodh.com Grantha Script Lessons Grantha Lipi Pāṭhāḥ �न् िलिप पाठाः ³ரத² பி பாடா²: ലിപി പാഠാഃ ගන් ලි පාඨාඃ คฺรนฺถ ลิปิ ปาฐา: �គន្ លិ បិ បោឋ Grantha Script Lessons by Vinodh Rajan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 India License. Based on a work at www.virtualvinodh.com. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc- sa/2.5/in/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA. 2 | Page http://www.virtualvinodh.com Contents Buddhanusmriti 3 Grantha - 1 - Vowels 5 Grantha - 2 - Ayogavaha 11 Grantha - 3 - Consonants - Ka 17 Grantha - 4 - Consonants ca - ṭa 21 Grantha - 5 - Consonants ta - pa 25 Grantha - 6 - Consonants ya - ha 29 Grantha - 7 - Summary I 34 Grantha - 8 - Vowel Signs I 42 Grantha - 9 - Vowel Signs II 48 Grantha - 10 - Vowel-less Consonants 52 Grantha - 11 - Summary II 55 Grantha - 12 - Conjuncts I 61 Grantha - 13 - Conjuncts II 64 Grantha - 14 - Conjuncts III 70 Grantha - 15 - Conjunct IV 75 Grantha - 16 - Conjuncts V 80 Grantha - 17 - Grantha Fonts & Softwares 85 Sample Texts in Grantha 93 3 | Page http://www.virtualvinodh.com buddhānusmṛtiḥ बद्धु नुु ्मृ�स �³த³தா �ஸ்மʼதி: ⁴ oṁ namaḥ sarvabuddhabodhisattvebhyaḥ ॐ नमः सवरबदबु ्धबस�धसत्तृ ஓ்ʼ நம: ஸர்�³த³த ேபா³தி ஸத்த்ே ய: ⁴ ⁴ ⁴ ityapi buddho bhagavāṁstathāgato'rhan इत्य��बद्�ु धो भगवांस्तथागतो ऽ இத்யப��³த³ேதா ப க³்ா்ʼஸததா²க³ேதா(அ)ர்ஹ ⁴ ⁴ samyaksaṁbuddho vidyācaraṇasampannaḥ समतयक्सद्�यवु ोत्वद्याचरधृ ஸம்யஸ்ʼ�³த³ேதா
    [Show full text]
  • Santali Language Policies
    Policy Document For INTERNATIONALIZED DOMAIN NAMES Language: SANTALI DEVANAGARI SCRIPT Contents 1. AUGMENTED BACKUS-NAUR FORMALISM (ABNF) ......................................... 3 1.1 Declaration of variables ............................................................................................ 3 1.2 ABNF Operators ....................................................................................................... 3 1.3 The Vowel Sequence ................................................................................................ 3 1.4 Consonant Sequence ................................................................................................. 4 1.5 ABNF Applied to the SANTALI (DEVANAGARI SCRIPT) IDN ......................... 6 2. RESTRICTION RULES............................................................................................... 7 3. EXAMPLES................................................................................................................. 10 4. LANGUAGE TABLE: SANTALI (DEVANAGARI) ................................................ 12 5. NOMENCLATURAL DESCRIPTION TABLE OF SANTALI(DEVANAGARI) LANGUAGE TABLE....................................................................................................... 13 7. EXPERTISE/BODIES CONSULTED ...................................................................... 17 8. Country Code Top Level Domain (ccTLD) FOR SANTALI in DEVANAGARI script ................................................................................................................................
    [Show full text]
  • Retroflex Consonant Harmony in South Asia
    Retroflex Consonant Harmony in South Asia by Paul Edmond Arsenault A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Department of Linguistics University of Toronto © Copyright by Paul Edmond Arsenault 2012 Retroflex Consonant Harmony in South Asia Paul Edmond Arsenault Doctor of Philosophy Department of Linguistics University of Toronto 2012 Abstract This dissertation explores the nature and extent of retroflex consonant harmony in South Asia. Using statistics calculated over lexical databases from a broad sample of languages, the study demonstrates that retroflex consonant harmony is an areal trait affecting most languages in the northern half of the South Asian subcontinent, including languages from at least three of the four major families in the region: Dravidian, Indo-Aryan and Munda (but not Tibeto-Burman). Dravidian and Indo-Aryan languages in the southern half of the subcontinent do not exhibit retroflex consonant harmony. In South Asia, retroflex consonant harmony is manifested primarily as a static co- occurrence restriction on coronal consonants in roots/words. Historical-comparative evidence reveals that this pattern is the result of retroflex assimilation that is non-local, regressive and conditioned by the similarity of interacting segments. These typological properties stand in ii contrast to those of other retroflex assimilation patterns, which are local, primarily progressive, and not conditioned by similarity. This is argued to support the hypothesis that local feature spreading and long-distance feature agreement constitute two independent mechanisms of assimilation, each with its own set of typological properties, and that retroflex consonant harmony is the product of agreement, not spreading.
    [Show full text]
  • Stgejoy Śaraṇāgati Surrender
    IIď/kj.tekĔä iryA All glory to Śrī Guru and Śrī Gaurāṅga Stgejoy Śaraṇāgati Surrender with I[7pvoȪwe5eƐ Śrī Laghu-chandrikā-bhāṣya Gentle Moonlight Commentary Śrī Chaitanya Sāraswat Maṭh, Nabadwīp IIď/kj.tekĔä iryA All glory to Śrī Guru and Śrī Gaurāṅga Stgejoy Śaraṇāgati Surrender composed by the pre-eminent associate of Śrī Gaurāṅga Mahāprabhu Śrīla Sachchidānanda Bhakti Vinod Ṭhākur with I[7pvoȪwe5eƐ Śrī Laghu-chandrikā-bhāṣya Gentle Moonlight Commentary composed by the Founder-President-Āchārya of Śrī Chaitanya Sāraswat Maṭh Ananta-śrī-vibhūṣita Oṁ Viṣṇupād Paramahaṁsa-kula-chūḍāmaṇi Śrīla Bhakti Rakṣak Śrīdhar Dev-Goswāmī Mahārāj Translated into English by the inspiration, merciful encouragement and specific desire of his dearmost associate Śrī Chaitanya Sāraswat Maṭh President-Sevāite-Āchārya Oṁ Viṣṇupād Viśva-guru Aṣṭottara-śata-śrī Śrīla Bhakti Sundar Govinda Dev-Goswāmī Mahārāj Published under the guidance of his appointed successor President-Sevāite-Āchārya of Śrī Chaitanya Sāraswat Maṭh Oṁ Viṣṇupād Aṣṭottara-śata-śrī Śrīla Bhakti Nirmal Āchārya Mahārāj from Śrī Chaitanya Sāraswat Maṭh, Nabadwīp © 2011 Sri Chaitanya Saraswat Math All rights reserved by The Current Sevaite-President-Acharya of Sri Chaitanya Saraswat Math Founder Acharya: His Divine Grace Srila Bhakti Raksak Sridhar Dev-Goswami Maharaj Successor Sevaite-President-Acharya: His Divine Grace Srila Bhakti Sundar Govinda Dev-Goswami Maharaj Current Sevaite-President-Acharya: His Divine Grace Srila Bhakti Nirmal Acharya Maharaj Published by Sri Chaitanya Saraswat Math
    [Show full text]
  • On the Indian Sect of the Jainas
    Note to the HTML edition: This document duplicates the diacritical marks of the original using HTML unicode combining character entities. Not all browsers and operating systems support them. The chart below shows how these unusual characters are displayed in your browser. Diacritical marks used in this document  circumflex over A ĭ breve over i ṇ dot under n â circumflex over a ī macron over i ñ tilde over n à grave over a Î circumflex over I ô circumflex over o á acute over a î circumflex over i ṛ dot under r ḍ dot under d í acute over i Ṛ dot under R ĕ breve over e ì grave over i Ś acute over S è grave over e Ṁ dot over M ś acute over s ê circumflex over e ṁ dot over m Ṭ dot under T ë umlaut over e ṃ dot under m ṭ dot under t ḥ dot under h n̄ macron over n Ü umlaut over U î circumflex over i ń acute over n ü umlaut over u ı̐ chandrabindu over i ṅ dot over n û circumflex over u CONTENTS. ON THE The linked image cannot be displayed. The file may have been moved, renamed, or deleted. Verify that the link points to the correct file and location. INDIAN SECT Preface OF The linked image cannot be displayed. The file may have been moved, renamed, or deleted. Verify that the link points to the correct file and location. THE JAINAS THE INDIAN SECT OF THE JAINAS, BY by Dr.
    [Show full text]
  • Proposal for a Telugu Script Root Zone Label Generation Ruleset (LGR)
    Proposal for a Telugu Script Root Zone Label Generation Ruleset (LGR) LGR Version: 3.0 Date: 2019-03-06 Document version: 2.7 Authors: Neo-Brahmi Generation Panel [NBGP] 1. General Information/ Overview/ Abstract This document lays down the Label Generation Rule Set for the Telugu script. Three main components of the Telugu Script LGR, viz. Code point repertoire, Variants and Whole Label Evaluation Rules have been described in detail here. All these components have been incorporated in a machine-readable format in the accompanying XML file: "proposal-telugu-lgr-06mar19-en.xml". In addition, a list of test labels has been provided in the following file, which covers the repertoire, variant code points and the whole label evaluation rules, providing examples for valid and invalid labels: “telugu-test-labels-06mar19-en.txt”. 2. Script for which the LGR is proposed ISO 15924 Code: Telu ISO 15924 Key N°: 340 ISO 15924 English Name: Telugu Latin transliteration of native script name: telugu Native name of the script: !ెల$గ& Maximal Starting Repertoire [MSR] version: 4 The Unicode Standard, Version: 6.3 Telugu Unicode Range: 0C00–0C7F 3. Background of the Script and Principal Languages Using It The Telugu language uses the Telugu script which is written in the form of sequences of orthographic syllables. Each orthographic syllable is formed of one or more Telugu characters placed from left to right and top to bottom. Telugu is one of the 22 scheduled languages of India. The Telugu script is immediately related to Kannada and closely related to the Sinhala script. 1 3.1 The Evolution of the Script The origins of the Telugu script can be traced to the Brahmi alphabet of ancient India, often known as Asokan Brahmi.
    [Show full text]
  • Rebuttal of Background of Indic Segmentation
    Rebuttal of Background of Indic Segmentation Submitter: Richard Wordingham Date: 30 April 2017 Introduction Submission L2/17-094 contains several unreliable statements, and I feel obliged to make corrections or warning. The definition offered for the Indic Orthographic syllable offered is erroneous and inadequate. The formulation offered is: V[m] | {CH}C[v][m] | CH where V = independent vowel m = anusvara, visarga, chandrabindu C = consonant, or consonant + nukta v = dependent vowel H = halant / virama Even for the above constituents and for the Devanagari script, a more general formulation is required, namely V[m] | {CH}C{v}[m] | {CH}CH The extension of the third alternative is obvious; even Sanskrit has a few words that end in two consonants. Even then, at the code point level, this ignores the fact that Microsoft has long acknowledged the sequence of repha and independent vowel. Some vowels in Devanagari are expressed by sequences of two vowels, for example <U+094E DEVANAGARI VOWEL SIGN PRISHTHAMATRA E, U+093E DEVANAGARI VOWEL SIGN AA>, which conveys the same vowel sound as U+094B DEVANAGARI VOWEL SIGN O. The expression above ignores the use of ZWJ and ZWNJ. These are required for the proper display of Hindi when a font might otherwise use conjuncts considered appropriate for Sanskrit but inappropriate for Hindi, 1 of 2 e.g. द <U+0926 DEVANAGARI LETTER DA, U+094D DEVANAGARI SIGN VIRAMA, U+0917 DEVANAGARI LETTER GA> as opposed to <U+0926, U+094D, U+200D ZERO WIDTH JOINER, U+0917>. द‍ग There is also the issue that even in Devanagari, a virama does not always combine consonants into a single orthographic cluster.
    [Show full text]