Word Normalization in Indian Languages
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
WO 2010/131256 Al
(12) INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) (19) World Intellectual Property Organization International Bureau (10) International Publication Number (43) International Publication Date 18 November 2010 (18.11.2010) WO 2010/131256 Al (51) International Patent Classification: AO, AT, AU, AZ, BA, BB, BG, BH, BR, BW, BY, BZ, G06F 3/01 (2006.01) CA, CH, CL, CN, CO, CR, CU, CZ, DE, DK, DM, DO, DZ, EC, EE, EG, ES, FI, GB, GD, GE, GH, GM, GT, (21) International Application Number: HN, HR, HU, ID, IL, IN, IS, JP, KE, KG, KM, KN, KP, PCT/IN20 10/000052 KR, KZ, LA, LC, LK, LR, LS, LT, LU, LY, MA, MD, (22) International Filing Date: ME, MG, MK, MN, MW, MX, MY, MZ, NA, NG, NI, 29 January 2010 (29.01 .2010) NO, NZ, OM, PE, PG, PH, PL, PT, RO, RS, RU, SC, SD, SE, SG, SK, SL, SM, ST, SV, SY, TH, TJ, TM, TN, TR, (25) Filing Language: English TT, TZ, UA, UG, US, UZ, VC, VN, ZA, ZM, ZW. (26) Publication Language: English (84) Designated States (unless otherwise indicated, for every (30) Priority Data: kind of regional protection available): ARIPO (BW, GH, 974/DEL/2009 13 May 2009 (13.05.2009) IN GM, KE, LS, MW, MZ, NA, SD, SL, SZ, TZ, UG, ZM, ZW), Eurasian (AM, AZ, BY, KG, KZ, MD, RU, TJ, (72) Inventor; and TM), European (AT, BE, BG, CH, CY, CZ, DE, DK, EE, (71) Applicant : MEHRA, Rajesh [IN/IN]; H-39, Tagore ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, LT, LU, LV, Path, Bani Park, Jaipur 302 0 16, Rajasthan (IN). -
Internationalized Domain Names-Sanskrit
Policy Document For INTERNATIONALIZED DOMAIN NAMES Language: SANSKRIT 1. AUGMENTED BACKUS-NAUR FORMALISM (ABNF) .......................................... 3 1.1 Declaration of variables ............................................................................................ 3 1.2 ABNF Operators ....................................................................................................... 3 1.3 The Vowel Sequence ................................................................................................. 3 1.4 Consonant Sequence ................................................................................................. 4 1.5 ABNF Applied to the SANSKRIT IDN .................................................................... 5 2. RESTRICTION RULES ................................................................................................. 6 3. EXAMPLES ................................................................................................................... 8 4. LANGUAGE TABLE: SANSKRIT ............................................................................... 9 5. NOMENCLATURAL DESCRIPTION TABLE OF SANSKRIT LANGUAGE TABLE ............................................................................................................................................11 6. VARIANT TABLE ........................................................................................................ 14 7. EXPERTISE/BODIES CONSULTED .......................................................................... 15 8. -
Internationalized Domain Names-Assamese
Policy Document For INTERNATIONALIZED DOMAIN NAMES Language: ASSAMESE 1. AUGMENTED BACKUS-NAUR FORMALISM (ABNF) ...........................................3 1.1 Naming of Variables: .................................................................................................3 1.2 ABNF Operators ........................................................................................................3 1.3 The Vowel Sequence .................................................................................................4 1.4 Consonant Sequence * ..............................................................................................5 1.5 ABNF Applied to the Assamese IDN ........................................................................8 2. RESTRICTION RULES ..................................................................................................9 3. EXAMPLES ..................................................................................................................12 4. LANGUAGE TABLE: ASSAMESE .............................................................................14 5. NOMENCLATURAL DESCRIPTION TABLE OF ASSAMESE LANGUAGE TABLE ...............................................................................................................................16 6. VARIANT TABLE .........................................................................................................19 7. EXPERTS/BODIES CONSULTED ..............................................................................20 8. Country Code -
Script Grammar for Gujarati Language
SCRIPT GRAMMAR FOR GUJARATI LANGUAGE Prepared by Technology Development for Indian Languages (TDIL) Programme Department of Information Technology, Government of India in association with Centre for Development of Advanced Computing (C-DAC) 1 Table of Contents 0. INTRODUCTION ...................................................................................................... 3 1. OBJECTIVES OF SCRIPT GRAMMAR .................................................................. 4 2. END USERS FOR SCRIPT GRAMMAR ................................................................. 5 3. SCOPE ........................................................................................................................ 6 4. TERMINOLOGY ......................................................................................................... 7 5. PHILOSOPHY AND UNDERLYING PRINCIPLES................................................ 11 6. SCRIPT GRAMMAR STRUCTURE ...................................................................... 12 6.1. PERIPHERAL ELEMENTS OF THE SCRIPT GRAMMAR .............................. 13 6.2. CONFORMITY TO THE SYLLABLE STRUCTURE ........................................ 14 6.3 SCRIPT GRAMMAR PROPER ............................................................................. 18 6.3.1. The Character Set of Gujarati. ........................................................................ 18 6.3.2. Consonant Mātrā Combinations. ................................................................... 24 6.3.3. The Ligature Set of Gujarati. -
A Study on Collation of Languages from Developing Asia
A Study on Collation of Languages from Developing Asia Sarmad Hussain Nadir Durrani Center for Research in Urdu Language Processing National University of Computer and Emerging Sciences www.nu.edu.pk www.idrc.ca Published by Center for Research in Urdu Language Processing National University of Computer and Emerging Sciences Lahore, Pakistan Copyrights © International Development Research Center, Canada Printed by Walayatsons, Pakistan ISBN: 978-969-8961-03-9 This work was carried out with the aid of a grant from the International Development Research Centre (IDRC), Ottawa, Canada, administered through the Centre for Research in Urdu Language Processing (CRULP), National University of Computer and Emerging Sciences (NUCES), Pakistan. ii Preface Defining collation, or what is normally termed as alphabetical order or less frequently as lexicographic order, is one of the first few requirements for enabling computing in any language, second only to encoding, keyboard and fonts. It is because of this critical dependence of computing on collation that its definition is included within the locale of a language. Collation of all written languages are defined in their dictionaries, developed over centuries, and are thus very representative of cultural tradition. However, though it is well understood in these cultures, it is not always thoroughly documented or well understood in the context of existing character encodings, especially the Unicode. Collation is a complex phenomenon, dependent on three factors: script, language and encoding. These factors interact in a complicated fashion to uniquely define the collation sequence for each language. This volume aims to address the complex algorithms needed for sorting out the words in sequence for a subset of the languages. -
Grantha Script Lessons.Pdf
1 | Page http://www.virtualvinodh.com Grantha Script Lessons Grantha Lipi Pāṭhāḥ �न् िलिप पाठाः ³ரத² பி பாடா²: ലിപി പാഠാഃ ගන් ලි පාඨාඃ คฺรนฺถ ลิปิ ปาฐา: �គន្ លិ បិ បោឋ Grantha Script Lessons by Vinodh Rajan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 India License. Based on a work at www.virtualvinodh.com. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc- sa/2.5/in/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA. 2 | Page http://www.virtualvinodh.com Contents Buddhanusmriti 3 Grantha - 1 - Vowels 5 Grantha - 2 - Ayogavaha 11 Grantha - 3 - Consonants - Ka 17 Grantha - 4 - Consonants ca - ṭa 21 Grantha - 5 - Consonants ta - pa 25 Grantha - 6 - Consonants ya - ha 29 Grantha - 7 - Summary I 34 Grantha - 8 - Vowel Signs I 42 Grantha - 9 - Vowel Signs II 48 Grantha - 10 - Vowel-less Consonants 52 Grantha - 11 - Summary II 55 Grantha - 12 - Conjuncts I 61 Grantha - 13 - Conjuncts II 64 Grantha - 14 - Conjuncts III 70 Grantha - 15 - Conjunct IV 75 Grantha - 16 - Conjuncts V 80 Grantha - 17 - Grantha Fonts & Softwares 85 Sample Texts in Grantha 93 3 | Page http://www.virtualvinodh.com buddhānusmṛtiḥ बद्धु नुु ्मृ�स �³த³தா �ஸ்மʼதி: ⁴ oṁ namaḥ sarvabuddhabodhisattvebhyaḥ ॐ नमः सवरबदबु ्धबस�धसत्तृ ஓ்ʼ நம: ஸர்�³த³த ேபா³தி ஸத்த்ே ய: ⁴ ⁴ ⁴ ityapi buddho bhagavāṁstathāgato'rhan इत्य��बद्�ु धो भगवांस्तथागतो ऽ இத்யப��³த³ேதா ப க³்ா்ʼஸததா²க³ேதா(அ)ர்ஹ ⁴ ⁴ samyaksaṁbuddho vidyācaraṇasampannaḥ समतयक्सद्�यवु ोत्वद्याचरधृ ஸம்யஸ்ʼ�³த³ேதா -
Santali Language Policies
Policy Document For INTERNATIONALIZED DOMAIN NAMES Language: SANTALI DEVANAGARI SCRIPT Contents 1. AUGMENTED BACKUS-NAUR FORMALISM (ABNF) ......................................... 3 1.1 Declaration of variables ............................................................................................ 3 1.2 ABNF Operators ....................................................................................................... 3 1.3 The Vowel Sequence ................................................................................................ 3 1.4 Consonant Sequence ................................................................................................. 4 1.5 ABNF Applied to the SANTALI (DEVANAGARI SCRIPT) IDN ......................... 6 2. RESTRICTION RULES............................................................................................... 7 3. EXAMPLES................................................................................................................. 10 4. LANGUAGE TABLE: SANTALI (DEVANAGARI) ................................................ 12 5. NOMENCLATURAL DESCRIPTION TABLE OF SANTALI(DEVANAGARI) LANGUAGE TABLE....................................................................................................... 13 7. EXPERTISE/BODIES CONSULTED ...................................................................... 17 8. Country Code Top Level Domain (ccTLD) FOR SANTALI in DEVANAGARI script ................................................................................................................................ -
Retroflex Consonant Harmony in South Asia
Retroflex Consonant Harmony in South Asia by Paul Edmond Arsenault A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Department of Linguistics University of Toronto © Copyright by Paul Edmond Arsenault 2012 Retroflex Consonant Harmony in South Asia Paul Edmond Arsenault Doctor of Philosophy Department of Linguistics University of Toronto 2012 Abstract This dissertation explores the nature and extent of retroflex consonant harmony in South Asia. Using statistics calculated over lexical databases from a broad sample of languages, the study demonstrates that retroflex consonant harmony is an areal trait affecting most languages in the northern half of the South Asian subcontinent, including languages from at least three of the four major families in the region: Dravidian, Indo-Aryan and Munda (but not Tibeto-Burman). Dravidian and Indo-Aryan languages in the southern half of the subcontinent do not exhibit retroflex consonant harmony. In South Asia, retroflex consonant harmony is manifested primarily as a static co- occurrence restriction on coronal consonants in roots/words. Historical-comparative evidence reveals that this pattern is the result of retroflex assimilation that is non-local, regressive and conditioned by the similarity of interacting segments. These typological properties stand in ii contrast to those of other retroflex assimilation patterns, which are local, primarily progressive, and not conditioned by similarity. This is argued to support the hypothesis that local feature spreading and long-distance feature agreement constitute two independent mechanisms of assimilation, each with its own set of typological properties, and that retroflex consonant harmony is the product of agreement, not spreading. -
Stgejoy Śaraṇāgati Surrender
IIď/kj.tekĔä iryA All glory to Śrī Guru and Śrī Gaurāṅga Stgejoy Śaraṇāgati Surrender with I[7pvoȪwe5eƐ Śrī Laghu-chandrikā-bhāṣya Gentle Moonlight Commentary Śrī Chaitanya Sāraswat Maṭh, Nabadwīp IIď/kj.tekĔä iryA All glory to Śrī Guru and Śrī Gaurāṅga Stgejoy Śaraṇāgati Surrender composed by the pre-eminent associate of Śrī Gaurāṅga Mahāprabhu Śrīla Sachchidānanda Bhakti Vinod Ṭhākur with I[7pvoȪwe5eƐ Śrī Laghu-chandrikā-bhāṣya Gentle Moonlight Commentary composed by the Founder-President-Āchārya of Śrī Chaitanya Sāraswat Maṭh Ananta-śrī-vibhūṣita Oṁ Viṣṇupād Paramahaṁsa-kula-chūḍāmaṇi Śrīla Bhakti Rakṣak Śrīdhar Dev-Goswāmī Mahārāj Translated into English by the inspiration, merciful encouragement and specific desire of his dearmost associate Śrī Chaitanya Sāraswat Maṭh President-Sevāite-Āchārya Oṁ Viṣṇupād Viśva-guru Aṣṭottara-śata-śrī Śrīla Bhakti Sundar Govinda Dev-Goswāmī Mahārāj Published under the guidance of his appointed successor President-Sevāite-Āchārya of Śrī Chaitanya Sāraswat Maṭh Oṁ Viṣṇupād Aṣṭottara-śata-śrī Śrīla Bhakti Nirmal Āchārya Mahārāj from Śrī Chaitanya Sāraswat Maṭh, Nabadwīp © 2011 Sri Chaitanya Saraswat Math All rights reserved by The Current Sevaite-President-Acharya of Sri Chaitanya Saraswat Math Founder Acharya: His Divine Grace Srila Bhakti Raksak Sridhar Dev-Goswami Maharaj Successor Sevaite-President-Acharya: His Divine Grace Srila Bhakti Sundar Govinda Dev-Goswami Maharaj Current Sevaite-President-Acharya: His Divine Grace Srila Bhakti Nirmal Acharya Maharaj Published by Sri Chaitanya Saraswat Math -
On the Indian Sect of the Jainas
Note to the HTML edition: This document duplicates the diacritical marks of the original using HTML unicode combining character entities. Not all browsers and operating systems support them. The chart below shows how these unusual characters are displayed in your browser. Diacritical marks used in this document  circumflex over A ĭ breve over i ṇ dot under n â circumflex over a ī macron over i ñ tilde over n à grave over a Î circumflex over I ô circumflex over o á acute over a î circumflex over i ṛ dot under r ḍ dot under d í acute over i Ṛ dot under R ĕ breve over e ì grave over i Ś acute over S è grave over e Ṁ dot over M ś acute over s ê circumflex over e ṁ dot over m Ṭ dot under T ë umlaut over e ṃ dot under m ṭ dot under t ḥ dot under h n̄ macron over n Ü umlaut over U î circumflex over i ń acute over n ü umlaut over u ı̐ chandrabindu over i ṅ dot over n û circumflex over u CONTENTS. ON THE The linked image cannot be displayed. The file may have been moved, renamed, or deleted. Verify that the link points to the correct file and location. INDIAN SECT Preface OF The linked image cannot be displayed. The file may have been moved, renamed, or deleted. Verify that the link points to the correct file and location. THE JAINAS THE INDIAN SECT OF THE JAINAS, BY by Dr. -
Proposal for a Telugu Script Root Zone Label Generation Ruleset (LGR)
Proposal for a Telugu Script Root Zone Label Generation Ruleset (LGR) LGR Version: 3.0 Date: 2019-03-06 Document version: 2.7 Authors: Neo-Brahmi Generation Panel [NBGP] 1. General Information/ Overview/ Abstract This document lays down the Label Generation Rule Set for the Telugu script. Three main components of the Telugu Script LGR, viz. Code point repertoire, Variants and Whole Label Evaluation Rules have been described in detail here. All these components have been incorporated in a machine-readable format in the accompanying XML file: "proposal-telugu-lgr-06mar19-en.xml". In addition, a list of test labels has been provided in the following file, which covers the repertoire, variant code points and the whole label evaluation rules, providing examples for valid and invalid labels: “telugu-test-labels-06mar19-en.txt”. 2. Script for which the LGR is proposed ISO 15924 Code: Telu ISO 15924 Key N°: 340 ISO 15924 English Name: Telugu Latin transliteration of native script name: telugu Native name of the script: !ెల$గ& Maximal Starting Repertoire [MSR] version: 4 The Unicode Standard, Version: 6.3 Telugu Unicode Range: 0C00–0C7F 3. Background of the Script and Principal Languages Using It The Telugu language uses the Telugu script which is written in the form of sequences of orthographic syllables. Each orthographic syllable is formed of one or more Telugu characters placed from left to right and top to bottom. Telugu is one of the 22 scheduled languages of India. The Telugu script is immediately related to Kannada and closely related to the Sinhala script. 1 3.1 The Evolution of the Script The origins of the Telugu script can be traced to the Brahmi alphabet of ancient India, often known as Asokan Brahmi. -
Rebuttal of Background of Indic Segmentation
Rebuttal of Background of Indic Segmentation Submitter: Richard Wordingham Date: 30 April 2017 Introduction Submission L2/17-094 contains several unreliable statements, and I feel obliged to make corrections or warning. The definition offered for the Indic Orthographic syllable offered is erroneous and inadequate. The formulation offered is: V[m] | {CH}C[v][m] | CH where V = independent vowel m = anusvara, visarga, chandrabindu C = consonant, or consonant + nukta v = dependent vowel H = halant / virama Even for the above constituents and for the Devanagari script, a more general formulation is required, namely V[m] | {CH}C{v}[m] | {CH}CH The extension of the third alternative is obvious; even Sanskrit has a few words that end in two consonants. Even then, at the code point level, this ignores the fact that Microsoft has long acknowledged the sequence of repha and independent vowel. Some vowels in Devanagari are expressed by sequences of two vowels, for example <U+094E DEVANAGARI VOWEL SIGN PRISHTHAMATRA E, U+093E DEVANAGARI VOWEL SIGN AA>, which conveys the same vowel sound as U+094B DEVANAGARI VOWEL SIGN O. The expression above ignores the use of ZWJ and ZWNJ. These are required for the proper display of Hindi when a font might otherwise use conjuncts considered appropriate for Sanskrit but inappropriate for Hindi, 1 of 2 e.g. द <U+0926 DEVANAGARI LETTER DA, U+094D DEVANAGARI SIGN VIRAMA, U+0917 DEVANAGARI LETTER GA> as opposed to <U+0926, U+094D, U+200D ZERO WIDTH JOINER, U+0917>. दग There is also the issue that even in Devanagari, a virama does not always combine consonants into a single orthographic cluster.