6 Design and Evaluation of Soft Keyboards for Brahmic Scripts
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Cross-Language Framework for Word Recognition and Spotting of Indic Scripts
Cross-language Framework for Word Recognition and Spotting of Indic Scripts aAyan Kumar Bhunia, bPartha Pratim Roy*, aAkash Mohta, cUmapada Pal aDept. of ECE, Institute of Engineering & Management, Kolkata, India bDept. of CSE, Indian Institute of Technology Roorkee India cCVPR Unit, Indian Statistical Institute, Kolkata, India bemail: [email protected], TEL: +91-1332-284816 Abstract Handwritten word recognition and spotting of low-resource scripts are difficult as sufficient training data is not available and it is often expensive for collecting data of such scripts. This paper presents a novel cross language platform for handwritten word recognition and spotting for such low-resource scripts where training is performed with a sufficiently large dataset of an available script (considered as source script) and testing is done on other scripts (considered as target script). Training with one source script and testing with another script to have a reasonable result is not easy in handwriting domain due to the complex nature of handwriting variability among scripts. Also it is difficult in mapping between source and target characters when they appear in cursive word images. The proposed Indic cross language framework exploits a large resource of dataset for training and uses it for recognizing and spotting text of other target scripts where sufficient amount of training data is not available. Since, Indic scripts are mostly written in 3 zones, namely, upper, middle and lower, we employ zone-wise character (or component) mapping for efficient learning purpose. The performance of our cross- language framework depends on the extent of similarity between the source and target scripts. -
Grammatica Et Verba Glamor and Verve
“GippertOVprint” — 2014/4/24 — 22:14 — page 81 —#1 Grammatica et verba Glamor and verve Studies in South Asian, historical, and Indo-European linguistics in honor of Hans Henrich Hock on the occasion of his seventy-fifth birthday edited by Shu-Fen Chen and Benjamin Slade Beech Stave Press Ann Arbor New York • “GippertOVprint” — 2014/4/24 — 22:14 — page 82 —#2 © Beech Stave Press, Inc. All rights reserved. No part of this publication may be reproduced, translated, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission from the publisher. Typeset with LATEX using the Galliard typeface designed by Matthew Carter and Greek Old Face by Ralph Hancock. The typeface on the cover is Post Hock by Steve Peter. Library of Congress Cataloging-in-Publication Data Grammatica et verba : glamor and verve : studies in South Asian, historical, and Indo- European linguistics in honor of Hans Henrich Hock on the occasion of his seventy- fifth birthday / edited by Shu-Fen Chen and Benjamin Slade. pages cm Includes bibliographical references. ISBN ---- (alk. paper) . Indo-European languages. Lexicography. Historical linguistics. I. Hock, Hans Henrich, - honoree. II. Chen, Shu-Fen, editor of compilation. III. Slade, Ben- jamin, editor of compilation. –dc Printed in the United States of America “GippertOVprint” — 2014/4/24 — 22:14 — page 83 —#3 TableHHHHHH ofHHHHHHHHHHHHHHH ContentsHHHHHHHHHHHH Preface . vii Bibliography of Hans Henrich Hock . ix List of Contributors . xxi , Traces of Archaic Human Language Structure Anvita Abbi in the Great Andamanese Language. , A Study of Punctuation Errors in the Chinese Diamond Sutra Shu-Fen Chen Based on Sanskrit Texts . -
SC22/WG20 N896 L2/01-476 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale De Normalisation
SC22/WG20 N896 L2/01-476 Universal multiple-octet coded character set International organization for standardization Organisation internationale de normalisation Title: Ordering rules for Khmer Source: Kent Karlsson Date: 2001-12-19 Status: Expert contribution Document type: Working group document Action: For consideration by the UTC and JTC 1/SC 22/WG 20 1 Introduction The Khmer script in Unicode/10646 uses conjoining characters, just like the Indic scripts and Hangul. An alternative that is suitable (and much more elegant) for Khmer, but not for Indic scripts, would have been to have combining- below (and sometimes a bit to the side) consonants and combining-below independent vowels, similar to the combining-above Latin letters recently encoded, as well as how Tibetan is handled. However that is not the chosen solution, which instead uses a combining character (COENG) that makes characters conjoin (glue) like it is done for Indic (Brahmic) scripts and like Hangul Jamo, though the latter does not have a separate gluer character. In the Khmer script, using the COENG based approach, the words are formed from orthographic syllables, where an orthographic syllable has the following structure [add ligation control?]: Khmer-syllable ::= (K H)* K M* where K is a Khmer consonant (most with an inherent vowel that is pronounced only if there is no consonant, independent vowel, or dependent vowel following it in the orthographic syllable) or a Khmer independent vowel, H is the invisible Khmer conjoint former COENG, M is a combining character (including COENG, though that would be a misspelling), in particular a combining Khmer vowel (noted A below) or modifier sign. -
Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only. -
Proposal to Encode 0D5F MALAYALAM LETTER ARCHAIC II
Proposal to encode 0D5F MALAYALAM LETTER ARCHAIC II Shriramana Sharma, jamadagni-at-gmail-dot-com, India 2012-May-22 §1. Introduction In the Malayalam Unicode encoding, the independent letter form for the long vowel Ī is: ഈ where the length mark ◌ൗ is appended to the short vowel ഇ to parallel the symbols for U/UU i.e. ഉ/ഊ. However, Ī was originally written as: As these are entirely different representations of Ī, although their sound value may be the same, it is proposed to encode the archaic form as a separate character. §2. Background As the core Unicode encoding for the major Indic scripts is based on ISCII, and the ISCII code chart for Malayalam only contained the modern form for the independent Ī [1, p 24]: … thus it is this written form that came to be encoded as 0D08 MALAYALAM LETTER II. While this “new” written form is seen in print as early as 1936 CE [2]: … there is no doubt that the much earlier form was parallel to the modern Grantha , both of which are derived from the old Grantha Ī as seen below: 1 ma - īdṛgvidhā | tatarāya īṇ Old Grantha from the Iḷaiyānputtūr copper plates [3, p 13]. Also seen is the Vatteḻuttu Ī of the same time (line 2, 2nd char from right) which also exhibits the two dots. Of course, both are derived from old South Indian Brahmi Ī which again has the two dots. It is said [entire paragraph: Radhakrishna Warrier, personal communication] that it was the poet Vaḷḷattōḷ Nārāyaṇa Mēnōn (1878–1958) who introduced the new form of Ī ഈ. -
The Rise of Dalit Peasants Kolhi Activism in Lower Sindh
The Rise of Dalit Peasants Kolhi Activism in Lower Sindh (Original Thesis Title) Kolhi-peasant Activism in Naon Dumbālo, Lower Sindh Creating Space for Marginalised through Multiple Channels Ghulam Hussain Mahesar Quaid-i-Azam University Department of Anthropology ii Islamabad - Pakistan Year 2014 Kolhi-Peasant Activism in Naon Dumbālo, Lower Sindh Creating Space for Marginalised through Multiple Channels Ghulam Hussain Thesis submitted to the Department of Anthropology, Quaid-i-Azam University Islamabad, in partial fulfillment of the degree of ‗Master of Philosophy in Anthropology‘ iii Quaid-i-Azam University Department of Anthropology Islamabad - Pakistan Year 2014 Formal declaration I hereby, declare that I have produced the present work by myself and without any aid other than those mentioned herein. Any ideas taken directly or indirectly from third party sources are indicated as such. This work has not been published or submitted to any other examination board in the same or a similar form. Islamabad, 25 March 2014 Mr. Ghulam Hussain Mahesar iv Final Approval of Thesis Quaid-i-Azam University Department of Anthropology Islamabad - Pakistan This is to certify that we have read the thesis submitted by Mr. Ghulam Hussain. It is our judgment that this thesis is of sufficient standard to warrant its acceptance by Quaid-i-Azam University, Islamabad for the award of the degree of ―MPhil in Anthropology‖. Committee Supervisor: Dr. Waheed Iqbal Chaudhry External Examiner: Full name of external examiner incl. title Incharge: Dr. Waheed Iqbal Chaudhry v ACKNOWLEDGEMENT This thesis is the product of cumulative effort of many teachers, scholars, and some institutions, that duly deserve to be acknowledged here. -
Proposal for a Gujarati Script Root Zone Label Generation Ruleset (LGR)
Proposal for a Gujarati Root Zone LGR Neo-Brahmi Generation Panel Proposal for a Gujarati Script Root Zone Label Generation Ruleset (LGR) LGR Version: 3.0 Date: 2019-03-06 Document version: 3.6 Authors: Neo-Brahmi Generation Panel [NBGP] 1 General Information/ Overview/ Abstract The purpose of this document is to give an overview of the proposed Gujarati LGR in the XML format and the rationale behind the design decisions taken. It includes a discussion of relevant features of the script, the communities or languages using it, the process and methodology used and information on the contributors. The formal specification of the LGR can be found in the accompanying XML document: proposal-gujarati-lgr-06mar19-en.xml Labels for testing can be found in the accompanying text document: gujarati-test-labels-06mar19-en.txt 2 Script for which the LGR is proposed ISO 15924 Code: Gujr ISO 15924 Key N°: 320 ISO 15924 English Name: Gujarati Latin transliteration of native script name: gujarâtî Native name of the script: ગજુ રાતી Maximal Starting Repertoire (MSR) version: MSR-4 1 Proposal for a Gujarati Root Zone LGR Neo-Brahmi Generation Panel 3 Background on the Script and the Principal Languages Using it1 Gujarati (ગજુ રાતી) [also sometimes written as Gujerati, Gujarathi, Guzratee, Guujaratee, Gujrathi, and Gujerathi2] is an Indo-Aryan language native to the Indian state of Gujarat. It is part of the greater Indo-European language family. It is so named because Gujarati is the language of the Gujjars. Gujarati's origins can be traced back to Old Gujarati (circa 1100– 1500 AD). -
Comment on L2/13-065: Grantha Characters from Other Indic Blocks Such As Devanagari, Tamil, Bengali Are NOT the Solution
Comment on L2/13-065: Grantha characters from other Indic blocks such as Devanagari, Tamil, Bengali are NOT the solution Dr. Naga Ganesan ([email protected]) Here is a brief comment on L2/13-065 which suggests Devanagari block for Grantha characters to write both Aryan and Dravidian languages. From L2/13-065, which suggests Devanagari block for writing Sanskrit and Dravidian languages, we read: “Similar dedicated characters were provided to cater different fonts as URUDU MARATHI SINDHI MARWARI KASHMERE etc for special conditions and needs within the shared DEVANAGARI range. There are provisions made to suit Dravidian specials also as short E short O LLLA RRA NNNA with new created glyptic forms in rhythm with DEVANAGARI glyphs (with suitable combining ortho-graphic features inside DEVANAGARI) for same special functions. These will help some aspirant proposers as a fall out who want to write every Indian language contents in Grantham when it is unified with DEVANAGARI because those features were already taken care of by the presence in that DEVANAGARI. Even few objections seen in some docs claim them as Tamil characters to induce a bug inside Grantham like letters from GoTN and others become tangential because it is going to use the existing Sanskrit properties inside shared DEVANAGARI and also with entirely new different glyph forms nothing connected with Tamil. In my doc I have shown rhythmic non-confusable glyphs for Grantham LLLA RRA NNNA for which I believe there cannot be any varied opinions even by GoTN etc which is very much as Grantham and not like Tamil form at all to confuse (as provided in proposals).” Grantha in Devanagari block to write Sanskrit and Dravidian languages will not work. -
Finite-State Script Normalization and Processing Utilities: the Nisaba Brahmic Library
Finite-state script normalization and processing utilities: The Nisaba Brahmic library Cibu Johny† Lawrence Wolf-Sonkin‡ Alexander Gutkin† Brian Roark‡ Google Research †United Kingdom and ‡United States {cibu,wolfsonkin,agutkin,roark}@google.com Abstract In addition to such normalization issues, some scripts also have well-formedness constraints, i.e., This paper presents an open-source library for efficient low-level processing of ten ma- not all strings of Unicode characters from a single jor South Asian Brahmic scripts. The library script correspond to a valid (i.e., legible) grapheme provides a flexible and extensible framework sequence in the script. Such constraints do not ap- for supporting crucial operations on Brahmic ply in the basic Latin alphabet, where any permuta- scripts, such as NFC, visual normalization, tion of letters can be rendered as a valid string (e.g., reversible transliteration, and validity checks, for use as an acronym). The Brahmic family of implemented in Python within a finite-state scripts, however, including the Devanagari script transducer formalism. We survey some com- mon Brahmic script issues that may adversely used to write Hindi, Marathi and many other South affect the performance of downstream NLP Asian languages, do have such constraints. These tasks, and provide the rationale for finite-state scripts are alphasyllabaries, meaning that they are design and system implementation details. structured around orthographic syllables (aksara)̣ as the basic unit.1 One or more Unicode characters 1 Introduction combine when rendering one of thousands of leg- The Unicode Standard separates the representation ible aksara,̣ but many combinations do not corre- of text from its specific graphical rendering: text spond to any aksara.̣ Given a token in these scripts, is encoded as a sequence of characters, which, at one may want to (a) normalize it to a canonical presentation time are then collectively rendered form; and (b) check whether it is a well-formed into the appropriate sequence of glyphs for display. -
N4035 Proposal to Encode the Tirhuta Script in ISO/IEC 10646
ISO/IEC JTC1/SC2/WG2 N4035 L2/11-175R 2011-05-05 Proposal to Encode the Tirhuta Script in ISO/IEC 10646 Anshuman Pandey Department of History University of Michigan Ann Arbor, Michigan, U.S.A. [email protected] May 5, 2011 1 Introduction This is a proposal to encode the Tirhuta script in the Universal Character Set (UCS). The recommendations made here are based upon the following documents: • L2/06-226 “Request to Allocate the Maithili Script in the Unicode Roadmap” • N3765 L2/09-329 “Towards an Encoding for the Maithili Script in ISO/IEC 10646” Tirhuta was called the ‘Maithili’ script in previous documents. A discussion of the names used for the script and the rationale for the change of name is offered in Section 3.1. 2 Background Tirhuta is the traditional writing system for the Maithili language (ISO 639: mai), which is spoken in the state of Bihar in India and in the Narayani and Janakpur zones of Nepal by more than 35 million people. Maithili is a scheduled language of India and the second most spoken language in Nepal. Tirhuta is a Brahmi-based script derived from Gauḍī, or ‘Proto-Bengali’, which evolved from the Kuṭila branch of Brahmi by the 10th century . It is related to the Bengali, Newari, and Oriya scripts, which are also descended from Gauḍī, and became differentiated from them by the 14th century.1 Tirhuta remained the primary writing system for Maithili until the 20th century, at which time it was replaced by Devanagari. The Tirhuta script forms the basis of scholarly and religious scribal traditions associated with the Maithili and Sanskrit languages since the 14th century . -
Relation Between Harappan and Brahmi Scripts
Relation Between Harappan And Brahmi Scripts Subhajit Ganguly Email: [email protected] Copyright © Subhajit Ganguly, 2012 Abstract: Around 45 odd signs out of the total number of Harappan signs found make up almost 100 percent of the inscriptions, in some form or other, as said earlier. Out of these 45 signs, around 40 are readily distinguishable. These form an almost exclusive and unique set. The primary signs are seen to have many variants, as in Brahmi. Many of these provide us with quite a vivid picture of their evolution, depending upon the factors of time, place and usefulness. Even minor adjustments in such signs, depending upon these factors, are noteworthy. Many of the signs in this list are the same as or are very similar to the corresponding Brahmi signs. These are similarities that simply cannot arise from mere chance. It is also to be noted that the most frequently used signs in the Brahmi look so similar to the most frequent Harappan symbols. The Harappan script transformed naturally into the Brahmi, depending upon the factors channelizing evolution of scripts. Brahmi Signs: hough a few variants of the Brahmi alphabet system have been known to exist, with the evolution of Brahmi characters, the core signs are seen to be quite consistent over T time. The syntax of their usage has also been found to be roughly consistent throughout this evolution process. Lists of Brahmi numerals, vowels and consonants are provided here in fig. 2and fig. 3 , respectively: 1 Fig. 2 : Brahmi numerals from 1-9. 2 Fig. 3 : Brahmi vowels and consonants. -
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset Brian Roarky, Lawrence Wolf-Sonkiny, Christo Kirovy, Sabrina J. Mielkez, Cibu Johnyy, Işın Demirşahiny and Keith Hally yGoogle Research zJohns Hopkins University {roark,wolfsonkin,ckirov,cibu,isin,kbhall}@google.com [email protected] Abstract This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text. Keywords: romanization, transliteration, South Asian languages 1. Introduction lingua franca, the prevalence of loanwords and code switch- Languages in South Asia – a region covering India, Pak- ing (largely but not exclusively with English) is high. istan, Bangladesh and neighboring countries – are generally Unlike translations, human generated parallel versions of written with Brahmic or Perso-Arabic scripts, but are also text in the native scripts of these languages and romanized often written in the Latin script, most notably for informal versions do not spontaneously occur in any meaningful vol- communication such as within SMS messages, WhatsApp, ume.