6 Design and Evaluation of Soft Keyboards for Brahmic Scripts

Total Page:16

File Type:pdf, Size:1020Kb

6 Design and Evaluation of Soft Keyboards for Brahmic Scripts 1 Design and Evaluation of Soft Keyboards for Brahmic Scripts 2 LAUREN HINKLE, University of Michigan 3 ALBERT BROUILLETTE,UniversityofColorado 4 SUJAY JAYAKAR, Cornell University 6 5 LEIGH GATHINGS, MITRE Corporation 6 MIGUEL LEZCANO and JUGAL KALITA,UniversityofColorado 7 Despite being spoken by a large percentage of the world, Indic languages in general lack user-friendly 8 and efficient methods for text input. These languages have poor or no support for typing. Soft keyboards, 9 because of their ease of installation and lack of reliance on specific hardware, are a promising solution 10 as an input device for many languages. Developing an acceptable soft keyboard requires the frequency 11 analysis of characters in order to design a layout that minimizes text-input time. This article proposes the 12 use of various development techniques, layout variations, and evaluation methods for the creation of soft 13 keyboards for Brahmic scripts. We propose that using optimization techniques such as genetic algorithms 14 and multi-objective Pareto optimization to develop multi-layer keyboards will increase the speed at which 15 text can be entered. 16 Categories and Subject Descriptors: H.5.2 [Information Systems–Information Interfaces and 17 Presentation]: User Interfaces 18 General Terms: Design, Experimentation, Human Factors 19 Additional Key Words and Phrases: Genetic algorithms, soft keyboards, Indic languages, Assamese, Pareto 20 optimization, mobile devices 21 ACM Reference Format: 22 Hinkle, L., Brouillette, A., Jayakar, S., Gathings, L., Lezcano, M., and Kalita, J. 2013. Design and evaluation 23 of soft keyboards for Brahmic scripts. ACM Trans. Asian Lang. Inform. Process. 12, 2, Article 6 (June 2013), 24 37 pages. 25 DOI:http://dx.doi.org/10.1145/2461316.2461318 26 1. INTRODUCTION 27 In an increasingly fast-paced and digitally connected world, being able to input text 28 and input it quickly is important to all users. Many efficient physical keyboards have 29 been designed and developed for standard Roman-alphabet based languages although 30 in practice, they have been discarded in favor of the familiar QWERTY keyboard. Thus, 31 the standard physical keyboard that comes with computers around the world is almost 32 always the QWERTY keyboard, which has its origins in the 1800s. Noyes [1983] and 33 Yamada [1980] provide detailed developmental histories of English physical keyboards 34 and of the QWERTY keyboard in particular. 35 The QWERTY keyboard is unsuitable for Brahmic scripts, which are the modern 36 descendants of the Brahmi script [Coulmas 1991], used widely in countries of the This work is supported by the National Science Foundation, under grants CNS-0958576, CNS-0851783 and DUE-0422524. Author’s address: J. Kalita, Computer Science Department, College of Engineering and Applied Science, 1420 Austin Bluffs Parkway, Colorado Springs, CO 80933-7150; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax 1(212)869-0481,[email protected]. +c 2013 ACM 1530-0226/2013/06-ART6 $15.00 DOI:! http://dx.doi.org/10.1145/2461316.2461318 ACM Transactions on Asian Language Information Processing, Vol. 12, No. 2, Article 6, Publication date: June 2013. 6:2 L. Hinkle et al. 37 Indian sub-continent and other parts of East Asia. This is evident in India where 1 38 there are several different extant Brahmic scripts [Salomon 1998] used by all 39 the major languages belonging to Indo-European, Dravidian, and other families 40 [Wagner et al. 1999, 24]. The scripts include Devanagari, Eastern Nagari, Gujarati, 41 Gurmukhi, Telugu, Kannada, Malayalam, and Tamil. Scripts for languages such as 42 Tibetan (Tibeto-Burman, Tibet), Burmese (Tibeto-Burman, Myanmar), Sinhala (Indo- 43 European, Sri Lanka), Balinese and Javanese (Austronesian, Indonesia), Thai (Austro- 44 Thai, Thailand), Khmer and Lao (Laos, both Mon-Khmer) belong to the same script 45 class, which have similar, and sometimes additional, issues. There is no easy and 46 widely acceptable text entry method for most of these languages, especially those in 47 the Indian subcontinent. This is true even in the case of Hindi, which is used natively 48 by between 182 and 366 million people, and Bengali, which is used by between 181 2 49 and 207 million people .Thesetwolanguagesarethefourthandsixthmostcommonly 50 spoken languages in the world, surpassing languages such as Russian (between 144 51 and 167 million), German (between 90 and 100 million), and French (between 68 and 52 78 million). 53 Most widely spoken Indic languages have had typewriters for some time, but the 54 typewriters have been used almost exclusively for press and bureaucratic needs. 55 However, records of how these typewriter keyboards were created are very difficult 56 to find. In addition, such typewriters, and hence the associated keyboards, never 57 achieved any significant level of usage outside printing presses and government offices. 58 While researchers have designed physical keyboards for Brahmic scripts for use with 59 computers [Joshi et al. 2004], usually the efficiency of input has not been a major 60 concern in these designs, and such keyboards are typically unavailable outside of a 3 61 few research labs. In a country like India, where only 4% [Kachru 1986] to 10% of 62 the population can perform adequately in English, almost everyone else relies on their 63 native language for everyday activities. However, the small percentage of people who 64 can perform well in English control most professions and these are the only people 65 who have frequent and reliable access to digital technologies that require reading text 66 and inputting text. Thus, one of the problems that has made the use of computers 67 and other digital devices in their native language very difficult is the lack of adequate 68 text entry methods. This perpetuates a severe but unacknowledged form of digital 69 divide (e.g., [Joshi et al. 2004; Ko and Yoshiki 2005; Hosken and Lyons 2003]) for a 70 large segment of the world’s population. Typically, computer or typewriter text entry 71 in Indian languages using physical keyboards is done by professional typists who 72 have had months of training or by dedicated individuals. Typing is almost impossible 73 for most common folk, which is in stark contrast with America and Western Europe 74 where the vast majority of people are able to type. The recent introduction of mobile 75 phones has spread across the socio-economic layers of India’s population. While this 76 development has greatly increased the access that many people have to information, 77 it has also increased the need for soft keyboards in the native languages. 78 Soft keyboards allow a user to input text without the use of a physical keyboard. 79 Kolsch and Turk [2002] define a soft keyboard as a typing device with no physical 80 manifestation. Soft keyboards are versatile because they allow data to be input 81 through mouse clicks or by touching an onscreen keyboard. With the recent surge 82 in popularity of touchscreen technology, well-designed soft keyboards are becoming 1See http://en.wikipedia.org/wiki/Brahmic script, accessed February 20, 2012. 2See http://en.wikipedia.org/wiki/List of languages by number of native speakers, accessed February 20, 2012. 3See http://en.wikipedia.org/wiki/List of countries by English-speaking population, accessed February 20, 2012. ACM Transactions on Asian Language Information Processing, Vol. 12, No. 2, Article 6, Publication date: June 2013. Design and Evaluation of Soft Keyboards for Brahmic Scripts 6:3 83 even more important everywhere. The market for English language soft keyboards is 84 already dominated by the familiar and ubiquitous QWERTY layout, even though it 85 is not the most efficient keyboard ever made, and many more efficient soft keyboards 86 have been designed. However, languages for which there are no entrenched keyboard 87 designs (and in fact no soft keyboards for Brahmic languages have been made beyond 88 the basic alphabetic ones and QWERTY-based designs, which have not been optimized 89 for text input performance), designing and developing soft keyboards for the emerging 90 and rapidly expanding touchscreen market may be very helpful in reducing the almost 91 complete inaccessibility of digital devices, including computers, for those who natively 92 speak Brahmic languages. Some preliminary work has been done in this field, but 93 there is still much room for improvement. Ghosh et al. [2011] has done work on Bengali 94 input systems for mobile devices, and Jalihal [2010] has proposed a Hindi input system 95 that maps characters to the standard nine buttons of a cell phone keypad. 96 Brahmic scripts have more characters and ligatures than what can usably fit on 97 astandardkeyboard.Asoftkeyboardallowsanylanguagetohavecustomlayouts 98 based on the frequency of character and ligature use within that language. The 99 development and spread of soft keyboards would allow the use of an easier to learn and 4 5 100 more efficient keyboard. Web service providers such as Google and Wikipedia have 101 developed soft keyboards for Brahmic scripts, but these are simply an alphabetical 102 listing of letters, and thus not optimized.
Recommended publications
  • Cross-Language Framework for Word Recognition and Spotting of Indic Scripts
    Cross-language Framework for Word Recognition and Spotting of Indic Scripts aAyan Kumar Bhunia, bPartha Pratim Roy*, aAkash Mohta, cUmapada Pal aDept. of ECE, Institute of Engineering & Management, Kolkata, India bDept. of CSE, Indian Institute of Technology Roorkee India cCVPR Unit, Indian Statistical Institute, Kolkata, India bemail: [email protected], TEL: +91-1332-284816 Abstract Handwritten word recognition and spotting of low-resource scripts are difficult as sufficient training data is not available and it is often expensive for collecting data of such scripts. This paper presents a novel cross language platform for handwritten word recognition and spotting for such low-resource scripts where training is performed with a sufficiently large dataset of an available script (considered as source script) and testing is done on other scripts (considered as target script). Training with one source script and testing with another script to have a reasonable result is not easy in handwriting domain due to the complex nature of handwriting variability among scripts. Also it is difficult in mapping between source and target characters when they appear in cursive word images. The proposed Indic cross language framework exploits a large resource of dataset for training and uses it for recognizing and spotting text of other target scripts where sufficient amount of training data is not available. Since, Indic scripts are mostly written in 3 zones, namely, upper, middle and lower, we employ zone-wise character (or component) mapping for efficient learning purpose. The performance of our cross- language framework depends on the extent of similarity between the source and target scripts.
    [Show full text]
  • Grammatica Et Verba Glamor and Verve
    “GippertOVprint” — 2014/4/24 — 22:14 — page 81 —#1 Grammatica et verba Glamor and verve Studies in South Asian, historical, and Indo-European linguistics in honor of Hans Henrich Hock on the occasion of his seventy-fifth birthday edited by Shu-Fen Chen and Benjamin Slade Beech Stave Press Ann Arbor New York • “GippertOVprint” — 2014/4/24 — 22:14 — page 82 —#2 © Beech Stave Press, Inc. All rights reserved. No part of this publication may be reproduced, translated, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission from the publisher. Typeset with LATEX using the Galliard typeface designed by Matthew Carter and Greek Old Face by Ralph Hancock. The typeface on the cover is Post Hock by Steve Peter. Library of Congress Cataloging-in-Publication Data Grammatica et verba : glamor and verve : studies in South Asian, historical, and Indo- European linguistics in honor of Hans Henrich Hock on the occasion of his seventy- fifth birthday / edited by Shu-Fen Chen and Benjamin Slade. pages cm Includes bibliographical references. ISBN ---- (alk. paper) . Indo-European languages. Lexicography. Historical linguistics. I. Hock, Hans Henrich, - honoree. II. Chen, Shu-Fen, editor of compilation. III. Slade, Ben- jamin, editor of compilation. –dc Printed in the United States of America “GippertOVprint” — 2014/4/24 — 22:14 — page 83 —#3 TableHHHHHH ofHHHHHHHHHHHHHHH ContentsHHHHHHHHHHHH Preface . vii Bibliography of Hans Henrich Hock . ix List of Contributors . xxi , Traces of Archaic Human Language Structure Anvita Abbi in the Great Andamanese Language. , A Study of Punctuation Errors in the Chinese Diamond Sutra Shu-Fen Chen Based on Sanskrit Texts .
    [Show full text]
  • SC22/WG20 N896 L2/01-476 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale De Normalisation
    SC22/WG20 N896 L2/01-476 Universal multiple-octet coded character set International organization for standardization Organisation internationale de normalisation Title: Ordering rules for Khmer Source: Kent Karlsson Date: 2001-12-19 Status: Expert contribution Document type: Working group document Action: For consideration by the UTC and JTC 1/SC 22/WG 20 1 Introduction The Khmer script in Unicode/10646 uses conjoining characters, just like the Indic scripts and Hangul. An alternative that is suitable (and much more elegant) for Khmer, but not for Indic scripts, would have been to have combining- below (and sometimes a bit to the side) consonants and combining-below independent vowels, similar to the combining-above Latin letters recently encoded, as well as how Tibetan is handled. However that is not the chosen solution, which instead uses a combining character (COENG) that makes characters conjoin (glue) like it is done for Indic (Brahmic) scripts and like Hangul Jamo, though the latter does not have a separate gluer character. In the Khmer script, using the COENG based approach, the words are formed from orthographic syllables, where an orthographic syllable has the following structure [add ligation control?]: Khmer-syllable ::= (K H)* K M* where K is a Khmer consonant (most with an inherent vowel that is pronounced only if there is no consonant, independent vowel, or dependent vowel following it in the orthographic syllable) or a Khmer independent vowel, H is the invisible Khmer conjoint former COENG, M is a combining character (including COENG, though that would be a misspelling), in particular a combining Khmer vowel (noted A below) or modifier sign.
    [Show full text]
  • Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
    1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only.
    [Show full text]
  • Proposal to Encode 0D5F MALAYALAM LETTER ARCHAIC II
    Proposal to encode 0D5F MALAYALAM LETTER ARCHAIC II Shriramana Sharma, jamadagni-at-gmail-dot-com, India 2012-May-22 §1. Introduction In the Malayalam Unicode encoding, the independent letter form for the long vowel Ī is: ഈ where the length mark ◌ൗ is appended to the short vowel ഇ to parallel the symbols for U/UU i.e. ഉ/ഊ. However, Ī was originally written as: As these are entirely different representations of Ī, although their sound value may be the same, it is proposed to encode the archaic form as a separate character. §2. Background As the core Unicode encoding for the major Indic scripts is based on ISCII, and the ISCII code chart for Malayalam only contained the modern form for the independent Ī [1, p 24]: … thus it is this written form that came to be encoded as 0D08 MALAYALAM LETTER II. While this “new” written form is seen in print as early as 1936 CE [2]: … there is no doubt that the much earlier form was parallel to the modern Grantha , both of which are derived from the old Grantha Ī as seen below: 1 ma - īdṛgvidhā | tatarāya īṇ Old Grantha from the Iḷaiyānputtūr copper plates [3, p 13]. Also seen is the Vatteḻuttu Ī of the same time (line 2, 2nd char from right) which also exhibits the two dots. Of course, both are derived from old South Indian Brahmi Ī which again has the two dots. It is said [entire paragraph: Radhakrishna Warrier, personal communication] that it was the poet Vaḷḷattōḷ Nārāyaṇa Mēnōn (1878–1958) who introduced the new form of Ī ഈ.
    [Show full text]
  • The Rise of Dalit Peasants Kolhi Activism in Lower Sindh
    The Rise of Dalit Peasants Kolhi Activism in Lower Sindh (Original Thesis Title) Kolhi-peasant Activism in Naon Dumbālo, Lower Sindh Creating Space for Marginalised through Multiple Channels Ghulam Hussain Mahesar Quaid-i-Azam University Department of Anthropology ii Islamabad - Pakistan Year 2014 Kolhi-Peasant Activism in Naon Dumbālo, Lower Sindh Creating Space for Marginalised through Multiple Channels Ghulam Hussain Thesis submitted to the Department of Anthropology, Quaid-i-Azam University Islamabad, in partial fulfillment of the degree of ‗Master of Philosophy in Anthropology‘ iii Quaid-i-Azam University Department of Anthropology Islamabad - Pakistan Year 2014 Formal declaration I hereby, declare that I have produced the present work by myself and without any aid other than those mentioned herein. Any ideas taken directly or indirectly from third party sources are indicated as such. This work has not been published or submitted to any other examination board in the same or a similar form. Islamabad, 25 March 2014 Mr. Ghulam Hussain Mahesar iv Final Approval of Thesis Quaid-i-Azam University Department of Anthropology Islamabad - Pakistan This is to certify that we have read the thesis submitted by Mr. Ghulam Hussain. It is our judgment that this thesis is of sufficient standard to warrant its acceptance by Quaid-i-Azam University, Islamabad for the award of the degree of ―MPhil in Anthropology‖. Committee Supervisor: Dr. Waheed Iqbal Chaudhry External Examiner: Full name of external examiner incl. title Incharge: Dr. Waheed Iqbal Chaudhry v ACKNOWLEDGEMENT This thesis is the product of cumulative effort of many teachers, scholars, and some institutions, that duly deserve to be acknowledged here.
    [Show full text]
  • Proposal for a Gujarati Script Root Zone Label Generation Ruleset (LGR)
    Proposal for a Gujarati Root Zone LGR Neo-Brahmi Generation Panel Proposal for a Gujarati Script Root Zone Label Generation Ruleset (LGR) LGR Version: 3.0 Date: 2019-03-06 Document version: 3.6 Authors: Neo-Brahmi Generation Panel [NBGP] 1 General Information/ Overview/ Abstract The purpose of this document is to give an overview of the proposed Gujarati LGR in the XML format and the rationale behind the design decisions taken. It includes a discussion of relevant features of the script, the communities or languages using it, the process and methodology used and information on the contributors. The formal specification of the LGR can be found in the accompanying XML document: proposal-gujarati-lgr-06mar19-en.xml Labels for testing can be found in the accompanying text document: gujarati-test-labels-06mar19-en.txt 2 Script for which the LGR is proposed ISO 15924 Code: Gujr ISO 15924 Key N°: 320 ISO 15924 English Name: Gujarati Latin transliteration of native script name: gujarâtî Native name of the script: ગજુ રાતી Maximal Starting Repertoire (MSR) version: MSR-4 1 Proposal for a Gujarati Root Zone LGR Neo-Brahmi Generation Panel 3 Background on the Script and the Principal Languages Using it1 Gujarati (ગજુ રાતી) [also sometimes written as Gujerati, Gujarathi, Guzratee, Guujaratee, Gujrathi, and Gujerathi2] is an Indo-Aryan language native to the Indian state of Gujarat. It is part of the greater Indo-European language family. It is so named because Gujarati is the language of the Gujjars. Gujarati's origins can be traced back to Old Gujarati (circa 1100– 1500 AD).
    [Show full text]
  • Comment on L2/13-065: Grantha Characters from Other Indic Blocks Such As Devanagari, Tamil, Bengali Are NOT the Solution
    Comment on L2/13-065: Grantha characters from other Indic blocks such as Devanagari, Tamil, Bengali are NOT the solution Dr. Naga Ganesan ([email protected]) Here is a brief comment on L2/13-065 which suggests Devanagari block for Grantha characters to write both Aryan and Dravidian languages. From L2/13-065, which suggests Devanagari block for writing Sanskrit and Dravidian languages, we read: “Similar dedicated characters were provided to cater different fonts as URUDU MARATHI SINDHI MARWARI KASHMERE etc for special conditions and needs within the shared DEVANAGARI range. There are provisions made to suit Dravidian specials also as short E short O LLLA RRA NNNA with new created glyptic forms in rhythm with DEVANAGARI glyphs (with suitable combining ortho-graphic features inside DEVANAGARI) for same special functions. These will help some aspirant proposers as a fall out who want to write every Indian language contents in Grantham when it is unified with DEVANAGARI because those features were already taken care of by the presence in that DEVANAGARI. Even few objections seen in some docs claim them as Tamil characters to induce a bug inside Grantham like letters from GoTN and others become tangential because it is going to use the existing Sanskrit properties inside shared DEVANAGARI and also with entirely new different glyph forms nothing connected with Tamil. In my doc I have shown rhythmic non-confusable glyphs for Grantham LLLA RRA NNNA for which I believe there cannot be any varied opinions even by GoTN etc which is very much as Grantham and not like Tamil form at all to confuse (as provided in proposals).” Grantha in Devanagari block to write Sanskrit and Dravidian languages will not work.
    [Show full text]
  • Finite-State Script Normalization and Processing Utilities: the Nisaba Brahmic Library
    Finite-state script normalization and processing utilities: The Nisaba Brahmic library Cibu Johny† Lawrence Wolf-Sonkin‡ Alexander Gutkin† Brian Roark‡ Google Research †United Kingdom and ‡United States {cibu,wolfsonkin,agutkin,roark}@google.com Abstract In addition to such normalization issues, some scripts also have well-formedness constraints, i.e., This paper presents an open-source library for efficient low-level processing of ten ma- not all strings of Unicode characters from a single jor South Asian Brahmic scripts. The library script correspond to a valid (i.e., legible) grapheme provides a flexible and extensible framework sequence in the script. Such constraints do not ap- for supporting crucial operations on Brahmic ply in the basic Latin alphabet, where any permuta- scripts, such as NFC, visual normalization, tion of letters can be rendered as a valid string (e.g., reversible transliteration, and validity checks, for use as an acronym). The Brahmic family of implemented in Python within a finite-state scripts, however, including the Devanagari script transducer formalism. We survey some com- mon Brahmic script issues that may adversely used to write Hindi, Marathi and many other South affect the performance of downstream NLP Asian languages, do have such constraints. These tasks, and provide the rationale for finite-state scripts are alphasyllabaries, meaning that they are design and system implementation details. structured around orthographic syllables (aksara)̣ as the basic unit.1 One or more Unicode characters 1 Introduction combine when rendering one of thousands of leg- The Unicode Standard separates the representation ible aksara,̣ but many combinations do not corre- of text from its specific graphical rendering: text spond to any aksara.̣ Given a token in these scripts, is encoded as a sequence of characters, which, at one may want to (a) normalize it to a canonical presentation time are then collectively rendered form; and (b) check whether it is a well-formed into the appropriate sequence of glyphs for display.
    [Show full text]
  • N4035 Proposal to Encode the Tirhuta Script in ISO/IEC 10646
    ISO/IEC JTC1/SC2/WG2 N4035 L2/11-175R 2011-05-05 Proposal to Encode the Tirhuta Script in ISO/IEC 10646 Anshuman Pandey Department of History University of Michigan Ann Arbor, Michigan, U.S.A. [email protected] May 5, 2011 1 Introduction This is a proposal to encode the Tirhuta script in the Universal Character Set (UCS). The recommendations made here are based upon the following documents: • L2/06-226 “Request to Allocate the Maithili Script in the Unicode Roadmap” • N3765 L2/09-329 “Towards an Encoding for the Maithili Script in ISO/IEC 10646” Tirhuta was called the ‘Maithili’ script in previous documents. A discussion of the names used for the script and the rationale for the change of name is offered in Section 3.1. 2 Background Tirhuta is the traditional writing system for the Maithili language (ISO 639: mai), which is spoken in the state of Bihar in India and in the Narayani and Janakpur zones of Nepal by more than 35 million people. Maithili is a scheduled language of India and the second most spoken language in Nepal. Tirhuta is a Brahmi-based script derived from Gauḍī, or ‘Proto-Bengali’, which evolved from the Kuṭila branch of Brahmi by the 10th century . It is related to the Bengali, Newari, and Oriya scripts, which are also descended from Gauḍī, and became differentiated from them by the 14th century.1 Tirhuta remained the primary writing system for Maithili until the 20th century, at which time it was replaced by Devanagari. The Tirhuta script forms the basis of scholarly and religious scribal traditions associated with the Maithili and Sanskrit languages since the 14th century .
    [Show full text]
  • Relation Between Harappan and Brahmi Scripts
    Relation Between Harappan And Brahmi Scripts Subhajit Ganguly Email: [email protected] Copyright © Subhajit Ganguly, 2012 Abstract: Around 45 odd signs out of the total number of Harappan signs found make up almost 100 percent of the inscriptions, in some form or other, as said earlier. Out of these 45 signs, around 40 are readily distinguishable. These form an almost exclusive and unique set. The primary signs are seen to have many variants, as in Brahmi. Many of these provide us with quite a vivid picture of their evolution, depending upon the factors of time, place and usefulness. Even minor adjustments in such signs, depending upon these factors, are noteworthy. Many of the signs in this list are the same as or are very similar to the corresponding Brahmi signs. These are similarities that simply cannot arise from mere chance. It is also to be noted that the most frequently used signs in the Brahmi look so similar to the most frequent Harappan symbols. The Harappan script transformed naturally into the Brahmi, depending upon the factors channelizing evolution of scripts. Brahmi Signs: hough a few variants of the Brahmi alphabet system have been known to exist, with the evolution of Brahmi characters, the core signs are seen to be quite consistent over T time. The syntax of their usage has also been found to be roughly consistent throughout this evolution process. Lists of Brahmi numerals, vowels and consonants are provided here in fig. 2and fig. 3 , respectively: 1 Fig. 2 : Brahmi numerals from 1-9. 2 Fig. 3 : Brahmi vowels and consonants.
    [Show full text]
  • Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset
    Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset Brian Roarky, Lawrence Wolf-Sonkiny, Christo Kirovy, Sabrina J. Mielkez, Cibu Johnyy, Işın Demirşahiny and Keith Hally yGoogle Research zJohns Hopkins University {roark,wolfsonkin,ckirov,cibu,isin,kbhall}@google.com [email protected] Abstract This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text. Keywords: romanization, transliteration, South Asian languages 1. Introduction lingua franca, the prevalence of loanwords and code switch- Languages in South Asia – a region covering India, Pak- ing (largely but not exclusively with English) is high. istan, Bangladesh and neighboring countries – are generally Unlike translations, human generated parallel versions of written with Brahmic or Perso-Arabic scripts, but are also text in the native scripts of these languages and romanized often written in the Latin script, most notably for informal versions do not spontaneously occur in any meaningful vol- communication such as within SMS messages, WhatsApp, ume.
    [Show full text]