Cross-Language Framework for Word Recognition and Spotting of Indic Scripts
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
SC22/WG20 N896 L2/01-476 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale De Normalisation
SC22/WG20 N896 L2/01-476 Universal multiple-octet coded character set International organization for standardization Organisation internationale de normalisation Title: Ordering rules for Khmer Source: Kent Karlsson Date: 2001-12-19 Status: Expert contribution Document type: Working group document Action: For consideration by the UTC and JTC 1/SC 22/WG 20 1 Introduction The Khmer script in Unicode/10646 uses conjoining characters, just like the Indic scripts and Hangul. An alternative that is suitable (and much more elegant) for Khmer, but not for Indic scripts, would have been to have combining- below (and sometimes a bit to the side) consonants and combining-below independent vowels, similar to the combining-above Latin letters recently encoded, as well as how Tibetan is handled. However that is not the chosen solution, which instead uses a combining character (COENG) that makes characters conjoin (glue) like it is done for Indic (Brahmic) scripts and like Hangul Jamo, though the latter does not have a separate gluer character. In the Khmer script, using the COENG based approach, the words are formed from orthographic syllables, where an orthographic syllable has the following structure [add ligation control?]: Khmer-syllable ::= (K H)* K M* where K is a Khmer consonant (most with an inherent vowel that is pronounced only if there is no consonant, independent vowel, or dependent vowel following it in the orthographic syllable) or a Khmer independent vowel, H is the invisible Khmer conjoint former COENG, M is a combining character (including COENG, though that would be a misspelling), in particular a combining Khmer vowel (noted A below) or modifier sign. -
The What and Why of Whole Number Arithmetic: Foundational Ideas from History, Language and Societal Changes
Portland State University PDXScholar Mathematics and Statistics Faculty Fariborz Maseeh Department of Mathematics Publications and Presentations and Statistics 3-2018 The What and Why of Whole Number Arithmetic: Foundational Ideas from History, Language and Societal Changes Xu Hu Sun University of Macau Christine Chambris Université de Cergy-Pontoise Judy Sayers Stockholm University Man Keung Siu University of Hong Kong Jason Cooper Weizmann Institute of Science SeeFollow next this page and for additional additional works authors at: https:/ /pdxscholar.library.pdx.edu/mth_fac Part of the Science and Mathematics Education Commons Let us know how access to this document benefits ou.y Citation Details Sun X.H. et al. (2018) The What and Why of Whole Number Arithmetic: Foundational Ideas from History, Language and Societal Changes. In: Bartolini Bussi M., Sun X. (eds) Building the Foundation: Whole Numbers in the Primary Grades. New ICMI Study Series. Springer, Cham This Book Chapter is brought to you for free and open access. It has been accepted for inclusion in Mathematics and Statistics Faculty Publications and Presentations by an authorized administrator of PDXScholar. Please contact us if we can make this document more accessible: [email protected]. Authors Xu Hu Sun, Christine Chambris, Judy Sayers, Man Keung Siu, Jason Cooper, Jean-Luc Dorier, Sarah Inés González de Lora Sued, Eva Thanheiser, Nadia Azrou, Lynn McGarvey, Catherine Houdement, and Lisser Rye Ejersbo This book chapter is available at PDXScholar: https://pdxscholar.library.pdx.edu/mth_fac/253 Chapter 5 The What and Why of Whole Number Arithmetic: Foundational Ideas from History, Language and Societal Changes Xu Hua Sun , Christine Chambris Judy Sayers, Man Keung Siu, Jason Cooper , Jean-Luc Dorier , Sarah Inés González de Lora Sued , Eva Thanheiser , Nadia Azrou , Lynn McGarvey , Catherine Houdement , and Lisser Rye Ejersbo 5.1 Introduction Mathematics learning and teaching are deeply embedded in history, language and culture (e.g. -
Tai Lü / ᦺᦑᦟᦹᧉ Tai Lùe Romanization: KNAB 2012
Institute of the Estonian Language KNAB: Place Names Database 2012-10-11 Tai Lü / ᦺᦑᦟᦹᧉ Tai Lùe romanization: KNAB 2012 I. Consonant characters 1 ᦀ ’a 13 ᦌ sa 25 ᦘ pha 37 ᦤ da A 2 ᦁ a 14 ᦍ ya 26 ᦙ ma 38 ᦥ ba A 3 ᦂ k’a 15 ᦎ t’a 27 ᦚ f’a 39 ᦦ kw’a 4 ᦃ kh’a 16 ᦏ th’a 28 ᦛ v’a 40 ᦧ khw’a 5 ᦄ ng’a 17 ᦐ n’a 29 ᦜ l’a 41 ᦨ kwa 6 ᦅ ka 18 ᦑ ta 30 ᦝ fa 42 ᦩ khwa A 7 ᦆ kha 19 ᦒ tha 31 ᦞ va 43 ᦪ sw’a A A 8 ᦇ nga 20 ᦓ na 32 ᦟ la 44 ᦫ swa 9 ᦈ ts’a 21 ᦔ p’a 33 ᦠ h’a 45 ᧞ lae A 10 ᦉ s’a 22 ᦕ ph’a 34 ᦡ d’a 46 ᧟ laew A 11 ᦊ y’a 23 ᦖ m’a 35 ᦢ b’a 12 ᦋ tsa 24 ᦗ pa 36 ᦣ ha A Syllable-final forms of these characters: ᧅ -k, ᧂ -ng, ᧃ -n, ᧄ -m, ᧁ -u, ᧆ -d, ᧇ -b. See also Note D to Table II. II. Vowel characters (ᦀ stands for any consonant character) C 1 ᦀ a 6 ᦀᦴ u 11 ᦀᦹ ue 16 ᦀᦽ oi A 2 ᦰ ( ) 7 ᦵᦀ e 12 ᦵᦀᦲ oe 17 ᦀᦾ awy 3 ᦀᦱ aa 8 ᦶᦀ ae 13 ᦺᦀ ai 18 ᦀᦿ uei 4 ᦀᦲ i 9 ᦷᦀ o 14 ᦀᦻ aai 19 ᦀᧀ oei B D 5 ᦀᦳ ŭ,u 10 ᦀᦸ aw 15 ᦀᦼ ui A Indicates vowel shortness in the following cases: ᦀᦲᦰ ĭ [i], ᦵᦀᦰ ĕ [e], ᦶᦀᦰ ăe [ ∎ ], ᦷᦀᦰ ŏ [o], ᦀᦸᦰ ăw [ ], ᦀᦹᦰ ŭe [ ɯ ], ᦵᦀᦲᦰ ŏe [ ]. -
Inflectional Morphology Analyzer for Sanskrit
Inflectional Morphology Analyzer for Sanskrit Girish Nath Jha, Muktanand Agrawal, Subash, Sudhir K. Mishra, Diwakar Mani, Diwakar Mishra, Manji Bhadra, Surjit K. Singh [email protected] Special Centre for Sanskrit Studies Jawaharlal Nehru University New Delhi-110067 The paper describes a Sanskrit morphological analyzer that identifies and analyzes inflected noun- forms and verb-forms in any given sandhi-free text. The system which has been developed as java servlet RDBMS can be tested at http://sanskrit.jnu.ac.in (Language Processing Tools > Sanskrit Tinanta Analyzer/Subanta Analyzer) with Sanskrit data as unicode text. Subsequently, the separate systems of subanta and ti anta will be combined into a single system of sentence analysis with karaka interpretation. Currently, the system checks and labels each word as three basic POS categories - subanta, ti anta, and avyaya. Thereafter, each subanta is sent for subanta processing based on an example database and a rule database. The verbs are examined based on a database of verb roots and forms as well by reverse morphology based on Pāini an techniques. Future enhancements include plugging in the amarakosha ( http://sanskrit.jnu.ac.in/amara ) and other noun lexicons with the subanta system. The ti anta will be enhanced by the k danta analysis module being developed separately. 1. Introduction The authors in the present paper are describing the subanta and ti anta analysis systems for Sanskrit which are currently running at http://sanskrit.jnu.ac.in . Sanskrit is a heavily inflected language, and depends on nominal and verbal inflections for communication of meaning. A fully inflected unit is called pada . -
Sanskrit Translation Now Made, Be Crowned with Success ! PUBLISHER's NOTE
“KOHAM?” (Who am I ?) Sri Ramanasramam Tiruvannamalai - 606 603. 2008 Koham ? (Who am I ?) Published by V. S. Ramanan, President, Sri Ramanasramam, Tiruvannamalai - 606 603. Phone : 04175-237200 email : [email protected] Website : www.sriramanamaharshi.org © Sri Ramanasramam Tiruvannamalai. Salutations to Sri Ramana Who resides in the Heart Lotus PREFACE TO THE FIRST EDITION Many were the devotees who were attracted to the presence of Bhagavan Sri Ramana while he was living in the Virupaksha Cave, situated in the holy region of Arunachala, the Heart Centre of the world, which confers liberation on those who (merely) think of It. They were drawn to him on account of his wonderful state of tapas - absorption in the silence of yoga samadhi, which is difficult of achievement. One among these devotees was Sivaprakasam Pillai who approached Maharshi in 1901-02 with devotion, faith and humility and prayed that he may be blessed with instructions on Reality. The questions raised over a period of time by Sivaprakasam Pillai were answered by the silent Maharshi in writing - in Tamil. The compilation (of the instruction) in the form of questions and answers has already been translated into Malayalam and many other languages. May this Sanskrit translation now made, be crowned with success ! PUBLISHER'S NOTE The Sanskrit translation of Sri Bhagavan's Who am I ? entitled Koham? by Jagadiswara Sastry was published earlier in 1945. We are happy to bring out a second edition. The special feature of this edition is that the text is in Sri Bhagavan's handwriting. A transliteration of the text into English has also been provided. -
Finite-State Script Normalization and Processing Utilities: the Nisaba Brahmic Library
Finite-state script normalization and processing utilities: The Nisaba Brahmic library Cibu Johny† Lawrence Wolf-Sonkin‡ Alexander Gutkin† Brian Roark‡ Google Research †United Kingdom and ‡United States {cibu,wolfsonkin,agutkin,roark}@google.com Abstract In addition to such normalization issues, some scripts also have well-formedness constraints, i.e., This paper presents an open-source library for efficient low-level processing of ten ma- not all strings of Unicode characters from a single jor South Asian Brahmic scripts. The library script correspond to a valid (i.e., legible) grapheme provides a flexible and extensible framework sequence in the script. Such constraints do not ap- for supporting crucial operations on Brahmic ply in the basic Latin alphabet, where any permuta- scripts, such as NFC, visual normalization, tion of letters can be rendered as a valid string (e.g., reversible transliteration, and validity checks, for use as an acronym). The Brahmic family of implemented in Python within a finite-state scripts, however, including the Devanagari script transducer formalism. We survey some com- mon Brahmic script issues that may adversely used to write Hindi, Marathi and many other South affect the performance of downstream NLP Asian languages, do have such constraints. These tasks, and provide the rationale for finite-state scripts are alphasyllabaries, meaning that they are design and system implementation details. structured around orthographic syllables (aksara)̣ as the basic unit.1 One or more Unicode characters 1 Introduction combine when rendering one of thousands of leg- The Unicode Standard separates the representation ible aksara,̣ but many combinations do not corre- of text from its specific graphical rendering: text spond to any aksara.̣ Given a token in these scripts, is encoded as a sequence of characters, which, at one may want to (a) normalize it to a canonical presentation time are then collectively rendered form; and (b) check whether it is a well-formed into the appropriate sequence of glyphs for display. -
Vidyut Documentation
Vidyut A Phonetic Keyboard for Sanskrit Developers: Adolf von Württemberg ([email protected]) Les Morgan ([email protected]) Download: http://www.mywhatever.com/sanskrit/vidyut Current version: 5.0 (August 27, 2010) Requirements: Windows XP, Vista, Windows 7; 32-bit or 64-bit License: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported http://creativecommons.org/licenses/by-nc-sa/3.0/ Vidyut Keyboard, Version 5 (Documentation as of 17 September 2010) Page 1 Contents Overview and Key Features ...................................................................................................................................... 3 Font Issues ................................................................................................................................................................... 5 Installation instructions ............................................................................................................................................. 6 Typing Method ......................................................................................................................................................... 15 Customization ........................................................................................................................................................... 18 Keyboard Encoding Table (in Keyboard key order) ............................................................................................ 19 Keyboard Layout ..................................................................................................................................................... -
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset Brian Roarky, Lawrence Wolf-Sonkiny, Christo Kirovy, Sabrina J. Mielkez, Cibu Johnyy, Işın Demirşahiny and Keith Hally yGoogle Research zJohns Hopkins University {roark,wolfsonkin,ckirov,cibu,isin,kbhall}@google.com [email protected] Abstract This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text. Keywords: romanization, transliteration, South Asian languages 1. Introduction lingua franca, the prevalence of loanwords and code switch- Languages in South Asia – a region covering India, Pak- ing (largely but not exclusively with English) is high. istan, Bangladesh and neighboring countries – are generally Unlike translations, human generated parallel versions of written with Brahmic or Perso-Arabic scripts, but are also text in the native scripts of these languages and romanized often written in the Latin script, most notably for informal versions do not spontaneously occur in any meaningful vol- communication such as within SMS messages, WhatsApp, ume. -
An Introduction to Indic Scripts
An Introduction to Indic Scripts Richard Ishida W3C [email protected] HTML version: http://www.w3.org/2002/Talks/09-ri-indic/indic-paper.html PDF version: http://www.w3.org/2002/Talks/09-ri-indic/indic-paper.pdf Introduction This paper provides an introduction to the major Indic scripts used on the Indian mainland. Those addressed in this paper include specifically Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and Telugu. I have used XHTML encoded in UTF-8 for the base version of this paper. Most of the XHTML file can be viewed if you are running Windows XP with all associated Indic font and rendering support, and the Arial Unicode MS font. For examples that require complex rendering in scripts not yet supported by this configuration, such as Bengali, Oriya, and Malayalam, I have used non- Unicode fonts supplied with Gamma's Unitype. To view all fonts as intended without the above you can view the PDF file whose URL is given above. Although the Indic scripts are often described as similar, there is a large amount of variation at the detailed implementation level. To provide a detailed account of how each Indic script implements particular features on a letter by letter basis would require too much time and space for the task at hand. Nevertheless, despite the detail variations, the basic mechanisms are to a large extent the same, and at the general level there is a great deal of similarity between these scripts. It is certainly possible to structure a discussion of the relevant features along the same lines for each of the scripts in the set. -
Designing Soft Keyboards for Brahmic Scripts
Designing Soft Keyboards for Brahmic Scripts Lauren Hinkle Miguel Lezcano Jugal Kalita Colorado College University of Colorado University of Colorado Colorado Springs, CO, USA Colorado Springs, CO, USA Colorado Springs, CO, USA lauren.hinkle [email protected] [email protected] @coloradocollege.edu Abstract Soft keyboards allow a user to input text with and without the use of a physical keyboard. They Despite being spoken by a large percent- are versatile because they allow data to be input age of the world, many Indic languages through mouse clicks on an on-screen keyboard, have escaped research and development in through a touch screen on a computer, cell phone, the realm of text-input. There exist lan- or PDA, or by mapping a virtual keyboard to guages with poor or no support for typ- a standard physical keyboard. With the recent ing. Soft keyboards, because of their ease surge in popularity of touchscreen media such of installation and lack of reliance on spe- as cell phones and computers, well-designed cific hardware, are a promising solution as soft keyboards are becoming more important an input device for many languages. De- everywhere. veloping an acceptable soft keyboard re- For languages that don’t conform well to quires frequency analysis of characters in standard QWERTY based keyboards that order to design a layout that minimizes accompany most computers, soft keyboards are text-input time. This paper proposes us- especially important. Brahmic scripts (Coulmas ing various development techniques, lay- 1990) have more characters and ligatures than fit out variations, and evaluation methods usably on a standard keyboard. -
Prof. P. Bhaskar Reddy Sri Venkateswara University, Tirupati
Component-I (A) – Personal details: Prof. P. Bhaskar Reddy Sri Venkateswara University, Tirupati. Prof. P. Bhaskar Reddy Sri Venkateswara University, Tirupati. & Dr. K. Muniratnam Director i/c, Epigraphy, ASI, Mysore Dr. Sayantani Pal Dept. of AIHC, University of Calcutta. Prof. P. Bhaskar Reddy Sri Venkateswara University, Tirupati. Component-I (B) – Description of module: Subject Name Indian Culture Paper Name Indian Epigraphy Module Name/Title Kharosthi Script Module Id IC / IEP / 15 Pre requisites Kharosthi Script – Characteristics – Origin – Objectives Different Theories – Distribution and its End Keywords E-text (Quadrant-I) : 1. Introduction Kharosthi was one of the major scripts of the Indian subcontinent in the early period. In the list of 64 scripts occurring in the Lalitavistara (3rd century CE), a text in Buddhist Hybrid Sanskrit, Kharosthi comes second after Brahmi. Thus both of them were considered to be two major scripts of the Indian subcontinent. Both Kharosthi and Brahmi are first encountered in the edicts of Asoka in the 3rd century BCE. 2. Discovery of the script and its Decipherment The script was first discovered on one side of a large number of coins bearing Greek legends on the other side from the north western part of the Indian subcontinent in the first quarter of the 19th century. Later in 1830 to 1834 two full inscriptions of the time of Kanishka bearing the same script were found at Manikiyala in Pakistan. After this discovery James Prinsep named the script as ‘Bactrian Pehelevi’ since it occurred on a number of so called ‘Bactrian’ coins. To James Prinsep the characters first looked similar to Pahlavi (Semitic) characters. -
Simplified Abugidas
Simplified Abugidas Chenchen Ding, Masao Utiyama, and Eiichiro Sumita Advanced Translation Technology Laboratory, Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289, Japan fchenchen.ding, mutiyama, [email protected] phonogram segmental abjad Abstract writing alphabet system An abugida is a writing system where the logogram syllabic abugida consonant letters represent syllables with Figure 1: Hierarchy of writing systems. a default vowel and other vowels are de- noted by diacritics. We investigate the fea- ណូ ន ណណន នួន ននន …ជិតណណន… sibility of recovering the original text writ- /noon/ /naen/ /nuən/ /nein/ vowel machine ten in an abugida after omitting subordi- diacritic omission learning ណន នន methods nate diacritics and merging consonant let- consonant character ters with similar phonetic values. This is merging N N … J T N N … crucial for developing more efficient in- (a) ABUGIDA SIMPLIFICATION (b) RECOVERY put methods by reducing the complexity in abugidas. Four abugidas in the south- Figure 2: Overview of the approach in this study. ern Brahmic family, i.e., Thai, Burmese, ters equally. In contrast, abjads (e.g., the Arabic Khmer, and Lao, were studied using a and Hebrew scripts) do not write most vowels ex- newswire 20; 000-sentence dataset. We plicitly. The third type, abugidas, also called al- compared the recovery performance of a phasyllabary, includes features from both segmen- support vector machine and an LSTM- tal and syllabic systems. In abugidas, consonant based recurrent neural network, finding letters represent syllables with a default vowel, that the abugida graphemes could be re- and other vowels are denoted by diacritics.