Khmer Ordering Rules
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Cross-Language Framework for Word Recognition and Spotting of Indic Scripts
Cross-language Framework for Word Recognition and Spotting of Indic Scripts aAyan Kumar Bhunia, bPartha Pratim Roy*, aAkash Mohta, cUmapada Pal aDept. of ECE, Institute of Engineering & Management, Kolkata, India bDept. of CSE, Indian Institute of Technology Roorkee India cCVPR Unit, Indian Statistical Institute, Kolkata, India bemail: [email protected], TEL: +91-1332-284816 Abstract Handwritten word recognition and spotting of low-resource scripts are difficult as sufficient training data is not available and it is often expensive for collecting data of such scripts. This paper presents a novel cross language platform for handwritten word recognition and spotting for such low-resource scripts where training is performed with a sufficiently large dataset of an available script (considered as source script) and testing is done on other scripts (considered as target script). Training with one source script and testing with another script to have a reasonable result is not easy in handwriting domain due to the complex nature of handwriting variability among scripts. Also it is difficult in mapping between source and target characters when they appear in cursive word images. The proposed Indic cross language framework exploits a large resource of dataset for training and uses it for recognizing and spotting text of other target scripts where sufficient amount of training data is not available. Since, Indic scripts are mostly written in 3 zones, namely, upper, middle and lower, we employ zone-wise character (or component) mapping for efficient learning purpose. The performance of our cross- language framework depends on the extent of similarity between the source and target scripts. -
SC22/WG20 N896 L2/01-476 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale De Normalisation
SC22/WG20 N896 L2/01-476 Universal multiple-octet coded character set International organization for standardization Organisation internationale de normalisation Title: Ordering rules for Khmer Source: Kent Karlsson Date: 2001-12-19 Status: Expert contribution Document type: Working group document Action: For consideration by the UTC and JTC 1/SC 22/WG 20 1 Introduction The Khmer script in Unicode/10646 uses conjoining characters, just like the Indic scripts and Hangul. An alternative that is suitable (and much more elegant) for Khmer, but not for Indic scripts, would have been to have combining- below (and sometimes a bit to the side) consonants and combining-below independent vowels, similar to the combining-above Latin letters recently encoded, as well as how Tibetan is handled. However that is not the chosen solution, which instead uses a combining character (COENG) that makes characters conjoin (glue) like it is done for Indic (Brahmic) scripts and like Hangul Jamo, though the latter does not have a separate gluer character. In the Khmer script, using the COENG based approach, the words are formed from orthographic syllables, where an orthographic syllable has the following structure [add ligation control?]: Khmer-syllable ::= (K H)* K M* where K is a Khmer consonant (most with an inherent vowel that is pronounced only if there is no consonant, independent vowel, or dependent vowel following it in the orthographic syllable) or a Khmer independent vowel, H is the invisible Khmer conjoint former COENG, M is a combining character (including COENG, though that would be a misspelling), in particular a combining Khmer vowel (noted A below) or modifier sign. -
Finite-State Script Normalization and Processing Utilities: the Nisaba Brahmic Library
Finite-state script normalization and processing utilities: The Nisaba Brahmic library Cibu Johny† Lawrence Wolf-Sonkin‡ Alexander Gutkin† Brian Roark‡ Google Research †United Kingdom and ‡United States {cibu,wolfsonkin,agutkin,roark}@google.com Abstract In addition to such normalization issues, some scripts also have well-formedness constraints, i.e., This paper presents an open-source library for efficient low-level processing of ten ma- not all strings of Unicode characters from a single jor South Asian Brahmic scripts. The library script correspond to a valid (i.e., legible) grapheme provides a flexible and extensible framework sequence in the script. Such constraints do not ap- for supporting crucial operations on Brahmic ply in the basic Latin alphabet, where any permuta- scripts, such as NFC, visual normalization, tion of letters can be rendered as a valid string (e.g., reversible transliteration, and validity checks, for use as an acronym). The Brahmic family of implemented in Python within a finite-state scripts, however, including the Devanagari script transducer formalism. We survey some com- mon Brahmic script issues that may adversely used to write Hindi, Marathi and many other South affect the performance of downstream NLP Asian languages, do have such constraints. These tasks, and provide the rationale for finite-state scripts are alphasyllabaries, meaning that they are design and system implementation details. structured around orthographic syllables (aksara)̣ as the basic unit.1 One or more Unicode characters 1 Introduction combine when rendering one of thousands of leg- The Unicode Standard separates the representation ible aksara,̣ but many combinations do not corre- of text from its specific graphical rendering: text spond to any aksara.̣ Given a token in these scripts, is encoded as a sequence of characters, which, at one may want to (a) normalize it to a canonical presentation time are then collectively rendered form; and (b) check whether it is a well-formed into the appropriate sequence of glyphs for display. -
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset Brian Roarky, Lawrence Wolf-Sonkiny, Christo Kirovy, Sabrina J. Mielkez, Cibu Johnyy, Işın Demirşahiny and Keith Hally yGoogle Research zJohns Hopkins University {roark,wolfsonkin,ckirov,cibu,isin,kbhall}@google.com [email protected] Abstract This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text. Keywords: romanization, transliteration, South Asian languages 1. Introduction lingua franca, the prevalence of loanwords and code switch- Languages in South Asia – a region covering India, Pak- ing (largely but not exclusively with English) is high. istan, Bangladesh and neighboring countries – are generally Unlike translations, human generated parallel versions of written with Brahmic or Perso-Arabic scripts, but are also text in the native scripts of these languages and romanized often written in the Latin script, most notably for informal versions do not spontaneously occur in any meaningful vol- communication such as within SMS messages, WhatsApp, ume. -
Designing Soft Keyboards for Brahmic Scripts
Designing Soft Keyboards for Brahmic Scripts Lauren Hinkle Miguel Lezcano Jugal Kalita Colorado College University of Colorado University of Colorado Colorado Springs, CO, USA Colorado Springs, CO, USA Colorado Springs, CO, USA lauren.hinkle [email protected] [email protected] @coloradocollege.edu Abstract Soft keyboards allow a user to input text with and without the use of a physical keyboard. They Despite being spoken by a large percent- are versatile because they allow data to be input age of the world, many Indic languages through mouse clicks on an on-screen keyboard, have escaped research and development in through a touch screen on a computer, cell phone, the realm of text-input. There exist lan- or PDA, or by mapping a virtual keyboard to guages with poor or no support for typ- a standard physical keyboard. With the recent ing. Soft keyboards, because of their ease surge in popularity of touchscreen media such of installation and lack of reliance on spe- as cell phones and computers, well-designed cific hardware, are a promising solution as soft keyboards are becoming more important an input device for many languages. De- everywhere. veloping an acceptable soft keyboard re- For languages that don’t conform well to quires frequency analysis of characters in standard QWERTY based keyboards that order to design a layout that minimizes accompany most computers, soft keyboards are text-input time. This paper proposes us- especially important. Brahmic scripts (Coulmas ing various development techniques, lay- 1990) have more characters and ligatures than fit out variations, and evaluation methods usably on a standard keyboard. -
Simplified Abugidas
Simplified Abugidas Chenchen Ding, Masao Utiyama, and Eiichiro Sumita Advanced Translation Technology Laboratory, Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289, Japan fchenchen.ding, mutiyama, [email protected] phonogram segmental abjad Abstract writing alphabet system An abugida is a writing system where the logogram syllabic abugida consonant letters represent syllables with Figure 1: Hierarchy of writing systems. a default vowel and other vowels are de- noted by diacritics. We investigate the fea- ណូ ន ណណន នួន ននន …ជិតណណន… sibility of recovering the original text writ- /noon/ /naen/ /nuən/ /nein/ vowel machine ten in an abugida after omitting subordi- diacritic omission learning ណន នន methods nate diacritics and merging consonant let- consonant character ters with similar phonetic values. This is merging N N … J T N N … crucial for developing more efficient in- (a) ABUGIDA SIMPLIFICATION (b) RECOVERY put methods by reducing the complexity in abugidas. Four abugidas in the south- Figure 2: Overview of the approach in this study. ern Brahmic family, i.e., Thai, Burmese, ters equally. In contrast, abjads (e.g., the Arabic Khmer, and Lao, were studied using a and Hebrew scripts) do not write most vowels ex- newswire 20; 000-sentence dataset. We plicitly. The third type, abugidas, also called al- compared the recovery performance of a phasyllabary, includes features from both segmen- support vector machine and an LSTM- tal and syllabic systems. In abugidas, consonant based recurrent neural network, finding letters represent syllables with a default vowel, that the abugida graphemes could be re- and other vowels are denoted by diacritics. -
A Hybrid of Sinhala and Tamil Scripts for Sri Lanka
Other-Letter; A hybrid of Sinhala and Tamil scripts for Sri Lanka - A Work in Progress Pathum Egodawatta, Graphic Designer, [email protected] Abstract: This paper documents and discusses the development process of a new hybrid script based on Sinhala and Tamil scripts. Sri Lanka has been in a civil war for more than 30 years. Language has been identified as the main reason for the ethnic conflict in Sri Lanka and the post-war reconciliation process has created new issues related to language. An artificial democratic distribution where both communities encounter a sense of otherness has been created with he resettlement of the Sinhalese and Tamil communities in the war-affected areas. Signage and street typography is the spaces where these people encounter the 'other' language. The task was to develop a typographic solution to dissolve the language barrier through these spaces itself. A single script, writing system and a typeface were developed, based on Sinhala and Tamil, which can be used to write and read in both languages. In addition to the development process of the project, this paper briefly discusses the socio-political background, which raised the opportunity for such script. Keywords: Hybrid Script, New Script, Ethnic conflict, New Language 1. Introduction When two or more languages co-exist in a polity, why does language become the “Object of social and Political conflict”? (Bohem, 1933) In Sri Lanka, the language was the catalyst of the conflict and the 30 yearlong war. Sinhala and Tamil are the two official languages of Sri Lanka. Sinhala is the most widely used language in the country, used by 74% of the people, while 18% use Tamil. -
The Tonal Comparative Method: Tai Tone in Historical Perspective
Abstract The Tonal Comparative Method: Tai Tone in Historical Perspective Rikker Dockum 2019 To date, the majority of attention given to sound change in lexical tone has focused on how an atonal language becomes tonal and on early stage tone development, a process known as tonogenesis. Lexical tone here refers to the systematic and obligatory variation of prosodic acoustic cues, primarily pitch height and contour, to encode contrastive lexical meaning. Perhaps the most crucial insight to date in accounting for tonogenesis is that lexically contrastive tone, a suprasegmental feature, is bom from segmental origins. What remains less studied and more poorly understood is how tone changes after it is well established in a language or language family. In the centuries following tonogenesis, tones continue to undergo splits, mergers, and random drift, both in their phonetic realization and in the phonemic categories that underlie those surface tones. How to incorporate this knowledge into such historical linguistic tasks as reconstmction, subgrouping, and language classification in a generally applicable fashion has remained elusive. The idea of reconstmcting tone, and the use of tonal evidence for language classifi cation, is not new. However, the predominant conventional wisdom has long been that tone is impenetrable by the traditional Comparative Method. This dissertation presents a new methodological approach to sound change in lexical tone for languages where tone is already firmly established. The Tonal Comparative Method is an extension of the logic of the traditional Comparative Method, and is a method for incorporating tonal evidence into historical analyses in a manner consistent with the first principles of the longstanding Comparative Method. -
17.1 Philippine Scripts
The Unicode® Standard Version 13.0 – Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trade- mark claim, the designations have been printed with initial capital letters or in all capitals. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries. The authors and publisher have taken care in the preparation of this specification, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. © 2020 Unicode, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction. For information regarding permissions, inquire at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please see http://www.unicode.org/copyright.html. The Unicode Standard / the Unicode Consortium; edited by the Unicode Consortium. — Version 13.0. Includes index. ISBN 978-1-936213-26-9 (http://www.unicode.org/versions/Unicode13.0.0/) 1. -
Simplified Abugidas
Simplified Abugidas Chenchen Ding, Masao Utiyama, and Eiichiro Sumita Advanced Translation Technology Laboratory, Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289, Japan fchenchen.ding, mutiyama, [email protected] phonogram segmental abjad Abstract writing alphabet system An abugida is a writing system where the logogram syllabic abugida consonant letters represent syllables with Figure 1: Hierarchy of writing systems. a default vowel and other vowels are de- noted by diacritics. We investigate the fea- ណូ ន ណណន នួន ននន …ជិតណណន… sibility of recovering the original text writ- /noon/ /naen/ /nuən/ /nein/ vowel machine ten in an abugida after omitting subordi- diacritic omission learning ណន នន methods nate diacritics and merging consonant let- consonant character ters with similar phonetic values. This is merging N N … J T N N … crucial for developing more efficient in- (a) ABUGIDA SIMPLIFICATION (b) RECOVERY put methods by reducing the complexity in abugidas. Four abugidas in the south- Figure 2: Overview of the approach in this study. ern Brahmic family, i.e., Thai, Burmese, ters equally. In contrast, abjads (e.g., the Arabic Khmer, and Lao, were studied using a and Hebrew scripts) do not write most vowels ex- newswire 20; 000-sentence dataset. We plicitly. The third type, abugidas, also called al- compared the recovery performance of a phasyllabary, includes features from both segmen- support vector machine and an LSTM- tal and syllabic systems. In abugidas, consonant based recurrent neural network, finding letters represent syllables with a default vowel, that the abugida graphemes could be re- and other vowels are denoted by diacritics. -
6 Design and Evaluation of Soft Keyboards for Brahmic Scripts
1 Design and Evaluation of Soft Keyboards for Brahmic Scripts 2 LAUREN HINKLE, University of Michigan 3 ALBERT BROUILLETTE,UniversityofColorado 4 SUJAY JAYAKAR, Cornell University 6 5 LEIGH GATHINGS, MITRE Corporation 6 MIGUEL LEZCANO and JUGAL KALITA,UniversityofColorado 7 Despite being spoken by a large percentage of the world, Indic languages in general lack user-friendly 8 and efficient methods for text input. These languages have poor or no support for typing. Soft keyboards, 9 because of their ease of installation and lack of reliance on specific hardware, are a promising solution 10 as an input device for many languages. Developing an acceptable soft keyboard requires the frequency 11 analysis of characters in order to design a layout that minimizes text-input time. This article proposes the 12 use of various development techniques, layout variations, and evaluation methods for the creation of soft 13 keyboards for Brahmic scripts. We propose that using optimization techniques such as genetic algorithms 14 and multi-objective Pareto optimization to develop multi-layer keyboards will increase the speed at which 15 text can be entered. 16 Categories and Subject Descriptors: H.5.2 [Information Systems–Information Interfaces and 17 Presentation]: User Interfaces 18 General Terms: Design, Experimentation, Human Factors 19 Additional Key Words and Phrases: Genetic algorithms, soft keyboards, Indic languages, Assamese, Pareto 20 optimization, mobile devices 21 ACM Reference Format: 22 Hinkle, L., Brouillette, A., Jayakar, S., Gathings, L., Lezcano, M., and Kalita, J. 2013. Design and evaluation 23 of soft keyboards for Brahmic scripts. ACM Trans. Asian Lang. Inform. Process. 12, 2, Article 6 (June 2013), 24 37 pages. -
Brahmic Schwa-Deletion with Neural Classifiers: Experiments with Bengali
The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages 29-31 August 2018, Gurugram, India Brahmic Schwa-Deletion with Neural Classifiers: Experiments with Bengali Cibu Johny, Martin Jansche Google AI, London, United Kingdom {cibu, mjansche}@google.com Abstract Table 1: The words car and bus as loanwords in various lan- guages, shown in native orthography and transliteration The Brahmic family of writing systems is an alpha-syllabary, in which a consonant letter without an explicit vowel marker can Group Languages Orth. Trans. Orth. Trans. be ambiguous: it can either represent a consonant phoneme or Bengali কার kār@ বাস bās@ a CV syllable with an inherent vowel (“schwa”). The schwa- Hindi, Marathi, Nepali कार kār@ बस b@s@ deletion ambiguity must be resolved when converting from text 1 Gujarati કાર kār@ બસ b@s@ to an accurate phonemic representation, particularly for text-to- Sinhala කා kār බ b@s Malayalam കാർ kār ബസ് b@s speech synthesis. We situate the problem of Bengali schwa- Tamil கா kār ப p@s deletion in the larger context of grapheme-to-phoneme conver- 2 sion for Brahmic scripts and solve it using neural network clas- sifiers with graphemic features that are independent of the script The phenomenon of schwa deletion arises in several lan- and the language. Classifier training is implemented using Ten- guages written in Brahmic scripts, including in Hindi [3, 4], sorFlow and related tools. We analyze the impact of both train- Punjabi [5] in Brahmic Gurmukhi (as opposed to Perso-Arabic ing data size and trained model size, as these represent real-life Shahmukhi) script, Marathi [6, 7], and others.