Creating Fonts for Brahmic Scripts with Opentype and Apple Advanced Typography Muthu Nedumaran & Norbert Lindenberg
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Types of Polar Questions in Javanese Jozina VANDER KLOK 1
Types of polar questions in Javanese Jozina VANDER KLOK University of Oslo This paper describes the different grammatical strategies to form polar questions (broadly including yes-no questions and alternative questions) in Javanese, an area that has not been fully documented before. Focusing on the dialect of Javanese spoken in Paciran, Lamongan, East Java, Indonesia, yes-no questions can be formed with intonation, the particles opo, toh and iyo, or by fronting an auxiliary. Yes-no questions with narrow focus in this dialect are achieved via various syntactic positions of the particle toh in contrast to broad focus sentence- finally. Alternative questions are also formed with toh, either conjoining two constituents or with negation as a tag question. Based on these new findings in Paciran Javanese compared with Standard Javanese, the reflex of the alternative question particle is shown to co-vary with the disjunctive marker of that dialect. Additional dialectal variation concerning syntactic restrictions on auxiliary fronting is also discussed. Finally, combinations of these strategies— unexplored in any dialect—are shown to be possible (e.g., auxiliary fronting plus the particle opo) while other combinations are shown to be impossible (e.g., with the particle opo and particle toh in sentence final position). This paper serves as a benchmark for further investigation into dialectal variation across Javanese as well as into the syntax-semantics and syntax-prosody interfaces in deriving different types of yes-no questions. 1. Introduction1 Polar questions in Javanese—in any dialect—are currently not well-documented. Putting together descriptions from various sources on Standard Javanese, as spoken in the courtly centers of Yogyakarta and Surakarta/Solo, yes-no questions are noted to be formed via (i) intonation, (ii) the particle iya/yha/ya ‘yes’, (iii) the particle apa, and (iv) the particle ta (Horne 1961; Arps et al. -
Cross-Language Framework for Word Recognition and Spotting of Indic Scripts
Cross-language Framework for Word Recognition and Spotting of Indic Scripts aAyan Kumar Bhunia, bPartha Pratim Roy*, aAkash Mohta, cUmapada Pal aDept. of ECE, Institute of Engineering & Management, Kolkata, India bDept. of CSE, Indian Institute of Technology Roorkee India cCVPR Unit, Indian Statistical Institute, Kolkata, India bemail: [email protected], TEL: +91-1332-284816 Abstract Handwritten word recognition and spotting of low-resource scripts are difficult as sufficient training data is not available and it is often expensive for collecting data of such scripts. This paper presents a novel cross language platform for handwritten word recognition and spotting for such low-resource scripts where training is performed with a sufficiently large dataset of an available script (considered as source script) and testing is done on other scripts (considered as target script). Training with one source script and testing with another script to have a reasonable result is not easy in handwriting domain due to the complex nature of handwriting variability among scripts. Also it is difficult in mapping between source and target characters when they appear in cursive word images. The proposed Indic cross language framework exploits a large resource of dataset for training and uses it for recognizing and spotting text of other target scripts where sufficient amount of training data is not available. Since, Indic scripts are mostly written in 3 zones, namely, upper, middle and lower, we employ zone-wise character (or component) mapping for efficient learning purpose. The performance of our cross- language framework depends on the extent of similarity between the source and target scripts. -
75 Characters Maximum
Kannada Script LGR Proposal Introduction, Current Analysis and Next Steps Dr. U.B. Pavanaja NBGP F2F Meeting, Colombo 14 December 2017 | 1 Agenda 1 2 3 Introduction to Repertoire Analysis Within Script Kannada Script Variants 4 5 6 Cross-Script WLE Rules Current Status and Variants Next Steps for Completion | 2 Introduction to Kannada Script Population – there are about 60 million speakers of Kannada language which uses Kannada script. Geographical area - Kannada is spoken predominantly by the people of Karnataka State of India. It is also spoken by significant linguistic minorities in the states of Andhra Pradesh, Telangana, Tamil Nadu, Maharashtra, Kerala, Goa and abroad Languages written in Kannada script – Kannada, Tulu, Kodava (Coorgi), Konkani, Havyaka, Sanketi, Beary (byaari), Arebaase, Koraga | 3 Classification of Characters Swaras (vowels) Letter ಅ ಆ ಇ ಈ ಉ ಊ ಋ ಎ ಏ ಐ ಒ ಓ ಔ Vowel sign/ N/Aಾ ಾ ಾ ಾ ಾ ಾ ಾ ಾ ಾ ಾ ಾ ಾ matra Yogavahas In Kannada, all consonants Anusvara ಅಂ (vyanjanas) when written as ಕ (ka), ಖ (kha), ಗ (ga), etc. actually have a built-in vowel sign (matra) Visarga ಅಃ of vowel ಅ (a) in them. | 4 Classification of Characters Vargeeya vyanjana (structured consonants) voiceless voiceless aspirate voiced voiced aspirate nasal Velars ಕ ಖ ಗ ಘ ಙ Palatals ಚ ಛ ಜ ಝ ಞ Retroflex ಟ ಠ ಡ ಢ ಣ Dentals ತ ಥ ದ ಧ ನ Labials ಪ ಫ ಬ ಭ ಮ Avargeeya vyanjana (unstructured consonants) ಯ ರ ಱ (obsolete) ಲ ವ ಶ ಷ ಸ ಹ ಳ ೞ (obsolete) | 5 Repertoire Included-1 Sr. Unicode Glyph Character Name Unicode Indic Ref Widespread No. Code General Syllabic use ? Point Category Category [Yes/No] 1 0C82 ಂ KANNADA SIGN ANUSVARA Mc Anusvara Yes 2 0C83 ಂ KANNADA SIGN VISARGA Mc Visarga Yes 3 0C85 ಅ KANNADA LETTER A Lo Vowel Yes 4 0C86 ಆ KANNADA LETTER AA Lo Vowel Yes 5 0C87 ಇ KANNADA LETTER I Lo Vowel Yes 6 0C88 ಈ KANNADA LETTER II Lo Vowel Yes 7 0C89 ಉ KANNADA LETTER U Lo Vowel Yes 8 0C8A ಊ KANNADA LETTER UU Lo Vowel Yes KANNADA LETTER VOCALIC 9 0C8B ಋ R Lo Vowel Yes 10 0C8E ಎ KANNADA LETTER E Lo Vowel Yes | 6 Repertoire Included-2 Sr. -
Ka И @И Ka M Л @Л Ga Н @Н Ga M М @М Nga О @О Ca П
ISO/IEC JTC1/SC2/WG2 N3319R L2/07-295R 2007-09-11 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Proposal for encoding the Javanese script in the UCS Source: Michael Everson, SEI (Universal Scripts Project) Status: Individual Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Replaces: N3292 Date: 2007-09-11 1. Introduction. The Javanese script, or aksara Jawa, is used for writing the Javanese language, the native language of one of the peoples of Java, known locally as basa Jawa. It is a descendent of the ancient Brahmi script of India, and so has many similarities with modern scripts of South Asia and Southeast Asia which are also members of that family. The Javanese script is also used for writing Sanskrit, Jawa Kuna (a kind of Sanskritized Javanese), and Kawi, as well as the Sundanese language, also spoken on the island of Java, and the Sasak language, spoken on the island of Lombok. Javanese script was in current use in Java until about 1945; in 1928 Bahasa Indonesia was made the national language of Indonesia and its influence eclipsed that of other languages and their scripts. Traditional Javanese texts are written on palm leaves; books of these bound together are called lontar, a word which derives from ron ‘leaf’ and tal ‘palm’. 2.1. Consonant letters. Consonants have an inherent -a vowel sound. Consonants combine with following consonants in the usual Brahmic fashion: the inherent vowel is “killed” by the PANGKON, and the follow- ing consonant is subjoined or postfixed, often with a change in shape: §£ ndha = § NA + @¿ PANGKON + £ DA-MAHAPRANA; üù n. -
Format Guide
FORMAT GUIDE OVERVIEW 139 GENERAL GUIDELINES 134 ELECTRONIC RÉSUMÉ GUIDELINES 140 STANDARDS OF MAILABILITY 140 FAIR USE GUIDELINES FOR EDUCATIONAL USE 141 AGENDA 142 ITINERARY 143 LABEL/ENVELOPE 144 BUSINESS LETTER 144 PERSONAL LETTER 145 LETTER WITH ADVANCED FEATURES 146 LETTER & MEMO SECOND PAGE 146 EMAIL 147 MEMORANDUM 148 NEWS RELEASE 149 MINUTES 150 OUTLINE 151 REPORT 152 REPORT CONTINUED 153 ENDNOTE PAGE 153 CITATIONS 154 REFERENCE PAGE 155 TABLES 156 ELECTRONIC RÉSUMÉ 157 TABLE OF CONTENTS 158 Revised 2014 Copyright FBLA-PBL 2014. All Rights Reserved. Designed by: FBLA-PBL, Inc CHAPTER MANAGEMENT HANDBOOK | 137 138 | FBLA-PBL.ORG OVERVIEW In today’s business world, communication is consistently expressed through writing. Successful businesses require a consistent message throughout the organization. A foundation of this strategy is the use of a format guide, which enables a corporation to maintain a uniform image through all its communications. Use this guide to prepare for Computer Applications and Word Processing skill events. GENERAL GUIDELINES Font Size: 11 or 12 Font Style: Times New Roman, Arial, Calibri, or Cambria Spacing: 1 space after punctuation ending a sentence (stay consistent within the document) 1 space after a semicolon 1 space after a comma 1 space after a colon (stay consistent within the document) 1 space between state abbreviation and zip code Letters: Block Style with Open Punctuation Top Margin: 2 inches Side and Bottom Margins: 1 inch Bulleted Lists: Single space individual items; double space between items -
SC22/WG20 N896 L2/01-476 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale De Normalisation
SC22/WG20 N896 L2/01-476 Universal multiple-octet coded character set International organization for standardization Organisation internationale de normalisation Title: Ordering rules for Khmer Source: Kent Karlsson Date: 2001-12-19 Status: Expert contribution Document type: Working group document Action: For consideration by the UTC and JTC 1/SC 22/WG 20 1 Introduction The Khmer script in Unicode/10646 uses conjoining characters, just like the Indic scripts and Hangul. An alternative that is suitable (and much more elegant) for Khmer, but not for Indic scripts, would have been to have combining- below (and sometimes a bit to the side) consonants and combining-below independent vowels, similar to the combining-above Latin letters recently encoded, as well as how Tibetan is handled. However that is not the chosen solution, which instead uses a combining character (COENG) that makes characters conjoin (glue) like it is done for Indic (Brahmic) scripts and like Hangul Jamo, though the latter does not have a separate gluer character. In the Khmer script, using the COENG based approach, the words are formed from orthographic syllables, where an orthographic syllable has the following structure [add ligation control?]: Khmer-syllable ::= (K H)* K M* where K is a Khmer consonant (most with an inherent vowel that is pronounced only if there is no consonant, independent vowel, or dependent vowel following it in the orthographic syllable) or a Khmer independent vowel, H is the invisible Khmer conjoint former COENG, M is a combining character (including COENG, though that would be a misspelling), in particular a combining Khmer vowel (noted A below) or modifier sign. -
A Practical Sanskrit Introductory
A Practical Sanskrit Intro ductory This print le is available from ftpftpnacaczawiknersktintropsjan Preface This course of fteen lessons is intended to lift the Englishsp eaking studentwho knows nothing of Sanskrit to the level where he can intelligently apply Monier DhatuPat ha Williams dictionary and the to the study of the scriptures The rst ve lessons cover the pronunciation of the basic Sanskrit alphab et Devanagar together with its written form in b oth and transliterated Roman ash cards are included as an aid The notes on pronunciation are largely descriptive based on mouth p osition and eort with similar English Received Pronunciation sounds oered where p ossible The next four lessons describ e vowel emb ellishments to the consonants the principles of conjunct consonants Devanagar and additions to and variations in the alphab et Lessons ten and sandhi eleven present in grid form and explain their principles in sound The next three lessons p enetrate MonierWilliams dictionary through its four levels of alphab etical order and suggest strategies for nding dicult words The artha DhatuPat ha last lesson shows the extraction of the from the and the application of this and the dictionary to the study of the scriptures In addition to the primary course the rst eleven lessons include a B section whichintro duces the student to the principles of sentence structure in this fully inected language Six declension paradigms and class conjugation in the present tense are used with a minimal vo cabulary of nineteen words In the B part of -
The Festvox Indic Frontend for Grapheme-To-Phoneme Conversion
The Festvox Indic Frontend for Grapheme-to-Phoneme Conversion Alok Parlikar, Sunayana Sitaram, Andrew Wilkinson and Alan W Black Carnegie Mellon University Pittsburgh, USA aup, ssitaram, aewilkin, [email protected] Abstract Text-to-Speech (TTS) systems convert text into phonetic pronunciations which are then processed by Acoustic Models. TTS frontends typically include text processing, lexical lookup and Grapheme-to-Phoneme (g2p) conversion stages. This paper describes the design and implementation of the Indic frontend, which provides explicit support for many major Indian languages, along with a unified framework with easy extensibility for other Indian languages. The Indic frontend handles many phenomena common to Indian languages such as schwa deletion, contextual nasalization, and voicing. It also handles multi-script synthesis between various Indian-language scripts and English. We describe experiments comparing the quality of TTS systems built using the Indic frontend to grapheme-based systems. While this frontend was designed keeping TTS in mind, it can also be used as a general g2p system for Automatic Speech Recognition. Keywords: speech synthesis, Indian language resources, pronunciation 1. Introduction in models of the spectrum and the prosody. Another prob- lem with this approach is that since each grapheme maps Intelligible and natural-sounding Text-to-Speech to a single “phoneme” in all contexts, this technique does (TTS) systems exist for a number of languages of the world not work well in the case of languages that have pronun- today. However, for low-resource, high-population lan- ciation ambiguities. We refer to this technique as “Raw guages, such as languages of the Indian subcontinent, there Graphemes.” are very few high-quality TTS systems available. -
The Unicode Cookbook for Linguists: Managing Writing Systems Using Orthography Profiles
Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2017 The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles Moran, Steven ; Cysouw, Michael DOI: https://doi.org/10.5281/zenodo.290662 Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-135400 Monograph The following work is licensed under a Creative Commons: Attribution 4.0 International (CC BY 4.0) License. Originally published at: Moran, Steven; Cysouw, Michael (2017). The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles. CERN Data Centre: Zenodo. DOI: https://doi.org/10.5281/zenodo.290662 The Unicode Cookbook for Linguists Managing writing systems using orthography profiles Steven Moran & Michael Cysouw Change dedication in localmetadata.tex Preface This text is meant as a practical guide for linguists, and programmers, whowork with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together. The intersection of the Unicode Standard and the International Phonetic Al- phabet is often not met without frustration by users. Nevertheless, thetwo standards have provided language researchers with a consistent computational architecture needed to process, publish and analyze data from many different languages. We bring to light common, but not always transparent, pitfalls that researchers face when working with Unicode and IPA. Our research uses quantitative methods to compare languages and uncover and clarify their phylogenetic relations. However, the majority of lexical data available from the world’s languages is in author- or document-specific orthogra- phies. -
Schwa Deletion: Investigating Improved Approach for Text-To-IPA System for Shiri Guru Granth Sahib
ISSN (Online) 2278-1021 ISSN (Print) 2319-5940 International Journal of Advanced Research in Computer and Communication Engineering Vol. 4, Issue 4, April 2015 Schwa Deletion: Investigating Improved Approach for Text-to-IPA System for Shiri Guru Granth Sahib Sandeep Kaur1, Dr. Amitoj Singh2 Pursuing M.E, CSE, Chitkara University, India 1 Associate Director, CSE, Chitkara University , India 2 Abstract: Punjabi (Omniglot) is an interesting language for more than one reasons. This is the only living Indo- Europen language which is a fully tonal language. Punjabi language is an abugida writing system, with each consonant having an inherent vowel, SCHWA sound. This sound is modifiable using vowel symbols attached to consonant bearing the vowel. Shri Guru Granth Sahib is a voluminous text of 1430 pages with 511,874 words, 1,720,345 characters, and 28,534 lines and contains hymns of 36 composers written in twenty-two languages in Gurmukhi script (Lal). In addition to text being in form of hymns and coming from so many composers belonging to different languages, what makes the language of Shri Guru Granth Sahib even more different from contemporary Punjabi. The task of developing an accurate Letter-to-Sound system is made difficult due to two further reasons: 1. Punjabi being the only tonal language 2. Historical and Cultural circumstance/period of writings in terms of historical and religious nature of text and use of words from multiple languages and non-native phonemes. The handling of schwa deletion is of great concern for development of accurate/ near perfect system, the presented work intend to report the state-of-the-art in terms of schwa deletion for Indian languages, in general and for Gurmukhi Punjabi, in particular. -
Devanagari in Luatex
Devanagari in LuaTeX Using OpenType features Ivo Geradts Kai Eigner 1. About us 2. Foreign scripts in TeX, past and present 3. OpenType features 4. OpenType rendering engines 5. ConTeXt engine 6. Example 1: Latin with mark and mkmk 7. Example 2: Arabic 8. Example 3: Devanagari 1. About us TAT Zetwerk offers academic typesetting services Specialties: - critical editions - foreign scripts PlainTeX, for two reasons: - our exotic needs - well documented 2. Foreign scripts in TeX, past and present Past: 8-bit PostScript fonts; e.g. approach for classical greek: - encoding: α = a, β = b, γ = g - ligature table: σ = s, ς = /s, ἄ = a)' - multitude of font-related files: tfm, vf, pfb, enc Present: new TeX-engines XeTeX and LuaTeX can handle OpenType fonts containing Unicode glyph table: - ἄ is unicode position 1F04 3. OpenType features OpenType fonts contain more information than traditional fonts - tfm: bounding boxes, kerning, ligaturen, etc. - OpenType additions: GPOS, GSUB, marks Introduction of XeTeX and LuaTeX removes old limitations: - vocalized Hebrew, Arabic - CJK - Devanagari 4. OpenType rendering engines OpenType fonts require rendering engine, such as: - Uniscribe (Windows) - AAT (OS X); XeTeX on OS X - Graphite (Windows and Linux); XeTeX on all platforms - ConTeXt engine: collection of Lua-scripts attached to various callbacks related to font calls and processing of node list 5. ConTeXt engine - independent of operating system - readable and modifiable - work in progress 6. Example 1: Latin script with mark and mkmk 7. Example 2: Arabic 8. Example 3: Devanagari - Introduction: what is Devanagari? - Devanagari and OpenType - Examples Devanagari Script - Sanskrit (old Indic) - Hindi - Nepali Abugida (consonant–vowel sequences are written as a unit) क = “ka” ◌ क् = “k” क + ् (halant) ◌ क = “ku” क + ु (matra) Mahabharata Syllable structure - Consonants: e.g. -
Understanding the Attack Surface and Attack Resilience of Project Spartan’S (Edge) New Edgehtml Rendering Engine
Understanding the Attack Surface and Attack Resilience of Project Spartan’s (Edge) New EdgeHTML Rendering Engine Mark Vincent Yason IBM X-Force Advanced Research yasonm[at]ph[dot]ibm[dot]com @MarkYason [v2] © 2015 IBM Corporation Agenda . Overview . Attack Surface . Exploit Mitigations . Conclusion © 2015 IBM Corporation 2 Notes . Detailed whitepaper is available . All information is based on Microsoft Edge running on 64-bit Windows 10 build 10240 (edgehtml.dll version 11.0.10240.16384) © 2015 IBM Corporation 3 Overview © 2015 IBM Corporation Overview > EdgeHTML Rendering Engine © 2015 IBM Corporation 5 Overview > EdgeHTML Attack Surface Map & Exploit Mitigations © 2015 IBM Corporation 6 Overview > Initial Recon: MSHTML and EdgeHTML . EdgeHTML is forked from Trident (MSHTML) . Problem: Quickly identify major code changes (features/functionalities) from MSHTML to EdgeHTML . One option: Diff class names and namespaces © 2015 IBM Corporation 7 Overview > Initial Recon: Diffing MSHTML and EdgeHTML (Method) © 2015 IBM Corporation 8 Overview > Initial Recon: Diffing MSHTML and EdgeHTML (Examples) . Suggests change in image support: . Suggests new DOM object types: © 2015 IBM Corporation 9 Overview > Initial Recon: Diffing MSHTML and EdgeHTML (Examples) . Suggests ported code from another rendering engine (Blink) for Web Audio support: © 2015 IBM Corporation 10 Overview > Initial Recon: Diffing MSHTML and EdgeHTML (Notes) . Further analysis needed –Renamed class/namespace results into a new namespace plus a deleted namespace . Requires availability