Introduction to Unicode and Beyond Mike Mckenna Mimckenna(At)Paypal.Com Craig Cummings I18ncraig(At)Gmail.Com Tex Texin Textexin(At)Xencraft.Com V.1.4 November, 2016
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
A Method to Accommodate Backward Compatibility on the Learning Application-Based Transliteration to the Balinese Script
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 6, 2021 A Method to Accommodate Backward Compatibility on the Learning Application-based Transliteration to the Balinese Script 1 3 4 Gede Indrawan , I Gede Nurhayata , Sariyasa I Ketut Paramarta2 Department of Computer Science Department of Balinese Language Education Universitas Pendidikan Ganesha (Undiksha) Universitas Pendidikan Ganesha (Undiksha) Singaraja, Indonesia Singaraja, Indonesia Abstract—This research proposed a method to accommodate transliteration rules (for short, the older rules) from The backward compatibility on the learning application-based Balinese Alphabet document 1 . It exposes the backward transliteration to the Balinese Script. The objective is to compatibility method to accommodate the standard accommodate the standard transliteration rules from the transliteration rules (for short, the standard rules) from the Balinese Language, Script, and Literature Advisory Agency. It is Balinese Language, Script, and Literature Advisory Agency considered as the main contribution since there has not been a [7]. This Bali Province government agency [4] carries out workaround in this research area. This multi-discipline guidance and formulates programs for the maintenance, study, collaboration work is one of the efforts to preserve digitally the development, and preservation of the Balinese Language, endangered Balinese local language knowledge in Indonesia. The Script, and Literature. proposed method covered two aspects, i.e. (1) Its backward compatibility allows for interoperability at a certain level with This study was conducted on the developed web-based the older transliteration rules; and (2) Breaking backward transliteration learning application, BaliScript, for further compatibility at a certain level is unavoidable since, for the same ubiquitous Balinese Language learning since the proposed aspect, there is a contradictory treatment between the standard method reusable for the mobile application [8], [9]. -
Iso/Iec Jtc1/Sc2/Wg2 N3022 L2/06-002
ISO/IEC JTC1/SC2/WG2 N3022 L2/06-002 2006-01-09 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Proposal for encoding the Sundanese script in the BMP of the UCS Source: Michael Everson Status: Individual Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Date: 2006-01-09 The Sundanese script, or aksara Sunda, is used for writing the Sundanese language, one of the languages of the island of Java in Indonesia. It is a descendent of the ancient Brahmi script of India, and so has many similarities with modern scripts of South Asia and Southeast Asia which are also members of that family. The script has official support and is taught in schools and is used on road signs. The SIL Ethnologue indicates that there are 27,000,000 Sundanese. Sundanese has been written in a number of scripts. Pallawa or Pra-Nagari was first used in West Java to write Sanskrit from the fifth to eighth centuries CE, and from Pallawa was derived Sunda Kuna or Old Sundanese which was used in the Sunda Kingdom from the 14th to 18th centuries. Both Javanese and Arabic script were used from the 17th to 19th centuries and the 17th to the mid-20th centuries respectively. Latin script has had currency since the 20th century. The modern Sundanese script, called Sunda Baku or Official Sundanese, was made official in 1996. The modern script itself was derived from Old Sundanese, the earliest example of which is the Prasasti Kawali stone (see Figure 1). -
Unicode Character Properties
Unicode character properties Document #: P1628R0 Date: 2019-06-17 Project: Programming Language C++ Audience: SG-16, LEWG Reply-to: Corentin Jabot <[email protected]> 1 Abstract We propose an API to query the properties of Unicode characters as specified by the Unicode Standard and several Unicode Technical Reports. 2 Motivation This API can be used as a foundation for various Unicode algorithms and Unicode facilities such as Unicode-aware regular expressions. Being able to query the properties of Unicode characters is important for any application hoping to correctly handle any textual content, including compilers and parsers, text editors, graphical applications, databases, messaging applications, etc. static_assert(uni::cp_script('C') == uni::script::latin); static_assert(uni::cp_block(U'[ ') == uni::block::misc_pictographs); static_assert(!uni::cp_is<uni::property::xid_start>('1')); static_assert(uni::cp_is<uni::property::xid_continue>('1')); static_assert(uni::cp_age(U'[ ') == uni::version::v10_0); static_assert(uni::cp_is<uni::property::alphabetic>(U'ß')); static_assert(uni::cp_category(U'∩') == uni::category::sm); static_assert(uni::cp_is<uni::category::lowercase_letter>('a')); static_assert(uni::cp_is<uni::category::letter>('a')); 3 Design Consideration 3.1 constexpr An important design decision of this proposal is that it is fully constexpr. Notably, the presented design allows an implementation to only link the Unicode tables that are actually used by a program. This can reduce considerably the size requirements of an Unicode-aware executable as most applications often depend on a small subset of the Unicode properties. While the complete 1 Unicode database has a substantial memory footprint, developers should not pay for the table they don’t use. It also ensures that developers can enforce a specific version of the Unicode Database at compile time and get a consistent and predictable run-time behavior. -
A Method for Scriptio Continua Management on the Transliteration to the Balinese Script
Turkish Journal of Computer and Mathematics Education Vol.12 No.6 (2021), 4182-4191 Research Article A Method for Scriptio Continua Management on the Transliteration to the Balinese Script G. Indrawan1, I W. Sutaya1, K. U. Ariawan1, M. S. Gitakarma1, I G. Nurhayata1, I K. Paramarta2 1Electrical Engineering and Computer Science Department 2Balinese Language Education Department Universitas Pendidikan Ganesha (Undiksha), Singaraja, Bali, Indonesia 81116 Abstract: This research proposed a method for scriptio continua management on the learning application-based transliteration to the Balinese Script. It has not been conducted yet and is important since the proposed method eases the learning of the Balinese Script as the transliteration output. This research is a technological preservation effort to the endangered Balinese local language knowledge in Indonesia through the multi- discipline collaboration between Engineering (especially on applied Computer Science) and Language discipline. The proposed method took care of two related aspects, i.e.: (1) The Balinese Script uses scriptio continua as a style of writing without spaces, so do its transliteration result to the Balinese Script; and (2) A switching management between scriptio continua and non-scriptio continua style is needed by the learning application for the ease of learning through the visual analysis of the transliteration output which is the Balinese Script. This study was conducted on the pioneering web-based transliteration learning application, BaliScript, with the Latin text as the input and the Balinese Script as the output. Through the experiment, the proposed method gave the expected transliteration result on scriptio continua management, which can switch between three styles of writing, i.e.: (1) scriptio continua; (2) unconditional non-scriptio continua; and (3) conditional non-scriptio continua. -
Unicodestandard the UNICODE CONSORTIUM .4Vaddison
UnicodeUnicode .o STANDARD THE UNICODE CONSORTIUM Edited by Julie D. Allen, Joe Becker, Richard Cook, Mark Davis, Michael Everson, Asmus Freytag, John H. Jenkins, Mike Ksar, Rick McGowan, Lisa Moore, Eric Muller, Markus Scherer, Michel Suignard, and Ken Whistler .4vAddison-Wesley Upper Saddle River, NJ • Boston • Indianapolis • San Francisco New York • Toronto • Montreal • London • Munich • Paris • Madrid Capetown • Sydney • Tokyo • Singapore • Mexico City Contents List of Figures xxiii List of Tables xxvii Foreword by Mark Davis xxxi Preface xxxiii Why Buy This Book xxxiii Why Upgrade to Version 5.0 xxxiv Organization of This Book xxxiv Unicode Standard Annexes xxxvi The Unicode Character Database xxxvii Unicode Technical Standards and Unicode Technical Reports xxxvii On the CD-ROM xxxvü Updates and Errata xxxviii Acknowledgments xxxix Unicode Consortium Members xlvii Unicode Consortium Liaison Members xlix Unicode Consortium Board of Directors 1 Introduction 1 1.1 Coverage 2 Standards Coverage 3 New Characters 3 1.2 Design Goals 4 1.3 Text Handling 5 Text Elements 6 2 General Structure 9 2.1 Architectural Context 9 Basic Text Processes 10 Text Elements, Characters, and Text Processes 10 Text Processes and Encoding 11 2.2 Unicode Design Principles 13 Universality 14 Efficiency 14 Characters, Not Glyphs 14 Semantics 16 Plain Text 18 Logical Order 19 Unification 21 Dynamic Composition 22 x Co ntents Stability 22 Convertibility 23 2.3 Compatibility Characters 23 Compatibility Variants 23 Compatibility Decomposable Characters 24 Mapping Compatibility -
Package 'Rebus.Unicode'
Package ‘rebus.unicode’ January 3, 2017 Type Package Title Unicode Extensions for the 'rebus' Package Version 0.0-2 Date 2017-01-02 Author Richard Cotton [aut, cre] Maintainer Richard Cotton <[email protected]> Description Build regular expressions piece by piece using human readable code. This package contains Unicode functionality, and is primarily intended to be used by package developers. Depends R (>= 3.1.0) Imports rebus.base (>= 0.0-2) Suggests stringi License Unlimited LazyLoad yes LazyData yes Acknowledgments Development of this package was partially funded by the Proteomics Core at Weill Cornell Medical College in Qatar <http://qatar-weill.cornell.edu>. The Core is supported by 'Biomedical Research Program' funds, a program funded by Qatar Foundation. Collate 'imports.R' 'unicode-classes.R' 'unicode-constants.R' 'unicode-general-category-classes.R' 'unicode-general-category-constants.R' 'unicode-operators.R' RoxygenNote 5.0.1 NeedsCompilation no Repository CRAN Date/Publication 2017-01-03 07:42:23 1 2 ugc_cased_letter R topics documented: ugc_cased_letter . .2 Unicode . .6 UnicodeOperators . 29 up_alphabetic . 31 Index 36 ugc_cased_letter Unicode General Categories Description Match a Unicode General Category. Usage ugc_cased_letter(lo, hi, char_class = TRUE) ugc_close_punctuation(lo, hi, char_class = TRUE) ugc_connector_punctuation(lo, hi, char_class = TRUE) ugc_control(lo, hi, char_class = TRUE) ugc_currency_symbol(lo, hi, char_class = TRUE) ugc_dash_punctuation(lo, hi, char_class = TRUE) ugc_decimal_number(lo, hi, char_class -
L2/20-169 TO: UTC FROM: Deborah Anderson, Ken Whistler
L2/20-169 TO: UTC FROM: Deborah Anderson, Ken Whistler, Roozbeh Pournader, Lisa Moore, Peter Constable and Liang Hai1 SUBJECT: Recommendations to UTC #164 July 2020 on Script Proposals DATE: July 21, 2020 The Script Ad Hoc group met on June 5 and July 10, 2020 in order to review proposals. The following represents feedback on proposals that were available when the group met. A table of contents is provided below. page I. EUROPE Armenian 2 Cypro-Minoan 2 Latin 4 Todhri 7 Vithkuqi 9 II. AFRICA Adinkra 9 Egyptian Hieroglyphs 10 Kore Sebeli 10 III. MIDDLE EAST Arabic 13 IV. SOUTH AND CENTRAL ASIA Brahmi 14 Gurmukhi 14 Kaithi 18 Limbu 19 Old Uyghur 20 Sinhala 20 Takri 21 Telugu 22 Tulu (Tigalari) 24 V. SOUTHEAST ASIA, INDONESIA, AND OCEANIA Balinese, Javanese, and Sundanese 24 Myanmar 25 Western Cham 28 VI. EAST ASIA Kana 30 Tangut 30 VII. SYMBOLS AND NUMERICAL NOTATION SYSTEMS Blissymbolics 31 Klingon 31 1 Also participating were Ben Yang, Craig Cornelius, Ned Holbrook, Andrew Glass, Marek Jeziorek, Norbert Lindenberg, Patrick Chew, Jan Kučera, Lawrence Wolf-Sonkin, Manish Goregaokar, and Lorna Evans. 1 Legacy Computing Symbols 32 Music Symbols 32 VIII. PUBLIC REVIEW FEEDBACK 33 I. EUROPE 1. Armenian Documents: L2/20-174 Comments on Public Review Issues (see full text of feedback in section 28) L2/20-143 Uppercase of U+0587 ARMENIAN SMALL LIGATURE ECH YIWN -- Anderson Comments: We reviewed the document L2/20-143, which responded to Public Review Feedback from Markus Scherer. The question involved the uppercase form of U+0587 ARMENIAN SMALL LIGATURE ECH-YIWN և. -
Iso/Iec Jtc1/Sc2/Wg2 N3319r3 L2/08-015
ISO/IEC JTC1/SC2/WG2 N3319R3 L2/08-015 2008-03-06 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Proposal for encoding the Javanese script in the UCS Source: Indonesia, Ireland, and UC Berkeley Script Encoding Initiative (Universal Scripts Project) Author: Michael Everson Status: National Body and Liaison Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Replaces: N3292 Date: 2008-03-06 1. Introduction. The Javanese script, or aksara Jawa, is used for writing the Javanese language, the native language of one of the peoples of Java, known locally as basa Jawa. It is a descendent of the ancient Brahmi script of India, and so has many similarities with modern scripts of South Asia and Southeast Asia which are also members of that family. The Javanese script is also used for writing Sanskrit, Jawa Kuna (a kind of Sanskritized Javanese), and transcriptions of Kawi (the Kawi script itself is not unified with Javanese), as well as the Sundanese language, also spoken on the island of Java, and the Sasak language, spoken on the island of Lombok. Javanese script was in current use in Java until about 1945; in 1928 Bahasa Indonesia was made the national language of Indonesia and its influence eclipsed that of other languages and their scripts. Traditional Javanese texts are written on palm leaves; books of these bound together are called lontar, a word which derives from ron ‘leaf’ and tal ‘palm’. 2.1. Consonant letters. Consonants have an inherent -a vowel sound. -
Jtc1/Sc2/Wg2 N3412 L2/08-131
JTC1/SC2/WG2 N3412 L2/08-131 2008-04-01 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Proposal for encoding Last Resort Pictures in Plane 14 of the UCS Source: Michael Everson Status: Individual Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Date: 2008-04-01 1. Introduction. The Last Resort font is a collection of glyphs which represents types of UCS characters. These glyphs are designed to allow users to recognize that an encoded value is a specific type of UCS character, a Private Use Area character, an unassigned character, or one of the illegal character codes. Apple Computer and SIL are two organizations which have shipped Last Resort fonts for some time now. Recently Apple decided to make its Last Resort font available for public distribution via the Unicode Consortium’s website (sometime after the publication of Unicode 5.1). One of the principles of a Last Resort font is that is normally displayed when no other font is available for the character in question. Similarly, the Control characters encoded at U+0000-001F cannot be represented in text, since they are intended “do” things (even if few of them do much on modern operating systems. At 2400-243F the CONTROL PICTURES block provides glyphs for those characters so that they can be discussed and displayed in text. This facility would be as valuable for the Last Resort functionality as it is for the control characters. Accordingly, this proposal requests the encoding of a new block, U+E0200-E03FF LAST RESORT PICTURES, in Plane 14 of the UCS. -
Unidings.Pdf
Unidings Glyphs and Icons for Blocks of The Unicode Standard O Unidings, version 13.00, March 2020 free strictly for personal, non-commercial use available under the general ufas licence Unicode Fonts for Ancient Scripts George Douros C0 Controls � � 0000..001F Basic Latin � � 0020..007F C1 Controls � � 0080..009F Latin-1 Supplement � � 00A0..00FF Latin Extended-A � � 0100..017F Latin Extended-B � � 0180..024F IPA Extensions � � 0250..02AF Spacing Modifier Leters � � 02B0..02FF Combining Diacritical Marks � � 0300..036F Greek and Coptic � � 0370..03FF Cyrillic � � 0400..04FF Cyrillic Supplement � � 0500..052F Armenian � � 0530..058F Hebrew � � 0590..05FF Arabic � � 0600..06FF Syriac � � 0700..074F Arabic Supplement � � 0750..077F Thaana � � 0780..07BF NKo � � 07C0..07FF Samaritan � � 0800..083F Mandaic � � 0840..085F Syriac Supplement � � 0860..086F � � Arabic Extended-B Arabic Extended-A � � 08A0..08FF Devanagari � � 0900..097F Bengali � � 0980..09FF Gurmukhi � � 0A00..0A7F Gujarati � � 0A80..0AFF Oriya � � 0B00..0B7F Tamil � � 0B80..0BFF Telugu � � 0C00..0C7F Kannada � � 0C80..0CFF Malayalam � � 0D00..0D7F Sinhala � � 0D80..0DFF Thai � � 0E00..0E7F Lao � � 0E80..0EFF Tibetan � � 0F00..0FFF Myanmar � � 1000..109F Georgian � � 10A0..10FF Hangul Jamo � � 1100..11FF Ethiopic � � 1200..137F Ethiopic Supplement � � 1380..139F Cherokee � � 13A0..13FF Unified Canadian Aboriginal Syllabics � � 1400..167F Ogham � � 1680..169F Runic � � 16A0..16FF Tagalog � � 1700..171F Hanunoo � � 1720..173F Buhid � � 1740..175F Tagbanwa � � 1760..177F Khmer -
ISO/IEC JTC1/SC2/WG2 N2856 L2/04-357 Structure
ISO/IEC JTC1/SC2/WG2 N2856 L2/04-357 2004-10-04 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation еждународная организация по стандартизации Doc Type: Working Group Document Title: Preliminary proposal for encoding the Balinese script in the UCS Source: Michael Everson Status: Individual Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Date: 2004-10-04 This is a preliminary proposal to encode the Balinese script in the BMP of the UCS. The Balinese script is used for writing the Balinese language, the native language of the people of Bali. It is a descendent of the ancient Brahmic script from India; therefore it has some notable similarities with modern scripts of South Asia and Southeast Asia that also are descendent of the Brahmic script. The Balinese script is also used for writing Kawi, or Old Javanese, which had a heavy influence to Balinese language in the 11th century CE. Some Balinese words are also borrowed from Sanskrit, thus Balinese script is also used to write words from Sanskrit. Structure Vowel signs are used in a manner similar to that employed by other Brahmi-derived scripts. Consonants have an inherent /a/ vowel sound. Ordering An ordering based on the traditional order of the Javanese script: ha na ca ra ka da ta sa wa la pa ja ya ma ga ba nga has some currency. (The Javanese order is hana caraka, data sawala, padha jayanya, maga bathanga, a sentence which means ‘There were (two) emissaries, they began to fight, their valour was equal, they both fell dead’.) This proposed encoding maps directly to the Devanagari block, however, for clarity at this stage of discussion. -
A Method for the Affixed Word Transliteration to the Balinese Script on the Learning Web Application
Turkish Journal of Computer and Mathematics Education Vol.12 No.6 (2021), 2849-2857 Research Article A Method for the Affixed Word Transliteration to the Balinese Script on the Learning Web Application G. Indrawan1, I K. Paramarta2, D. P. Ramendra3, I G. A. Gunadi1, Sariyasa1 1Computer Science Department 2Balinese Language Education Department 3English Language Education Department Universitas Pendidikan Ganesha (Undiksha), Singaraja, Bali, Indonesia 81116 Abstract: This research proposed a method for the affixed word transliteration to the Balinese Script since there has not been studied yet and it is important since the affixed word needs to be transliterated, inevitably. This research is one of the efforts to preserve digitally the endangered Balinese local language knowledge in Indonesia through the multi-discipline collaboration between Computer Science and Language discipline. The proposed method was taken care of two related aspects, i.e.; (1) A Latin root word has its related Balinese Script root word by using default or special transliteration rule; and (2) A Latin root word with a special transliteration rule for its Balinese Script root word, also need a special transliteration rule for its affixed word. This study was conducted on the pioneering web-based transliteration learning application, BaliScript, that receives the Latin text input and outputs the Balinese Script by using the Noto Serif Balinese font with its dedicated Balinese Unicode. Through the experiment, the proposed method gave the expected transliteration results, added a certain perspective, and strengthened the transliteration knowledge. Future work is to enhance and reuse this method on the mobile computing device, as a part of the Balinese Language ubiquitous learning that supports Balinese Language education, which is a mandatory local subject from the elementary school to the high school in Bali Province.