Introduction to Unicode and Beyond Mike Mckenna Mimckenna(At)Paypal.Com Craig Cummings I18ncraig(At)Gmail.Com Tex Texin Textexin(At)Xencraft.Com V.1.4 November, 2016

Total Page:16

File Type:pdf, Size:1020Kb

Introduction to Unicode and Beyond Mike Mckenna Mimckenna(At)Paypal.Com Craig Cummings I18ncraig(At)Gmail.Com Tex Texin Textexin(At)Xencraft.Com V.1.4 November, 2016 Introduction to Unicode and Beyond Mike McKenna mimckenna(at)paypal.com Craig Cummings i18ncraig(at)gmail.com Tex Texin textexin(at)xencraft.com v.1.4 November, 2016 Yahoo! C onfidential Agenda • Brief Background • Characters – Structure – Properties – Encoding • Key Specifications for Internationalization • Unicode in the Real World • How Unicode is Evolving Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 2 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Unicode in Brief Yahoo! C onfidential Unicode - Summary • Universal Character Set replaces – ASCII – 8-bit – Double-byte and some multibyte • Encompasses over 240 known coded character sets in use today • Covers virtually all modern business languages • Additional archaic and academic scripts • Integrated with ISO 10646 as BMP • Information: http://www.unicode.org Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 4 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 History: Character Sets • Evolved when and where the need arose • Developed for stand-alone applications • Many developed their own • Example: 3270 nightmare – A different encoding for each European country! • Now: – Over 250 character sets in use around the world – Conversion? Any to any: (n)(n-1) > 62,000 Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 5 Solution: Unicode • ISO 10646 - draft – Make everyone happy? • Each standard with own 16-bit plane – Big mess! Not feasible • Joe Becker/Xerox and XCCS • Project began 1988 • Industry Consortium formed 1991 – Xerox and Apple initially • ISO 10646 merged with Unicode 1992 Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 6 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Some terminology … • A “glyph” is a single visual unit of text. • A “character” is a single logical unit of text. À U+00C0 • A “code point” is an integer assigned to a character. • A “character set” is an organized collection of characters with code points. • A “character encoding”is a mapping from a sequence of code points (characters) to a sequence of code units. • A “code unit” is a single logical unit of storage (like a byte, wchar_t, int16_t, etc.) Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 0xC340 – Santa C lar a 0x80– November 2016(utf-8) 7 Unicode (ISO 10646) • Code space of up to 0x10FFFF characters (about 1.1 million) • Unicode and ISO 10646 are maintained in sync. • Unicode is maintained by an industry consortium • ISO 10646 is maintained by the ISO Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 8 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Structure of Unicode Yahoo! C onfidential Structure of Unicode § Unicode divided in equal sized regions of code points. § 17 planes (0 through 0x10), each with 65,535 characters. § Plane 0 is called the Basic Multilingual Plane (BMP). § > 99% of text in the wild lives in the BMP Unicode § Planes 1 through 0x10 are called supplementary planes. BMP Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 10 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Unicode Allocation - BMP General Scripts Symbols Compatibility Zone CJK Misc. Private Use CJK Unified Ideographs Surrogates Hangul 00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0 FE Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 11 Unicode Allocation - BMP Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 12 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Unicode Allocation – Plane 1 Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 13 Supplementary Planes Contain additional, rarer characters – Plane 1: Rare, historical, or extinct scripts: Cuneiform, Gothic, etc. – Plane 2: less common Chinese/Japanese/Korean ideographs – 1024 x 1024 extra code points beyond BMP Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 14 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Extensible via Surrogates • 1024 * 1024 extra code points • Rare characters in future use • Structure: – high-surrogate U+D800 - U+DBFF – low-surrogate U+DC00 - U+DFFF – surrogate pair represents a single character • Scalar value: 0 - 10FFFF16 – N = U (no surrogates) – N = (H - 0xD800) * 0x400 + (L - 0xDC00) + 0x10000 Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 15 Unicode as the Universal Character Set • An organized collection of characters. • Each character has a code point aka Unicode Scalar Value (USV) • U+0041 <= hex notation Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 16 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Organized into blocks by script Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 17 Compatibility Characters Many characters were included in Unicode for round-trip conversion compatibility with legacy encodings: ①②③45Ⅵ ¾Lj¼Nj½dž Compatibility Characters ︴︷︻︽﹁﹄ ヲィゥォェュ゙ ﺲ ﺳ ھ ض ﵬ ﷺ Presentation Forms fi fl ffi ffl ſt ﬔ legacy encoding: a term for non- Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 Unicode– Santa C lar a character– November 2016 encodings. 18 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Coverage – Scripts • Arabic • Georgian • Armenian • Greek • Malayalam • Bengali • Gujarati • Mathematical • Bopomofo Operators • Hangul Jamo • CJK Unified • Oriya • Hangul Syllables Ideographs • Private Use • Hebrew • Combining • Spacing Modifier Diacritical Marks • Hiragana and symbols Letters • IPA Extensions • Currency Symbols • Superscripts and Subscripts • Cyrillic • Kanbun • Devanagari • Surrogates • Kannada • Dingbats • Ta mi l • Punctuation and • Katakana • Te l u g u Symbols • Lao • Thai • Latin • Tibetan Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 19 Unicode Encodings (and the Web) Yahoo! C onfidential 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Unicode Encodings • UTF-32 – Uses 32-bit code units. – All characters are the same width. • UTF-16 – Uses 16-bit code units. – BMP characters use one 16-bit code unit. – Supplementary characters use two special 16-bit code units: a “surrogate pair”. • UTF-8 – Uses 8-bit code units (bytes!) – It’s a multi-byte encoding! – Characters use between 1 and 4 bytes. – ASCII is ASCII in UTF-8 Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 21 The Replacement Character U+FFFD • Indicates a bad byte sequence or a character � that could not be “there was a converted. character here, but now it is gone” • Equivalent to “question marks” in legacy encoding conversions Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 22 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Special Noncharacter Values • U+FFFE – Mirror of U+FEFF, Byte Order Mark (BOM) – Strong hint that text may be byte-reversed • U+FFFF – No requirement to interpret – Should recognize as a non-character value Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 23 Byte Order Mark (BOM) U+FEFF • Originally to indicate the “byte- order” of UTF-16 code units – 0xFE FF (UTF-16BE) – 0xFF FE (UTF-16LE) • Also used as a Unicode signature by some software (Microsoft) for UTF-8 – 0xEF BB BF (XMl parsers don’t like this) Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 24 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 UTF-16 • Uses 16-bit code units (instead of the more-familiar 8-bit code unit, aka the “byte”) • BMP characters use one unit • Supplementary characters use a “surrogate pair” • special code points that don’t do anything else. • Used for uncommon range of U+10000 to U+10FFFF • Expands UTF-16 support to 1,048,576 supplementary code points 0x1251 0xD800 0xDF38 High Surrogate Low Surrogate 0xD800-DBFF 0xDC00-DFFF Unique Ranges! Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 25 Advantages and Disadvantages of UTF-16 • Most common • May not be suitable for languages and scripts all applications are encoded in the BMP. – Affected by processor – less wasteful than UTF- architecture (Big-Endian 32 vs. little-Endian) – Simpler to process – Requires different data (excepting surrogates) types in some languages (C, C++) – Commonly supported in major operating – Requires more storage, on environments, average, for Western programming languages, European scripts, ASCII, and libraries HTMl/XMl markup. Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 26 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 UTF-8 • 7-bit
Recommended publications
  • A Method to Accommodate Backward Compatibility on the Learning Application-Based Transliteration to the Balinese Script
    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 6, 2021 A Method to Accommodate Backward Compatibility on the Learning Application-based Transliteration to the Balinese Script 1 3 4 Gede Indrawan , I Gede Nurhayata , Sariyasa I Ketut Paramarta2 Department of Computer Science Department of Balinese Language Education Universitas Pendidikan Ganesha (Undiksha) Universitas Pendidikan Ganesha (Undiksha) Singaraja, Indonesia Singaraja, Indonesia Abstract—This research proposed a method to accommodate transliteration rules (for short, the older rules) from The backward compatibility on the learning application-based Balinese Alphabet document 1 . It exposes the backward transliteration to the Balinese Script. The objective is to compatibility method to accommodate the standard accommodate the standard transliteration rules from the transliteration rules (for short, the standard rules) from the Balinese Language, Script, and Literature Advisory Agency. It is Balinese Language, Script, and Literature Advisory Agency considered as the main contribution since there has not been a [7]. This Bali Province government agency [4] carries out workaround in this research area. This multi-discipline guidance and formulates programs for the maintenance, study, collaboration work is one of the efforts to preserve digitally the development, and preservation of the Balinese Language, endangered Balinese local language knowledge in Indonesia. The Script, and Literature. proposed method covered two aspects, i.e. (1) Its backward compatibility allows for interoperability at a certain level with This study was conducted on the developed web-based the older transliteration rules; and (2) Breaking backward transliteration learning application, BaliScript, for further compatibility at a certain level is unavoidable since, for the same ubiquitous Balinese Language learning since the proposed aspect, there is a contradictory treatment between the standard method reusable for the mobile application [8], [9].
    [Show full text]
  • Iso/Iec Jtc1/Sc2/Wg2 N3022 L2/06-002
    ISO/IEC JTC1/SC2/WG2 N3022 L2/06-002 2006-01-09 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Proposal for encoding the Sundanese script in the BMP of the UCS Source: Michael Everson Status: Individual Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Date: 2006-01-09 The Sundanese script, or aksara Sunda, is used for writing the Sundanese language, one of the languages of the island of Java in Indonesia. It is a descendent of the ancient Brahmi script of India, and so has many similarities with modern scripts of South Asia and Southeast Asia which are also members of that family. The script has official support and is taught in schools and is used on road signs. The SIL Ethnologue indicates that there are 27,000,000 Sundanese. Sundanese has been written in a number of scripts. Pallawa or Pra-Nagari was first used in West Java to write Sanskrit from the fifth to eighth centuries CE, and from Pallawa was derived Sunda Kuna or Old Sundanese which was used in the Sunda Kingdom from the 14th to 18th centuries. Both Javanese and Arabic script were used from the 17th to 19th centuries and the 17th to the mid-20th centuries respectively. Latin script has had currency since the 20th century. The modern Sundanese script, called Sunda Baku or Official Sundanese, was made official in 1996. The modern script itself was derived from Old Sundanese, the earliest example of which is the Prasasti Kawali stone (see Figure 1).
    [Show full text]
  • Unicode Character Properties
    Unicode character properties Document #: P1628R0 Date: 2019-06-17 Project: Programming Language C++ Audience: SG-16, LEWG Reply-to: Corentin Jabot <[email protected]> 1 Abstract We propose an API to query the properties of Unicode characters as specified by the Unicode Standard and several Unicode Technical Reports. 2 Motivation This API can be used as a foundation for various Unicode algorithms and Unicode facilities such as Unicode-aware regular expressions. Being able to query the properties of Unicode characters is important for any application hoping to correctly handle any textual content, including compilers and parsers, text editors, graphical applications, databases, messaging applications, etc. static_assert(uni::cp_script('C') == uni::script::latin); static_assert(uni::cp_block(U'[ ') == uni::block::misc_pictographs); static_assert(!uni::cp_is<uni::property::xid_start>('1')); static_assert(uni::cp_is<uni::property::xid_continue>('1')); static_assert(uni::cp_age(U'[ ') == uni::version::v10_0); static_assert(uni::cp_is<uni::property::alphabetic>(U'ß')); static_assert(uni::cp_category(U'∩') == uni::category::sm); static_assert(uni::cp_is<uni::category::lowercase_letter>('a')); static_assert(uni::cp_is<uni::category::letter>('a')); 3 Design Consideration 3.1 constexpr An important design decision of this proposal is that it is fully constexpr. Notably, the presented design allows an implementation to only link the Unicode tables that are actually used by a program. This can reduce considerably the size requirements of an Unicode-aware executable as most applications often depend on a small subset of the Unicode properties. While the complete 1 Unicode database has a substantial memory footprint, developers should not pay for the table they don’t use. It also ensures that developers can enforce a specific version of the Unicode Database at compile time and get a consistent and predictable run-time behavior.
    [Show full text]
  • A Method for Scriptio Continua Management on the Transliteration to the Balinese Script
    Turkish Journal of Computer and Mathematics Education Vol.12 No.6 (2021), 4182-4191 Research Article A Method for Scriptio Continua Management on the Transliteration to the Balinese Script G. Indrawan1, I W. Sutaya1, K. U. Ariawan1, M. S. Gitakarma1, I G. Nurhayata1, I K. Paramarta2 1Electrical Engineering and Computer Science Department 2Balinese Language Education Department Universitas Pendidikan Ganesha (Undiksha), Singaraja, Bali, Indonesia 81116 Abstract: This research proposed a method for scriptio continua management on the learning application-based transliteration to the Balinese Script. It has not been conducted yet and is important since the proposed method eases the learning of the Balinese Script as the transliteration output. This research is a technological preservation effort to the endangered Balinese local language knowledge in Indonesia through the multi- discipline collaboration between Engineering (especially on applied Computer Science) and Language discipline. The proposed method took care of two related aspects, i.e.: (1) The Balinese Script uses scriptio continua as a style of writing without spaces, so do its transliteration result to the Balinese Script; and (2) A switching management between scriptio continua and non-scriptio continua style is needed by the learning application for the ease of learning through the visual analysis of the transliteration output which is the Balinese Script. This study was conducted on the pioneering web-based transliteration learning application, BaliScript, with the Latin text as the input and the Balinese Script as the output. Through the experiment, the proposed method gave the expected transliteration result on scriptio continua management, which can switch between three styles of writing, i.e.: (1) scriptio continua; (2) unconditional non-scriptio continua; and (3) conditional non-scriptio continua.
    [Show full text]
  • Unicodestandard the UNICODE CONSORTIUM .4Vaddison
    UnicodeUnicode .o STANDARD THE UNICODE CONSORTIUM Edited by Julie D. Allen, Joe Becker, Richard Cook, Mark Davis, Michael Everson, Asmus Freytag, John H. Jenkins, Mike Ksar, Rick McGowan, Lisa Moore, Eric Muller, Markus Scherer, Michel Suignard, and Ken Whistler .4vAddison-Wesley Upper Saddle River, NJ • Boston • Indianapolis • San Francisco New York • Toronto • Montreal • London • Munich • Paris • Madrid Capetown • Sydney • Tokyo • Singapore • Mexico City Contents List of Figures xxiii List of Tables xxvii Foreword by Mark Davis xxxi Preface xxxiii Why Buy This Book xxxiii Why Upgrade to Version 5.0 xxxiv Organization of This Book xxxiv Unicode Standard Annexes xxxvi The Unicode Character Database xxxvii Unicode Technical Standards and Unicode Technical Reports xxxvii On the CD-ROM xxxvü Updates and Errata xxxviii Acknowledgments xxxix Unicode Consortium Members xlvii Unicode Consortium Liaison Members xlix Unicode Consortium Board of Directors 1 Introduction 1 1.1 Coverage 2 Standards Coverage 3 New Characters 3 1.2 Design Goals 4 1.3 Text Handling 5 Text Elements 6 2 General Structure 9 2.1 Architectural Context 9 Basic Text Processes 10 Text Elements, Characters, and Text Processes 10 Text Processes and Encoding 11 2.2 Unicode Design Principles 13 Universality 14 Efficiency 14 Characters, Not Glyphs 14 Semantics 16 Plain Text 18 Logical Order 19 Unification 21 Dynamic Composition 22 x Co ntents Stability 22 Convertibility 23 2.3 Compatibility Characters 23 Compatibility Variants 23 Compatibility Decomposable Characters 24 Mapping Compatibility
    [Show full text]
  • Package 'Rebus.Unicode'
    Package ‘rebus.unicode’ January 3, 2017 Type Package Title Unicode Extensions for the 'rebus' Package Version 0.0-2 Date 2017-01-02 Author Richard Cotton [aut, cre] Maintainer Richard Cotton <[email protected]> Description Build regular expressions piece by piece using human readable code. This package contains Unicode functionality, and is primarily intended to be used by package developers. Depends R (>= 3.1.0) Imports rebus.base (>= 0.0-2) Suggests stringi License Unlimited LazyLoad yes LazyData yes Acknowledgments Development of this package was partially funded by the Proteomics Core at Weill Cornell Medical College in Qatar <http://qatar-weill.cornell.edu>. The Core is supported by 'Biomedical Research Program' funds, a program funded by Qatar Foundation. Collate 'imports.R' 'unicode-classes.R' 'unicode-constants.R' 'unicode-general-category-classes.R' 'unicode-general-category-constants.R' 'unicode-operators.R' RoxygenNote 5.0.1 NeedsCompilation no Repository CRAN Date/Publication 2017-01-03 07:42:23 1 2 ugc_cased_letter R topics documented: ugc_cased_letter . .2 Unicode . .6 UnicodeOperators . 29 up_alphabetic . 31 Index 36 ugc_cased_letter Unicode General Categories Description Match a Unicode General Category. Usage ugc_cased_letter(lo, hi, char_class = TRUE) ugc_close_punctuation(lo, hi, char_class = TRUE) ugc_connector_punctuation(lo, hi, char_class = TRUE) ugc_control(lo, hi, char_class = TRUE) ugc_currency_symbol(lo, hi, char_class = TRUE) ugc_dash_punctuation(lo, hi, char_class = TRUE) ugc_decimal_number(lo, hi, char_class
    [Show full text]
  • L2/20-169 TO: UTC FROM: Deborah Anderson, Ken Whistler
    L2/20-169 TO: UTC FROM: Deborah Anderson, Ken Whistler, Roozbeh Pournader, Lisa Moore, Peter Constable and Liang Hai1 SUBJECT: Recommendations to UTC #164 July 2020 on Script Proposals DATE: July 21, 2020 The Script Ad Hoc group met on June 5 and July 10, 2020 in order to review proposals. The following represents feedback on proposals that were available when the group met. A table of contents is provided below. page I. EUROPE Armenian 2 Cypro-Minoan 2 Latin 4 Todhri 7 Vithkuqi 9 II. AFRICA Adinkra 9 Egyptian Hieroglyphs 10 Kore Sebeli 10 III. MIDDLE EAST Arabic 13 IV. SOUTH AND CENTRAL ASIA Brahmi 14 Gurmukhi 14 Kaithi 18 Limbu 19 Old Uyghur 20 Sinhala 20 Takri 21 Telugu 22 Tulu (Tigalari) 24 V. SOUTHEAST ASIA, INDONESIA, AND OCEANIA Balinese, Javanese, and Sundanese 24 Myanmar 25 Western Cham 28 VI. EAST ASIA Kana 30 Tangut 30 VII. SYMBOLS AND NUMERICAL NOTATION SYSTEMS Blissymbolics 31 Klingon 31 1 Also participating were Ben Yang, Craig Cornelius, Ned Holbrook, Andrew Glass, Marek Jeziorek, Norbert Lindenberg, Patrick Chew, Jan Kučera, Lawrence Wolf-Sonkin, Manish Goregaokar, and Lorna Evans. 1 Legacy Computing Symbols 32 Music Symbols 32 VIII. PUBLIC REVIEW FEEDBACK 33 I. EUROPE 1. Armenian Documents: L2/20-174 Comments on Public Review Issues (see full text of feedback in section 28) L2/20-143 Uppercase of U+0587 ARMENIAN SMALL LIGATURE ECH YIWN -- Anderson Comments: We reviewed the document L2/20-143, which responded to Public Review Feedback from Markus Scherer. The question involved the uppercase form of U+0587 ARMENIAN SMALL LIGATURE ECH-YIWN և.
    [Show full text]
  • Iso/Iec Jtc1/Sc2/Wg2 N3319r3 L2/08-015
    ISO/IEC JTC1/SC2/WG2 N3319R3 L2/08-015 2008-03-06 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Proposal for encoding the Javanese script in the UCS Source: Indonesia, Ireland, and UC Berkeley Script Encoding Initiative (Universal Scripts Project) Author: Michael Everson Status: National Body and Liaison Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Replaces: N3292 Date: 2008-03-06 1. Introduction. The Javanese script, or aksara Jawa, is used for writing the Javanese language, the native language of one of the peoples of Java, known locally as basa Jawa. It is a descendent of the ancient Brahmi script of India, and so has many similarities with modern scripts of South Asia and Southeast Asia which are also members of that family. The Javanese script is also used for writing Sanskrit, Jawa Kuna (a kind of Sanskritized Javanese), and transcriptions of Kawi (the Kawi script itself is not unified with Javanese), as well as the Sundanese language, also spoken on the island of Java, and the Sasak language, spoken on the island of Lombok. Javanese script was in current use in Java until about 1945; in 1928 Bahasa Indonesia was made the national language of Indonesia and its influence eclipsed that of other languages and their scripts. Traditional Javanese texts are written on palm leaves; books of these bound together are called lontar, a word which derives from ron ‘leaf’ and tal ‘palm’. 2.1. Consonant letters. Consonants have an inherent -a vowel sound.
    [Show full text]
  • Jtc1/Sc2/Wg2 N3412 L2/08-131
    JTC1/SC2/WG2 N3412 L2/08-131 2008-04-01 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Proposal for encoding Last Resort Pictures in Plane 14 of the UCS Source: Michael Everson Status: Individual Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Date: 2008-04-01 1. Introduction. The Last Resort font is a collection of glyphs which represents types of UCS characters. These glyphs are designed to allow users to recognize that an encoded value is a specific type of UCS character, a Private Use Area character, an unassigned character, or one of the illegal character codes. Apple Computer and SIL are two organizations which have shipped Last Resort fonts for some time now. Recently Apple decided to make its Last Resort font available for public distribution via the Unicode Consortium’s website (sometime after the publication of Unicode 5.1). One of the principles of a Last Resort font is that is normally displayed when no other font is available for the character in question. Similarly, the Control characters encoded at U+0000-001F cannot be represented in text, since they are intended “do” things (even if few of them do much on modern operating systems. At 2400-243F the CONTROL PICTURES block provides glyphs for those characters so that they can be discussed and displayed in text. This facility would be as valuable for the Last Resort functionality as it is for the control characters. Accordingly, this proposal requests the encoding of a new block, U+E0200-E03FF LAST RESORT PICTURES, in Plane 14 of the UCS.
    [Show full text]
  • Unidings.Pdf
    Unidings Glyphs and Icons for Blocks of The Unicode Standard O Unidings, version 13.00, March 2020 free strictly for personal, non-commercial use available under the general ufas licence Unicode Fonts for Ancient Scripts George Douros C0 Controls � � 0000..001F Basic Latin � � 0020..007F C1 Controls � � 0080..009F Latin-1 Supplement � � 00A0..00FF Latin Extended-A � � 0100..017F Latin Extended-B � � 0180..024F IPA Extensions � � 0250..02AF Spacing Modifier Leters � � 02B0..02FF Combining Diacritical Marks � � 0300..036F Greek and Coptic � � 0370..03FF Cyrillic � � 0400..04FF Cyrillic Supplement � � 0500..052F Armenian � � 0530..058F Hebrew � � 0590..05FF Arabic � � 0600..06FF Syriac � � 0700..074F Arabic Supplement � � 0750..077F Thaana � � 0780..07BF NKo � � 07C0..07FF Samaritan � � 0800..083F Mandaic � � 0840..085F Syriac Supplement � � 0860..086F � � Arabic Extended-B Arabic Extended-A � � 08A0..08FF Devanagari � � 0900..097F Bengali � � 0980..09FF Gurmukhi � � 0A00..0A7F Gujarati � � 0A80..0AFF Oriya � � 0B00..0B7F Tamil � � 0B80..0BFF Telugu � � 0C00..0C7F Kannada � � 0C80..0CFF Malayalam � � 0D00..0D7F Sinhala � � 0D80..0DFF Thai � � 0E00..0E7F Lao � � 0E80..0EFF Tibetan � � 0F00..0FFF Myanmar � � 1000..109F Georgian � � 10A0..10FF Hangul Jamo � � 1100..11FF Ethiopic � � 1200..137F Ethiopic Supplement � � 1380..139F Cherokee � � 13A0..13FF Unified Canadian Aboriginal Syllabics � � 1400..167F Ogham � � 1680..169F Runic � � 16A0..16FF Tagalog � � 1700..171F Hanunoo � � 1720..173F Buhid � � 1740..175F Tagbanwa � � 1760..177F Khmer
    [Show full text]
  • ISO/IEC JTC1/SC2/WG2 N2856 L2/04-357 Structure
    ISO/IEC JTC1/SC2/WG2 N2856 L2/04-357 2004-10-04 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation еждународная организация по стандартизации Doc Type: Working Group Document Title: Preliminary proposal for encoding the Balinese script in the UCS Source: Michael Everson Status: Individual Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Date: 2004-10-04 This is a preliminary proposal to encode the Balinese script in the BMP of the UCS. The Balinese script is used for writing the Balinese language, the native language of the people of Bali. It is a descendent of the ancient Brahmic script from India; therefore it has some notable similarities with modern scripts of South Asia and Southeast Asia that also are descendent of the Brahmic script. The Balinese script is also used for writing Kawi, or Old Javanese, which had a heavy influence to Balinese language in the 11th century CE. Some Balinese words are also borrowed from Sanskrit, thus Balinese script is also used to write words from Sanskrit. Structure Vowel signs are used in a manner similar to that employed by other Brahmi-derived scripts. Consonants have an inherent /a/ vowel sound. Ordering An ordering based on the traditional order of the Javanese script: ha na ca ra ka da ta sa wa la pa ja ya ma ga ba nga has some currency. (The Javanese order is hana caraka, data sawala, padha jayanya, maga bathanga, a sentence which means ‘There were (two) emissaries, they began to fight, their valour was equal, they both fell dead’.) This proposed encoding maps directly to the Devanagari block, however, for clarity at this stage of discussion.
    [Show full text]
  • A Method for the Affixed Word Transliteration to the Balinese Script on the Learning Web Application
    Turkish Journal of Computer and Mathematics Education Vol.12 No.6 (2021), 2849-2857 Research Article A Method for the Affixed Word Transliteration to the Balinese Script on the Learning Web Application G. Indrawan1, I K. Paramarta2, D. P. Ramendra3, I G. A. Gunadi1, Sariyasa1 1Computer Science Department 2Balinese Language Education Department 3English Language Education Department Universitas Pendidikan Ganesha (Undiksha), Singaraja, Bali, Indonesia 81116 Abstract: This research proposed a method for the affixed word transliteration to the Balinese Script since there has not been studied yet and it is important since the affixed word needs to be transliterated, inevitably. This research is one of the efforts to preserve digitally the endangered Balinese local language knowledge in Indonesia through the multi-discipline collaboration between Computer Science and Language discipline. The proposed method was taken care of two related aspects, i.e.; (1) A Latin root word has its related Balinese Script root word by using default or special transliteration rule; and (2) A Latin root word with a special transliteration rule for its Balinese Script root word, also need a special transliteration rule for its affixed word. This study was conducted on the pioneering web-based transliteration learning application, BaliScript, that receives the Latin text input and outputs the Balinese Script by using the Noto Serif Balinese font with its dedicated Balinese Unicode. Through the experiment, the proposed method gave the expected transliteration results, added a certain perspective, and strengthened the transliteration knowledge. Future work is to enhance and reuse this method on the mobile computing device, as a part of the Balinese Language ubiquitous learning that supports Balinese Language education, which is a mandatory local subject from the elementary school to the high school in Bali Province.
    [Show full text]