The Unicode Standard Version 3.0

Total Page:16

File Type:pdf, Size:1020Kb

The Unicode Standard Version 3.0 The Unicode Standard Version 3.0 The Unicode Consortium ADDISON-WESLEY An Imprint of Addison Wesley Longman, Inc. Reading, Massachusetts • Harlow, England • Menlo Park, California Berkeley, California • Don Mills, Ontario • Sydney Bonn • Amsterdam • Tokyo • Mexico City Contents Acknowledgments iii Unicode Consortium Menibers and Directors viii Füll Members viii Current Associate Members viii Current Liaison Menibers ix Current Specialist Members ix Current Individual Members ix Current Members of the Board of Directors ix Former Members of the Board of Directors ix Contents xi Figures xix Tables xxi Preface xxv 0.1 About the Unicode Standard xxv Concepts, Architecture, Conformance, and Guidelines xxv Character Block Descriptions xxvi Charts and Index xxvi Appendices and Tables xxvii The Unicode Character Database and Technical Reports xxvii On the CD-ROM xxvii 0.2 Notational Conventions xxviii Extended BNF xxviii Operators xxix 0.3 Resources xxx Unicode Web Site xxx Unicode Anonymous FTP Site xxx Unicode Public Mailing List xxx How to Contact the Unicode Consortium xxx Introduction 1 1.1 Coverage 2 Standards Coverage 3 New Characters 3 1.2 Design Basis 3 1.3 Text Handling 4 Interpreting Characters 5 Text Elements 5 1.4 The Unicode Standard and ISO/IEC 10646 5 1.5 The Unicode Consortium 6 The Unicode Technical Committee 6 General Structure 9 2.1 Architectural Context 9 Basic Text Processes 9 Text Elements, Code Values, and Text Processes 10 The Unicode Standard 3.0 xi Contents Text Processes and Encoding 1 ] 2.2 Unicode Design Principles 12 Sixteen-Bit Character Codes 12 Efficiency 13 Characters, Not Glyphs 13 Semantics 15 Piain Text 15 Logical Order 16 Unification 17 Dynamic Composition 18 Equivalent Sequence 18 Convertibility 18 2.3 Encoding Forms 19 UTF-16 19 UTF-8 20 Character Encoding Schemes 21 2.4 Unicode Allocation 21 Allocation Areas 21 Codespace Assignment for Graphic Characters 23 Nongraphic Characters, Reserved and Unassigned Codes 23 2.5 Writing Direction 24 2.6 Combining Characters 24 Sequence of Base Characters and Diacritics 25 Multiple Combining Characters 25 Multiple Base Characters 27 Spacing Clones of European Diacritical Marks 27 2.7 Special Character and Noncharacter Values 28 Byte Order Mark (BOM) 28 Special Noncharacter Values 28 Separators 29 Layout and Format Control Characters 29 The Replacement Character 29 2.8 Controls and Control Sequences 29 Control Characters 29 Representing Control Sequences 30 2.9 Conforming to the Unicode Standard 30 Characters Not Used in a Subset 32 2.10 Referencing Versions of the Unicode Standard 32 3 Conformance 37 3.1 Conformance Requirements 37 Byte Ordering 37 Invalid Code Values 3g Interpretation 38 Modification 39 Transformations 39 Bidirectional Text 39 Unicode Technical Reports 39 3.2 Semantics 40 3.3 Characters and Coded Representations 40 3.4 Simple Properties 42 3.5 Combination 43 " TheUn kode Standard 3.0 Contents 3.6 Decomposition 44 CompatibiHty Decomposition 44 Canonical Decomposition 44 3.7 Surrogates 45 3.8 Transformations 45 3.9 Special Character Properties 47 3.10 Canonical Ordering Behavior 50 Combining Classes 51 Canonical Ordering 51 Use with Collation 52 3.11 Conjoining Jarno Behavior 52 Syllable Boundaries 53 Standard Syllables 53 Hangul Syllable Composition 54 Hangul Syllable Decomposition 55 Hangul Syllable Names 55 3.12 Bidirectional Behavior 55 Directional Formatting Codes 56 Basic Display Algorithm 57 Definitions 58 Resolving Embedding Levels 61 Reordering Resolved Levels 65 Bidirectional Conformance 67 Implementation Notes 68 4 Character Properties 73 4.1 Case—Normative 75 4.2 Combining Classes—Normative 75 4.3 Directionality—Normative 85 4.4 Jamo Short Names—Normative 86 4.5 General Category—Normative in Part 87 4.6 Numeric Value—Normative 89 4.7 Mirrored—Normative 97 4.8 Unicode 1.0 Names 101 4.9 Mathematical Property 101 4.10 Letters and Other Useful Properties 102 5 Implementation Guidelines 105 5.1 Transcoding to Other Standards 105 Issues 105 Multistage Tables 106 7-Bit or 8-Bit Transmission 107 Mapping Table Resources 107 5.2 ANSI/ISO C wchar_t 107 5.3 Unknown and Missing Characters 108 Unassigned and Private Use Character Codes 108 Interpretable but Unrenderable Characters 108 Reassigned Characters 109 5.4 Handimg Surrogate Pairs 109 5.5 Handling Numbers HO 5.6 Handling Properties 111 The Unicode Standard 3.0 Xln Contents 5.7 Normalization 5.8 Compression 112 5.9 Line Handling 113 5.10 Regulär Expressions H3 5.11 Language Information in Piain Text !U Requirements for Language Tagging Working with Language Tags ..'. U4 Language Tags and Han Unification 114 5.12 Editing and Selection ! 15 Consistent Text Elements U6 5.13 Strategies for Handling Nonspacing Marks ! J? Keyboard Input Truncation J18 5.14 Rendering Nonspacing Marks U9 Positioning Methods 120 5.15 Locating Text Element Boundaries \fA Boundary Specification Example Specifications .'.'.'.'.'.'.'.' 124 Grapheme Boundaries 126 Word Boundaries 126 Line Boundaries 127 Sentence Boundaries '' " _' 129 Random Access ^ 5.16 Identifiers 133 Syntactic Rule 133 5.17 SortingandSearching 134 Culturally Expected Sorüng .'' 135 Unicode Character Equivalence ..." 135 Similar Characters 136 Levels of Comparison .'."'.' 136 Ignorable Characters ' 137 Multiple Mappings .'..'.'.'. 13S Collating Out-of-Scope Characters ..'.' 138 Unmapped Characters 139 Parameterization 139 Optimizations 140 Searching ! 40 Sublinear Searching ..." 140 5.18 Case Mappings 141 Punctuation I41 6.1 General Punctuation 147 Punctuation: U+0020-U+OOBF .'.'.' 148 General Punctuation: U+200O-U+206F .'.'.' 148 CJK Symbols and Punctuation: U+3000-U+303F 149 CJK Compatibility Forms: U+FE30-U+FE4F . 155 Small Form Variants: U+FE50-U+FE6F I56 European Alphabetic Scripts 156 7.1 Latin "_" 159 Letters of Basic Latin: U+0041-U+007A 16° Letters of the Latin-1 Supplement: U+OOCO-uVoOFF J5? Latin Extended-A: U+01CKMJ+017F 161 xiv The Unicode Standard 3.0 Contents Latin Extended-B: U+0180-U+024F 163 IPA Extensions: U+0250-U+02AF 164 Latin Extended Additional: U+1E00-U+1EFF 165 Latin Ligatures: FB00-FBO6 166 7.2 Greek 167 Greek: U+037O-U+03FF 167 Greek Extended: U+1F0O-U+1FFF 169 7.3 Cyrillic 171 Cyrillic: U+0400-U+04FF 171 7.4 Armenian 172 Armenian: U+0530-U+058F 172 7.5 Georgian 173 Georgian: U+10A(MJ+10FF 173 7.6 Runic 174 Runic: U+16A0-U+16FO 174 7.7 Ogham 176 Ogham: U+1680-U+169F 176 7.8 Modifier Letters 177 Spacing Modifier Letters: U+02BO-U+02FF 177 7.9 Combining Marks 179 Combining Diacritical Marks: U+0300-U+036F 179 Combining Marks for Symbols: U+20D0-U+20FF 180 Combining Half Marks: U+FE20-U+FE2F 181 8 Middle Eastern Scripts 185 8.1 Hebrew 186 Hebrew: U+0590-U+05FF 186 Alphabetic Presentation Forms: U+FB1D-U+FB4F 188 8.2 Arabic 189 Arabic: U+0600-U+06FF 189 Cursive Joining 192 Ligatures 194 Arabic Presentation Forms-A: U+FB50-U+FDFF 197 Arabic Presentation Forms-B: U+FE70-U+FEFF 197 8.3 Syriac 199 Syriac: U+0700-U+074F 199 Syriac Shaping 203 Syriac Cursive Joining 203 Ligatures 205 8.4 Thaana 206 Thaana: U+0780-U+07BF 206 9 South and Southeast Asian Scripts 209 9.1 Devanagari 211 Devanagari: U+0900-U+097F 211 9.2 Bengali 224 Bengali: U+0980-U+09FF 224 9.3 Gurmukhi 225 Gurmukhi: U+OA0O-U+0A7F 225 9.4 Gujarati 226 Gujarati: U+0A80-U+0AFF 226 9.5 Oriya 227 Oriya: U+0B00-U+0B7F 227 The Unicode Standard 3.0 xv Contents 9.6 Tamil 228 Tamil: U+0B80-U+0BFF ' 228 9.7 Telugu 233 Telugu: U+0COO-U+0C7F 233 9.8 Kannada 234 Kannada: U+0C80-U+0CFF 234 9.9 Malayalam 235 Malayalam: U+ODO0-U+OD7F 235 9.10 Sinhala 236 Sinhala: U+0D80-U+0DFF 236 9.11 Thai '.'.'.'.'.'''.'.'.'.'.'.'.'''.'.'.217 Thai: U+OEOO-U+0E7F 237 9.12 Lao 239 Lao: U+0E80-U+0EFF 239 9.13 Tibetan 240 Tibetan: U+0F00-U+0FBF !!'.' 240 9.14 Myanmar 249 Myanmar: U+1000-U+109F 249 9.15 Khmer "'' \ \\\' " '' \\' " 251 Khmer: U+1780-U+17FF 251 10 East Asian Scripts 257 10.1 Han 258 CJK Unified Ideographs 258 CJK Compatibility Ideographs: U+F900-U+FAFF 267 Kanbun: U+3190-U+319F " 267 CJK and KangXi Radicals: U+2E8Ö-U+2FD5 267 Ideographie Description: (J+2FF0-U+2FFB 268 10.2 Hiragana 272 Hiragana: U+3040-U+309F 272 10.3 Katakana 273 Katakana: U+30A0-U+30FF 273 Halfwidth and Fullwidth Forms: U+FFOO-U+FFEF 273 10.4 Hangul 275 HangulJamo:U+1100-U+llFF 275 Hangul Compatibility Jarno: U+3130-U+318F 275 Hangul Syllables: U+AC00-U+D7A3 .....276 10.5 Bopomofo 27g Bopomofo: U+3100-U+312F .. 278 io-6 Yi .'";;;;::.".".".":.'.":::.':".';28o Yi: U+AO0O-U+A4CF 280 11 Additional Scripts 283 11.1 Ethiopic 284 Ethiopic: U+1200-U+137F 284 11.2 Cherokee 287 Cherokee: U+13A0-U+13FF ^ 287 11.3 Canadian Aboriginal Syllabics 288 Canadian Aboriginal Syllabics: U+1400-U+167F 288 11.4 Mongolian 289 Mongolian: U+180O-U+18AF 289 12 Symbols 295 xvi The Unicode Standard 3.0 Contents 12.1 Currency Symbols 297 Currency Symbols: U+20A0-U+20CF 297 12.2 Letterlike Symbols 298 Letterlike Symbols: U+2100-U+214F 298 12.3 Number Forms 299 Number Forms: U+2150-U+218F 299 Superscripts and Subscripts: U+2070-U+209F 299 12.4 Mathematical Operators 300 Mathematical Operators: U+2200-U+22FF 300 Arrows: U+2190-U+21FF 301 12.5 Technical Symbols 302 Control Pictures: U+2400-U+243F 302 Miscellaneous Technical: U+2300-U+23FF 302 Optical Character Recognition: U+2440-U+245F 303 12.6 Geometrical Symbols 304 Box Drawing: U+2500-U+257F 304 Block Elements: U+2580-U+259F 304 Geometrie Shapes: U+25A0^U+25FF 304 12.7 Miscellaneous Symbols and Dingbats 305 Miscellaneous Symbols: U+2600-U+26FF 305 Dingbats: U+2700-U+27BF 305 12.8 Enclosed and Square 307 Enclosed Alphanumerics: U+2460-U+24FF 307 Enclosed CJK Letters and Months: U+3200-U+32FF
Recommended publications
  • International Standard Iso/Iec 10646
    This is a preview - click here to buy the full publication INTERNATIONAL ISO/IEC STANDARD 10646 Sixth edition 2020-12 Information technology — Universal coded character set (UCS) Technologies de l'information — Jeu universel de caractères codés (JUC) Reference number ISO/IEC 10646:2020(E) © ISO/IEC 2020 This is a preview - click here to buy the full publication ISO/IEC 10646:2020 (E) CONTENTS 1 Scope ..................................................................................................................................................1 2 Normative references .........................................................................................................................1 3 Terms and definitions .........................................................................................................................2 4 Conformance ......................................................................................................................................8 4.1 General ....................................................................................................................................8 4.2 Conformance of information interchange .................................................................................8 4.3 Conformance of devices............................................................................................................8 5 Electronic data attachments ...............................................................................................................9 6 General structure
    [Show full text]
  • Unicode Ate My Brain
    UNICODE ATE MY BRAIN John Cowan Reuters Health Information Copyright 2001-04 John Cowan under GNU GPL 1 Copyright • Copyright © 2001 John Cowan • Licensed under the GNU General Public License • ABSOLUTELY NO WARRANTIES; USE AT YOUR OWN RISK • Portions written by Tim Bray; used by permission • Title devised by Smarasderagd; used by permission • Black and white for readability Copyright 2001-04 John Cowan under GNU GPL 2 Abstract Unicode, the universal character set, is one of the foundation technologies of XML. However, it is not as widely understood as it should be, because of the unavoidable complexity of handling all of the world's writing systems, even in a fairly uniform way. This tutorial will provide the basics about using Unicode and XML to save lots of money and achieve world domination at the same time. Copyright 2001-04 John Cowan under GNU GPL 3 Roadmap • Brief introduction (4 slides) • Before Unicode (16 slides) • The Unicode Standard (25 slides) • Encodings (11 slides) • XML (10 slides) • The Programmer's View (27 slides) • Points to Remember (1 slide) Copyright 2001-04 John Cowan under GNU GPL 4 How Many Different Characters? a A à á â ã ä å ā ă ą a a a a a a a a a a a Copyright 2001-04 John Cowan under GNU GPL 5 How Computers Do Text • Characters in computer storage are represented by “small” numbers • The numbers use a small number of bits: from 6 (BCD) to 21 (Unicode) to 32 (wchar_t on some Unix boxes) • Design choices: – Which numbers encode which characters – How to pack the numbers into bytes Copyright 2001-04 John Cowan under GNU GPL 6 Where Does XML Come In? • XML is a textual data format • XML software is required to handle all commercially important characters in the world; a promise to “handle XML” implies a promise to be international • Applications can do what they want; monolingual applications can mostly ignore internationalization Copyright 2001-04 John Cowan under GNU GPL 7 $$$ £££ ¥¥¥ • Extra cost of building-in internationalization to a new computer application: about 20% (assuming XML and Unicode).
    [Show full text]
  • Unicode and Code Page Support
    Natural for Mainframes Unicode and Code Page Support Version 4.2.6 for Mainframes October 2009 This document applies to Natural Version 4.2.6 for Mainframes and to all subsequent releases. Specifications contained herein are subject to change and these changes will be reported in subsequent release notes or new editions. Copyright © Software AG 1979-2009. All rights reserved. The name Software AG, webMethods and all Software AG product names are either trademarks or registered trademarks of Software AG and/or Software AG USA, Inc. Other company and product names mentioned herein may be trademarks of their respective owners. Table of Contents 1 Unicode and Code Page Support .................................................................................... 1 2 Introduction ..................................................................................................................... 3 About Code Pages and Unicode ................................................................................ 4 About Unicode and Code Page Support in Natural .................................................. 5 ICU on Mainframe Platforms ..................................................................................... 6 3 Unicode and Code Page Support in the Natural Programming Language .................... 7 Natural Data Format U for Unicode-Based Data ....................................................... 8 Statements .................................................................................................................. 9 Logical
    [Show full text]
  • Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
    1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only.
    [Show full text]
  • The Unicode Standard 5.2 Code Charts
    Miscellaneous Technical Range: 2300–23FF The Unicode Standard, Version 5.2 This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 5.2. Characters in this chart that are new for The Unicode Standard, Version 5.2 are shown in conjunction with any existing characters. For ease of reference, the new characters have been highlighted in the chart grid and in the names list. This file will not be updated with errata, or when additional characters are assigned to the Unicode Standard. See http://www.unicode.org/errata/ for an up-to-date list of errata. See http://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See http://www.unicode.org/charts/PDF/Unicode-5.2/ for charts showing only the characters added in Unicode 5.2. See http://www.unicode.org/Public/5.2.0/charts/ for a complete archived file of character code charts for Unicode 5.2. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 5.2 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 5.2, online at http://www.unicode.org/versions/Unicode5.2.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, and #44, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online.
    [Show full text]
  • TS 126 234 V5.6.0 (2003-09) Technical Specification
    ETSI TS 126 234 V5.6.0 (2003-09) Technical Specification Universal Mobile Telecommunications System (UMTS); Transparent end-to-end streaming service; Protocols and codecs (3GPP TS 26.234 version 5.6.0 Release 5) 3GPP TS 26.234 version 5.6.0 Release 5 1 ETSI TS 126 234 V5.6.0 (2003-09) Reference RTS/TSGS-0426234v560 Keywords UMTS ETSI 650 Route des Lucioles F-06921 Sophia Antipolis Cedex - FRANCE Tel.: +33 4 92 94 42 00 Fax: +33 4 93 65 47 16 Siret N° 348 623 562 00017 - NAF 742 C Association à but non lucratif enregistrée à la Sous-Préfecture de Grasse (06) N° 7803/88 Important notice Individual copies of the present document can be downloaded from: http://www.etsi.org The present document may be made available in more than one electronic version or in print. In any case of existing or perceived difference in contents between such versions, the reference version is the Portable Document Format (PDF). In case of dispute, the reference shall be the printing on ETSI printers of the PDF version kept on a specific network drive within ETSI Secretariat. Users of the present document should be aware that the document may be subject to revision or change of status. Information on the current status of this and other ETSI documents is available at http://portal.etsi.org/tb/status/status.asp If you find errors in the present document, send your comment to: [email protected] Copyright Notification No part may be reproduced except as authorized by written permission.
    [Show full text]
  • Campusroman Pro Presentation
    MacCampus® Fonts CampusRoman Pro our Unicode Reference font UNi UC .otf code 7.0 .ttf € $ MacCampus® Fonts CampusRoman Pro • Supports everything Latin, Cyrillic, Greek, and Coptic, Lisu; • phonetics, combining diacritics, spacing modifiers, punctuation, editorial marks, tone letters, counting rod numerals; mathematical alphanumerics; • transliterated Armenian, Georgian, Glagolitic, Gothic, and Old Persian Cuneiform; • superscripts & subscripts, currency signs, letterlike symbols, number forms, enclosed alphanumerics, dingbats... contains ca. 5.000 characters, incl. 133 (!) completely new additions from Unicode v. 7.0 (July 2014), esp. for German dialectology UC 7.0 MacCampus® Fonts CampusRoman Pro UC NEW: LatinExtended-E: Letters for 7.0 German dialectology and Americanist orthographies MacCampus® Fonts CampusRoman Pro UC NEW: LatinExtended-D: Lithuanian 7.0 dialectology, middle Vietnamese,Ewe, Volapük, Celtic epigraphy, Americanist orthographies, etc. MacCampus® Fonts CampusRoman Pro UC NEW: Cyrillic Supplement: Orok, 7.0 Komi, Khanty letters MacCampus® Fonts CampusRoman Pro UC NEW: CyrillicExtended-B: Letters 7.0 for Old Cyrillic and Lithuanian dialectology MacCampus® Fonts CampusRoman Pro UC NEW: Supplemental Punctuation: 7.0 alternate, historic and reversed punctuation; double hyphen MacCampus® Fonts CampusRoman Pro UC NEW: Currency: Nordic Mark, 7.0 Manat, Ruble MacCampus® Fonts CampusRoman Pro UC NEW: Combining Diacritics Extended 7.0 + Combining Halfmarks: German dialectology; comb. halfmarks below MacCampus® Fonts CampusRoman Pro UC NEW: Combining Diacritics Supplement: 7.0 Superscript letters for German dialectology MacCampus® Fonts CampusRoman Pro • available now • single and multi-user licenses • embeddable • OpenType and TrueType • professionally designed • from a linguist for linguists • for scholars-philologists UC 7.0 • one weight (upright) only • Unicode 7.0 compliant MacCampus® Fonts A Lang. Font З ABC List www.maccampus.de www.maccampus.de/fonts/CampusRoman-Pro.htm © Sebastian Kempgen 2014.
    [Show full text]
  • CJK Compatibility Ideographs Range: F900–FAFF
    CJK Compatibility Ideographs Range: F900–FAFF This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 14.0 This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See https://www.unicode.org/errata/ for an up-to-date list of errata. See https://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See https://www.unicode.org/charts/PDF/Unicode-14.0/ for charts showing only the characters added in Unicode 14.0. See https://www.unicode.org/Public/14.0.0/charts/ for a complete archived file of character code charts for Unicode 14.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 14.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 14.0, online at https://www.unicode.org/versions/Unicode14.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, #45, and #50, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See https://www.unicode.org/ucd/ and https://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation.
    [Show full text]
  • Supplemental Punctuation Range: 2E00–2E7F
    Supplemental Punctuation Range: 2E00–2E7F This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 14.0 This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See https://www.unicode.org/errata/ for an up-to-date list of errata. See https://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See https://www.unicode.org/charts/PDF/Unicode-14.0/ for charts showing only the characters added in Unicode 14.0. See https://www.unicode.org/Public/14.0.0/charts/ for a complete archived file of character code charts for Unicode 14.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 14.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 14.0, online at https://www.unicode.org/versions/Unicode14.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, #45, and #50, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See https://www.unicode.org/ucd/ and https://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation.
    [Show full text]
  • Wg2 N5125 & L2/19-386
    ISO/IEC JTC1/SC2/WG2 N5125 L2/19-386 Title: Proposal to change the code chart font for 21 blocks in Unicode Version 14.0 Author: Ken Lunde Date: 2019-12-02 This document proposes to change the code chart font for 21 blocks—affecting 1,762 charac- ters—that are CJK-related or otherwise include characters that are used primarily in CJK con- texts and are therefore supported in fonts for typical CJK typefaces. The suggested target for this change is Unicode Version 14.0 (2021). The purpose of this font change is four-fold: • Reduce the number of fonts that are necessary for code chart production. • Improve the consistency of the glyphs for similar or related characters by using a uniform typeface design and typeface style, at least for characters whose glyphs do not need to ad- here to a particular specification.. • Simplify code chart font management by covering as many complete blocks as possible using a single font. • Ensure that the font is accessible via open source. The actual font, CJK Symbols, will be open-sourced in both TrueType and OpenType/CFF for- mats under the terms of the SIL Open Font License, Version 1.1, and therefore can be updated to include glyphs for additional blocks, or to add glyphs to blocks that are already supported. This is maintenance work that I plan to take on for the foreseeable future. Code Tables This and subsequent pages include left-to-right, top-to-bottom code tables that serve as a glyph synopsis for the CJK Symbols font.
    [Show full text]
  • Fun with Unicode - an Overview About Unicode Dangers
    Fun with Unicode - an overview about Unicode dangers by Thomas Skora Overview ● Short Introduction to Unicode/UTF-8 ● Fooling charset detection ● Ambigiuous Encoding ● Ambigiuous Characters ● Normalization overflows your buffer ● Casing breaks your XSS filter ● Unicode in domain names – how to short payloads ● Text Direction Unicode/UTF-8 ● Unicode = Character set ● Encodings: – UTF-8: Common standard in web, … – UTF-16: Often used as internal representation – UTF-7: if the 8th bit is not safe – UTF-32: yes, it exists... UTF-8 ● Often used in Internet communication, e.g. the web. ● Efficient: minimum length 1 byte ● Variable length, up to 7 bytes (theoretical). ● Downwards-compatible: First 127 chars use ASCII encoding ● 1 Byte: 0xxxxxxx ● 2 Bytes: 110xxxxx 10xxxxxx ● 3 Bytes: 1110xxxx 10xxxxxx 10xxxxxx ● ...got it? ;-) UTF-16 ● Often used for internal representation: Java, .NET, Windows, … ● Inefficient: minimum length per char is 2 bytes. ● Byte Order? Byte Order Mark! → U+FEFF – BOM at HTML beginning overrides character set definition in IE. ● Y\x00o\x00u\x00 \x00k\x00n\x00o\x00w\x00 \x00t\x00h\x00i\x00s\x00?\x00 UTF-7 ● Unicode chars in not 8 bit-safe environments. Used in SMTP, NNTP, … ● Personal opinion: browser support was an inside job of the security industry. ● Why? Because: <script>alert(1)</script> == +Adw-script+AD4-alert(1)+ADw-/script+AD4- ● Fortunately (for the defender) support is dropped by browser vendors. Byte Order Mark ● U+FEFF ● Appears as:  ● W3C says: BOM has priority over declaration – IE 10+11 just dropped this insecure behavior, we should expect that it comes back. – http://www.w3.org/International/tests/html-css/character- encoding/results-basics#precedence – http://www.w3.org/International/questions/qa-byte-order -mark.en#bomhow ● If you control the first character of a HTML document, then you also control its character set.
    [Show full text]
  • Letterlike Symbols Range: 2100–214F
    Letterlike Symbols Range: 2100–214F This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 14.0 This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See https://www.unicode.org/errata/ for an up-to-date list of errata. See https://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See https://www.unicode.org/charts/PDF/Unicode-14.0/ for charts showing only the characters added in Unicode 14.0. See https://www.unicode.org/Public/14.0.0/charts/ for a complete archived file of character code charts for Unicode 14.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 14.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 14.0, online at https://www.unicode.org/versions/Unicode14.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, #45, and #50, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See https://www.unicode.org/ucd/ and https://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation.
    [Show full text]