The Unicode Standard Version 3.0
Total Page:16
File Type:pdf, Size:1020Kb
The Unicode Standard Version 3.0 The Unicode Consortium ADDISON-WESLEY An Imprint of Addison Wesley Longman, Inc. Reading, Massachusetts • Harlow, England • Menlo Park, California Berkeley, California • Don Mills, Ontario • Sydney Bonn • Amsterdam • Tokyo • Mexico City Contents Acknowledgments iii Unicode Consortium Menibers and Directors viii Füll Members viii Current Associate Members viii Current Liaison Menibers ix Current Specialist Members ix Current Individual Members ix Current Members of the Board of Directors ix Former Members of the Board of Directors ix Contents xi Figures xix Tables xxi Preface xxv 0.1 About the Unicode Standard xxv Concepts, Architecture, Conformance, and Guidelines xxv Character Block Descriptions xxvi Charts and Index xxvi Appendices and Tables xxvii The Unicode Character Database and Technical Reports xxvii On the CD-ROM xxvii 0.2 Notational Conventions xxviii Extended BNF xxviii Operators xxix 0.3 Resources xxx Unicode Web Site xxx Unicode Anonymous FTP Site xxx Unicode Public Mailing List xxx How to Contact the Unicode Consortium xxx Introduction 1 1.1 Coverage 2 Standards Coverage 3 New Characters 3 1.2 Design Basis 3 1.3 Text Handling 4 Interpreting Characters 5 Text Elements 5 1.4 The Unicode Standard and ISO/IEC 10646 5 1.5 The Unicode Consortium 6 The Unicode Technical Committee 6 General Structure 9 2.1 Architectural Context 9 Basic Text Processes 9 Text Elements, Code Values, and Text Processes 10 The Unicode Standard 3.0 xi Contents Text Processes and Encoding 1 ] 2.2 Unicode Design Principles 12 Sixteen-Bit Character Codes 12 Efficiency 13 Characters, Not Glyphs 13 Semantics 15 Piain Text 15 Logical Order 16 Unification 17 Dynamic Composition 18 Equivalent Sequence 18 Convertibility 18 2.3 Encoding Forms 19 UTF-16 19 UTF-8 20 Character Encoding Schemes 21 2.4 Unicode Allocation 21 Allocation Areas 21 Codespace Assignment for Graphic Characters 23 Nongraphic Characters, Reserved and Unassigned Codes 23 2.5 Writing Direction 24 2.6 Combining Characters 24 Sequence of Base Characters and Diacritics 25 Multiple Combining Characters 25 Multiple Base Characters 27 Spacing Clones of European Diacritical Marks 27 2.7 Special Character and Noncharacter Values 28 Byte Order Mark (BOM) 28 Special Noncharacter Values 28 Separators 29 Layout and Format Control Characters 29 The Replacement Character 29 2.8 Controls and Control Sequences 29 Control Characters 29 Representing Control Sequences 30 2.9 Conforming to the Unicode Standard 30 Characters Not Used in a Subset 32 2.10 Referencing Versions of the Unicode Standard 32 3 Conformance 37 3.1 Conformance Requirements 37 Byte Ordering 37 Invalid Code Values 3g Interpretation 38 Modification 39 Transformations 39 Bidirectional Text 39 Unicode Technical Reports 39 3.2 Semantics 40 3.3 Characters and Coded Representations 40 3.4 Simple Properties 42 3.5 Combination 43 " TheUn kode Standard 3.0 Contents 3.6 Decomposition 44 CompatibiHty Decomposition 44 Canonical Decomposition 44 3.7 Surrogates 45 3.8 Transformations 45 3.9 Special Character Properties 47 3.10 Canonical Ordering Behavior 50 Combining Classes 51 Canonical Ordering 51 Use with Collation 52 3.11 Conjoining Jarno Behavior 52 Syllable Boundaries 53 Standard Syllables 53 Hangul Syllable Composition 54 Hangul Syllable Decomposition 55 Hangul Syllable Names 55 3.12 Bidirectional Behavior 55 Directional Formatting Codes 56 Basic Display Algorithm 57 Definitions 58 Resolving Embedding Levels 61 Reordering Resolved Levels 65 Bidirectional Conformance 67 Implementation Notes 68 4 Character Properties 73 4.1 Case—Normative 75 4.2 Combining Classes—Normative 75 4.3 Directionality—Normative 85 4.4 Jamo Short Names—Normative 86 4.5 General Category—Normative in Part 87 4.6 Numeric Value—Normative 89 4.7 Mirrored—Normative 97 4.8 Unicode 1.0 Names 101 4.9 Mathematical Property 101 4.10 Letters and Other Useful Properties 102 5 Implementation Guidelines 105 5.1 Transcoding to Other Standards 105 Issues 105 Multistage Tables 106 7-Bit or 8-Bit Transmission 107 Mapping Table Resources 107 5.2 ANSI/ISO C wchar_t 107 5.3 Unknown and Missing Characters 108 Unassigned and Private Use Character Codes 108 Interpretable but Unrenderable Characters 108 Reassigned Characters 109 5.4 Handimg Surrogate Pairs 109 5.5 Handling Numbers HO 5.6 Handling Properties 111 The Unicode Standard 3.0 Xln Contents 5.7 Normalization 5.8 Compression 112 5.9 Line Handling 113 5.10 Regulär Expressions H3 5.11 Language Information in Piain Text !U Requirements for Language Tagging Working with Language Tags ..'. U4 Language Tags and Han Unification 114 5.12 Editing and Selection ! 15 Consistent Text Elements U6 5.13 Strategies for Handling Nonspacing Marks ! J? Keyboard Input Truncation J18 5.14 Rendering Nonspacing Marks U9 Positioning Methods 120 5.15 Locating Text Element Boundaries \fA Boundary Specification Example Specifications .'.'.'.'.'.'.'.' 124 Grapheme Boundaries 126 Word Boundaries 126 Line Boundaries 127 Sentence Boundaries '' " _' 129 Random Access ^ 5.16 Identifiers 133 Syntactic Rule 133 5.17 SortingandSearching 134 Culturally Expected Sorüng .'' 135 Unicode Character Equivalence ..." 135 Similar Characters 136 Levels of Comparison .'."'.' 136 Ignorable Characters ' 137 Multiple Mappings .'..'.'.'. 13S Collating Out-of-Scope Characters ..'.' 138 Unmapped Characters 139 Parameterization 139 Optimizations 140 Searching ! 40 Sublinear Searching ..." 140 5.18 Case Mappings 141 Punctuation I41 6.1 General Punctuation 147 Punctuation: U+0020-U+OOBF .'.'.' 148 General Punctuation: U+200O-U+206F .'.'.' 148 CJK Symbols and Punctuation: U+3000-U+303F 149 CJK Compatibility Forms: U+FE30-U+FE4F . 155 Small Form Variants: U+FE50-U+FE6F I56 European Alphabetic Scripts 156 7.1 Latin "_" 159 Letters of Basic Latin: U+0041-U+007A 16° Letters of the Latin-1 Supplement: U+OOCO-uVoOFF J5? Latin Extended-A: U+01CKMJ+017F 161 xiv The Unicode Standard 3.0 Contents Latin Extended-B: U+0180-U+024F 163 IPA Extensions: U+0250-U+02AF 164 Latin Extended Additional: U+1E00-U+1EFF 165 Latin Ligatures: FB00-FBO6 166 7.2 Greek 167 Greek: U+037O-U+03FF 167 Greek Extended: U+1F0O-U+1FFF 169 7.3 Cyrillic 171 Cyrillic: U+0400-U+04FF 171 7.4 Armenian 172 Armenian: U+0530-U+058F 172 7.5 Georgian 173 Georgian: U+10A(MJ+10FF 173 7.6 Runic 174 Runic: U+16A0-U+16FO 174 7.7 Ogham 176 Ogham: U+1680-U+169F 176 7.8 Modifier Letters 177 Spacing Modifier Letters: U+02BO-U+02FF 177 7.9 Combining Marks 179 Combining Diacritical Marks: U+0300-U+036F 179 Combining Marks for Symbols: U+20D0-U+20FF 180 Combining Half Marks: U+FE20-U+FE2F 181 8 Middle Eastern Scripts 185 8.1 Hebrew 186 Hebrew: U+0590-U+05FF 186 Alphabetic Presentation Forms: U+FB1D-U+FB4F 188 8.2 Arabic 189 Arabic: U+0600-U+06FF 189 Cursive Joining 192 Ligatures 194 Arabic Presentation Forms-A: U+FB50-U+FDFF 197 Arabic Presentation Forms-B: U+FE70-U+FEFF 197 8.3 Syriac 199 Syriac: U+0700-U+074F 199 Syriac Shaping 203 Syriac Cursive Joining 203 Ligatures 205 8.4 Thaana 206 Thaana: U+0780-U+07BF 206 9 South and Southeast Asian Scripts 209 9.1 Devanagari 211 Devanagari: U+0900-U+097F 211 9.2 Bengali 224 Bengali: U+0980-U+09FF 224 9.3 Gurmukhi 225 Gurmukhi: U+OA0O-U+0A7F 225 9.4 Gujarati 226 Gujarati: U+0A80-U+0AFF 226 9.5 Oriya 227 Oriya: U+0B00-U+0B7F 227 The Unicode Standard 3.0 xv Contents 9.6 Tamil 228 Tamil: U+0B80-U+0BFF ' 228 9.7 Telugu 233 Telugu: U+0COO-U+0C7F 233 9.8 Kannada 234 Kannada: U+0C80-U+0CFF 234 9.9 Malayalam 235 Malayalam: U+ODO0-U+OD7F 235 9.10 Sinhala 236 Sinhala: U+0D80-U+0DFF 236 9.11 Thai '.'.'.'.'.'''.'.'.'.'.'.'.'''.'.'.217 Thai: U+OEOO-U+0E7F 237 9.12 Lao 239 Lao: U+0E80-U+0EFF 239 9.13 Tibetan 240 Tibetan: U+0F00-U+0FBF !!'.' 240 9.14 Myanmar 249 Myanmar: U+1000-U+109F 249 9.15 Khmer "'' \ \\\' " '' \\' " 251 Khmer: U+1780-U+17FF 251 10 East Asian Scripts 257 10.1 Han 258 CJK Unified Ideographs 258 CJK Compatibility Ideographs: U+F900-U+FAFF 267 Kanbun: U+3190-U+319F " 267 CJK and KangXi Radicals: U+2E8Ö-U+2FD5 267 Ideographie Description: (J+2FF0-U+2FFB 268 10.2 Hiragana 272 Hiragana: U+3040-U+309F 272 10.3 Katakana 273 Katakana: U+30A0-U+30FF 273 Halfwidth and Fullwidth Forms: U+FFOO-U+FFEF 273 10.4 Hangul 275 HangulJamo:U+1100-U+llFF 275 Hangul Compatibility Jarno: U+3130-U+318F 275 Hangul Syllables: U+AC00-U+D7A3 .....276 10.5 Bopomofo 27g Bopomofo: U+3100-U+312F .. 278 io-6 Yi .'";;;;::.".".".":.'.":::.':".';28o Yi: U+AO0O-U+A4CF 280 11 Additional Scripts 283 11.1 Ethiopic 284 Ethiopic: U+1200-U+137F 284 11.2 Cherokee 287 Cherokee: U+13A0-U+13FF ^ 287 11.3 Canadian Aboriginal Syllabics 288 Canadian Aboriginal Syllabics: U+1400-U+167F 288 11.4 Mongolian 289 Mongolian: U+180O-U+18AF 289 12 Symbols 295 xvi The Unicode Standard 3.0 Contents 12.1 Currency Symbols 297 Currency Symbols: U+20A0-U+20CF 297 12.2 Letterlike Symbols 298 Letterlike Symbols: U+2100-U+214F 298 12.3 Number Forms 299 Number Forms: U+2150-U+218F 299 Superscripts and Subscripts: U+2070-U+209F 299 12.4 Mathematical Operators 300 Mathematical Operators: U+2200-U+22FF 300 Arrows: U+2190-U+21FF 301 12.5 Technical Symbols 302 Control Pictures: U+2400-U+243F 302 Miscellaneous Technical: U+2300-U+23FF 302 Optical Character Recognition: U+2440-U+245F 303 12.6 Geometrical Symbols 304 Box Drawing: U+2500-U+257F 304 Block Elements: U+2580-U+259F 304 Geometrie Shapes: U+25A0^U+25FF 304 12.7 Miscellaneous Symbols and Dingbats 305 Miscellaneous Symbols: U+2600-U+26FF 305 Dingbats: U+2700-U+27BF 305 12.8 Enclosed and Square 307 Enclosed Alphanumerics: U+2460-U+24FF 307 Enclosed CJK Letters and Months: U+3200-U+32FF