The Standard Version 3.0

The

ADDISON-WESLEY An Imprint of Addison Wesley Longman, Inc. Reading, Massachusetts • Harlow, England • Menlo Park, California Berkeley, California • Don Mills, Ontario • Sydney Bonn • Amsterdam • Tokyo • Mexico City Contents

Acknowledgments iii Unicode Consortium Menibers and Directors viii Füll Members viii Current Associate Members viii Current Liaison Menibers ix Current Specialist Members ix Current Individual Members ix Current Members of the Board of Directors ix Former Members of the Board of Directors ix Contents xi Figures xix Tables xxi Preface xxv 0.1 About the Unicode Standard xxv Concepts, Architecture, Conformance, and Guidelines xxv Block Descriptions xxvi Charts and Index xxvi Appendices and Tables xxvii The Unicode Character Database and Technical Reports xxvii On the CD-ROM xxvii 0.2 Notational Conventions xxviii Extended BNF xxviii Operators xxix 0.3 Resources xxx Unicode Web Site xxx Unicode Anonymous FTP Site xxx Unicode Public Mailing List xxx How to Contact the Unicode Consortium xxx Introduction 1 1.1 Coverage 2 Standards Coverage 3 New Characters 3 1.2 Design Basis 3 1.3 Text Handling 4 Interpreting Characters 5 Text Elements 5 1.4 The Unicode Standard and ISO/IEC 10646 5 1.5 The Unicode Consortium 6 The Unicode Technical Committee 6 General Structure 9 2.1 Architectural Context 9 Basic Text Processes 9 Text Elements, Code Values, and Text Processes 10

The Unicode Standard 3.0 xi Contents

Text Processes and Encoding 1 ] 2.2 Unicode Design Principles 12 Sixteen-Bit Character Codes 12 Efficiency 13 Characters, Not Glyphs 13 Semantics 15 Piain Text 15 Logical Order 16 Unification 17 Dynamic Composition 18 Equivalent Sequence 18 Convertibility 18 2.3 Encoding Forms 19 UTF-16 19 UTF-8 20 Schemes 21 2.4 Unicode Allocation 21 Allocation Areas 21 Codespace Assignment for Graphic Characters 23 Nongraphic Characters, Reserved and Unassigned Codes 23 2.5 Writing Direction 24 2.6 Combining Characters 24 Sequence of Base Characters and 25 Multiple Combining Characters 25 Multiple Base Characters 27 Spacing Clones of European Diacritical Marks 27 2.7 Special Character and Noncharacter Values 28 (BOM) 28 Special Noncharacter Values 28 Separators 29 Layout and Format Control Characters 29 The Replacement Character 29 2.8 Controls and Control Sequences 29 Control Characters 29 Representing Control Sequences 30 2.9 Conforming to the Unicode Standard 30 Characters Not Used in a Subset 32 2.10 Referencing Versions of the Unicode Standard 32 3 Conformance 37 3.1 Conformance Requirements 37 Byte Ordering 37 Invalid Code Values 3g Interpretation 38 Modification 39 Transformations 39 39 Unicode Technical Reports 39 3.2 Semantics 40 3.3 Characters and Coded Representations 40 3.4 Simple Properties 42 3.5 Combination 43

" TheUn kode Standard 3.0 Contents

3.6 Decomposition 44 CompatibiHty Decomposition 44 Canonical Decomposition 44 3.7 Surrogates 45 3.8 Transformations 45 3.9 Special Character Properties 47 3.10 Canonical Ordering Behavior 50 Combining Classes 51 Canonical Ordering 51 Use with Collation 52 3.11 Conjoining Jarno Behavior 52 Syllable Boundaries 53 Standard Syllables 53 Syllable Composition 54 Hangul Syllable Decomposition 55 Hangul Syllable Names 55 3.12 Bidirectional Behavior 55 Directional Formatting Codes 56 Basic Display Algorithm 57 Definitions 58 Resolving Embedding Levels 61 Reordering Resolved Levels 65 Bidirectional Conformance 67 Implementation Notes 68 4 Character Properties 73 4.1 Case—Normative 75 4.2 Combining Classes—Normative 75 4.3 Directionality—Normative 85 4.4 Jamo Short Names—Normative 86 4.5 General Category—Normative in Part 87 4.6 Numeric Value—Normative 89 4.7 Mirrored—Normative 97 4.8 Unicode 1.0 Names 101 4.9 Mathematical Property 101 4.10 Letters and Other Useful Properties 102 5 Implementation Guidelines 105 5.1 Transcoding to Other Standards 105 Issues 105 Multistage Tables 106 7-Bit or 8-Bit Transmission 107 Mapping Table Resources 107 5.2 ANSI/ISO C wchar_t 107 5.3 Unknown and Missing Characters 108 Unassigned and Private Use Character Codes 108 Interpretable but Unrenderable Characters 108 Reassigned Characters 109 5.4 Handimg Surrogate Pairs 109 5.5 Handling Numbers HO 5.6 Handling Properties 111

The Unicode Standard 3.0 Xln Contents 5.7 Normalization 5.8 Compression 112 5.9 Line Handling 113 5.10 Regulär Expressions H3 5.11 Language Information in Piain Text !U Requirements for Language Tagging Working with Language Tags ..'. U4 Language Tags and 114 5.12 Editing and Selection ! 15 Consistent Text Elements U6 5.13 Strategies for Handling Nonspacing Marks ! J? Keyboard Input Truncation J18 5.14 Rendering Nonspacing Marks U9 Positioning Methods 120 5.15 Locating Text Element Boundaries \fA Boundary Specification Example Specifications .'.'.'.'.'.'.'.' 124 Grapheme Boundaries 126 Word Boundaries 126 Line Boundaries 127 Sentence Boundaries '' " _' 129 Random Access ^ 5.16 Identifiers 133 Syntactic Rule 133 5.17 SortingandSearching 134 Culturally Expected Sorüng .'' 135 Unicode Character Equivalence ..." 135 Similar Characters 136 Levels of Comparison .'."'.' 136 Ignorable Characters ' 137 Multiple Mappings .'..'.'.'. 13S Collating Out-of-Scope Characters ..'.' 138 Unmapped Characters 139 Parameterization 139 Optimizations 140 Searching ! 40 Sublinear Searching ..." 140 5.18 Case Mappings 141 I41 6.1 General Punctuation 147 Punctuation: U+0020-U+OOBF .'.'.' 148 General Punctuation: U+200O-U+206F .'.'.' 148 CJK Symbols and Punctuation: U+3000-U+303F 149 CJK Compatibility Forms: U+FE30-U+FE4F . 155 Small Form Variants: U+FE50-U+FE6F I56 European Alphabetic Scripts 156 7.1 Latin "_" 159 Letters of Basic Latin: U+0041-U+007A 16° Letters of the Latin-1 Supplement: U+OOCO-uVoOFF J5? Latin Extended-A: U+01CKMJ+017F 161 xiv The Unicode Standard 3.0 Contents

Latin Extended-B: U+0180-U+024F 163 IPA Extensions: U+0250-U+02AF 164 Latin Extended Additional: U+1E00-U+1EFF 165 Latin Ligatures: FB00-FBO6 166 7.2 Greek 167 Greek: U+037O-U+03FF 167 Greek Extended: U+1F0O-U+1FFF 169 7.3 Cyrillic 171 Cyrillic: U+0400-U+04FF 171 7.4 Armenian 172 Armenian: U+0530-U+058F 172 7.5 Georgian 173 Georgian: U+10A(MJ+10FF 173 7.6 Runic 174 Runic: U+16A0-U+16FO 174 7.7 176 Ogham: U+1680-U+169F 176 7.8 Modifier Letters 177 Spacing Modifier Letters: U+02BO-U+02FF 177 7.9 Combining Marks 179 Combining Diacritical Marks: U+0300-U+036F 179 Combining Marks for Symbols: U+20D0-U+20FF 180 Combining Half Marks: U+FE20-U+FE2F 181 8 Middle Eastern Scripts 185 8.1 Hebrew 186 Hebrew: U+0590-U+05FF 186 Alphabetic Presentation Forms: U+FB1D-U+FB4F 188 8.2 Arabic 189 Arabic: U+0600-U+06FF 189 Cursive Joining 192 Ligatures 194 Arabic Presentation Forms-A: U+FB50-U+FDFF 197 Arabic Presentation Forms-B: U+FE70-U+FEFF 197 8.3 Syriac 199 Syriac: U+0700-U+074F 199 Syriac Shaping 203 Syriac Cursive Joining 203 Ligatures 205 8.4 206 Thaana: U+0780-U+07BF 206 9 South and Southeast Asian Scripts 209 9.1 211 Devanagari: U+0900-U+097F 211 9.2 Bengali 224 Bengali: U+0980-U+09FF 224 9.3 225 Gurmukhi: U+OA0O-U+0A7F 225 9.4 Gujarati 226 Gujarati: U+0A80-U+0AFF 226 9.5 Oriya 227 Oriya: U+0B00-U+0B7F 227

The Unicode Standard 3.0 xv Contents

9.6 Tamil 228 Tamil: U+0B80-U+0BFF ' 228 9.7 Telugu 233 Telugu: U+0COO-U+0C7F 233 9.8 Kannada 234 Kannada: U+0C80-U+0CFF 234 9.9 Malayalam 235 Malayalam: U+ODO0-U+OD7F 235 9.10 Sinhala 236 Sinhala: U+0D80-U+0DFF 236 9.11 Thai '.'.'.'.'.'''.'.'.'.'.'.'.'''.'.'.217 Thai: U+OEOO-U+0E7F 237 9.12 Lao 239 Lao: U+0E80-U+0EFF 239 9.13 Tibetan 240 Tibetan: U+0F00-U+0FBF !!'.' 240 9.14 Myanmar 249 Myanmar: U+1000-U+109F 249 9.15 Khmer "'' \ \\\' " '' \\' " 251 Khmer: U+1780-U+17FF 251 10 East Asian Scripts 257 10.1 Han 258 CJK Unified Ideographs 258 CJK Compatibility Ideographs: U+F900-U+FAFF 267 Kanbun: U+3190-U+319F " 267 CJK and KangXi Radicals: U+2E8Ö-U+2FD5 267 Ideographie Description: (J+2FF0-U+2FFB 268 10.2 272 Hiragana: U+3040-U+309F 272 10.3 273 Katakana: U+30A0-U+30FF 273 Halfwidth and Fullwidth Forms: U+FFOO-U+FFEF 273 10.4 Hangul 275 HangulJamo:U+1100-U+llFF 275 Hangul Compatibility Jarno: U+3130-U+318F 275 Hangul Syllables: U+AC00-U+D7A3 .....276 10.5 27g Bopomofo: U+3100-U+312F .. 278 io-6 Yi .'";;;;::.".".".":.'.":::.':".';28o Yi: U+AO0O-U+A4CF 280 11 Additional Scripts 283 11.1 Ethiopic 284 Ethiopic: U+1200-U+137F 284 11.2 Cherokee 287 Cherokee: U+13A0-U+13FF ^ 287 11.3 Canadian Aboriginal Syllabics 288 Canadian Aboriginal Syllabics: U+1400-U+167F 288 11.4 Mongolian 289 Mongolian: U+180O-U+18AF 289 12 Symbols 295 xvi The Unicode Standard 3.0 Contents

12.1 Currency Symbols 297 Currency Symbols: U+20A0-U+20CF 297 12.2 298 Letterlike Symbols: U+2100-U+214F 298 12.3 299 Number Forms: U+2150-U+218F 299 Superscripts and Subscripts: U+2070-U+209F 299 12.4 Mathematical Operators 300 Mathematical Operators: U+2200-U+22FF 300 Arrows: U+2190-U+21FF 301 12.5 Technical Symbols 302 Control Pictures: U+2400-U+243F 302 : U+2300-U+23FF 302 Optical Character Recognition: U+2440-U+245F 303 12.6 Geometrical Symbols 304 Box Drawing: U+2500-U+257F 304 Block Elements: U+2580-U+259F 304 Geometrie Shapes: U+25A0^U+25FF 304 12.7 and 305 Miscellaneous Symbols: U+2600-U+26FF 305 Dingbats: U+2700-U+27BF 305 12.8 Enclosed and Square 307 : U+2460-U+24FF 307 Enclosed CJK Letters and Months: U+3200-U+32FF 307 CJK Compatibility: U+3300-U+33FF 307 12.9 308 Braille: U+2800-U+28FF 308 13 Special Areas and Format Characters 313 13.1 Control Codes 314 CO Control Codes: U+0000-U+001F 314 Cl Control Codes: U+0080-U+009F 314 13.2 Layout Controls 315 Layout Controls 315 13.3 Deprecated Format Characters 320 Deprecated Format Characters: U+206A-U+206F 320 13.4 Surrogates Area 322 Surrogates Area: U+D800-U+DFFF 322 13.5 Private Use Area 323 Private Use Area: U+EO0O^U+F8FF 323 13.6 324 Specials: U+FEFF, U+FFF0-U+FFFF , 324 14 Code Charts 331 14.1 Character Names List 331 Images in the Code Charts and Character Lists 332 Cross References 333 Case Form Mappings 333 Decompositions 333 Information About Languages 334 Reserved Characters 334 14.2 CJK Unified Ideographs 335

The Unicode Standard 3.0 xvii Contents

14.3 Hangul Syllables 335 15 Han Indices 849 15.1 Han Radical-Stroke Index 849 15.2 Shift-JIS Index 923 A Han Unification History 96i B Submitting New Characters 963 B. Proposal Guidelines 963 B.2 Requirements of Proposal Form and Process 964 Interim Solutions 965 Sending Proposals , 965 C Relationship to ISO/IEC 10646 967 C.1 History 967 Unicode 1.0 968 Unicode 2.0 968 Unicode 3.0 968 C.2 Encoding Forms in ISO/IEC 10646 969 Zero Extending 970 C.3 UCS Transformation Formats 970 UTF-8 970 UTF-16 970 C.4 Synchronization of the Standards 971 C.5 Identification of Features for the Unicode Standard 971 C.6 Character Names 972 C.7 Character Functional Specifications 972 D Changes from Unicode Version 2.0 973 D.l Versions of the Unicode Standard 973 D.2 Changes from Unicode Version 2.0 to Version 2.1 974 New Characters Added 974 Character Semantics Changes 974 Changes Affecting Conformance 974 D.3 Changes from Unicode Version 2.1 to Version 3.0 975 New Characters Added 975 Character Semantics Changes 978 Changes Affecting Conformance 979 Unicode Technical Reports 980 Glossary 983 References 999 R.l Source Standards 999 R.2 Source Dictionaries for Han Unification 1002 R.3 Other Sources for the Unicode Standard 1003 R.4 Selected Resources 1003 I Indices 1011 LI Unicode Names Index 1011 L2 General Index 1037

xvm The Unicode Standard 3.0