The ® Standard Version 12.0 – Core Specification

To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers and sellers distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of trade- mark claim, the designations have been printed with initial capital letters or in all capitals. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries. The authors and publisher have taken care in the preparation of this specification, but make expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Database and other files are provided as-is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. © 2019 Unicode, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction. For information regarding permissions, inquire at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please see http://www.unicode.org/copyright.html. The Unicode Standard / the ; edited by the Unicode Consortium. — Version 12.0. Includes index. ISBN 978-1-936213-22-1 (http://www.unicode.org/versions/Unicode12.0.0/) 1. Unicode (Computer character set) . Unicode Consortium. QA268.U545 2019

ISBN 978-1-936213-22-1 Published in Mountain View, March 2019 969 I Index

The index covers the contents of this core specification. To find topics in the Unicode Stan- dard Annexes, Unicode Technical Standards, and Unicode Technical Reports, use the search feature on the Unicode website. For definitions of terms used, see the glossary on the Unicode website. To find the code points for specific characters or the code ranges for particular scripts, use the Character Index on the Unicode website. (See Section B.3, Other Unicode Online Resources.)

A alternate format characters (deprecated) . . .192, 882– 883 abbreviation, Coptic ...... 311 Americas ...... 256, 359 scripts of ...... 775–783 abstract character sequences definition ...... 90 Amharic ...... 752 ...... 442–443 abstract characters ...... 29 Ancient Symbols ...... 859 definition ...... 90 ...... 257, 258, 445, 629 angle brackets (+2329 and U+232A) deprecated for technical publication ...... 841 accent marks see Annexes, Unicode Standard (UAX) ...... xxiv, 929 accented characters encoding ...... 12 as components of Unicode Standard ...... 79 conformance ...... 85 Latin ...... 289 list of ...... 85 normalization ...... 206 accounting numbers, ideographic ...... 176 annotation characters ...... 895–897 use in discouraged ...... 896 acrophonic numerals ...... 205, 308 ANSI/ISO Adlam ...... 772–773 Aegean numbers ...... 340 wchar_t and Unicode ...... 200 (U+0027) ...... 272 Africa Arabic ...... 367–390 scripts of ...... 751–774 Afrikaans ...... 294 digits ...... 818 Arabic-Indic digits ...... 371–372 Ahom ...... 624–625 signs used with ...... 373 Ainu ...... 731 Aiton ...... 644 ArabicShaping.txt ...... 375, 380, 396 Aramaic ...... 412, 445, 532, 563, 569 Alchemical Symbols ...... 855 areas of the Unicode Standard ...... 45 Algonquian ...... 778 Ali Gali ...... 532 ARIB ...... 851 Armenian ...... 319–320 aliases arrows ...... 837–838 character name ...... 88, 181 informative ...... 908 ASCII characters with multiple semantics ...... 262 normative ...... 909 transparency of UTF-8 ...... 36 property ...... 162 property value ...... 162 Unicode modeled on ...... 1 zero extension ...... 200, 942 allocation areas ...... 45 Assamese ...... 472 allocation of encoded characters ...... 44–52 Alphabetic (informative property) ...... 188 assigned code points ...... 11, 30 Athapascan ...... 778 ...... 256 atomic character boundaries ...... 218 European ...... 287–336 mathematical ...... 811–815 Avestan ...... 420–421 Index 970

B Breton ...... 294 Buginese ...... 681–682 ...... 683–688 Buhid ...... 678 Bamum ...... 767–768 Bangla ...... 472–478 Bulgarian ...... 313 bullets ...... 275 base characters ...... 327 numeric ...... 819 definition ...... 106 multiple ...... 59 Burmese see Byelorussian ...... 313 ordered before combining marks ...... 220, 327 (BOM) (U+FEFF) ..40, 67, 130–133, Basic Multilingual (BMP) ...... 1, 44 allocation areas ...... 49 893–895 byte ordering representation in UTF-16 ...... 36 changing ...... 81 Basque ...... 294 Bassa Vah ...... 769 conformance ...... 83 byte serialization ...... 40, 67 Batak ...... 694–695 Byzantine Musical Symbols ...... 794 benefits of Unicode ...... 1 ...... 472–478 Bhaiksuki ...... 575–576 C Bidi Class (normative property) ...... 171 C language Bidi Mirrored (normative property) ...... 178 wchar_t and Unicode ...... 200 Bidi Mirroring (informative property) . . . . 179 C0 and C1 control codes ...... 31, 187, 868 BidiMirroring.txt ...... 179 Cambodian see Khmer Bidirectional Algorithm, Unicode ...... 53, 84 Canadian Aboriginal Syllabics ...... 778–779 bidirectional ordering ...... 20 candrabindu ...... 470, 600 controls ...... 879 canonical composite characters ...... 53, 84 see canonical decomposable characters Middle Eastern scripts ...... 359 canonical composition algorithm ...... 138 nonspacing marks in ...... 223 canonical decomposable characters in ...... 261 definition ...... 118 big-endian ...... 40 canonical decomposition ...... 63 definition ...... 83 definition ...... 117 Bihari ...... 468 mappings ...... 116 binary comparison and sort order canonical equivalence caution for UTF-16 ...... 36 definition ...... 118 UTF differences ...... 231, 233 nonspacing marks ...... 225 UTF-8 ...... 39 canonical equivalent character sequences block ...... 45, 90, 255, 903 conformance ...... 81 headers ...... 916 canonical mappings BMP see Basic Multilingual Plane see canonical decomposition mappings BNF (Backus-Naur Form) ...... 923 canonical ordering algorithm ...... 137 BOCU-1 see UTN #6, BOCU-1 canonical precomposed characters MIME-Compatible Unicode Compression see canonical decomposable characters Bodhi ...... 521 Cantonese ...... 710 Bodo ...... 467 capital letters ...... 164, 236, 287 BOM (U+FEFF) ...... 40, 67, 130–133, 893–895 Carian ...... 343 ...... 727–729 carriage return (U+000D) (CR) ...... 209, 869 boundaries, text ...... 61, 189, 217–218, 228 carriage return and line feed (CRLF) ...... 209 see also UAX #14, Unicode Line Breaking Algo- case ...... 295 rithm and text processes ...... 12 see also UAX #29, Unicode Text Segmentation beyond ASCII ...... 237 boustrophedon ...... 53, 349 camelcase ...... 239 box drawing symbols ...... 845 case folding ...... 240 Brahmi ...... 445, 563, 565–568, 569, 631 case operations (conformance) ...... 85, 152–158 ...... 786–787 case operations and normalization ...... 242 Index 971

case operations, reversibility ...... 239 character sequences cased (definition) ...... 153 abstract see abstract character sequences case-insensitive comparison ...... 157, 231, 240 canonical equivalent see canonical equivalent casing context (definition) ...... 153 character sequences conversion ...... 154 compatibility equivalent see compatibility equiva- detection ...... 156 lent character sequences European alphabets ...... 287 conformance ...... 81 exceptional Latin pairs ...... 291, 295 named ...... 181 Georgian ...... 322 character sequences, combining ...... 106 lowercase ...... 164, 236, 287 character shaping selectors (deprecated) ...... 882 mapping tables ...... 196 character tabulation (U+0009) ...... 869 mappings ...... 152, 166, 236–238 characters mappings noted in code charts ...... 912 abstract see abstract characters titlecase ...... 164, 236 arrangement in Unicode ...... 46 Turkish I ...... 238, 291 assigned ...... 11, 30 uppercase ...... 164, 236, 287 boundaries ...... 217 see also default case canonical decomposable see canonical decompos- Case (normative property) ...... 164, 236 able characters CaseFolding.txt ...... 166, 240 classes ...... 924 caseless letters ...... 295 code charts ...... 903–920 Catalan ...... 293 coded see encoded characters Caucasian Albanian ...... 354 combining see combining characters cedilla ...... 290 compatibility decomposable see compatibility CEF see forms decomposable characters CES see character encoding schemes composite see decomposable characters Chakma ...... 552 concept of ...... 15, 60 Cham ...... 669–670 conformance definitions ...... 90–93 character encoding forms (CEF) ...... 33–39, 942 confusable ...... 245 see also Unicode encoding forms conversion ...... 196–197 character encoding model ...... 33, 42 decomposable see decomposable characters see also UTR #17, Unicode Character Encoding deprecated see deprecated characters Model encoded see encoded characters character encoding schemes (CES) ...... 40–43 encoding forms see encoding forms see also Unicode encoding schemes encoding schemes see encoding schemes character encoding standards end-user perceived ...... 60 coverage by Unicode ...... 3 format control ...... 30, 68, 263, 867–883 Character Index ...... 930 , relationship to ...... 15 character literals, Unicode graphic ...... 30 notation U+ ...... 924 identity (definition) ...... 87 character names ...... 88, 180–186, 946 ignored in processing ...... 248–253 aliases ...... 88, 181 interpretation ...... 80 conventions ...... 921 layout control ...... 68, 871–881 for CJK ideographs ...... 917 modification ...... 81 for control codes ...... 185, 187 names list ...... 904–916 in code charts ...... 908 names see character names matching ...... 181 not encoded in Unicode ...... 3 character properties number encoded in Version 12.0 ...... 3 see properties precomposed see decomposable characters see also individual properties, .g. Combining Class properties see properties character semantics ...... 1, 80, 87–88, 947 semantics see character semantics as Unicode design principle ...... 18 special ...... 67, 867–902 ASCII ...... 262 supplementary see supplementary characters definition ...... 87 transcoding ...... 196–197 unsupported ...... 201 Index 972

characters, not glyphs CJKV Ideographs Area ...... 50 in spoofing ...... 246 cluster boundaries ...... 217 Unicode principle ...... 15 code charts ...... 903–920 charsets representative glyphs ...... 904 IANA registered names ...... 41 code point sequences Cherokee ...... 776 notation ...... 922 Chinese ...... 709–711 code points ...... 7, 29 Cantonese ...... 710 assigned ...... 11, 30 Hakka ...... 728 assignment ...... 46 Mandarin ...... 710 categories ...... 30 Minnan (Hokkien/Fujian, incl. Taiwanese) . . 728 default ignorable ...... 201, 252 simplified and traditional ...... 709 definition ...... 90 Chu hán ...... 708 designated ...... 30 Chu Nôm ...... 958 notation ...... 921 citations for number in Unicode Standard ...... 1 properties ...... 77 private-use see private-use code points Unicode algorithms ...... 78 reserved see reserved code points Unicode Standard ...... 76 semantics ...... 32 CJK ideographs ...... 258, 704–720 surrogate see surrogates accounting numbers ...... 176 unassigned see unassigned code points CJK Compatibility Ideographs ...... 719–720 undesignated ...... 30 CJK Compatibility Supplement ...... 720 code positions see code points CJK Strokes ...... 722, 961 code set independence ...... 18 CJK Unified Ideographs ...... 704–718 code unit sequences CJK Unified Ideographs Extension A ...... 706 definition ...... 120 CJK Unified Ideographs Extension B ...... 718 ill-formed (definition) ...... 122 CJK Unified Ideographs Extension C ...... 719 notation ...... 922 CJK Unified Ideographs Extension D ...... 719 well-formed (definition) ...... 122 CJK Unified Ideographs Extension E ...... 719 code units CJK Unified Ideographs Extension F ...... 719 definition ...... 120 code charts ...... 917 isolated ...... 119 compatibility ideographs in Plane 2 ...... 52 code values see code units component structure ...... 714 coded character representations encoding blocks ...... 705 see coded character sequences ideographic description sequences . . . . . 723–726 coded character sequences ideographic variation mark (U+303E) ...... 725 definition ...... 92 KangXi radicals ...... 717, 721–722 coded characters see encoded characters names ...... 917 codespace see Unicode codespace numbers ...... 818 coeng ...... 645, 648 numeric values ...... 176, 205 Collation Algorithm, Unicode (UCA) ...... 12 order of encoding ...... 716 collation see sorting radicals ...... 721–722 collation tables ...... 196 source standards ...... 960 sequences ...... 56, 106 unknown or unavailable ...... 284 defective ...... 223 Vietnamese ...... 702 definition ...... 108 CJK Miscellaneous Area ...... 50 Latin ...... 289 CJK punctuation and symbols ...... 282 line breaking ...... 219 compatibility forms ...... 284 matching ...... 219 overscores and underscores ...... 284 order of base character and marks ...... 220, 327 quotation marks ...... 270 rendering ...... 219 sesame dots ...... 283 selection ...... 217 vertical forms ...... 284 truncation ...... 220–221 CJK-JRG (Chinese/Japanese/Korean Joint Research combining characters . . . . . 55–60, 110–115, 219–227 Group) ...... 956 blocking reordering ...... 878 Index 973

canonical ordering ...... 62, 137, 168 conformance ...... 73–158 combining marks ...... 327–328 definitions ...... 87–93 definition ...... 106 examples ...... 69 dependence ...... 327 ISO/IEC 10646 implementations ...... 947 display order ...... 58 requirements ...... 79–84 keyboard input ...... 220 confusables ...... 245 ligatures ...... 59 conjunct consonants multiple ...... 57 Indic ...... 217, 453 multiple base characters ...... 59 Myanmar ...... 639 normalization of ...... 206 selection of clusters ...... 217 ordering conventions ...... 56 contextual shaping rendering of marks ...... 222–227 apostrophe ...... 272 reordrant ...... 169 Arabic ...... 367 -specific ...... 56 not used for Hebrew final forms ...... 362 split ...... 169 quotation marks ...... 268 strikethrough ...... 170 Syriac ...... 395 subjoined ...... 170 contour tones ...... 325 typographical interaction ...... 58, 168 control codes ...... 31, 68, 868 vertical stacking ...... 58 graphics for ...... 840 see also diacritics names ...... 187 Combining Class (normative property) ...... 168 properties ...... 869 combining classes ...... 135, 168, 225–226 semantics ...... 32, 869 class zero characters ...... 168 specified in Unicode ...... 869 definition ...... 135 control sequences ...... 868 combining joiner (U+034F) ...... 877 conversion of characters ...... 196–197 combining half marks ...... 190, 335 convertibility combining marks see combining characters as Unicode design principle ...... 24 comma below ...... 290 Coptic ...... 307, 310–312 Compatibility and Area ...... 26, 50 Coptic Epact numbers ...... 823 compatibility characters ...... 22 corporate use subarea ...... 888 compatibility composite characters ...... 27 corrigenda ...... 76 see compatibility decomposable characters CR (U+000D carriage return) ...... 209, 869 compatibility decomposable characters ...... 26 CRLF (carriage return and line feed) ...... 209 definition ...... 116 Croatian ...... 294 compatibility decomposition ...... 63 digraphs ...... 294 definition ...... 116 culturally expected sorting ...... 12, 230 compatibility decomposition mappings ...... 116 compatibility equivalence Old Persian ...... 433 definition ...... 117 Sumero-Akkadian ...... 428–431 compatibility equivalent character sequences Ugaritic ...... 432 conformance ...... 81 Cuneiform and Hieroglyphic Area ...... 51 compatibility mappings Cuneiform and Hieroglyphs ...... 427–443 see compatibility decomposition mappings currency symbols block ...... 805–808 compatibility precomposed characters currency symbols encoded in other blocks . . 806 see compatibility decomposable characters currency symbols, other ...... 807 compatibility variants ...... 26 dollar sign, form and usage ...... 806 mapping ...... 243 euro sign ...... 807 composite characters lari sign ...... 807 see decomposable characters lira sign, compatibility usage ...... 806 Composition Exclusion (normative property) . . . 100 lira sign, Turkish ...... 807 compression ...... 208 peso signs, usage ...... 806 see also UTS #6, A Standard Compression Scheme ruble sign ...... 807 for Unicode (SCSU) rupee signs, Indian, usage ...... 807 conferences ...... 930 yen and yuan signs, usage ...... 806 Index 974

cursive joining ...... 873–877 Derived Age (property) ...... 201 Arabic ...... 375–382 derived properties control characters for . . . . 191, 369–370, 535, 872 definition ...... 104 Mandaic ...... 403 DerivedCoreProperties.txt ...... 153, 164, 253 Mongolian ...... 534–536 DerivedNormalizationProps.txt ...... 242 ...... 763 ...... 781–783 Phags- ...... 582 design goals of Unicode ...... 4 Syriac ...... 395–398 design principles of Unicode ...... 14–24 transparency ...... 876 designated code points ...... 30 cursive scripts ...... 359 ...... 447–471 Cypriot ...... 342 Dhivehi ...... 515 see also diacritics ...... 55, 327 Cyrillic ...... 313–316 alternative glyphs ...... 289, 327 Czech ...... 294 Czech ...... 290 display in isolation ...... 60, 265, 328 D double ...... 114, 190, 329 danda, in Devanagari block ...... 466 German dialectology ...... 333 Danish ...... 293 Greek ...... 303–304, 307 dashes ...... 265 Latin ...... 289–292 Database, Unicode Character Latvian ...... 290 see Unicode Character Database (UCD) mathematical ...... 814 dead consonants, Indic ...... 450 on i and j ...... 291 dead keys ...... 220 rendering ...... 222–227 decomposable characters ...... 63 Slovak ...... 290 definition ...... 116 spacing clones of ...... 325, 329 normalization of ...... 206 ...... 55, 334 decomposition ...... 63, 116–118 see also combining characters canonical see canonical decomposition dictionary symbols ...... 851 compatibility see compatibility decomposition digit form names ...... 371 definition ...... 116 digits ...... 205 in normalization ...... 206 Arabic ...... 818 mapping, definition ...... 116 Arabic-Indic ...... 371–372 mappings noted in code charts ...... 912 compatibility ...... 818 default case ...... 175 algorithms ...... 85, 152–158 glyph variants ...... 820 conversion ...... 154 ...... 818 detection ...... 156 Myanmar ...... 818 folding ...... 155 national shapes ...... 883 default caseless matching ...... 157 Shan ...... 818 default grapheme clusters ...... 217 superscript and subscript ...... 819 see also UAX #29, Unicode Text Segmentation Tai Laing ...... 818 Default Ignorable Code Point (property) ...... 252 Tai Tham ...... 818 default ignorable code points ...... 201, 252 digraphs ...... 294, 297, 299 default property values ...... 854–855 definition ...... 97 directionality ...... 20, 53 defective combining character sequences ...... 223 East Asian scripts ...... 702 definition ...... 108 Middle Eastern scripts ...... 359 dependent vowel signs Mongolian ...... 534 Indic ...... 449 musical symbols ...... 789 Khmer ...... 650 normative property ...... 171 Philippine scripts ...... 678 ...... 356 deprecated characters ...... 74, 907 Old Italic ...... 346 alternate format ...... 192, 882–883 Philippine scripts ...... 679 definition ...... 92 Runic ...... 349 Index 975

discussion list for Unicode ...... 930 encoding model for Unicode characters ...... 33, 42 Dogra ...... 627–628 see also UTR #17, Unicode Character Encoding Dogri ...... 467 Model Domino Tiles ...... 856 encoding schemes ...... 40–43 dotless i ...... 238, 291 encoding schemes, Unicode dotted circle see Unicode encoding schemes in code charts ...... 107, 328 endian ordering in fallback rendering ...... 222 see byte order mark (BOM) (U+FEFF) to indicate ...... 55 end-user subarea ...... 889 to indicate vowel sign placement ...... 56 English ...... 293 double diacritics ...... 114, 190, 329 equivalent sequences ...... 206 Duployan ...... 798–799 as Unicode design principle ...... 23 Dutch ...... 293, 294 case-insensitivity ...... 231, 240 dynamic composition combining characters in matching ...... 219 as Unicode design principle ...... 23 conformance ...... 82 Dzongkha ...... 521 syllables ...... 737 in sorting and searching ...... 230 E language-specific ...... 118 security implications ...... 245 East Asian scripts ...... 701–750 writing direction ...... 53 see also canonical equivalence see also compatibility equivalence see also CJK ideographs see also encoding forms, encoding schemes Eastern Arabic-Indic digits ...... 371 EBCDIC errata ...... xxvi, 76, 931 escape sequences ...... 868 newline function ...... 210 not used in Unicode ...... 1, 4 editing, text boundaries for ...... 217–218 efficiency Esperanto ...... 294 Estonian ...... 294 as Unicode design principle ...... 15 Ethiopic ...... 752–755 ...... 434–439 format controls ...... 436–439 Etruscan ...... 345 European scripts ...... 287–336 Elbasan ...... 353 ancient ...... 337–357 ellipsis ...... 273–274 ...... 422 eyelash- ...... 459 e-mail discussion list for Unicode ...... 930 F ...... 849, 930 animal symbols ...... 853 fallback rendering ...... 252 charts ...... 930 of nonspacing marks ...... 222 cultural symbols ...... 853 FAQ (Frequently Asked Questions) ...... 930 zodiacal symbols ...... 853 Faroese ...... 293 emoji modifiers ...... 853 Farsi ...... 367, 370 ...... 853 featural ...... 257 ...... 863 FF (U+000C form feed) ...... 209, 869 enclosing marks ...... 335 file separator (U+001C) ...... 869 definition ...... 107 Finnish ...... 293 encoded characters ...... 7, 29 Finno-Ugric Transcription (FUT) allocation ...... 44–52 see Uralic Phonetic (UPA) definition ...... 92 fixed-width Unicode encoding form (UTF-32) . . . 35, encoding form conversion 124 definition ...... 127 flat tables ...... 196 encoding forms ...... 33–39 Flemish ...... 293 ISO/IEC 10646 definitions ...... 942 fleurons ...... 855 encoding forms, Unicode fonts see Unicode encoding forms and Unicode characters ...... 16 for mathematical alphabets ...... 813–815 style variation for symbols ...... 803 Index 976

form feed (U+000C) (FF) ...... 209, 869 editorial marks ...... 279 format control characters ...... 30, 68, 263, 867–883 letters as symbols ...... 304–306, 834 deprecated ...... 882–883 see also Cypriot, Linear B prefixed ...... 192, 331 Greenlandic ...... 294 stateful ...... 880 group separator (U+001D) ...... 869 fraction characters ...... 831 guillemets ...... 268 fraction slash (U+2044) ...... 273, 827 ...... 484–485 French ...... 294 Gunjala Gondi ...... 559–560 Frisian ...... 294 ...... 479–483 fullwidth forms in East Asian encodings ...... 734 futhark ...... 348 H Hakka ...... 728 G halant ...... 445 Garshuni ...... 391 see also virama Ge’ez ...... 752 half marks, combining ...... 190, 335 General Category (normative property) ...... 172 half-consonants, Indic ...... 454 list of values ...... 172 halfwidth forms in East Asian encodings ...... 734 general punctuation ...... 261–285 Han ideographs see CJK ideographs General Scripts Area ...... 50 ...... 711–718 geometrical symbols ...... 845–848 and language tags ...... 215 Georgian ...... 321–322 history ...... 955–960 German ...... 293 language usage ...... 709 geta mark (U+3013) ...... 284 source separation rule ...... 706, 712 Glagolitic ...... 318 source standards ...... 960 Glossary ...... 930 hand symbols ...... 853 glyph selection tables ...... 196 Hangul Area ...... 50 glyphs ...... 6, 15 Hangul syllables ...... 701, 735–738 characters, relationship to ...... 15 and combining marks ...... 114 diacritics alternative ...... 289, 327 as grapheme clusters ...... 61 Greek alternative ...... 304–306 canonical decomposition ...... 144 Latin alternative ...... 289 collation ...... 737 mathematical alternative ...... 833 composition ...... 146 missing ...... 252 conjoining jamo ...... 142–151 representative in code charts ...... 904 equivalent sequences ...... 737 standardized variants ...... 884 Hangul Compatibility Jamo ...... 736 symbols alternative ...... 803 Hangul Jamo ...... 735–738 golden numbers ...... 350 Hangul Syllables block ...... 737–738 Gothic ...... 352 Johab set ...... 737 Grantha ...... 621–623 name generation ...... 147 grapheme base ...... 327 normalization ...... 736 definition ...... 108 standard ...... 143 grapheme clusters ...... 11, 60–61 Hangzhou numerals ...... 826 see also UAX #29, Unicode Text Segmentation Hanifi Rohingya ...... 676 default ...... 217 see CJK ideographs definition ...... 109 Hanunóo ...... 678 grapheme extender Hanzi see CJK ideographs definition ...... 109 harakat ...... 368 grapheme joiner, combining (U+034F) ...... 877 hasant ...... 472 graphic characters ...... 30 hash tables ...... 197 Greek ...... 303–308 Hatran ...... 425 acrophonic numerals ...... 205, 308 Hebrew ...... 361–366 alternative glyphs ...... 304–306 ...... 731–733 ancient musical notation ...... 795–797 Index 977

hieroglyphs ...... 418 Anatolian ...... 442–443 ...... 418 Egyptian ...... 434–439 inside-out rule ...... 222 Meroitic ...... 440–441 interchange restrictions ...... 31 high surrogate International Phonetic Alphabet (IPA) 256, 296–297 definition ...... 119 ...... 324 high-surrogate code points ...... 79, 890 see also phonetic alphabets high-surrogate code units ...... 119 internationalization ...... 18 higher-level protocols Internationalization & Unicode Conference . . . . 930 definition ...... 93 Internet protocols Hindi ...... 447 UTF-8 as preferred encoding ...... 37 ...... 730 Inuktitut ...... 778 horizontal tab (U+0009) ...... 869 invisible operators ...... 839 HTML newline function ...... 210 iota subscript ...... 304 Hungarian ...... 294 IPA see International Phonetic Alphabet hyphenation ...... 872 IRG (Ideographic Rapporteur Group) ...... 958 as a text process ...... 10 Irish ...... 293, 356 ...... 265, 872 ISCII standard and Unicode ...... 447 ISO/IEC 10646 ...... 933–947 I conformance of Unicode implementations . . 947 encoding forms ...... 942 I Ching symbols ...... 858 synchrony with Unicode Standard ...... 944 IANA charset names ...... 41 Icelandic ...... 293 timeline compared to Unicode versions . . . . . 935 Italian ...... 293 identifiers ...... 229 ITC ...... 854 see also UAX #31, Unicode Identifier and Pattern Syntax IUC see Internationalization & Unicode Conference Ideographic (informative property) ...... 188 ideographic description sequences ...... 724 J Ideographic Rapporteur Group (IRG) ...... 958 jamos see Hangul syllables ideographs see also CJK ideographs Japanese ...... 701 IICore ...... 707, 958 Javanese ...... 689–692 ill-formed Jawi ...... 387 definition ...... 122 jihvamuliya ...... 471, 600 Imperial Aramaic ...... 412–413 Johab ...... 737 implementation guidelines ...... 195–254 joiners ...... 369 in a Unicode encoding form combining grapheme joiner (U+034F) ...... 877 definition ...... 123 (U+2060) ...... 871 in-band mechanisms ...... 902 zero width joiner (U+200D) ...... 369–370, 874 India justification ...... 224 Official scripts ...... 445–512 Indian rupee signs, usage ...... 807 K Indic scripts ...... 445–512 ...... 597–599 principles, in terms of Devanagari . . . . . 448–458 Kana (Hiragana and ) ...... 730–731 relation to ISCII standard ...... 447 Kanbun ...... 720 Indic Siyaq ...... 825 KangXi radicals ...... 717, 721–722 Indonesia and Oceania see CJK ideographs scripts of ...... 677–699 ...... 502–505 Indonesian ...... 293 Kashmiri ...... 468 industry character sets Katakana ...... 730–731 covered in Unicode ...... 3 Kawi ...... 683, 685 information separators (U+001C..U+001F) . . . . . 869 Kayah Li ...... 668 informative properties KC (normalization form) definition ...... 101 see Normalization Form KC Index 978

KD (normalization form) ligatures ...... 873–877 see Normalization Form KD Arabic ...... 378–379 keytop labels ...... 840 combining characters on ...... 59 Khamti Shan ...... 642 control characters for ...... 191 Kharoshthi ...... 569–570 for nonspacing marks ...... 226 Khmer ...... 645–656 Latin ...... 299 characters not recommended ...... 653 selection ...... 218 syllable components, order of ...... 654 Syriac ...... 399 Khojki ...... 608–609 Limbu ...... 542–545 Khudawadi ...... 610–611 line breaking ...... 209–213, 871–873 killer ...... 258 control characters ...... 190 Batak ...... 694 in South Asian scripts ...... 633, 641, 656 Brahmi ...... 565 recommendations ...... 211 Meetei Mayek ...... 546 see also UAX #14, Unicode Line Breaking Algo- Myanmar (asat) ...... 640 rithm see also virama line feed (U+000A) (LF) ...... 209, 869 Konkani ...... 467 line separator (U+2028) (LS) ...... 209, 873 Korean Hangul see Hangul line tabulation (U+000B) (VT) ...... 869 Kurdish ...... 387 ...... 339 Linear B ...... 340–341 L see also Cypriot linear boundaries ...... 218 Ladino ...... 361 language tags ...... 215, 898–902 Lisu ...... 743–745 Lithuanian ...... 294 and Han unification ...... 215 little-endian ...... 40 use strongly discouraged ...... 898, 901 Lanna ...... 659 definition ...... 83 logical order Lao ...... 635–637 as Unicode design principle ...... 19 last-resort glyphs ...... 252 Latin ...... 289–302 exceptions to ...... 169 logograph ...... 258 alternative glyphs ...... 289 logosyllabaries ...... 258 Basic Latin ...... 293 encoding blocks ...... 45 low surrogate definition ...... 119 IPA Extensions ...... 296–297 low-surrogate code points ...... 79, 890 Latin Extended Additional ...... 299–302 Latin Extended-A ...... 293 low-surrogate code units ...... 119 lowercase ...... 164, 236, 287 Latin Extended-B ...... 294–296 LS (U+2028 line separator) ...... 209, 873 Latin Extended-C ...... 299 Latin Extended-D ...... 300 Lycian ...... 343 Lydian ...... 343 Latin Extended-E ...... 301 Latin Ligatures ...... 299 Latin-1 Supplement ...... 293 M ...... 298–302 MacOS newline function ...... 210 Latvian ...... 294, 301 ...... 606–607 cedilla ...... 290 Mahjong Tiles ...... 856 layout control characters ...... 68, 871–881 mail discussion list for Unicode ...... 930 leading surrogates Maithili ...... 467 see high-surrogate code units major version ...... 75 legibility criterion for plain text ...... 19 Makasar ...... 698–699 Lepcha ...... 553–555 Malay ...... 293 letter spacing ...... 872 Malay, Patani ...... 634 ...... 809–815 Malayalam ...... 506–512 LF (U+000A line feed) ...... 209, 869 Suriyani ...... 399, 507 Index 979

Maltese ...... 294 Mro ...... 548 Manchu ...... 533 Multani ...... 612 Mandaic ...... 402–404 multibyte encodings Mandarin ...... 710 compared to UTF-8 ...... 37 Manden ...... 760 multistage tables ...... 196 Manichaean ...... 414–417 musical symbols ...... 788–797 map symbols ...... 851 ancient Greek ...... 795–797 mapping tables see tables of character data Balinese ...... 687 Marathi ...... 447, 459, 466 Byzantine ...... 794 Marchen ...... 584 directionality ...... 789 markup languages Gregorian ...... 793 and Unicode conformance ...... 902 Kievan ...... 793 line breaking ...... 209 Western ...... 788–793 Masaram Gondi ...... 557–558 Myanmar ...... 638–644 Mathematical (informative property) ...... 831 digits ...... 818 mathematical expression format characters ...... 192 Myanmar Extended-A ...... 642 see also UTR #25, Unicode Support for Mathe- Myanmar Extended-B ...... 642 matics mathematical symbols ...... 831–838 N alphabets ...... 811–815 N’Ko ...... 760–764 alphanumeric ...... 810–815 Nabataean ...... 423 fonts ...... 813–815 named character sequences ...... 181 format characters ...... 839 names, character see character names fragments for typesetting ...... 841 namespace ...... 89 invisible operators ...... 839 ...... 619–620 operators ...... 832–835 NEL (U+0085 next line) ...... 209, 869 standardized variants ...... 838 Nepali ...... 447 MathML ...... 835 neutral directional characters ...... 171 matras ...... 168, 449 New Tai Lue ...... 659–661 Medefaidrin ...... 774 Newa ...... 519–520 Meetei Mayek ...... 546–547 newline function (NLF) ...... 210, 870 Mende Kikakui ...... 770–771 newline guidelines ...... 209–213 Meroitic next line (U+0085) (NEL) ...... 209, 869 cursive ...... 440–441 NFC (Normalization Form C) ...... 62 hieroglyphs ...... 440–441 NFD (Normalization Form D) ...... 62 Miao ...... 746–747 NFKC (Normalization Form KC) ...... 62 Middle Eastern scripts ...... 359–516 NFKD (Normalization Form KD) ...... 62 ancient ...... 405–425 NLF (newline function) ...... 210, 870 Min ...... 710 no-break (U+00A0) ...... 871 Minnan (Hokkien/Fujian, incl. Taiwanese) . . . . . 728 base for diacritic in isolation ...... 60, 265, 328 minor version ...... 75 no-break space, narrow (U+202F) ...... 538 minus sign ...... 834 noncharacter code points see noncharacters commercial (U+2052) ...... 276 noncharacters ...... 31, 891 mirrored property conformance ...... 79 see Bidi Mirrored (normative property) definition ...... 93 mirroring of paired punctuation ...... 267 handling ...... 82 ...... 850 in code charts ...... 907 missing glyphs ...... 252 interchange restrictions ...... 31 Modi ...... 616–618 semantics ...... 32 modifier letters ...... 323–326 U+10FFFF (not a character code) ...... 891 Modifier Letters, Spacing ...... 299 U+FDD0..U+FDEF ...... 31, 891 Mongolian ...... 532–541, 577 U+FFFE (not a character code) ...... 67, 892 writing direction ...... 534 U+FFFF (not a character code) ...... 31, 891 moon symbols ...... 851 Index 980

nondecomposable characters ...... 64 Roman ...... 205, 831 non-joiner, zero width (U+200C) . . . . . 369–370, 875 Rumi ...... 824 nonlinear boundaries ...... 218 Suzhou-style ...... 826 non-overlap principle in Unicode encoding forms 33 numeric separators ...... 276 nonspacing marks ...... 327 numeric shape selectors (deprecated) ...... 883 definition ...... 107 Numeric Type (normative property) ...... 175 display in isolation ...... 60, 265, 328 Numeric Value (normative property) ...... 175 positioning ...... 226 numero sign (U+2116) ...... 809 rendering ...... 222–227 Nüshu ...... 742 see also combining characters ...... 673–674 see also diacritics normalization ...... 62, 206–207 and case operations ...... 242 object replacement character (U+FFFC) ...... 897 canonical ordering algorithm ...... 62, 137, 168 octet ...... 923 conformance ...... 84 Ogham ...... 356 of private-use characters ...... 888 Ol Chiki ...... 550–551 see also UAX #15, Unicode Normalization Forms Old Church Slavonic ...... 313 stability ...... 134 Old Hungarian ...... 351 Normalization Form C (NFC) ...... 62 Old Italic ...... 345–347 Normalization Form D (NFD) ...... 62 Old North Arabian ...... 407 Normalization Form KC (NFKC) ...... 62 Old Permic ...... 355 Normalization Form KD (NFKD) ...... 62 Old Persian ...... 433 normalization forms ...... 134–141 Old Sogdian ...... 591 definition ...... 140 Old South Arabian ...... 408–409 specification ...... 136 Old Turkic ...... 590 normative behaviors old-style numerals ...... 273 definition ...... 87 ...... 486–488 normative properties ornamental dingbats ...... 855 definition ...... 99 Oromo ...... 752 list ...... 100 Osage ...... 780 may change ...... 99 Osmanya ...... 756 Norwegian ...... 293 Ottoman Siyaq ...... 825 notational conventions ...... 921–925 out-of-band mechanisms ...... 902 notational systems ...... 260, 785–801 overlapping encodings ...... 33 nukta ...... 368, 389, 460 overscores ...... 273 null (U+0000) as Unicode string terminator ...... 870 P number forms CJK ideographs ...... 205 ...... 671–672 numbers Pahlavi, Inscriptional ...... 418 Coptic Epact ...... 823 Pahlavi, Psalter ...... 419 handling ...... 205 Palmyrene ...... 424 ideographic accounting ...... 176 Panjabi ...... 479 numerals ...... 816–828 paragraph or section marks ...... 276 acrophonic ...... 308 paragraph separator (U+2029) (PS) ...... 209, 873 Chinese ...... 829 Parthian, Inscriptional ...... 418 Coptic ...... 312 Pashto ...... 367 Cuneiform ...... 431 Patani Malay ...... 634 Ethiopic ...... 754 ...... 675 Greek acrophonic ...... 205 Persian ...... 367, 370 Hangzhou ...... 826 Phags-pa ...... 577–583 Meroitic cursive ...... 441 symbols ...... 860 old-style ...... 273 Phake ...... 644 Index 981

Philippine scripts ...... 678–680 provisional see provisional properties Phoenician ...... 410 simple see simple properties phonemes ...... 259 see also individual properties, e.g. combining phonetic alphabets ...... 256 classes IPA Extensions ...... 296–297 property values Phonetic Extensions ...... 298–302 aliases ...... 162 Spacing Modifier Letters ...... 324–326 aliases (definition) ...... 105 Uralic Phonetic Alphabet (UPA) ...... 276, 298 default ...... 97 see also International Phonetic Alphabet (IPA) default (definition) ...... 97 ...... 293 normative references to ...... 84 pipeline table PropertyAliases.txt ...... 104, 924 proposed new characters ...... 931 PropertyValueAliases.txt ...... 105, 924 pivot code, Unicode as ...... 196 PropList.txt ...... 166 plain text Provençal ...... 294 as Unicode design principle ...... 18 provisional properties legibility criterion ...... 19 definition ...... 101 planes of Unicode codespace ...... 44 PS (U+2029 paragraph separator) ...... 209, 873 Plane 0 (BMP) ...... 44 ...... 419 Plane 1 (SMP) ...... 44, 51 PUA (Private Use Area) ...... 50, 888 Plane 14 (SSP) ...... 45 punctuation ...... 261–285 Plane 2 (SIP) ...... 44, 52 blocks containing ...... 255 Planes 15-16 (Private Use) ...... 52, 889 CJK ...... 282 Playing Cards ...... 857 doubled ...... 273 points, Hebrew pronunciation marks ...... 361 in bidirectional text ...... 261 policies of the Unicode Consortium ...... 931 paired ...... 267 Polish ...... 294 small form variants ...... 285 Portuguese ...... 293 typographic forms ...... 261 precomposed characters vertical forms ...... 284 see decomposable characters Punctuation and Symbols Area ...... 50 compatibility see compatibility decomposable Punjabi ...... 479 characters prefixed format control characters ...... 192 Q prepended concatenation marks ...... 253, 331 quotation marks ...... 268–271 Private Use Area (PUA) ...... 50, 888 East Asian ...... 270 Private Use planes ...... 45, 52, 889 European ...... 268 private-use characters properties ...... 887 semantics ...... 32 radicals, KangXi and other CJK ...... 721–722 private-use code points ...... 31, 201 radical- index ...... 717 conformance ...... 80 record separator (U+001E) ...... 869 definition ...... 105 recycling symbols ...... 852 high surrogates ...... 890 references ...... 931 processing code, Unicode as ...... 38 referencing ...... 84 properties ...... 18, 95–105, 159–193 properties ...... 77 aliases ...... 162 Unicode algorithms ...... 78 aliases (definition) ...... 104 Unicode Standard ...... 76 and Unicode algorithms ...... 100 regional indicator symbols ...... 864 data tables ...... 196 regular expressions ...... 214 derived see derived properties and line breaking ...... 209 in Unicode Character Database (UCD) ...... 46 see also UTS #18, Unicode Regular Expressions informative see informative properties Rejang ...... 693 normative references to ...... 77, 84 rendering of text ...... 6, 10, 17 normative see normative properties fallback ...... 252 of control codes ...... 869 unsupported characters ...... 201 Index 982

repertoire of abstract characters ...... 29 Sharada ...... 600–601 reph ...... 458, 462, 500 ...... 357, 743 replacement character (U+FFFD) . . . . 43, 68, 83, 897 Show Hidden ...... 81, 222, 252, 885 reserved code points ...... 30, 201 SHY (U+00AD soft ) ...... 872 definition ...... 93 Sibe ...... 533 in code charts ...... 907 Siddham ...... 604–605 preservation in interchange ...... 31 signature for Unicode data ...... 67, 893–895 see also unassigned code points simple properties Rhaeto-Romanic ...... 294 definition ...... 104 rich text ...... 18 simplified Chinese ...... 709 right single quotation mark (U+2019) Sindhi ...... 367, 467 preferred for apostrophe ...... 272 Sinhala ...... 517–518 right-to-left text ...... 53 Sinological ...... 301 East Asian scripts ...... 702 SIP (Supplementary Ideographic Plane) ...... 44, 52 Middle Eastern scripts ...... 359 Siyaq Numbers ...... 824 roadmap for script additions ...... 46, 931 Indic ...... 824 ...... 205, 831 slash, fraction (U+2044) ...... 273 Romanian ...... 294 Slovak ...... 294 comma below ...... 291 Slovenian ...... 294 Romany ...... 294 small letters ...... 164, 236, 287 Rong ...... 553–555 SMP (Supplementary Multilingual Plane) . . . . 44, 51 Rumi symbols ...... 824 (U+00AD) (SHY) ...... 872 Runic ...... 348–350 Sogdian ...... 592 Russian ...... 313 Somali ...... 756 Sora Sompeng ...... 626 Sorbian ...... 294 Samaritan ...... 400–401 sorting ...... 12, 230 Sami ...... 294 and combining grapheme joiner ...... 878 ...... 447 as a text process ...... 10 Saurashtra ...... 556 case-insensitive ...... 231 scalar values, Unicode culturally expected ...... 12, 230 see Unicode scalar values language-insensitive ...... 230 scripts see also Unicode Collation Algorithm (UCA) in Unicode Standard ...... 3 source separation rule ...... 706, 712 roadmap for future additions ...... 46, 931 South and Central Asian scripts types of ...... 260 Ancient ...... 563–592 see also UAX #24, Unicode Script Property Other historic ...... 593–628 SCSU Other modern ...... 513–560 see UTS #6, A Standard Compression Scheme for South Asian scripts ...... 445–545 Unicode Southeast Asian scripts ...... 629–676 searching ...... 230–232 Soyombo ...... 588–589 as a text process ...... 10 space (U+0020) case-insensitive ...... 231, 240 base for diacritic in isolation ...... 60, 265, 328 section or paragraph marks ...... 276 space characters ...... 264, 871–873 security issues ...... 245 graphics for ...... 840 self-synchronization of encoding forms ...... 34 space, zero width (U+200B) ...... 264 semantics spacing clones of diacritics ...... 325, 329 see character semantics spacing marks ...... 327 sequences definition ...... 108 notation ...... 922 Spacing Modifier Letters ...... 324–326 Serbian Spanish ...... 293 corresponding digraphs in Croatian ...... 294 special characters ...... 67, 867–902 Shan ...... 657 SpecialCasing.txt ...... 152, 166 digits ...... 818 Specials ...... 893–897 Index 983

spell-checking surrogate pairs ...... 36, 125 as a text process ...... 11 definition ...... 119 spellings, alternative processing ...... 38, 203–204 see equivalent sequences surrogates ...... 31, 119, 890 spoofing ...... 245 interchange restrictions ...... 31 SSP (Supplementary Special-purpose Plane) . . . . . 45 isolated surrogates, handling ...... 43 stability ...... 102, 161 isolated surrogates, ill-formed ...... 125 as Unicode design principle ...... 23 isolated surrogates, uninterpreted ...... 119 stacked boundaries ...... 217 support levels ...... 203 stacking sequences ...... 57 Surrogates Area ...... 50, 890 nondefault ...... 58 Sutton SignWriting ...... 800–801 standardized variants ...... 536, 884 Suzhou-style numerals ...... 826 in the code charts ...... 914 svasti signs ...... 528 mathematical symbols ...... 838 Swahili ...... 293 StandardizedVariants.txt ...... 536, 838 Swedish ...... 293 standards coverage ...... 3 syllabaries ...... 257 starters ...... 136 alphabetic property ...... 188 stateful encoding featural ...... 257 not used in Unicode ...... 4 Syloti Nagri ...... 595–596 paired format controls ...... 880 symbols ...... 803–865 string comparison ...... 12 animal ...... 853 string literals, Unicode appearance variation ...... 803 code point notation \u1234 ...... 924 arrows ...... 837–838 strings, Unicode ...... 43, 121 box drawing ...... 845 null termination ...... 870 cultural ...... 853 strong directional characters ...... 171 currency symbols block ...... 805–808 styled text ...... 18 dictionary ...... 851 sublinear searching ...... 232 dingbats ...... 854–855 subsets, supported ...... 71 emoji ...... 849, 864 conformance ...... 80 Enclosed Alphanumerics ...... 863 ISO/IEC 10646 specification for ...... 945 fragments for mathematical typesetting . . . . . 841 substitution character game ...... 852 see replacement character gender ...... 852 Sumero-Akkadian ...... 428–431 genealogical ...... 852 Sundanese ...... 696–697 geometrical ...... 845–848 superscripts ...... 325 hand ...... 853 and subscripts ...... 829 Khmer lunar calendar ...... 656 supplementary characters letterlike ...... 809–815 in UTF-16 strings ...... 43 map ...... 851 tables for ...... 197 mathematical ...... 831–838 Supplementary General Scripts Area ...... 50 mathematical alphanumeric ...... 810–815 Supplementary Ideographic Plane (SIP) ...... 44, 52 miscellaneous ...... 850 Supplementary Multilingual Plane (SMP) . . . . .44, 51 musical ...... 788–797 supplementary planes numerals ...... 816–828 representation in UTF-16 ...... 36 recycling ...... 852 representation in UTF-8 ...... 37 regional indicator ...... 864 Supplementary ...... 52, 889 technical ...... 840–844 Supplementary Special-purpose Plane (SSP) . . . . . 45 weather ...... 851 supported subsets ...... 71 zodiacal ...... 853 conformance ...... 80 symmetric swapping format characters ...... 882 supralineation ...... 311 Syriac ...... 391–399 surrogate code points see surrogates Index 984

T tone marks Bopomofo spacing ...... 727, 728 tab (U+0009 character tabulation) ...... 869 Chinantec ...... 326 tab, vertical (U+000B) ...... 209, 869 tables of character data ...... 196–197 Chinese ...... 326 Tai Le ...... 657 optimization ...... 197 ...... 631 supplementary characters ...... 197 tag characters ...... 898–902 Vietnamese ...... 292 traditional Chinese ...... 709 Tagalog ...... 678 traffic signs ...... 851 Tagbanwa ...... 678 tags, language ...... 215, 898–902 trailing surrogates see low-surrogate code units use strongly discouraged ...... 901 transcoding ...... 196–197 Tai Laing digits ...... 818 tables ...... 196 Transport and Map Symbols ...... 854 Tai Le ...... 657–658 triangulation in transcoding ...... 196 Tai Tham ...... 662–664 digits ...... 818 tries ...... 196 truncation Tai Viet ...... 665–667 combining character sequences ...... 220–221 Tai Xuan Jing symbols ...... 859 surrogates and ...... 204 Takri ...... 602–603 ...... 489–498 Turkish ...... 294 case mapping of I ...... 238, 291 Tangut ...... 748–750 cedilla ...... 291 components ...... 749–750 radicals ...... 749 lira sign ...... 807 two-stage tables ...... 197 tashkil ...... 368 tashkil, harakat, points ...... 370 TCHAR in Win32 API ...... 200 U Technical Reports (UTR) ...... 929 U+ notation ...... 924 Technical Standards (UTS) ...... xxvi, 929 U+10FFFF (not a character code) ...... 891 abstracts ...... 930 U+FEFF (BOM) ...... 893–895 technical symbols ...... 840–844 U+FFFE (not a character code) ...... 892 ...... 499–501 U+FFFF (not a character code) ...... 891 terminal emulation ...... 804 UAX (Unicode Standard Annex) ...... xxiv, 929 text boundaries ...... 61, 189, 217–218, 228 as component of Unicode Standard ...... 79 see also UAX #14, Unicode Line Breaking Algo- conformance ...... 85 rithm list of ...... 85 see also UAX #29, Unicode Text Boundaries UCA see Unicode Collation Algorithm and see also text elements ...... 6, 10, 217 UTS #10, Unicode Collation Algorithm boundaries ...... 228 UCD see Unicode Character Database for sorting ...... 230 UCS (Universal Character Set) variable-width nature ...... 38 see ISO/IEC 10646 text processes ...... 6, 10–13 UCS-2 ...... 942 text rendering ...... 6, 10, 17 UCS-4 ...... 942 text selection, boundaries for ...... 217–218 Ugaritic ...... 432 ...... 515–516 Ukrainian ...... 313 Thai ...... 631–634 unassigned code points ...... 30, 79, 201 ...... 521–531 defined as reserved code points ...... 93 ...... 757 handling ...... 74 Tigre ...... 752 properties of ...... 97 tilde (U+007E) ...... 276 semantics ...... 79 Tirhuta ...... 613–615 see also reserved code points titlecase ...... 164, 236 underscores ...... 273 Todo ...... 533 undesignated code points ...... 30 tone letters ...... 325–326 Unicode 1.0 Name (informative property) ...... 187 Index 985

Unicode algorithms areas ...... 45 and properties ...... 100 benefits ...... 1 conformance ...... 84 blocks ...... 255 definition ...... 93 code charts ...... 903–920 normative references to ...... 78, 84 components ...... 79 Unicode Bidirectional Algorithm ...... 21, 53 conformance ...... 73–158 see also UAX #9, Unicode Bidirectional Algorithm conformance of ISO/IEC 10646 implementations .. Unicode Character Database (UCD) . . xxiv, 161, 931 947 as component of Unicode Standard ...... 79 corrections ...... 76 changes ...... 74 definitions for conformance ...... 87–93 properties in ...... 46 design goals ...... 4 Unicode character encoding model ...... 33, 42 design principles ...... 14–24 see also UTR #17, Unicode Character Encoding errata ...... 76, 931 Model normative references to ...... 76, 84 Unicode character literals number of characters ...... 3 code point notation U+ ...... 924 number of code points ...... 1, 29 Unicode codespace script coverage ...... 3 allocation numbers ...... 950 security issues ...... 245 definition ...... 90 synchrony with ISO/IEC 10646 ...... 944 planes ...... 44 updates ...... 931 size ...... 1, 29 versions see versions of the Unicode Standard Unicode Collation Algorithm (UCA) ...... 12 see also Version 12.0 Unicode conferences ...... 930 Unicode Standard Annexes (UAX) ...... xxiv, 929 Unicode Consortium ...... 928 as components of Unicode Standard ...... 79 addresses ...... 932 conformance ...... 85 Consortium membership in standards bodies 928 list of ...... 85 e-mail discussion list ...... 930 Unicode string literals membership ...... 928 code point notation \u1234 ...... 924 policies ...... 931 Unicode strings ...... 43 website ...... 930 definition ...... 121 Unicode data signature ...... 67, 893–895 Unicode Technical Committee (UTC) ...... 928 Unicode data types ...... 199–200 Unicode Technical Reports (UTR) ...... 929 for C ...... 199–200 Unicode Technical Standards (UTS) ...... xxvi, 929 Unicode encoding forms ...... 120–127 abstracts ...... 930 advantages of each ...... 38 UnicodeData.txt ...... 152, 166 conformance ...... 34, 82 unification definition ...... 121 as Unicode design principle ...... 21 fixed-width (UTF-32) ...... 35, 124 see also Han unification signatures ...... 894, 895 Unified Repertoire and Ordering (URO) . . . 713, 957 variable-width ...... 36, 125 see also Han unification see also encoding forms Unihan Database ...... 161, 717, 718, 931, 958 Unicode encoding schemes Unihan.zip ...... 102, 161 conformance ...... 130–133 unit separator (U+001F) ...... 869 definition ...... 130 Universal Character Set (UCS) endian ordering ...... 40 see ISO/IEC 10646 see also encoding schemes universality Unicode escape sequence notation \u1234 ...... 924 as Unicode design principle ...... 14 Unicode scalar values Unix definition ...... 120 and UTFs ...... 38 Unicode security ...... 245 newline function ...... 210 see also UTS #39, Unicode Security Mechanisms UTF-32 in ...... 35 Unicode Standard UTF-8 in ...... 18 allocation of encoded characters ...... 44–52 unsupported characters ...... 201 architecture ...... 10–13 upadhmaniya ...... 471, 600 Index 986

update version ...... 75 preferred encoding for Internet protocols . . . . 37 uppercase ...... 164, 236, 287 security and ...... 245 Uralic Phonetic Alphabet (UPA) ...... 276, 298 signature ...... 130, 133, 894 Urdu ...... 367 UTR (Unicode Technical Report) ...... 929 URO (Unified Repertoire and Ordering) . . 713, 957 UTS (Unicode Technical Standard) ...... xxvi, 929 see also Han unification abstracts ...... 930 UTF, Unicode Transformation Formats ...... 33, 121 Uyghur ...... 367, 577 advantages of each ...... 38 as encoding form or scheme ...... 133 V binary comparison and sort order differences . . . Vai ...... 765–766 231, ...... 233 valid (synonym for well-formed) ...... 123 in APIs ...... 200 variable-width Unicode encoding form ...... 36, 125 UTF-16 ...... 36, 125, 943 variants binary comparison and sort order caution . . . . 36 compatibility ...... 26 bit distribution (table) ...... 125 fullwidth and halfwidth ...... 285 BOM in ...... 131, 893 mathematical symbols ...... 838 encoding form (definition) ...... 125 small form ...... 285 encoding scheme (definition) ...... 131 standardized ...... 884 encoding schemes ...... 40 variation selectors ...... 193, 884 in ISO/IEC 10646 ...... 943 ideographic variation mark (U+303E) ...... 725 in UTF-8 order ...... 234 Mongolian free variation selectors ...... 536 surrogates and string handling ...... 43, 203 variation sequences ...... 884 UTF-16BE (Big-endian) ...... 894 for Phags-pa ...... 581–583 encoding scheme ...... 41 Version 12.0 ...... 79 encoding scheme (definition) ...... 130 number of characters ...... 3 UTF-16LE (Little-endian) ...... 894 versions of the Unicode Standard . xxiv, 74, 931, 949– encoding scheme ...... 41 951 encoding scheme (definition) ...... 130 backward compatibility ...... 74 UTF-32 ...... 35, 124 compared to ISO/IEC 10646 editions ...... 949 as processing code ...... 38 content ...... 75 BOM in ...... 132 interaction in implementations ...... 201 encoding form (definition) ...... 124 numbering ...... 75 encoding scheme (definition) ...... 132 property changes ...... 74 encoding schemes ...... 40 stability ...... 74 in Unix ...... 35 updates ...... 931 UTF-32BE (Big-endian) vertical tab (U+000B) ...... 209, 869 encoding scheme ...... 41 vertical text ...... 53, 262, 284 encoding scheme (definition) ...... 132 East Asian scripts ...... 702 UTF-32LE (Little-endian) Mongolian ...... 534 encoding scheme ...... 41 Vietnamese ...... 292, 299 encoding scheme (definition) ...... 132 ideographs ...... 702 UTF-8 ...... 36, 125, 943 virama ...... 258, 445 ASCII transparency ...... 36 definition ...... 450 binary comparison and sort order ...... 39 Kharoshthi ...... 573 bit distribution (table) ...... 126 Khmer ...... 648 BOM in ...... 130, 133, 894 Myanmar ...... 639 byte ranges ...... 126 Philippine scripts ...... 678 compared to multibyte encodings ...... 37 virama-like characters ...... 191 encoding form (definition) ...... 125 visual order used for Thai and Lao ...... 21 encoding scheme ...... 40 vowel harmony encoding scheme (definition) ...... 130 Mongolian ...... 538 in Unix ...... 18 vowel marks, Middle Eastern scripts ...... 359 in UTF-16 order ...... 233 vowel separator non-shortest form is invalid ...... 125, 245 Mongolian ...... 539 Index 987

vowel signs ZWJ see zero width joiner (U+200D) Indic ...... 56, 449 ZWNBSP see zero width no-break space (U+FEFF) Khmer ...... 650 ZWNJ see zero width non-joiner (U+200C) Philippine scripts ...... 678 ZWSP see zero width space (U+200B) W Wancho ...... 561 ...... 549 wchar_t and Unicode encoding forms ...... 38 in C language ...... 200 weak directional characters ...... 171 weather symbols ...... 851 website, Unicode Consortium ...... 930 Weierstrass elliptic function symbol ...... 810 well-formed definition ...... 122 Welsh ...... 294 Where Is My Character? ...... 932 wide characters data type in C ...... 200 wiggly fence (U+29DB) ...... 836 Windows newline function ...... 210 word breaks ...... 219, 871–873 in South Asian scripts ...... 633, 641, 656 word joiner (U+2060) ...... 871 writing direction see directionality writing systems ...... 256–260 Wu (Shanghainese) ...... 710 X Xibe ...... 533 Xishuangbanna Dai ...... 659 Y ...... 739–741 Yiddish ...... 361 Yijing Hexagram Symbols ...... 858 ypogegrammeni ...... 304 Z Zanabazar Square ...... 585–587 Zapf Dingbats ...... 854 zero extension relation among encodings ...... 942 zero width joiner (U+200D) ...... 369–370, 874 zero width no-break space (U+FEFF) . . . 67, 83, 871 initial ...... 133, 894 zero width non-joiner (U+200C) ...... 369–370, 875 zero width space (U+200B) ...... 872 for word breaks in South Asian scripts . 633, 641, 656 zero-width space characters ...... 872