<<

The ® Standard Version 14.0 – Core Specification

To learn about the latest version of the Unicode Standard, see https://www.unicode.org/versions/latest/. Many of the designations used by manufacturers and sellers distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of trade- mark claim, the designations have been printed with initial capital letters or in all capitals. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the and other countries. The authors and publisher have taken care in the preparation of this specification, but make expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Database and other files are provided as-is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. © 2021 Unicode, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction. For information regarding permissions, inquire at https://www.unicode.org/reporting.html. For information about the Unicode terms of use, please see https://www.unicode.org/copyright.html. The Unicode Standard / the ; edited by the Unicode Consortium. — Version 14.0. Includes . ISBN 978-1-936213-29-0 (https://www.unicode.org/versions/Unicode14.0.0/) 1. Unicode (Computer character set) . Unicode Consortium. QA268.U545 2021

ISBN 978-1-936213-29-0 Published in Mountain View, September 2021 1001 I Index

The index covers the contents of this core specification. To find topics in the Unicode Stan- dard Annexes, Unicode Technical Standards, and Unicode Technical Reports, use the search feature on the Unicode website. For definitions of terms used, see the glossary on the Unicode website. To find the code points for specific characters or the code ranges for particular scripts, use the Character Index on the Unicode website. (See Appendix .3, Other Unicode Online Resources.)

A alternate format characters (deprecated) . . .193, 915– 916 abbreviation, Coptic ...... 313 Americas ...... 258, 365 scripts of ...... 805–813 abstract character sequences definition ...... 88 Amharic ...... 782 ...... 453–454 abstract characters ...... 29 Ancient Symbols ...... 891 definition ...... 88 ...... 259, 260, 455, 653 angle (+2329 and U+232A) deprecated for technical publication ...... 872 accent marks see Annexes, Unicode Standard (UAX) ...... xxiii, 965 accented characters encoding ...... 12 as components of Unicode Standard ...... 77 conformance ...... 83 Latin ...... 291 list of ...... 83 normalization ...... 206 accounting , ideographic ...... 176 annotation characters ...... 928–930 use in discouraged ...... 929 acrophonic numerals ...... 205, 310 ANSI/ISO Adlam ...... 802–803 Aegean numbers ...... 344 wchar_t and Unicode ...... 200 (U+0027) ...... 274 Africa Arabic ...... 373–398 scripts of ...... 781–804 Afrikaans ...... 296 digits ...... 849 Arabic-Indic digits ...... 377–378 Ahom ...... 648–649 signs used with ...... 379 Ainu ...... 756 Aiton ...... 668 ArabicShaping.txt ...... 382, 386, 404 Aramaic ...... 422, 455, 550, 581, 587 Alchemical Symbols ...... 887 areas of the Unicode Standard ...... 44 Algonquian ...... 808 Ali Gali ...... 549 ARIB ...... 882 Armenian ...... 321–322 aliases arrows ...... 868–869 character name ...... 86, 181 informative ...... 942 ASCII characters with multiple semantics ...... 264 normative ...... 943 transparency of UTF-8 ...... 37 property ...... 162 property value ...... 162 Unicode modeled on ...... 1 zero extension ...... 200, 978 allocation areas ...... 44 Assamese ...... 483 allocation of encoded characters ...... 43–51 Alphabetic (informative property) ...... 189 assigned code points ...... 11, 30 Athapascan ...... 808 ...... 258 atomic character boundaries ...... 218 European ...... 289–339 mathematical ...... 841–846 Avestan ...... 430–431 Index 1002

B ...... 816–817 Breton ...... 296 Balinese ...... 707–712 Buginese ...... 705–706 Bamum ...... 797–798 Bangla ...... 483–489 Buhid ...... 702 Bulgarian ...... 315 base characters ...... 329 bullets ...... 278 definition ...... 104 multiple ...... 59 numeric ...... 851 Burmese see ordered before combining marks ...... 220, 329 Byelorussian ...... 315 Basic Multilingual (BMP) ...... 1, 43 allocation areas ...... 48 (BOM) (U+FEFF) .39, 66, 129–132, 926–928 representation in UTF-16 ...... 36 byte ordering Basque ...... 296 Bassa Vah ...... 799 changing ...... 79 conformance ...... 81 Batak ...... 718–719 byte serialization ...... 39, 66 ...... 702 benefits of Unicode ...... 1 Byzantine Musical Symbols ...... 824 ...... 483–489 Bhaiksuki ...... 593–594 C Bidi Class (normative property) ...... 171 C language Bidi Mirrored (normative property) ...... 178 wchar_t and Unicode ...... 200 Bidi Mirroring (informative property) . . . . 179 C0 and C1 control codes ...... 31, 188, 900 BidiMirroring.txt ...... 179 Cambodian see Khmer Bidirectional Algorithm, Unicode ...... 52, 82 Canadian Aboriginal Syllabics ...... 808–809 bidirectional ordering ...... 20 candrabindu ...... 481, 621 controls ...... 912 canonical composite characters ...... 52, 82 see canonical decomposable characters Middle Eastern scripts ...... 365 canonical composition algorithm ...... 137 nonspacing marks in ...... 223 canonical decomposable characters in ...... 263 definition ...... 116 big-endian ...... 39 canonical decomposition ...... 62 definition ...... 81 definition ...... 115 Bihari ...... 479 mappings ...... 114 binary comparison and sort order canonical equivalence caution for UTF-16 ...... 36 definition ...... 116 UTF differences ...... 231, 233 nonspacing marks ...... 225 UTF-8 ...... 38 canonical equivalent character sequences block ...... 44, 88, 257, 937 conformance ...... 79 headers ...... 949 canonical mappings BMP see Basic Multilingual Plane see canonical decomposition mappings BNF (Backus-Naur Form) ...... 959 canonical ordering algorithm ...... 136 BOCU-1 see UTN #6, BOCU-1 canonical precomposed characters MIME-Compatible Unicode Compression see canonical decomposable characters Bodhi ...... 538 Cantonese ...... 735 Bodo ...... 478 capital letters ...... 164, 236, 289 BOM (U+FEFF) ...... 39, 66, 129–132, 926–928 Carian ...... 348 ...... 752–754 (U+000D) (CR) ...... 209, 901 boundaries, text ...... 60, 190, 217–218, 228 carriage return and line feed (CRLF) ...... 209 see also UAX #14, Unicode Line Breaking Algo- case ...... 297 rithm and text processes ...... 12 see also UAX #29, Unicode Text Segmentation beyond ASCII ...... 237 boustrophedon ...... 52, 354 camelcase ...... 239 box drawing symbols ...... 876 case folding ...... 240 Brahmi ...... 455, 581, 583–586, 587, 655 case operations (conformance) ...... 83, 151–157 Index 1003

case operations and normalization ...... 242 character sequences case operations, reversibility ...... 239 abstract see abstract character sequences cased (definition) ...... 152 canonical equivalent see canonical equivalent case-insensitive comparison ...... 156, 231, 240 character sequences casing context (definition) ...... 152 compatibility equivalent see compatibility equiva- conversion ...... 153 lent character sequences detection ...... 155 conformance ...... 79 European alphabets ...... 289 named ...... 181 exceptional Latin pairs ...... 293, 297 character sequences, combining ...... 104 ...... 324 character shaping selectors (deprecated) ...... 915 lowercase ...... 164, 236, 289 character statistics ...... 966 mapping tables ...... 196 character tabulation (U+0009) ...... 901 mappings ...... 151, 166, 236–238 characters mappings noted in code charts ...... 946 abstract see abstract characters titlecase ...... 164, 236 arrangement in Unicode ...... 45 Turkish I ...... 238, 293 assigned ...... 11, 30 uppercase ...... 164, 236, 289 boundaries ...... 217 see also default case canonical decomposable see canonical decompos- Case (normative property) ...... 164, 236 able characters CaseFolding.txt ...... 166, 240 classes ...... 960 caseless letters ...... 297 code charts ...... 937–955 Catalan ...... 295 coded see encoded characters Caucasian Albanian ...... 359 combining see combining characters cedilla ...... 292 compatibility decomposable see compatibility CEF see forms decomposable characters CES see character encoding schemes composite see decomposable characters Chakma ...... 569 concept of ...... 15, 60 Cham ...... 693–694 conformance definitions ...... 88–91 character encoding forms (CEF) ...... 33–38, 978 confusable ...... 245 see also Unicode encoding forms conversion ...... 196–197 character encoding model ...... 33, 41 decomposable see decomposable characters see also UTR #17, Unicode Character Encoding deprecated see deprecated characters Model encoded see encoded characters character encoding schemes (CES) ...... 39–42 encoding forms see encoding forms see also Unicode encoding schemes encoding schemes see encoding schemes character encoding standards end-user perceived ...... 60 coverage by Unicode ...... 3 format control ...... 30, 67, 265, 899–916 Character Index ...... 966 , relationship to ...... 15 character literals, Unicode graphic ...... 30 notation U+ ...... 960 identity (definition) ...... 85 character names ...... 86, 180–187, 982 ignored in processing ...... 248–254 aliases ...... 86, 181 interpretation ...... 78 conventions ...... 957 layout control ...... 67, 903–913 for CJK ideographs ...... 951 modification ...... 79 for control codes ...... 185, 188 names list ...... 938–950 in code charts ...... 942 names see character names matching ...... 181 not encoded in Unicode ...... 3 character properties encoded in Version 14.0 ...... 3 see properties precomposed see decomposable characters see also individual properties, .. Combining Class properties see properties character semantics ...... 1, 78, 85–86, 983 semantics see character semantics as Unicode design principle ...... 18 special ...... 66, 899–935 ASCII ...... 264 supplementary see supplementary characters definition ...... 85 Index 1004

transcoding ...... 196–197 cluster boundaries ...... 217 unsupported ...... 201 code charts ...... 937–955 characters, not glyphs representative glyphs ...... 938 in spoofing ...... 246 code point sequences Unicode principle ...... 15 notation ...... 958 charsets code points ...... 7, 29 IANA registered names ...... 40 assigned ...... 11, 30 Cherokee ...... 806 assignment ...... 45 Chinese ...... 734–735 categories ...... 30 Cantonese ...... 735 default ignorable ...... 201, 253 Hakka ...... 753 definition ...... 88 Mandarin ...... 735 designated ...... 30 Minnan (Hokkien/Fujian, incl. Taiwanese) . . 753 notation ...... 957 simplified and traditional ...... 734 number in Unicode Standard ...... 1 Chorasmian ...... 432 private-use see private-use code points Chu hán ...... 733 reserved see reserved code points Chu Nôm ...... 990 semantics ...... 32 citations for surrogate see surrogates properties ...... 75 unassigned see unassigned code points Unicode algorithms ...... 76 undesignated ...... 30 Unicode Standard ...... 74 code positions see code points CJK ideographs ...... 260, 728–744 code set independence ...... 18 accounting numbers ...... 176 code unit sequences CJK Compatibility Ideographs ...... 743 definition ...... 118 CJK Compatibility Supplement ...... 744 ill-formed (definition) ...... 120 CJK Strokes ...... 746, 993 notation ...... 958 CJK Unified Ideographs ...... 728–743 well-formed (definition) ...... 120 CJK Unified Ideographs Extension B ...... 743 code units code charts ...... 951 definition ...... 118 compatibility ideographs in Plane 2 ...... 51 isolated ...... 117 component structure ...... 738 code values see code units encoding blocks ...... 729 coded character representations ideographic description sequences . . . . . 748–751 see coded character sequences ideographic variation mark (U+303E) ...... 750 coded character sequences Kangxi radicals ...... 742, 745–746 definition ...... 90 names ...... 951 coded characters see encoded characters numbers ...... 849 codespace see Unicode codespace numeric values ...... 176, 205 coeng ...... 669, 672 order of encoding ...... 740 Collation Algorithm, Unicode (UCA) ...... 12 radicals ...... 745–746 collation see sorting source standards ...... 992 collation tables ...... 196 unknown or unavailable ...... 286 sequences ...... 55, 104 Vietnamese ...... 726 defective ...... 223 CJK Miscellaneous Area ...... 49 definition ...... 106 CJK punctuation and symbols ...... 284 Latin ...... 291 compatibility forms ...... 287 line breaking ...... 219 overscores and ...... 287 matching ...... 219 quotation marks ...... 272 order of base character and marks ...... 220, 329 sesame dots ...... 286 rendering ...... 219 vertical forms ...... 287 selection ...... 217 CJK Unified Ideographs extensions ...... 730–731 truncation ...... 220–221 CJK-JRG (Chinese/Japanese/Korean Joint Research combining characters . . . . . 54–59, 108–113, 219–227 Group) ...... 988 blocking reordering ...... 910 CJKV Ideographs Area ...... 49 canonical ordering ...... 61, 136, 168 Index 1005

combining marks ...... 329–330 conformance ...... 71–157 definition ...... 104 definitions ...... 85–91 dependence ...... 329 examples ...... 68 display order ...... 56 ISO/IEC 10646 implementations ...... 983 keyboard input ...... 220 requirements ...... 77–82 ligatures ...... 59 confusables ...... 245 multiple ...... 56 conjunct consonants multiple base characters ...... 59 Indic ...... 217, 463 normalization of ...... 206 Myanmar ...... 663 ordering conventions ...... 55 selection of clusters ...... 217 rendering of marks ...... 222–227 contextual shaping reordrant ...... 169 apostrophe ...... 274 -specific ...... 55 Arabic ...... 373 split ...... 169 not used for final forms ...... 368 ...... 170 Syriac ...... 403 subjoined ...... 170 contour tones ...... 327 typographical interaction ...... 56, 168 control codes ...... 31, 67, 900 vertical stacking ...... 56 graphics for ...... 871 see also diacritics names ...... 188 Combining Class (normative property) ...... 168 properties ...... 901 combining classes ...... 134, 168, 225–226 semantics ...... 32, 901 class zero characters ...... 168 specified in Unicode ...... 901 definition ...... 134 control sequences ...... 900 combining joiner (U+034F) ...... 910 conversion of characters ...... 196–197 combining half marks ...... 191, 337 convertibility combining marks see combining characters as Unicode design principle ...... 24 below ...... 292 Coptic ...... 309, 312–314 Compatibility and Area ...... 26, 49 Coptic Epact numbers ...... 854 compatibility characters ...... 22 corporate use subarea ...... 921 compatibility composite characters ...... 27 corrigenda ...... 74 see compatibility decomposable characters CR (U+000D carriage return) ...... 209, 901 compatibility decomposable characters ...... 26 Creative Commons ...... 897 definition ...... 114 CRLF (carriage return and line feed) ...... 209 compatibility decomposition ...... 62 Croatian ...... 296 definition ...... 114 digraphs ...... 296 compatibility decomposition mappings ...... 114 culturally expected sorting ...... 12, 230 compatibility equivalence definition ...... 115 Old Persian ...... 443 compatibility equivalent character sequences Sumero-Akkadian ...... 438–441 conformance ...... 79 Ugaritic ...... 442 compatibility mappings Cuneiform and Hieroglyphic Area ...... 50 see compatibility decomposition mappings Cuneiform and Hieroglyphs ...... 437–454 compatibility precomposed characters currency symbols block ...... 835–838 see compatibility decomposable characters currency symbols encoded in other blocks . . 836 compatibility variants ...... 26 currency symbols, other ...... 837 mapping ...... 243 dollar sign, and usage ...... 836 composite characters euro sign ...... 837 see decomposable characters lari sign ...... 837 Composition Exclusion (normative property) . . . . 98 lira sign, compatibility usage ...... 836 compression ...... 208 lira sign, Turkish ...... 837 see also UTS #6, A Standard Compression Scheme peso signs, usage ...... 836 for Unicode (SCSU) ruble sign ...... 837 conferences ...... 966 rupee signs, Indian, usage ...... 837 yen and yuan signs, usage ...... 836 Index 1006

joining ...... 905–909 deprecated characters ...... 72, 941 Arabic ...... 381–388 alternate format ...... 193, 915–916 control characters for . . . . 192, 375–376, 553, 904 definition ...... 90 Mandaic ...... 412 Derived Age (property) ...... 202 Mongolian ...... 552–553 derived properties ...... 793 definition ...... 102 Phags- ...... 600 DerivedCoreProperties.txt ...... 152, 164, 253 Syriac ...... 403–407 DerivedNormalizationProps.txt ...... 242 transparency ...... 908 ...... 811–813 cursive scripts ...... 365 design goals of Unicode ...... 4 Cypriot ...... 346 design principles of Unicode ...... 14–24 see also designated code points ...... 30 Cypro-Minoan ...... 347 ...... 457–482 Cyrillic ...... 315–318 Dhivehi ...... 529 Czech ...... 296 diacritics ...... 54, 329 alternative glyphs ...... 291, 329 Czech ...... 292 display in isolation ...... 59, 267, 330 danda, in Devanagari block ...... 477 double ...... 112, 191, 331 Danish ...... 295 dashes ...... 267 German dialectology ...... 335 Greek ...... 305–306, 309 Database, Unicode Character Latin ...... 291–294 see Unicode Character Database (UCD) dead consonants, Indic ...... 460 Latvian ...... 292 mathematical ...... 845 dead keys ...... 220 on i and ...... 293 decomposable characters ...... 62 definition ...... 114 rendering ...... 222–227 Slovak ...... 292 normalization of ...... 206 spacing clones of ...... 327, 331 decomposition ...... 62, 114–116 canonical see canonical decomposition ...... 54, 336 see also combining characters compatibility see compatibility decomposition dictionary symbols ...... 883 definition ...... 114 in normalization ...... 206 digit form names ...... 377 digits ...... 205 mapping, definition ...... 114 Arabic ...... 849 mappings noted in code charts ...... 946 default case Arabic-Indic ...... 377–378 compatibility ...... 849 algorithms ...... 83, 151–157 ...... 175 conversion ...... 153 detection ...... 155 glyph variants ...... 851 ...... 849 folding ...... 154 Myanmar ...... 849 default caseless matching ...... 156 default grapheme clusters ...... 217 national shapes ...... 916 Shan ...... 849 see also UAX #29, Unicode Text Segmentation superscript and subscript ...... 850 Default Ignorable Code Point (property) ...... 253 default ignorable code points ...... 201, 253 Tai Laing ...... 849 Tai Tham ...... 849 default property values digraphs ...... 296, 299, 301 definition ...... 95 defective combining character sequences ...... 223 ...... 885–886 directionality ...... 20, 52 definition ...... 106 East Asian scripts ...... 726 dependent vowel signs Indic ...... 459 Middle Eastern scripts ...... 365 Mongolian ...... 551 Khmer ...... 674 musical symbols ...... 819 Philippine scripts ...... 702 normative property ...... 171 ...... 362 Index 1007

Old Italic ...... 351 encoding forms ...... 33–38 Philippine scripts ...... 703 ISO/IEC 10646 definitions ...... 978 ...... 354 encoding forms, Unicode discussion list for Unicode ...... 966 see Unicode encoding forms Dives Akuru ...... 646–647 encoding model for Unicode characters ...... 33, 41 Dogra ...... 651–652 see also UTR #17, Unicode Character Encoding Dogri ...... 478 Model Domino Tiles ...... 888 encoding schemes ...... 39–42 dotless i ...... 238, 293 encoding schemes, Unicode dotted circle see Unicode encoding schemes in code charts ...... 105, 330 endian ordering in fallback rendering ...... 222 see byte order mark (BOM) (U+FEFF) to indicate ...... 54 end-user subarea ...... 922 to indicate vowel sign placement ...... 56 English ...... 295 double diacritics ...... 112, 191, 331 equivalent sequences ...... 206 Duployan ...... 829–830 as Unicode design principle ...... 23 Dutch ...... 295, 296 case-insensitivity ...... 231, 240 dynamic composition combining characters in matching ...... 219 as Unicode design principle ...... 23 conformance ...... 80 Dzongkha ...... 538 ...... 764 in sorting and searching ...... 230 E language-specific ...... 116 East Asian scripts ...... 725–777 security implications ...... 245 see also canonical equivalence direction ...... 52 see also compatibility equivalence see also CJK ideographs Eastern Arabic-Indic digits ...... 377 see also encoding forms, encoding schemes errata ...... xxv, 74, 968 EBCDIC escape sequences ...... 900 function ...... 210 editing, text boundaries for ...... 217–218 not used in Unicode ...... 1, 4 Esperanto ...... 296 efficiency Estonian ...... 296 as Unicode design principle ...... 15 ...... 444–450 ...... 782–785 Etruscan ...... 350 format controls ...... 446–450 European scripts ...... 289–339 Elbasan ...... 358 ...... 276–277 ancient ...... 341–363 eyelash- ...... 469 ...... 433 e-mail discussion list for Unicode ...... 966 ...... 880, 881, 966 animal symbols ...... 884 fallback rendering ...... 252 charts ...... 966 of nonspacing marks ...... 222 cultural symbols ...... 884 FAQ (Frequently Asked Questions) ...... 967 zodiacal symbols ...... 884 Faroese ...... 295 emoji modifiers ...... 884 Farsi ...... 373, 375 ...... 885 featural ...... 259 ...... 895 FF (U+000C form feed) ...... 209, 901 enclosing marks ...... 337 separator (U+001C) ...... 901 definition ...... 105 Finnish ...... 295 encoded characters ...... 7, 29 Finno-Ugric Transcription (FUT) allocation ...... 43–51 see Uralic Phonetic (UPA) definition ...... 90 fixed-width Unicode encoding form (UTF-32) . . . 35, encoding form conversion 122 definition ...... 125 flat tables ...... 196 Flemish ...... 295 Index 1008

fleurons ...... 887 graphic characters ...... 30 fonts Greek ...... 305–310 and Unicode characters ...... 16 acrophonic numerals ...... 205, 310 for mathematical alphabets ...... 844–846 alternative glyphs ...... 306–308 style variation for symbols ...... 833 ancient ...... 826–828 form feed (U+000C) (FF) ...... 209, 901 editorial marks ...... 281 format control characters ...... 30, 67, 265, 899–916 letters as symbols ...... 306–308, 865 deprecated ...... 915–916 see also Cypriot, Linear B prefixed ...... 193, 333 Greenlandic ...... 296 stateful ...... 913 group separator (U+001D) ...... 901 fraction characters ...... 862 guillemets ...... 271 fraction (U+2044) ...... 275, 858 ...... 495–496 French ...... 296 Gunjala Gondi ...... 576–577 Frisian ...... 296 ...... 490–494 fullwidth forms in East Asian encodings ...... 761 futhark ...... 353 Hakka ...... 753 G halant ...... 455 Garshuni ...... 399 see also Ge’ez ...... 782 half marks, combining ...... 191, 337 General Category (normative property) ...... 172 half-consonants, Indic ...... 464 list of values ...... 172 halfwidth forms in East Asian encodings ...... 761 ...... 263–287 hamza ...... 392–394 General Scripts Area ...... 49 Han ideographs see CJK ideographs geometrical symbols ...... 876–879 ...... 736–743 Georgian ...... 323–324 and language tags ...... 215 German ...... 295 history ...... 987–992 geta mark (U+3013) ...... 286 language usage ...... 733 Glagolitic ...... 320 source separation rule ...... 731, 737 Glossary ...... 967 source standards ...... 992 glyph selection tables ...... 196 hand symbols ...... 884 glyphs ...... 6, 15 Hangul Area ...... 49 characters, relationship to ...... 15 Hangul syllables ...... 725, 762–765 diacritics alternative ...... 291, 329 and combining marks ...... 112 Greek alternative ...... 306–308 canonical decomposition ...... 143 Latin alternative ...... 291 collation ...... 764 mathematical alternative ...... 864 composition ...... 145 missing ...... 253 conjoining jamo ...... 141–150 representative in code charts ...... 938 equivalent sequences ...... 764 standardized variants ...... 917 Hangul Compatibility Jamo ...... 763 symbols alternative ...... 833 Hangul Jamo ...... 762–765 golden numbers ...... 355 Hangul Syllables block ...... 764–765 Gothic ...... 357 Johab set ...... 764 Grantha ...... 642–645 name generation ...... 146 grapheme base ...... 329 normalization ...... 763 definition ...... 107 standard ...... 142 grapheme clusters ...... 11, 60 Hangzhou numerals ...... 858 see also UAX #29, Unicode Text Segmentation Hanifi Rohingya ...... 700 default ...... 217 see CJK ideographs definition ...... 107 Hanunóo ...... 702 grapheme extender Hanzi see CJK ideographs definition ...... 107 harakat ...... 374 grapheme joiner, combining (U+034F) ...... 910 hasant ...... 483 Index 1009

hash tables ...... 197 industry character sets Hatran ...... 436 covered in Unicode ...... 3 Hebrew ...... 367–372 information separators (U+001C..U+001F) . . . . . 901 ...... 756–758 informative properties hieroglyphs definition ...... 99 Anatolian ...... 453–454 ...... 428 Egyptian ...... 444–450 ...... 428 Meroitic ...... 451–452 inside-out rule ...... 222 high surrogate interchange restrictions ...... 31 definition ...... 117 International Phonetic Alphabet (IPA) 258, 298–299 high-surrogate code points ...... 77, 923 ...... 326 high-surrogate code units ...... 117 see also phonetic alphabets higher-level protocols internationalization ...... 18 definition ...... 91 Internationalization & Unicode Conference . . . . 966 Hindi ...... 457 protocols ...... 755 UTF-8 as preferred encoding ...... 37 horizontal tab (U+0009) ...... 901 Inuktitut ...... 808 HTML newline function ...... 210 invisible operators ...... 870 Hungarian ...... 296 subscript ...... 306 hyphenation ...... 904 IPA see International Phonetic Alphabet as a text process ...... 10 IRG (Ideographic Research Group) ...... 991 hyphens ...... 267, 904 Irish ...... 295, 362 ISCII standard and Unicode ...... 457 I ISO/IEC 10646 ...... 969–983 conformance of Unicode implementations . . 983 I Ching symbols ...... 890 IANA charset names ...... 40 encoding forms ...... 978 synchrony with Unicode Standard ...... 980 Icelandic ...... 295 timeline compared to Unicode versions . . . . . 972 identifiers ...... 229 see also UAX #31, Unicode Identifier and Pattern Italian ...... 295 ITC ...... 885 Syntax IUC see Internationalization & Unicode Conference Ideographic (informative property) ...... 189 ideographic description sequences ...... 749 Ideographic Rapporteur Group (IRG) ...... 990 J Ideographic Research Group (IRG) ...... 991 jamos see Hangul syllables ideographs see also CJK ideographs Japanese ...... 725 IICore ...... 731, 990 Japanese era names ...... 896 ill-formed Javanese ...... 713–716 definition ...... 120 Jawi ...... 394 Imperial Aramaic ...... 422–423 jihvamuliya ...... 482, 621 implementation guidelines ...... 195–255 Johab ...... 764 in a Unicode encoding form joiners ...... 375 definition ...... 121 combining grapheme joiner (U+034F) ...... 910 in-band mechanisms ...... 935 (U+2060) ...... 903 India zero width joiner (U+200D) ...... 375–376, 906 Official scripts ...... 455–526 justification ...... 224 Indian rupee signs, usage ...... 837 Indic scripts ...... 455–526 principles, in terms of Devanagari . . . . . 458–468 ...... 618–620 relation to ISCII standard ...... 457 Kana (Hiragana and ) ...... 755–756 Indic Siyaq ...... 856 Kanbun ...... 744 Indonesia and Oceania Kangxi radicals ...... 742, 745–746 scripts of ...... 701–723 see CJK ideographs Indonesian ...... 295 ...... 514–517 Index 1010

Kashmiri ...... 479 leading surrogates Katakana ...... 755–756 see high-surrogate code units Kawi ...... 707, 709 legibility criterion for plain text ...... 19 Kayah Li ...... 692 Lepcha ...... 570–572 KC (normalization form) spacing ...... 904 see Normalization Form KC ...... 839–846 KD (normalization form) LF (U+000A line feed) ...... 209, 901 see Normalization Form KD ligatures ...... 905–909 keytop labels ...... 871 Arabic ...... 384–385 Khamti Shan ...... 666 combining characters on ...... 59 Kharoshthi ...... 587–588 control characters for ...... 192 ...... 778–779 for nonspacing marks ...... 226 Khmer ...... 669–680 Latin ...... 301 characters not recommended ...... 677 selection ...... 218 components, order of ...... 678 Syriac ...... 407 Khojki ...... 629–630 Limbu ...... 559–562 Khudawadi ...... 631–632 line breaking ...... 209–213, 903–905 killer ...... 260 control characters ...... 191 Batak ...... 718 in South Asian scripts ...... 657, 665, 680 Brahmi ...... 583 recommendations ...... 211 Meetei Mayek ...... 563 see also UAX #14, Unicode Line Breaking Algo- Myanmar (asat) ...... 664 rithm see also virama line feed (U+000A) (LF) ...... 209, 901 Konkani ...... 477 line separator (U+2028) () ...... 209, 905 Kurdish ...... 394 line tabulation (U+000B) (VT) ...... 901 ...... 343 Linear B ...... 344–345 see also Cypriot Ladino ...... 367 language tags ...... 215, 931–935 linear boundaries ...... 218 Lisu ...... 770–772 and Han unification ...... 215 Lithuanian ...... 296 use strongly discouraged ...... 931, 934 Lanna ...... 683 little-endian ...... 39 definition ...... 81 ...... 659–661 logical order last-resort glyphs ...... 253 Latin ...... 291–304 as Unicode design principle ...... 19 exceptions to ...... 169 alternative glyphs ...... 291 logograph ...... 260 Basic Latin ...... 295 encoding blocks ...... 44 logosyllabaries ...... 260 low surrogate IPA Extensions ...... 298–299 definition ...... 117 Latin Extended Additional ...... 301–304 Latin Extended-A ...... 295 low-surrogate code points ...... 77, 923 low-surrogate code units ...... 117 Latin Extended-B ...... 296–298 lowercase ...... 164, 236, 289 Latin Extended-C ...... 301 Latin Extended-D ...... 302 LS (U+2028 line separator) ...... 209, 905 Lycian ...... 348 Latin Extended-E ...... 303 Lydian ...... 348 Latin Ligatures ...... 301 Latin-1 Supplement ...... 295 ...... 300–304 Latin Extended-F ...... 304 MacOS newline function ...... 210 Latin Extended-G ...... 304 ...... 627–628 Latvian ...... 296, 303 Mahjong Tiles ...... 887 cedilla ...... 292 mail discussion list for Unicode ...... 966 layout control characters ...... 67, 903–913 Maithili ...... 478 Index 1011

major version ...... 73 Modi ...... 637–639 Makasar ...... 722–723 modifier letters ...... 325–328 Malay ...... 295 Modifier Letters, Spacing ...... 301 Malay, Patani ...... 658 Mongolian ...... 549–558, 595 Malayalam ...... 518–526 writing direction ...... 551 Suriyani ...... 408, 519 symbols ...... 882 Maltese ...... 296 Mro ...... 565 Manchu ...... 550 Multani ...... 633 Mandaic ...... 411–413 multibyte encodings Mandarin ...... 735 compared to UTF-8 ...... 37 Manden ...... 790 multistage tables ...... 196 Manichaean ...... 424–427 musical symbols ...... 818–828 map symbols ...... 883 ancient Greek ...... 826–828 mapping tables see tables of character data Balinese ...... 712 Marathi ...... 457, 469, 476 Byzantine ...... 824 Marchen ...... 602 directionality ...... 819 markup languages Gregorian ...... 823 and Unicode conformance ...... 935 Kievan ...... 823 line breaking ...... 209 Persian ...... 823 Masaram Gondi ...... 574–575 Western ...... 818–823 Mathematical (informative property) ...... 862 Myanmar ...... 662–668 mathematical expression format characters ...... 193 digits ...... 849 see also UTR #25, Unicode Support for Mathe- Myanmar Extended-A ...... 666 matics Myanmar Extended-B ...... 666 mathematical symbols ...... 862–869 alphabets ...... 841–846 N alphanumeric ...... 840–846 N’Ko ...... 790–794 fonts ...... 844–846 Nabataean ...... 434 format characters ...... 870 named character sequences ...... 181 fragments for typesetting ...... 872 names, character see character names invisible operators ...... 870 namespace ...... 87 operators ...... 863–866 ...... 640–641 standardized variants ...... 869 NEL (U+0085 next line) ...... 209, 901 MathML ...... 866 Nepali ...... 457 matras ...... 168, 459 neutral directional characters ...... 171 Medefaidrin ...... 804 New Tai Lue ...... 683–685 Meetei Mayek ...... 563–564 Newa ...... 535–537 Mende Kikakui ...... 800–801 newline function (NLF) ...... 210, 902 Meroitic newline guidelines ...... 209–213 cursive ...... 451–452 next line (U+0085) (NEL) ...... 209, 901 hieroglyphs ...... 451–452 NFC (Normalization Form C) ...... 61 Miao ...... 773–774 NFD (Normalization Form D) ...... 61 Middle Eastern scripts ...... 365–530 NFKC (Normalization Form KC) ...... 61 ancient ...... 415–436 NFKD (Normalization Form KD) ...... 61 Min ...... 735 NLF (newline function) ...... 210, 902 Minnan (Hokkien/Fujian, incl. Taiwanese) . . . . . 753 no-break (U+00A0) ...... 903 minor version ...... 73 base for diacritic in isolation ...... 59, 267, 330 minus sign ...... 865 no-break space, narrow (U+202F) ...... 556 commercial (U+2052) ...... 279 noncharacter code points see noncharacters mirrored property noncharacters ...... 31, 924 see Bidi Mirrored (normative property) conformance ...... 77 mirroring of paired punctuation ...... 269 definition ...... 91 ...... 882 handling ...... 80 missing glyphs ...... 253 Index 1012

in code charts ...... 941 Cuneiform ...... 441 interchange restrictions ...... 31 Ethiopic ...... 784 semantics ...... 32 Greek acrophonic ...... 205 U+10FFFF (not a character code) ...... 924 Hangzhou ...... 858 U+FDD0..U+FDEF ...... 31, 924 Meroitic cursive ...... 452 U+FFFE (not a character code) ...... 66, 925 old-style ...... 276 U+FFFF (not a character code) ...... 31, 924 Roman ...... 205, 862 nondecomposable characters ...... 63 Rumi ...... 855 non-joiner, zero width (U+200C) . . . . . 375–376, 907 Suzhou-style ...... 858 nonlinear boundaries ...... 218 numeric separators ...... 278 non-overlap principle in Unicode encoding forms 33 numeric shape selectors (deprecated) ...... 916 nonspacing marks ...... 329 Numeric Type (normative property) ...... 175 definition ...... 105 Numeric Value (normative property) ...... 175 display in isolation ...... 59, 267, 330 (U+2116) ...... 839 positioning ...... 226 Nüshu ...... 769 rendering ...... 222–227 ...... 697–698 see also combining characters see also diacritics normalization ...... 61, 206–207 object replacement character (U+FFFC) ...... 930 and case operations ...... 242 octet ...... 959 canonical ordering algorithm ...... 61, 136, 168 Ogham ...... 362 conformance ...... 82 Ol Chiki ...... 567–568 of private-use characters ...... 921 Old Church Slavonic ...... 315 see also UAX #15, Unicode Normalization Forms Old Hungarian ...... 356 stability ...... 133 Old Italic ...... 350–352 Normalization Form C (NFC) ...... 61 Old North Arabian ...... 417 Normalization Form D (NFD) ...... 61 Old Permic ...... 361 Normalization Form KC (NFKC) ...... 61 Old Persian ...... 443 Normalization Form KD (NFKD) ...... 61 Old Sogdian ...... 609 normalization forms ...... 133–140 Old South Arabian ...... 418–419 definition ...... 139 Old Turkic ...... 608 specification ...... 135 Old Uyghur ...... 611–612 normative behaviors old-style numerals ...... 276 definition ...... 85 ...... 497–500 normative properties ornamental dingbats ...... 886 definition ...... 97 Oromo ...... 782 list ...... 98 Osage ...... 810 may change ...... 97 Osmanya ...... 786 Norwegian ...... 295 Ottoman Siyaq ...... 856 notational conventions ...... 957–961 out-of-band mechanisms ...... 935 notational systems ...... 262, 815–832 overlapping encodings ...... 33 nukta ...... 374, 396, 470 overscores ...... 275 null (U+0000) as Unicode string terminator ...... 902 P number forms CJK ideographs ...... 205 ...... 695–696 numbers Pahlavi, Inscriptional ...... 428 Coptic Epact ...... 854 Pahlavi, Psalter ...... 429 handling ...... 205 Palmyrene ...... 435 ideographic accounting ...... 176 Panjabi ...... 490 numerals ...... 847–859 or section marks ...... 278 acrophonic ...... 310 paragraph separator (U+2029) () ...... 209, 905 Chinese ...... 860 Parthian, Inscriptional ...... 428 Coptic ...... 314 Pashto ...... 373 Index 1013

Patani Malay ...... 658 derived see derived properties ...... 699 in Unicode Character Database (UCD) ...... 45 Persian ...... 373, 375 informative see informative properties Phags-pa ...... 595–601 normative references to ...... 75, 82 symbols ...... 891 normative see normative properties Phake ...... 668 of control codes ...... 901 Philippine scripts ...... 702–704 provisional see provisional properties Phoenician ...... 420 simple see simple properties phonemes ...... 261 see also individual properties, e.g. combining phonetic alphabets ...... 258 classes IPA Extensions ...... 298–299 property values Phonetic Extensions ...... 300–304 aliases ...... 162 Spacing Modifier Letters ...... 326–328 aliases (definition) ...... 103 Uralic Phonetic Alphabet (UPA) ...... 279, 300 default ...... 95 see also International Phonetic Alphabet (IPA) default (definition) ...... 95 ...... 295 normative references to ...... 82 table PropertyAliases.txt ...... 103, 960 proposed new characters ...... 967 PropertyValueAliases.txt ...... 103, 960 pivot code, Unicode as ...... 196 PropList.txt ...... 166 plain text Provençal ...... 296 as Unicode design principle ...... 18 provisional properties legibility criterion ...... 19 definition ...... 100 planes of Unicode codespace ...... 43 PS (U+2029 paragraph separator) ...... 209, 905 Plane 0 (BMP) ...... 43 ...... 429 Plane 1 (SMP) ...... 43, 50 PUA (Private Use Area) ...... 49, 921 Plane 14 (SSP) ...... 44 punctuation ...... 263–287 Plane 2 (SIP) ...... 43, 51 blocks containing ...... 257 Plane 3 (TIP) ...... 43, 51 CJK ...... 284 Planes 15-16 (Private Use) ...... 51, 922 doubled ...... 276 Playing Cards ...... 888 ideographic ...... 747 points, Hebrew pronunciation marks ...... 367 in bidirectional text ...... 263 policies of the Unicode Consortium ...... 967 paired ...... 269 Polish ...... 296 small form variants ...... 287 Portuguese ...... 295 typographic forms ...... 263 precomposed characters vertical forms ...... 287 see decomposable characters Punctuation and Symbols Area ...... 49 compatibility see compatibility decomposable Punjabi ...... 490 characters prefixed format control characters ...... 193 prepended concatenation marks ...... 254, 333 quotation marks ...... 270–274 Private Use Area (PUA) ...... 49, 921 East Asian ...... 273 Private Use planes ...... 44, 51, 922 European ...... 271 private-use characters properties ...... 920 semantics ...... 32 private-use code points ...... 31, 201 radicals, Kangxi and other CJK ...... 745–746 conformance ...... 78 radical- index ...... 742 definition ...... 103 record separator (U+001E) ...... 901 high surrogates ...... 923 symbols ...... 883 properties ...... 18, 93–103, 159–194 references ...... 967 aliases ...... 162 referencing ...... 82 aliases (definition) ...... 102 properties ...... 75 and Unicode algorithms ...... 98 Unicode algorithms ...... 76 data tables ...... 196 Unicode Standard ...... 74 Index 1014

regional indicator symbols ...... 896 self-synchronization of encoding forms ...... 34 regular expressions ...... 214 semantics and line breaking ...... 209 see character semantics see also UTS #18, Unicode Regular Expressions sequences Rejang ...... 717 notation ...... 958 rendering of text ...... 6, 10, 17 Serbian fallback ...... 252 corresponding digraphs in Croatian ...... 296 unsupported characters ...... 201 Shan ...... 681 repertoire of abstract characters ...... 29 digits ...... 849 reph ...... 468, 472, 512 Sharada ...... 621–622 replacement character (U+FFFD) . . . . 42, 67, 81, 930 ...... 363, 770 reserved code points ...... 30, 201 Show Hidden ...... 79, 222, 252, 918 definition ...... 91 SHY (U+00AD ) ...... 904 in code charts ...... 941 Sibe ...... 551 preservation in interchange ...... 31 Siddham ...... 625–626 see also unassigned code points signature for Unicode data ...... 66, 926–928 Rhaeto-Romanic ...... 296 simple properties rich text ...... 18 definition ...... 102 right single (U+2019) simplified Chinese ...... 734 preferred for apostrophe ...... 274 Sindhi ...... 373, 477 right-to-left text ...... 52 Sinhala ...... 531–534 East Asian scripts ...... 726 Sinological ...... 303 Middle Eastern scripts ...... 365 SIP (Supplementary Ideographic Plane) ...... 43, 51 roadmap for script additions ...... 45, 967 Siyaq Numbers ...... 855 ...... 205, 862 Indic ...... 855 Romanian ...... 296 slash, fraction (U+2044) ...... 275 comma below ...... 293 Slovak ...... 296 Romany ...... 296 Slovenian ...... 296 Rong ...... 570–572 small letters ...... 164, 236, 289 Rumi numeral symbols ...... 855 SMP (Supplementary Multilingual Plane) . . . . 43, 50 Runic ...... 353–355 soft hyphen (U+00AD) (SHY) ...... 904 Russian ...... 315 Sogdian ...... 610 Somali ...... 786 Sora Sompeng ...... 650 Samaritan ...... 409–410 Sorbian ...... 296 sorting ...... 12, 230 Sami ...... 296 and combining grapheme joiner ...... 911 ...... 457 Saurashtra ...... 573 as a text process ...... 10 case-insensitive ...... 231 scalar values, Unicode culturally expected ...... 12, 230 see Unicode scalar values scripts language-insensitive ...... 230 see also Unicode Collation Algorithm (UCA) in Unicode Standard ...... 3 source separation rule ...... 731, 737 roadmap for future additions ...... 45, 967 types of ...... 262 South and Central Asian scripts Ancient ...... 581–610 see also UAX #24, Unicode Script Property Other historic ...... 613–652 SCSU see UTS #6, A Standard Compression Scheme for Other modern ...... 527–577 South Asian scripts ...... 455–562 Unicode Southeast Asian scripts ...... 653–700 searching ...... 230–232 as a text process ...... 10 Soyombo ...... 606–607 space (U+0020) case-insensitive ...... 231, 240 base for diacritic in isolation ...... 59, 267, 330 section or paragraph marks ...... 278 security issues ...... 245 space characters ...... 266, 903–905 graphics for ...... 871 Index 1015

space, zero width (U+200B) ...... 266 supplementary planes spacing clones of diacritics ...... 327, 331 representation in UTF-16 ...... 36 spacing marks ...... 329 representation in UTF-8 ...... 37 definition ...... 106 Supplementary ...... 51, 922 Spacing Modifier Letters ...... 326–328 Supplementary Special-purpose Plane (SSP) . . . . . 44 Spanish ...... 295 supported subsets ...... 70 special characters ...... 66, 899–935 conformance ...... 78 SpecialCasing.txt ...... 151, 166 supralineation ...... 313 Specials ...... 926–930 surrogate code points spell-checking see surrogates as a text process ...... 11 surrogate pairs ...... 36, 123 spellings, alternative definition ...... 117 see equivalent sequences processing ...... 203–204 spoofing ...... 245 surrogates ...... 31, 117, 923 SSP (Supplementary Special-purpose Plane) . . . . . 44 interchange restrictions ...... 31 stability ...... 100, 161 isolated surrogates, handling ...... 42 as Unicode design principle ...... 23 isolated surrogates, ill-formed ...... 123 stacked boundaries ...... 217 isolated surrogates, uninterpreted ...... 117 stacking sequences ...... 56 support levels ...... 203 nondefault ...... 58 Surrogates Area ...... 49, 923 standardized variants ...... 554, 917 Sutton SignWriting ...... 831–832 in the code charts ...... 948 Suzhou-style numerals ...... 858 mathematical symbols ...... 869 svasti signs ...... 545 StandardizedVariants.txt ...... 554, 869 Swahili ...... 295 standards coverage ...... 3 Swedish ...... 295 starters ...... 135 syllabaries ...... 259 stateful encoding alphabetic property ...... 189 not used in Unicode ...... 4 featural ...... 259 paired format controls ...... 913 Syloti Nagri ...... 616–617 string comparison ...... 12 symbols ...... 833–898 string literals, Unicode animal ...... 884 code point notation \u1234 ...... 960 appearance variation ...... 833 strings, Unicode ...... 42, 119 arrows ...... 868–869 null termination ...... 902 box drawing ...... 876 strong directional characters ...... 171 cultural ...... 884 styled text ...... 18 currency symbols block ...... 835–838 sublinear searching ...... 232 dictionary ...... 883 subsets, supported ...... 70 dingbats ...... 885–886 conformance ...... 78 emoji ...... 880, 881, 896 ISO/IEC 10646 specification for ...... 981 Enclosed Alphanumerics ...... 895 substitution character fragments for mathematical typesetting . . . . . 872 see replacement character game ...... 884 Sumero-Akkadian ...... 438–441 gender ...... 883 Sundanese ...... 720–721 genealogical ...... 884 superscripts ...... 327 geometrical ...... 876–879 and subscripts ...... 860 hand ...... 884 supplementary characters Khmer lunar calendar ...... 680 in UTF-16 strings ...... 42 letterlike ...... 839–846 tables for ...... 197 map ...... 883 Supplementary General Scripts Area ...... 49 mathematical ...... 862–869 Supplementary Ideographic Plane (SIP) ...... 43, 51 mathematical alphanumeric ...... 840–846 Supplementary Multilingual Plane (SMP) . . . . .43, 50 miscellaneous ...... 882 musical ...... 818–828 numerals ...... 847–859 Index 1016

recycling ...... 883 ...... 529–530 regional indicator ...... 896 ...... 655–658 technical ...... 871–875 ...... 538–548 ...... 882 ...... 787 zodiacal ...... 884 Tigre ...... 782 symmetric swapping format characters ...... 915 (U+007E) ...... 279 Syriac ...... 399–408 TIP (Tertiary Ideographic Plane) ...... 43, 51 Tirhuta ...... 634–636 titlecase ...... 164, 236 tab (U+0009 character tabulation) ...... 901 Todo ...... 550 tone letters ...... 327–328 tab, vertical (U+000B) ...... 209, 901 tone marks tables of character data ...... 196–197 optimization ...... 197 Bopomofo spacing ...... 752, 753 Chinantec ...... 328 supplementary characters ...... 197 Chinese ...... 328 tag characters ...... 931–935 Tagalog ...... 702 Tai Le ...... 681 Thai ...... 655 Tagbanwa ...... 702 Vietnamese ...... 294 tags, language ...... 215, 931–935 Toto ...... 579 use strongly discouraged ...... 934 Tai Laing traditional Chinese ...... 734 traffic signs ...... 883 digits ...... 849 trailing surrogates Tai Le ...... 681–682 Tai Tham ...... 686–688 see low-surrogate code units transcoding ...... 196–197 digits ...... 849 tables ...... 196 Tai Viet ...... 689–691 Tai Xuan Jing symbols ...... 890 Transport and Map Symbols ...... 885 triangulation in transcoding ...... 196 Takri ...... 623–624 tries ...... 196 ...... 501–510 Tangsa ...... 580 truncation combining character sequences ...... 220–221 Tangut ...... 775–777 surrogates and ...... 204 code charts ...... 953 components ...... 776–777 Turkish ...... 296 case mapping of I ...... 238, 293 radicals ...... 776 cedilla ...... 293 tashkil ...... 374 tashkil, harakat, points ...... 376 lira sign ...... 837 two-stage tables ...... 197 TCHAR in Win32 API ...... 200 Technical Reports (UTR) ...... 965 Technical Standards (UTS) ...... xxiv, 965 U abstracts ...... 966 U+ notation ...... 960 technical symbols ...... 871–875 U+10FFFF (not a character code) ...... 924 ...... 511–513 U+FEFF (BOM) ...... 926–928 terminal emulation ...... 834 U+FFFE (not a character code) ...... 925 Tertiary Ideographic Plane (TIP) ...... 43, 51 U+FFFF (not a character code) ...... 924 text boundaries ...... 60, 190, 217–218, 228 UAX (Unicode Standard Annex) ...... xxiii, 965 see also UAX #14, Unicode Line Breaking Algo- as component of Unicode Standard ...... 77 rithm conformance ...... 83 see also UAX #29, Unicode Text Boundaries list of ...... 83 text elements ...... 6, 10, 217 UCA see Unicode Collation Algorithm and see also boundaries ...... 228 UTS #10, Unicode Collation Algorithm for sorting ...... 230 UCD see Unicode Character Database text processes ...... 6, 10–13 UCS (Universal Character Set) text rendering ...... 6, 10, 17 see ISO/IEC 10646 text selection, boundaries for ...... 217–218 UCS-2 ...... 978 Index 1017

UCS-4 ...... 978 endian ordering ...... 39 Ugaritic ...... 442 see also encoding schemes Ukrainian ...... 315 Unicode notation \u1234 ...... 960 unassigned code points ...... 30, 77, 201 Unicode scalar values defined as reserved code points ...... 91 definition ...... 118 handling ...... 72 Unicode security ...... 245 properties of ...... 95 see also UTS #39, Unicode Security Mechanisms semantics ...... 77 Unicode Standard see also reserved code points allocation of encoded characters ...... 43–51 underscores ...... 275 architecture ...... 10–13 undesignated code points ...... 30 areas ...... 44 Unicode 1.0 Name (informative property) ...... 188 benefits ...... 1 Unicode algorithms blocks ...... 257 and properties ...... 98 code charts ...... 937–955 conformance ...... 82 components ...... 77 definition ...... 91 conformance ...... 71–157 normative references to ...... 76, 82 conformance of ISO/IEC 10646 implementations. Unicode Bidirectional Algorithm ...... 21, 52 983 see also UAX #9, Unicode Bidirectional Algorithm corrections ...... 74 Unicode Character Database (UCD) . . xxii, 161, 967 definitions for conformance ...... 85–91 as component of Unicode Standard ...... 77 design goals ...... 4 changes ...... 72 design principles ...... 14–24 properties in ...... 45 errata ...... 74, 968 Unicode character encoding model ...... 33, 41 normative references to ...... 74, 82 see also UTR #17, Unicode Character Encoding number of characters ...... 3 Model number of code points ...... 1, 29 Unicode character literals script coverage ...... 3 code point notation U+ ...... 960 security issues ...... 245 Unicode codespace synchrony with ISO/IEC 10646 ...... 980 definition ...... 88 updates ...... 968 planes ...... 43 versions see versions of the Unicode Standard size ...... 1, 29 Unicode Standard Annexes (UAX) ...... xxiii, 965 Unicode Collation Algorithm (UCA) ...... 12 as components of Unicode Standard ...... 77 Unicode conferences ...... 966 conformance ...... 83 Unicode Consortium ...... 964 list of ...... 83 addresses ...... 968 Unicode string literals Consortium membership in standards bodies 964 code point notation \u1234 ...... 960 e-mail discussion list ...... 966 Unicode strings ...... 42 membership ...... 964 definition ...... 119 policies ...... 967 Unicode Technical Committee (UTC) ...... 964 website ...... 966 Unicode Technical Reports (UTR) ...... 965 Unicode data signature ...... 66, 926–928 Unicode Technical Standards (UTS) ...... xxiv, 965 Unicode data types ...... 199–200 abstracts ...... 966 for C ...... 199–200 UnicodeData.txt ...... 151, 166 Unicode encoding forms ...... 118–125 unification conformance ...... 34, 80 as Unicode design principle ...... 21 definition ...... 119 see also Han unification fixed-width (UTF-32) ...... 35, 122 Unified Repertoire and Ordering (URO) . . . 737, 989 signatures ...... 927, 928 see also Han unification variable-width ...... 36, 37, 123 Unihan Database ...... 161, 741, 742, 967, 990 see also encoding forms Unihan.zip ...... 100, 161 Unicode encoding schemes UnihanCore2020 ...... 732, 990 conformance ...... 129–132 unit separator (U+001F) ...... 901 definition ...... 129 Index 1018

Universal Character Set (UCS) compared to multibyte encodings ...... 37 see ISO/IEC 10646 encoding form (definition) ...... 123 universality encoding scheme ...... 39 as Unicode design principle ...... 14 encoding scheme (definition) ...... 129 in Unix ...... 18 newline function ...... 210 in UTF-16 order ...... 233 UTF-8 in ...... 18 non-shortest form is invalid ...... 123, 245 unsupported characters ...... 201 preferred encoding for Internet protocols . . . . 37 upadhmaniya ...... 482, 621 security and ...... 245 update version ...... 73 signature ...... 129, 132, 927 uppercase ...... 164, 236, 289 UTR (Unicode Technical Report) ...... 965 Uralic Phonetic Alphabet (UPA) ...... 279, 300 UTS (Unicode Technical Standard) ...... xxiv, 965 Urdu ...... 373 abstracts ...... 966 URO (Unified Repertoire and Ordering) . . 737, 989 Uyghur ...... 373, 595 see also Han unification UTF, Unicode Transformation Formats ...... 33, 119 as encoding form or scheme ...... 132 Vai ...... 795–796 binary comparison and sort order differences . . . valid (synonym for well-formed) ...... 121 231, ...... 233 variable-width Unicode encoding form . . .36, 37, 123 in APIs ...... 200 variants UTF-16 ...... 36, 123, 979 compatibility ...... 26 binary comparison and sort order caution . . . . 36 fullwidth and halfwidth ...... 287 bit distribution (table) ...... 123 mathematical symbols ...... 869 BOM in ...... 130, 926 small form ...... 287 encoding form (definition) ...... 123 standardized ...... 917 encoding scheme (definition) ...... 130 variation selectors ...... 194, 917 encoding schemes ...... 39 ideographic variation mark (U+303E) ...... 750 in ISO/IEC 10646 ...... 979 Mongolian free variation selectors ...... 554 in UTF-8 order ...... 234 variation sequences ...... 917 surrogates and string handling ...... 42, 203 for Phags-pa ...... 599–601 UTF-16BE (Big-endian) ...... 927 Version 14.0 ...... 77 encoding scheme ...... 40 number of characters ...... 3 encoding scheme (definition) ...... 129 versions of the Unicode Standard .xxiii, 72, 968, 985– UTF-16LE (Little-endian) ...... 927 986 encoding scheme ...... 40 backward compatibility ...... 72 encoding scheme (definition) ...... 129 compared to ISO/IEC 10646 editions ...... 985 UTF-32 ...... 35, 122 content ...... 73 BOM in ...... 131 interaction in implementations ...... 202 encoding form (definition) ...... 122 numbering ...... 73 encoding scheme (definition) ...... 131 property changes ...... 72 encoding schemes ...... 39 stability ...... 72 UTF-32BE (Big-endian) updates ...... 968 encoding scheme ...... 40 vertical tab (U+000B) ...... 209, 901 encoding scheme (definition) ...... 131 vertical text ...... 52, 264, 287 UTF-32LE (Little-endian) East Asian scripts ...... 726 encoding scheme ...... 40 Mongolian ...... 551 encoding scheme (definition) ...... 131 Vietnamese ...... 294, 301 UTF-8 ...... 37, 123, 979 ideographs ...... 726 ASCII transparency ...... 37 virama ...... 260, 455 binary comparison and sort order ...... 38 definition ...... 460 bit distribution (table) ...... 124 Kharoshthi ...... 591 BOM in ...... 129, 132, 927 Khmer ...... 672 byte ranges ...... 124 Myanmar ...... 663 Index 1019

Philippine scripts ...... 703 zero extension relation among encodings ...... 978 virama-like characters ...... 192 zero width joiner (U+200D) ...... 375–376, 906 visual order used for Thai and Lao ...... 21 zero width no-break space (U+FEFF) . . . . 66, 82, 903 Vithkuqi ...... 360 initial ...... 132, 927 vowel harmony zero width non-joiner (U+200C) . . . . . 375–376, 907 Mongolian ...... 555 zero width space (U+200B) ...... 904 vowel marks, Middle Eastern scripts ...... 366 for word breaks in South Asian scripts . 657, 665, vowel separator 680 Mongolian ...... 556 zero-width space characters ...... 904 vowel signs Znamenny Musical Notation ...... 825 Indic ...... 56, 459 ZWJ see zero width joiner (U+200D) Khmer ...... 674 ZWNBSP see zero width no-break space (U+FEFF) Philippine scripts ...... 702 ZWNJ see zero width non-joiner (U+200C) ZWSP see zero width space (U+200B) Wancho ...... 578 ...... 566 wchar_t in C language ...... 200 weak directional characters ...... 171 weather symbols ...... 882 website, Unicode Consortium ...... 966 Weierstrass elliptic function symbol ...... 840 well-formed definition ...... 120 Welsh ...... 296 Where Is My Character? ...... 968 wide characters data in C ...... 200 wiggly fence (U+29DB) ...... 867 Windows newline function ...... 210 word breaks ...... 219, 903–905 in South Asian scripts ...... 657, 665, 680 word joiner (U+2060) ...... 903 writing direction see directionality writing systems ...... 258–262 Wu (Shanghainese) ...... 735 Xibe ...... 551 Xishuangbanna Dai ...... 683 Yezidi ...... 414 Yi ...... 766–768 Yiddish ...... 367 Yijing Hexagram Symbols ...... 890 ypogegrammeni ...... 306 Zanabazar Square ...... 603–605 Zapf Dingbats ...... 885 Index 1020