Introduction to Unicode and Beyond Mike Mckenna Mimckenna(At)Paypal.Com Craig Cummings I18ncraig(At)Gmail.Com Tex Texin Textexin(At)Xencraft.Com V.1.4 November, 2016
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Unicode and Beyond Mike McKenna mimckenna(at)paypal.com Craig Cummings i18ncraig(at)gmail.com Tex Texin textexin(at)xencraft.com v.1.4 November, 2016 Yahoo! C onfidential Agenda • Brief Background • Characters – Structure – Properties – Encoding • Key Specifications for Internationalization • Unicode in the Real World • How Unicode is Evolving Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 2 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Unicode in Brief Yahoo! C onfidential Unicode - Summary • Universal Character Set replaces – ASCII – 8-bit – Double-byte and some multibyte • Encompasses over 240 known coded character sets in use today • Covers virtually all modern business languages • Additional archaic and academic scripts • Integrated with ISO 10646 as BMP • Information: http://www.unicode.org Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 4 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 History: Character Sets • Evolved when and where the need arose • Developed for stand-alone applications • Many developed their own • Example: 3270 nightmare – A different encoding for each European country! • Now: – Over 250 character sets in use around the world – Conversion? Any to any: (n)(n-1) > 62,000 Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 5 Solution: Unicode • ISO 10646 - draft – Make everyone happy? • Each standard with own 16-bit plane – Big mess! Not feasible • Joe Becker/Xerox and XCCS • Project began 1988 • Industry Consortium formed 1991 – Xerox and Apple initially • ISO 10646 merged with Unicode 1992 Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 6 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Some terminology … • A “glyph” is a single visual unit of text. • A “character” is a single logical unit of text. À U+00C0 • A “code point” is an integer assigned to a character. • A “character set” is an organized collection of characters with code points. • A “character encoding”is a mapping from a sequence of code points (characters) to a sequence of code units. • A “code unit” is a single logical unit of storage (like a byte, wchar_t, int16_t, etc.) Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 0xC340 – Santa C lar a 0x80– November 2016(utf-8) 7 Unicode (ISO 10646) • Code space of up to 0x10FFFF characters (about 1.1 million) • Unicode and ISO 10646 are maintained in sync. • Unicode is maintained by an industry consortium • ISO 10646 is maintained by the ISO Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 8 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Structure of Unicode Yahoo! C onfidential Structure of Unicode § Unicode divided in equal sized regions of code points. § 17 planes (0 through 0x10), each with 65,535 characters. § Plane 0 is called the Basic Multilingual Plane (BMP). § > 99% of text in the wild lives in the BMP Unicode § Planes 1 through 0x10 are called supplementary planes. BMP Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 10 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Unicode Allocation - BMP General Scripts Symbols Compatibility Zone CJK Misc. Private Use CJK Unified Ideographs Surrogates Hangul 00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0 FE Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 11 Unicode Allocation - BMP Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 12 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Unicode Allocation – Plane 1 Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 13 Supplementary Planes Contain additional, rarer characters – Plane 1: Rare, historical, or extinct scripts: Cuneiform, Gothic, etc. – Plane 2: less common Chinese/Japanese/Korean ideographs – 1024 x 1024 extra code points beyond BMP Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 14 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Extensible via Surrogates • 1024 * 1024 extra code points • Rare characters in future use • Structure: – high-surrogate U+D800 - U+DBFF – low-surrogate U+DC00 - U+DFFF – surrogate pair represents a single character • Scalar value: 0 - 10FFFF16 – N = U (no surrogates) – N = (H - 0xD800) * 0x400 + (L - 0xDC00) + 0x10000 Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 15 Unicode as the Universal Character Set • An organized collection of characters. • Each character has a code point aka Unicode Scalar Value (USV) • U+0041 <= hex notation Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 16 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Organized into blocks by script Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 17 Compatibility Characters Many characters were included in Unicode for round-trip conversion compatibility with legacy encodings: ①②③45Ⅵ ¾Lj¼Nj½dž Compatibility Characters ︴︷︻︽﹁﹄ ヲィゥォェュ゙ ﺲ ﺳ ھ ض ﵬ ﷺ Presentation Forms fi fl ffi ffl ſt ﬔ legacy encoding: a term for non- Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 Unicode– Santa C lar a character– November 2016 encodings. 18 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Coverage – Scripts • Arabic • Georgian • Armenian • Greek • Malayalam • Bengali • Gujarati • Mathematical • Bopomofo Operators • Hangul Jamo • CJK Unified • Oriya • Hangul Syllables Ideographs • Private Use • Hebrew • Combining • Spacing Modifier Diacritical Marks • Hiragana and symbols Letters • IPA Extensions • Currency Symbols • Superscripts and Subscripts • Cyrillic • Kanbun • Devanagari • Surrogates • Kannada • Dingbats • Ta mi l • Punctuation and • Katakana • Te l u g u Symbols • Lao • Thai • Latin • Tibetan Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 19 Unicode Encodings (and the Web) Yahoo! C onfidential 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Unicode Encodings • UTF-32 – Uses 32-bit code units. – All characters are the same width. • UTF-16 – Uses 16-bit code units. – BMP characters use one 16-bit code unit. – Supplementary characters use two special 16-bit code units: a “surrogate pair”. • UTF-8 – Uses 8-bit code units (bytes!) – It’s a multi-byte encoding! – Characters use between 1 and 4 bytes. – ASCII is ASCII in UTF-8 Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 21 The Replacement Character U+FFFD • Indicates a bad byte sequence or a character � that could not be “there was a converted. character here, but now it is gone” • Equivalent to “question marks” in legacy encoding conversions Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 22 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 Special Noncharacter Values • U+FFFE – Mirror of U+FEFF, Byte Order Mark (BOM) – Strong hint that text may be byte-reversed • U+FFFF – No requirement to interpret – Should recognize as a non-character value Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 23 Byte Order Mark (BOM) U+FEFF • Originally to indicate the “byte- order” of UTF-16 code units – 0xFE FF (UTF-16BE) – 0xFF FE (UTF-16LE) • Also used as a Unicode signature by some software (Microsoft) for UTF-8 – 0xEF BB BF (XMl parsers don’t like this) Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 24 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 UTF-16 • Uses 16-bit code units (instead of the more-familiar 8-bit code unit, aka the “byte”) • BMP characters use one unit • Supplementary characters use a “surrogate pair” • special code points that don’t do anything else. • Used for uncommon range of U+10000 to U+10FFFF • Expands UTF-16 support to 1,048,576 supplementary code points 0x1251 0xD800 0xDF38 High Surrogate Low Surrogate 0xD800-DBFF 0xDC00-DFFF Unique Ranges! Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 25 Advantages and Disadvantages of UTF-16 • Most common • May not be suitable for languages and scripts all applications are encoded in the BMP. – Affected by processor – less wasteful than UTF- architecture (Big-Endian 32 vs. little-Endian) – Simpler to process – Requires different data (excepting surrogates) types in some languages (C, C++) – Commonly supported in major operating – Requires more storage, on environments, average, for Western programming languages, European scripts, ASCII, and libraries HTMl/XMl markup. Unicode Tutorial Inter nationaliz ation and U nic ode C onfer enc e 40 – Santa C lar a – November 2016 26 40th Internationalization and Unicode Conference ‹#› Santa Clara, November 2016 UTF-8 • 7-bit