Unicode Demystified a Tutorial Introduction to the Unciode Standard
Total Page:16
File Type:pdf, Size:1020Kb
Unicode Demystified A Tutorial Introduction to the Unciode Standard Richard Gillam Senior Developer Trilogy Software, Inc. TRILOGY Shameless Plug 1 The Code Page Problem • Characters in most languages are traditionally represented by single-byte values – Allows for 256 characters max – Real limit for most encodings is 192 characters – This includes letters, digits, punctuation, symbols • When a system is used for a new language, the encoding has to be adapted to use that language’s characters • Encodings proliferate – Each language or group of languages gets its own encoding – Different vendors or standards committees devise different encodings, so generally each language has several, often incompatible, encodings Multi-byte encodings • Some languages (Chinese, Japanese, Korean, etc.) have more than 256 characters • Encoding standards for these languages use sequences of bytes for many characters – In many standards, not all characters are the same number of bytes – Can’t tell whether a given byte is a whole character or part of a character – Corruption of one byte can corrupt the whole data stream 2 Interoperability problems • Can’t easily mix languages in a document or system • Data not tagged with encoding, so loss can occur when transferring between systems • Most encodings are ASCII-based, so problems often not seen with English-only data • Two possible solutions: – Systematic tagging of textual data with encoding ID – Universal encoding standard with all languages’ characters 3 Encoding space An ASCII character is 7 bits wide Encoding space Most encodings press the eighth bit into service 4 Encoding space Early versions of Unicode used 16 bits Encoding space Unicode now uses 21 bits 5 Encoding space Plane Row Character number number number Unicode • 21-bit encoding space allows for 1,114,112 characters • 95,156 code point values assigned to characters in Unicode 3.2 • 137,216 code point values set aside for application use • 2,114 code point values set aside for non- character use • 879,626 code point values reserved for future character assignments 6 The Unicode Encoding Space 10 F E D C B A 9 8 7 6 5 4 3 2 1 0 Basic Multilingual Plane The Unicode Encoding Space 10 F E D C B A Supplementary Planes 9 8 7 6 5 4 3 2 1 0 7 The Unicode Encoding Space 10 F Supplementary Special-Purpose E Plane D C B A 9 8 7 6 5 4 3 Supplementary Ideographic Plane 2 Supplementary Multilingual Plane 1 0 The Unicode Encoding Space Private Use Planes 10 F E D C B A 9 8 7 6 5 4 3 2 1 0 8 The Unicode Encoding Space 10 F E D C B A 9 8 7 6 5 4 3 2 1 0 Basic Multilingual Plane The Basic Multilingual Plane 0 General Scripts Area 1 2 Symbols Area CJK Punct. 3 CJK Punct. 4 5 Han 6 7 8 9 A Yi B Hangul C D Surrogates Area E Private Use Area F Compatibility Area 9 The General Scripts Area 00/01 Latin 02/03 IPA Diacriticals Greek 04/05 Cyrillic Armenian Hebrew 06/07 Arabic Syriac Thaana 08/09 Devanagari Bengali 0A/0B Gurmukhi Gujarati Oriya Tamil 0C/0D Telugu Kannada Malayalam Sinhala 0E/0F Thai Lao Tibetan 10/11 Myanmar Georgian Hangul 12/13 Ethiopic Cherokee 14/15 Canadian Aboriginal Syllabics Ogh 16/17 am Runic Philippine Khmer 18/19 Mongolian 1A/1B 1C/1D 1E/1F Latin Greek Unicode Coverage • European scripts – Latin, Greek, Cyrillic, Armenian, Georgian, IPA • Bidirectional (Middle Eastern) scripts – Hebrew, Arabic, Syriac, Thaana • Indic (Indian and Southeast Asian) scripts – Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Khmer, Myanmar, Tibetan, Philippine • East Asian scripts – Chinese (Han) characters, Japanese (Hiragana and Katakana), Korean (Hangul), Yi • Other modern scripts – Mongolian, Ethiopic, Cherokee, Canadian Aboriginal • Historical scripts – Runic, Ogham, Old Italic, Gothic, Deseret • Punctuation and symbols – Numerals, math symbols, scientific symbols, arrows, blocks, geometric shapes, Braille, musical notation, etc. 10 Characters and Glyphs Characters and Glyphs 11 Characters and Glyphs Ligatures fi 12 Ligatures fi Ligatures fi 13 Ligatures Ligatures 14 Ligatures Split Vowels These two marks are parts of the same character 15 Character Positioning Character Positioning 16 Character Positioning Character Positioning 17 Combining characters One character… é Combining characters …or two? é 18 Combining characters Actually, either. Unicode is generative, with accent marks represented with their own code point values… = U+0065 (e) U+0301 (accent) …buté common combinations of letters and accents are also given their own code points for convenience. é = U+00E9 Combining characters This can be tough, because the two representations are to be treated as absolutely identical. é = é U+0065 U+0301 = U+00E9 19 Combining characters Things can get really wild for characters with more than one accent mark: = 006F (o) 0302 (circumflex) 0323 (dot) = 006F (o) 0323 (dot) 0302 (circumflex) = 00F4 (o-circumflex) 0323 (dot) = 1ECD (o-dot) 0302 (circumflex) = 1ED9 (o-circumflex-dot) Combining characters Unicode provides normalization rules to aid in comparison. These provide for a preferred (normalized) representation: = 006F (o) 0302 (circumflex) 0323 (dot) = 006F (o) 0323 (dot) 0302 (circumflex) = 00F4 (o-circumflex) 0323 (dot) Fully = 1ECD (o-dot) 0302 (circumflex) decomposed = 1ED9 (o-circumflex-dot) Fully composed 20 Combining characters • Certain characters are designated as combining characters • Combining characters are grouped into classes by how they combine • Many accented characters are represented as combining character sequences • Composite characters with equivalent combining character sequences are said to decompose to the equivalent sequence • The standard provides for four normalized forms to aid in comparison and processing • The standard provides for a canonical ordering for multiple combining marks attached to the same character Character semantics • The Unicode standard includes an extensive database that specifies a large number of character properties, including: –Name – Type (e.g., letter, digit, punctuation mark) – Decomposition – Case and case mappings (for cased letters) – Numeric value (for digits and numerals) – Combining class (for combining characters) – Directionality – Line-breaking behavior – Cursive joining behavior – For Chinese characters, mappings to various other standards and many other properties 21 Storage formats UTF-32: The 21-bit abstract Unicode value is simply zero-padded to 32 bits: Storage formats UTF-16: For characters in the BMP, the 21-bit value is simply truncated to 16 bits: For other characters, the 21-bit value is turned into a sequence of two 16-bit values called a surrogate pair: A particular numeric value is either a BMP character, a high surrogate, or a low surrogate. 22 Storage formats UTF-8: For ASCII characters, the 21-bit value is truncated to 8 bits: For other characters, the 21-bit value is turned into a sequence of two, three, or four 8-bit values: Different numeric ranges are used for ASCII characters and leading and trailing bytes. Different ranges are used for leading bytes of different-length sequences. Serialization formats • UTF-16 and UTF-32 can be written to a serial device in different byte orders. The standard provides three serialization formats for UTF-16 and UTF-32: – A big-endian version (UTF-16BE and UTF-32BE) where the most-significant byte is written first – A little-endian version (UTF-16LE and UTF-32LE) where the least-significant byte is written first – A self-describing version where the text is preceded by a byte order mark that the receiving process can use to determing endian-ness 23 The Unicode standard • The Unicode standard consists of: – The standard text, published in book form (this includes a complete set of printed code charts) – The Unicode Character Database, a set of data files providing complete property information on every character – Various Web-published supplemental materials: • Unicode Standard Annexes (UAX): Amendments to the standard since the last book was published • Unicode Technical Standards (UTS): Allied standards maintained separately from Unicode itself • Unicode Technical Reports (UTR): Non-normative documents providing background info, implementation hints, or other useful information • Unicode Technical Notes (UTN): Other articles of interest Dealing with Unicode • The basic character and string classes in Windows 2000 and XP are Unicode-based, and Windows provides an extensive set of APIs for working with Unicode text • The basic character and string classes in Java are also Unicode-based, and the Java Class Library also provides an extensive set of APIs for working with Unicode text • Several third-party packages, including the open-source International Components for Unicode, are also available 24 For more information • The published standard is available in bookstores • Virtually everything related to the standard is available at http://ww w.unicode.org • Two good books: Unicode Demystified by yours truly and Unicode: A Primer by Tony Graham • Ask questions at unicode @ u nicode.org • Contact me at rtgillam @concentric.net 25.