Unicode Demystified a Tutorial Introduction to the Unciode Standard
Unicode Demystified A Tutorial Introduction to the Unciode Standard
Richard Gillam Senior Developer Trilogy Software, Inc. TRILOGY
Shameless Plug
1 The Code Page Problem • Characters in most languages are traditionally represented by single-byte values – Allows for 256 characters max – Real limit for most encodings is 192 characters – This includes letters, digits, punctuation, symbols • When a system is used for a new language, the encoding has to be adapted to use that language’s characters • Encodings proliferate – Each language or group of languages gets its own encoding – Different vendors or standards committees devise different encodings, so generally each language has several, often incompatible, encodings
Multi-byte encodings • Some languages (Chinese, Japanese, Korean, etc.) have more than 256 characters • Encoding standards for these languages use sequences of bytes for many characters – In many standards, not all characters are the same number of bytes – Can’t tell whether a given byte is a whole character or part of a character – Corruption of one byte can corrupt the whole data stream
2 Interoperability problems
• Can’t easily mix languages in a document or system • Data not tagged with encoding, so loss can occur when transferring between systems • Most encodings are ASCII-based, so problems often not seen with English-only data • Two possible solutions: – Systematic tagging of textual data with encoding ID – Universal encoding standard with all languages’ characters
3 Encoding space
An ASCII character is 7 bits wide
Encoding space
Most encodings press the eighth bit into service
4 Encoding space
Early versions of Unicode used 16 bits
Encoding space
Unicode now uses 21 bits
5 Encoding space
Plane Row Character number number number
Unicode • 21-bit encoding space allows for 1,114,112 characters • 95,156 code point values assigned to characters in Unicode 3.2 • 137,216 code point values set aside for application use • 2,114 code point values set aside for non- character use • 879,626 code point values reserved for future character assignments
6 The Unicode Encoding Space
10 F E D C B A 9 8 7 6 5 4 3 2 1 0 Basic Multilingual Plane
The Unicode Encoding Space
10 F E D C B A Supplementary Planes 9 8 7 6 5 4 3 2 1 0
7 The Unicode Encoding Space
10 F Supplementary Special-Purpose E Plane D C B A 9 8 7 6 5 4 3 Supplementary Ideographic Plane 2 Supplementary Multilingual Plane 1 0
The Unicode Encoding Space
Private Use Planes 10 F E D C B A 9 8 7 6 5 4 3 2 1 0
8 The Unicode Encoding Space
10 F E D C B A 9 8 7 6 5 4 3 2 1 0 Basic Multilingual Plane
The Basic Multilingual Plane
0 General Scripts Area 1 2 Symbols Area CJK Punct. 3 CJK Punct. 4 5 Han 6 7 8 9 A Yi B Hangul C D Surrogates Area E Private Use Area F Compatibility Area
9 The General Scripts Area
00/01 Latin 02/03 IPA Diacriticals Greek 04/05 Cyrillic Armenian Hebrew 06/07 Arabic Syriac Thaana 08/09 Devanagari Bengali 0A/0B Gurmukhi Gujarati Oriya Tamil 0C/0D Telugu Kannada Malayalam Sinhala 0E/0F Thai Lao Tibetan 10/11 Myanmar Georgian Hangul 12/13 Ethiopic Cherokee 14/15 Canadian Aboriginal Syllabics Ogh 16/17 am Runic Philippine Khmer 18/19 Mongolian 1A/1B 1C/1D 1E/1F Latin Greek
Unicode Coverage
• European scripts – Latin, Greek, Cyrillic, Armenian, Georgian, IPA • Bidirectional (Middle Eastern) scripts – Hebrew, Arabic, Syriac, Thaana • Indic (Indian and Southeast Asian) scripts – Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Khmer, Myanmar, Tibetan, Philippine • East Asian scripts – Chinese (Han) characters, Japanese (Hiragana and Katakana), Korean (Hangul), Yi • Other modern scripts – Mongolian, Ethiopic, Cherokee, Canadian Aboriginal • Historical scripts – Runic, Ogham, Old Italic, Gothic, Deseret • Punctuation and symbols – Numerals, math symbols, scientific symbols, arrows, blocks, geometric shapes, Braille, musical notation, etc.
10 Characters and Glyphs
Characters and Glyphs
11 Characters and Glyphs
Ligatures fi
12 Ligatures fi
Ligatures fi
13 Ligatures
Ligatures
14 Ligatures
Split Vowels
These two marks are parts of the same character
15 Character Positioning
Character Positioning
16 Character Positioning
Character Positioning
17 Combining characters
One character… é
Combining characters
…or two? é
18 Combining characters
Actually, either. Unicode is generative, with accent marks represented with their own code point values…
= U+0065 (e) U+0301 (accent)
…buté common combinations of letters and accents are also given their own code points for convenience. é = U+00E9
Combining characters
This can be tough, because the two representations are to be treated as absolutely identical. é = é U+0065 U+0301 = U+00E9
19 Combining characters
Things can get really wild for characters with more than one accent mark:
= 006F (o) 0302 (circumflex) 0323 (dot) = 006F (o) 0323 (dot) 0302 (circumflex) = 00F4 (o-circumflex) 0323 (dot) = 1ECD (o-dot) 0302 (circumflex) = 1ED9 (o-circumflex-dot)
Combining characters
Unicode provides normalization rules to aid in comparison. These provide for a preferred (normalized) representation:
= 006F (o) 0302 (circumflex) 0323 (dot) = 006F (o) 0323 (dot) 0302 (circumflex)
= 00F4 (o-circumflex) 0323 (dot) Fully = 1ECD (o-dot) 0302 (circumflex) decomposed
= 1ED9 (o-circumflex-dot) Fully composed
20 Combining characters • Certain characters are designated as combining characters • Combining characters are grouped into classes by how they combine • Many accented characters are represented as combining character sequences • Composite characters with equivalent combining character sequences are said to decompose to the equivalent sequence • The standard provides for four normalized forms to aid in comparison and processing • The standard provides for a canonical ordering for multiple combining marks attached to the same character
Character semantics • The Unicode standard includes an extensive database that specifies a large number of character properties, including: –Name – Type (e.g., letter, digit, punctuation mark) – Decomposition – Case and case mappings (for cased letters) – Numeric value (for digits and numerals) – Combining class (for combining characters) – Directionality – Line-breaking behavior – Cursive joining behavior – For Chinese characters, mappings to various other standards and many other properties
21 Storage formats
UTF-32: The 21-bit abstract Unicode value is simply zero-padded to 32 bits:
Storage formats
UTF-16: For characters in the BMP, the 21-bit value is simply truncated to 16 bits:
For other characters, the 21-bit value is turned into a sequence of two 16-bit values called a surrogate pair:
A particular numeric value is either a BMP character, a high surrogate, or a low surrogate.
22 Storage formats
UTF-8: For ASCII characters, the 21-bit value is truncated to 8 bits:
For other characters, the 21-bit value is turned into a sequence of two, three, or four 8-bit values:
Different numeric ranges are used for ASCII characters and leading and trailing bytes. Different ranges are used for leading bytes of different-length sequences.
Serialization formats
• UTF-16 and UTF-32 can be written to a serial device in different byte orders. The standard provides three serialization formats for UTF-16 and UTF-32: – A big-endian version (UTF-16BE and UTF-32BE) where the most-significant byte is written first – A little-endian version (UTF-16LE and UTF-32LE) where the least-significant byte is written first – A self-describing version where the text is preceded by a byte order mark that the receiving process can use to determing endian-ness
23 The Unicode standard • The Unicode standard consists of: – The standard text, published in book form (this includes a complete set of printed code charts) – The Unicode Character Database, a set of data files providing complete property information on every character – Various Web-published supplemental materials: • Unicode Standard Annexes (UAX): Amendments to the standard since the last book was published • Unicode Technical Standards (UTS): Allied standards maintained separately from Unicode itself • Unicode Technical Reports (UTR): Non-normative documents providing background info, implementation hints, or other useful information • Unicode Technical Notes (UTN): Other articles of interest
Dealing with Unicode • The basic character and string classes in Windows 2000 and XP are Unicode-based, and Windows provides an extensive set of APIs for working with Unicode text • The basic character and string classes in Java are also Unicode-based, and the Java Class Library also provides an extensive set of APIs for working with Unicode text • Several third-party packages, including the open-source International Components for Unicode, are also available
24 For more information • The published standard is available in bookstores • Virtually everything related to the standard is available at http://ww w.unicode.org • Two good books: Unicode Demystified by yours truly and Unicode: A Primer by Tony Graham • Ask questions at unicode @ u nicode.org • Contact me at rtgillam @concentric.net
25