Quick viewing(Text Mode)

Unicode Demystified a Tutorial Introduction to the Unciode Standard

Unicode Demystified a Tutorial Introduction to the Unciode Standard

Demystified A Tutorial Introduction to the Unciode Standard

Richard Gillam Senior Developer Trilogy Software, Inc. TRILOGY

Shameless Plug

1 The Code Page Problem • Characters in most languages are traditionally represented by single-byte values – Allows for 256 characters max – Real limit for most encodings is 192 characters – This includes letters, digits, , symbols • When a system is used for a new language, the encoding has to be adapted to use that language’s characters • Encodings proliferate – Each language or group of languages gets its own encoding – Different vendors or standards committees devise different encodings, so generally each language has several, often incompatible, encodings

Multi-byte encodings • Some languages (Chinese, Japanese, Korean, etc.) have more than 256 characters • Encoding standards for these languages use sequences of bytes for many characters – In many standards, not all characters are the same number of bytes – Can’t tell whether a given byte is a whole or part of a character – Corruption of one byte can corrupt the whole data stream

2 Interoperability problems

• Can’t easily mix languages in a document or system • Data not tagged with encoding, so loss can occur when transferring between systems • Most encodings are ASCII-based, so problems often not seen with English-only data • Two possible solutions: – Systematic tagging of textual data with encoding ID – Universal encoding standard with all languages’ characters

3 Encoding

An ASCII character is 7 bits wide

Encoding space

Most encodings press the eighth bit into service

4 Encoding space

Early versions of Unicode used 16 bits

Encoding space

Unicode now uses 21 bits

5 Encoding space

Plane Row Character number number number

Unicode • 21-bit encoding space allows for 1,114,112 characters • 95,156 values assigned to characters in Unicode 3.2 • 137,216 code point values set aside for application use • 2,114 code point values set aside for non- character use • 879,626 code point values reserved for future character assignments

6 The Unicode Encoding Space

10 F E D C B A 9 8 7 6 5 4 3 2 1 0 Basic Multilingual

The Unicode Encoding Space

10 F E D C B A Supplementary Planes 9 8 7 6 5 4 3 2 1 0

7 The Unicode Encoding Space

10 F Supplementary Special-Purpose E Plane D C B A 9 8 7 6 5 4 3 Supplementary Ideographic Plane 2 Supplementary Multilingual Plane 1 0

The Unicode Encoding Space

Private Use Planes 10 F E D C B A 9 8 7 6 5 4 3 2 1 0

8 The Unicode Encoding Space

10 F E D C B A 9 8 7 6 5 4 3 2 1 0 Basic Multilingual Plane

The Basic Multilingual Plane

0 General Scripts Area 1 2 Symbols Area CJK Punct. 3 CJK Punct. 4 5 Han 6 7 8 9 A Yi B C D Surrogates Area E Private Use Area F Compatibility Area

9 The General Scripts Area

00/01 Latin 02/03 IPA Diacriticals Greek 04/05 Cyrillic Armenian Hebrew 06/07 Arabic Syriac 08/09 Bengali 0A/0B Gujarati Oriya Tamil 0C/0D Telugu Kannada Malayalam Sinhala 0E/0F Thai Lao Tibetan 10/11 Myanmar Georgian Hangul 12/13 Ethiopic Cherokee 14/15 Canadian Aboriginal Syllabics Ogh 16/17 am Runic Philippine Khmer 18/19 Mongolian 1A/1B 1C/1D 1E/1F Latin Greek

Unicode Coverage

• European scripts – Latin, Greek, Cyrillic, Armenian, Georgian, IPA • Bidirectional (Middle Eastern) scripts – Hebrew, Arabic, Syriac, Thaana • Indic (Indian and Southeast Asian) scripts – Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Khmer, Myanmar, Tibetan, Philippine • East Asian scripts – Chinese (Han) characters, Japanese ( and ), Korean (Hangul), Yi • Other modern scripts – Mongolian, Ethiopic, Cherokee, Canadian Aboriginal • Historical scripts – Runic, , Old Italic, Gothic, Deseret • Punctuation and symbols – Numerals, math symbols, scientific symbols, arrows, blocks, geometric , , musical notation, etc.

10 Characters and Glyphs

Characters and Glyphs

11 Characters and Glyphs

Ligatures fi

12 Ligatures fi

Ligatures fi

13 Ligatures

Ligatures

14 Ligatures

Split Vowels

These two marks are parts of the same character

15 Character Positioning

Character Positioning

16 Character Positioning

Character Positioning

17 Combining characters

One character… é

Combining characters

…or two? é

18 Combining characters

Actually, either. Unicode is generative, with accent marks represented with their own code point values…

= +0065 (e) U+0301 (accent)

…buté common combinations of letters and accents are also given their own code points for convenience. é = U+00E9

Combining characters

This can be tough, because the two representations are to be treated as absolutely identical. é = é U+0065 U+0301 = U+00E9

19 Combining characters

Things can get really wild for characters with more than one accent mark:

= 006F (o) 0302 () 0323 () = 006F (o) 0323 (dot) 0302 (circumflex) = 00F4 (o-circumflex) 0323 (dot) = 1ECD (o-dot) 0302 (circumflex) = 1ED9 (o-circumflex-dot)

Combining characters

Unicode provides normalization rules to aid in comparison. These provide for a preferred (normalized) representation:

= 006F (o) 0302 (circumflex) 0323 (dot) = 006F (o) 0323 (dot) 0302 (circumflex)

= 00F4 (o-circumflex) 0323 (dot) Fully = 1ECD (o-dot) 0302 (circumflex) decomposed

= 1ED9 (o-circumflex-dot) Fully composed

20 Combining characters • Certain characters are designated as combining characters • Combining characters are grouped into classes by how they combine • Many accented characters are represented as sequences • Composite characters with equivalent combining character sequences are said to decompose to the equivalent sequence • The standard provides for four normalized forms to aid in comparison and processing • The standard provides for a canonical ordering for multiple combining marks attached to the same character

Character semantics • The Unicode standard includes an extensive database that specifies a large number of character properties, including: –Name – Type (e.g., letter, digit, punctuation mark) – Decomposition – Case and case mappings (for cased letters) – Numeric value (for digits and numerals) – Combining class (for combining characters) – Directionality – Line-breaking behavior – Cursive joining behavior – For Chinese characters, mappings to various other standards and many other properties

21 Storage formats

UTF-32: The 21-bit abstract Unicode value is simply zero-padded to 32 bits:

Storage formats

UTF-16: For characters in the BMP, the 21-bit value is simply truncated to 16 bits:

For other characters, the 21-bit value is turned into a sequence of two 16-bit values called a surrogate pair:

A particular numeric value is either a BMP character, a high surrogate, or a low surrogate.

22 Storage formats

UTF-8: For ASCII characters, the 21-bit value is truncated to 8 bits:

For other characters, the 21-bit value is turned into a sequence of two, three, or four 8-bit values:

Different numeric ranges are used for ASCII characters and leading and trailing bytes. Different ranges are used for leading bytes of different-length sequences.

Serialization formats

• UTF-16 and UTF-32 can be written to a serial device in different byte orders. The standard provides three serialization formats for UTF-16 and UTF-32: – A big-endian version (UTF-16BE and UTF-32BE) where the most-significant byte is written first – A little-endian version (UTF-16LE and UTF-32LE) where the least-significant byte is written first – A self-describing version where the text is preceded by a that the receiving process can use to determing endian-ness

23 The Unicode standard • The Unicode standard consists of: – The standard text, published in book form (this includes a complete set of printed code charts) – The Unicode Character Database, a set of data files providing complete property information on every character – Various Web-published supplemental materials: • Unicode Standard Annexes (UAX): Amendments to the standard since the last book was published • Unicode Technical Standards (UTS): Allied standards maintained separately from Unicode itself • Unicode Technical Reports (UTR): Non-normative documents providing background info, implementation hints, or other useful information • Unicode Technical Notes (UTN): Other articles of interest

Dealing with Unicode • The basic character and string classes in Windows 2000 and XP are Unicode-based, and Windows provides an extensive set of APIs for working with Unicode text • The basic character and string classes in Java are also Unicode-based, and the Java Class Library also provides an extensive set of APIs for working with Unicode text • Several third-party packages, including the open-source International Components for Unicode, are also available

24 For more information • The published standard is available in bookstores • Virtually everything related to the standard is available at http://ww w.unicode.org • Two good books: Unicode Demystified by yours truly and Unicode: A Primer by Tony Graham • Ask questions at unicode @ u nicode.org • Contact me at rtgillam @concentric.net

25