Unicode Demystified a Tutorial Introduction to the Unciode Standard

Unicode Demystified A Tutorial Introduction to the Unciode Standard Richard Gillam Senior Developer Trilogy Software, Inc. TRILOGY Shameless Plug 1 The Code Page Problem • Characters in most languages are traditionally represented by single-byte values – Allows for 256 characters max – Real limit for most encodings is 192 characters – This includes letters, digits, punctuation, symbols • When a system is used for a new language, the encoding has to be adapted to use that language’s characters • Encodings proliferate – Each language or group of languages gets its own encoding – Different vendors or standards committees devise different encodings, so generally each language has several, often incompatible, encodings Multi-byte encodings • Some languages (Chinese, Japanese, Korean, etc.) have more than 256 characters • Encoding standards for these languages use sequences of bytes for many characters – In many standards, not all characters are the same number of bytes – Can’t tell whether a given byte is a whole character or part of a character – Corruption of one byte can corrupt the whole data stream 2 Interoperability problems • Can’t easily mix languages in a document or system • Data not tagged with encoding, so loss can occur when transferring between systems • Most encodings are ASCII-based, so problems often not seen with English-only data • Two possible solutions: – Systematic tagging of textual data with encoding ID – Universal encoding standard with all languages’ characters 3 Encoding space An ASCII character is 7 bits wide Encoding space Most encodings press the eighth bit into service 4 Encoding space Early versions of Unicode used 16 bits Encoding space Unicode now uses 21 bits 5 Encoding space Plane Row Character number number number Unicode • 21-bit encoding space allows for 1,114,112 characters • 95,156 code point values assigned to characters in Unicode 3.2 • 137,216 code point values set aside for application use • 2,114 code point values set aside for non- character use • 879,626 code point values reserved for future character assignments 6 The Unicode Encoding Space 10 F E D C B A 9 8 7 6 5 4 3 2 1 0 Basic Multilingual Plane The Unicode Encoding Space 10 F E D C B A Supplementary Planes 9 8 7 6 5 4 3 2 1 0 7 The Unicode Encoding Space 10 F Supplementary Special-Purpose E Plane D C B A 9 8 7 6 5 4 3 Supplementary Ideographic Plane 2 Supplementary Multilingual Plane 1 0 The Unicode Encoding Space Private Use Planes 10 F E D C B A 9 8 7 6 5 4 3 2 1 0 8 The Unicode Encoding Space 10 F E D C B A 9 8 7 6 5 4 3 2 1 0 Basic Multilingual Plane The Basic Multilingual Plane 0 General Scripts Area 1 2 Symbols Area CJK Punct. 3 CJK Punct. 4 5 Han 6 7 8 9 A Yi B Hangul C D Surrogates Area E Private Use Area F Compatibility Area 9 The General Scripts Area 00/01 Latin 02/03 IPA Diacriticals Greek 04/05 Cyrillic Armenian Hebrew 06/07 Arabic Syriac Thaana 08/09 Devanagari Bengali 0A/0B Gurmukhi Gujarati Oriya Tamil 0C/0D Telugu Kannada Malayalam Sinhala 0E/0F Thai Lao Tibetan 10/11 Myanmar Georgian Hangul 12/13 Ethiopic Cherokee 14/15 Canadian Aboriginal Syllabics Ogh 16/17 am Runic Philippine Khmer 18/19 Mongolian 1A/1B 1C/1D 1E/1F Latin Greek Unicode Coverage • European scripts – Latin, Greek, Cyrillic, Armenian, Georgian, IPA • Bidirectional (Middle Eastern) scripts – Hebrew, Arabic, Syriac, Thaana • Indic (Indian and Southeast Asian) scripts – Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Khmer, Myanmar, Tibetan, Philippine • East Asian scripts – Chinese (Han) characters, Japanese (Hiragana and Katakana), Korean (Hangul), Yi • Other modern scripts – Mongolian, Ethiopic, Cherokee, Canadian Aboriginal • Historical scripts – Runic, Ogham, Old Italic, Gothic, Deseret • Punctuation and symbols – Numerals, math symbols, scientific symbols, arrows, blocks, geometric shapes, Braille, musical notation, etc. 10 Characters and Glyphs Characters and Glyphs 11 Characters and Glyphs Ligatures fi 12 Ligatures fi Ligatures ﬁ 13 Ligatures Ligatures 14 Ligatures Split Vowels These two marks are parts of the same character 15 Character Positioning Character Positioning 16 Character Positioning Character Positioning 17 Combining characters One character… é Combining characters …or two? é 18 Combining characters Actually, either. Unicode is generative, with accent marks represented with their own code point values… = U+0065 (e) U+0301 (accent) …buté common combinations of letters and accents are also given their own code points for convenience. é = U+00E9 Combining characters This can be tough, because the two representations are to be treated as absolutely identical. é = é U+0065 U+0301 = U+00E9 19 Combining characters Things can get really wild for characters with more than one accent mark: = 006F (o) 0302 (circumflex) 0323 (dot) = 006F (o) 0323 (dot) 0302 (circumflex) = 00F4 (o-circumflex) 0323 (dot) = 1ECD (o-dot) 0302 (circumflex) = 1ED9 (o-circumflex-dot) Combining characters Unicode provides normalization rules to aid in comparison. These provide for a preferred (normalized) representation: = 006F (o) 0302 (circumflex) 0323 (dot) = 006F (o) 0323 (dot) 0302 (circumflex) = 00F4 (o-circumflex) 0323 (dot) Fully = 1ECD (o-dot) 0302 (circumflex) decomposed = 1ED9 (o-circumflex-dot) Fully composed 20 Combining characters • Certain characters are designated as combining characters • Combining characters are grouped into classes by how they combine • Many accented characters are represented as combining character sequences • Composite characters with equivalent combining character sequences are said to decompose to the equivalent sequence • The standard provides for four normalized forms to aid in comparison and processing • The standard provides for a canonical ordering for multiple combining marks attached to the same character Character semantics • The Unicode standard includes an extensive database that specifies a large number of character properties, including: –Name – Type (e.g., letter, digit, punctuation mark) – Decomposition – Case and case mappings (for cased letters) – Numeric value (for digits and numerals) – Combining class (for combining characters) – Directionality – Line-breaking behavior – Cursive joining behavior – For Chinese characters, mappings to various other standards and many other properties 21 Storage formats UTF-32: The 21-bit abstract Unicode value is simply zero-padded to 32 bits: Storage formats UTF-16: For characters in the BMP, the 21-bit value is simply truncated to 16 bits: For other characters, the 21-bit value is turned into a sequence of two 16-bit values called a surrogate pair: A particular numeric value is either a BMP character, a high surrogate, or a low surrogate. 22 Storage formats UTF-8: For ASCII characters, the 21-bit value is truncated to 8 bits: For other characters, the 21-bit value is turned into a sequence of two, three, or four 8-bit values: Different numeric ranges are used for ASCII characters and leading and trailing bytes. Different ranges are used for leading bytes of different-length sequences. Serialization formats • UTF-16 and UTF-32 can be written to a serial device in different byte orders. The standard provides three serialization formats for UTF-16 and UTF-32: – A big-endian version (UTF-16BE and UTF-32BE) where the most-significant byte is written first – A little-endian version (UTF-16LE and UTF-32LE) where the least-significant byte is written first – A self-describing version where the text is preceded by a byte order mark that the receiving process can use to determing endian-ness 23 The Unicode standard • The Unicode standard consists of: – The standard text, published in book form (this includes a complete set of printed code charts) – The Unicode Character Database, a set of data files providing complete property information on every character – Various Web-published supplemental materials: • Unicode Standard Annexes (UAX): Amendments to the standard since the last book was published • Unicode Technical Standards (UTS): Allied standards maintained separately from Unicode itself • Unicode Technical Reports (UTR): Non-normative documents providing background info, implementation hints, or other useful information • Unicode Technical Notes (UTN): Other articles of interest Dealing with Unicode • The basic character and string classes in Windows 2000 and XP are Unicode-based, and Windows provides an extensive set of APIs for working with Unicode text • The basic character and string classes in Java are also Unicode-based, and the Java Class Library also provides an extensive set of APIs for working with Unicode text • Several third-party packages, including the open-source International Components for Unicode, are also available 24 For more information • The published standard is available in bookstores • Virtually everything related to the standard is available at http://ww w.unicode.org • Two good books: Unicode Demystified by yours truly and Unicode: A Primer by Tony Graham • Ask questions at unicode @ u nicode.org • Contact me at rtgillam @concentric.net 25.

Unicode Demystified a Tutorial Introduction to the Unciode Standard

The Unicode Cookbook for Linguists: Managing Writing Systems Using Orthography Profiles

Unicode and Code Page Support

Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress

QCMUQ@QALB-2015 Shared Task: Combining Character Level MT and Error-Tolerant Finite-State Recognition for Arabic Spelling Correction

Geometry and Art LACMA | | April 5, 2011 Evenings for Educators

L2/14-274 Title: Proposed Math-Class Assignments for UTR #25

Combining Character and Word Information in Neural Machine Translation Using a Multi-Level Attention

Combining Character and Conversational Types in Strategic Choice

PHP and Unicode

Chapter 5. Characters: Typology and Page Encoding 1

Diakritika in Unicode

Special Characters in Aletheia