Character Encoding
Total Page:16
File Type:pdf, Size:1020Kb
Unicode J. Schneeberger University of Applied Sciences Deggendorf [email protected] Unicode and Character Sets With the help of Olaf Winterstein (Greifswald 08) 2 What is Unicode? • An international standard • Goal: unify all characters of all languages worldwide – Until then, there were various national standards. – Standardization and integration of these standards • Publisher is the Unicode Consortium (founded 1991) • aka ISO 10646 3 ASCII – the base • American Standard Code for Information Interchange (ASCII) • 128 characters – presentable by 7 bits XY (X = 1..8, Y = 1..F) 4 ISO-Latin-1 • western characters • 256 characters • 8 bits • hexa- decimal: XY (X=1..F, Y=1..F) http://www.unicode.org/charts/PDF/U0000.pdf 5 http://www.unicode.org/charts/PDF/U0080.pdf More character sets 6 Balinese • But also latin script 7 ... (indefinitely) more characters • 65,536 places for characters – 256 tables of 256 characters – enumeration from 1 to 65,536 – hexadecimal 0001 - FFFF • For readability: 256 blocks with 256 entries • e.g. block 00 for all entries 0001 to 00FF • normally: complete blocks for character systems • now: max. 1.114.112 characters • Unicode Consortium http://www.unicode.org/ 8 Unicode Character Set 9 Unicode Plane 0 10 [Wikipedia] Unicode Planes Plane Range Name Description 0 0000-FFFF Basic Multilingual BMP Integration of old Plane characters sets 1 10000-1FFFF Supplementary SMP Historic characters, Multilingual Plane music, mathematics 2 20000-2FFFF Supplementary SIP Han unification Ideographic Plane (40.000 Zeichen) 3-13 30000-DFFFF unassigned 14 E0000-EFFFF Supplementary SSP non graphical Special-purpose characters, e.g. Plane country codes 15 F0000-FFFFF Private Use Area PUA private usage 16 100000-10FFFF Private Use Area PUA 11 Unicode Plane 1 • Linear B (13.-15. century b.C.) • ancient Greek • Cuneiform script • Old Italian • gothic • Old Persian • Ottoman Turkish language • Phoenician [Wikipedia] • ... 12 Characters and Glyphs • Unicode stores characters no glyphs. • A glyph is a particular writing of a character. • Multiple glyphs for one character. • Glyphs are stored in fonts. • If a font is “Unicode-conform”, the character table of the font resembles the unicode index correctly. It does not contain all unicode characters. 13 Finding Unicode Characters • Different methods 1. Unicode NamesList http://unicode.org/Public/UNIDATA/NamesList.txt 2. Unicode Character Name Index http://unicode.org/charts/charindex.html 3. Unicode Character Code Charts http://unicode.org/charts/ 4. Unicode Search http://www.fileformat.info/info/unicode/char/searc h.htm 14 Unicode NamesList ftp://ftp.unicode.org/ 15 ftp://ftp.unicode.org/Public/5.2.0/ucd/NamesList-5.2.0d2.txt Unicode Character Name Index http://unicode.org/charts/charindex.html 16 Unicode Charts 17 http://www.unicode.org/charts/ Character Search http://www.fileformat.info 18 Complex Characters • Some characters are a combination of multiple other characters. • Search is difficult [http://www.unicode.org/standard/where/] 19 Character Encoding The pitfalls of Unicode ... 20 What is a character encoding? • Historic examples – Morse code – Braille – ... • There are multiple ways to code the numbers 1 – 65.536 with binaries 21 4 Levels / Steps • ACR: Abstract Character Repertoire – The set of characters which has to be coded (eg. an alphabet) • CCS: Coded Character Set – Mapping of the characters set or alphabet (abstract character repertoire – ACR) to a set of (non negative) numbers. Typically 1..n • CEF: Character Encoding Form – Mapping of a set of numbers to a set of units of fixed length (eg. 32-bit units). – If the size of the set exceeds the number of available units (eg. 265 for 8-bit) an escaping procedure has to be agreed on. • CES: Character Encoding Scheme – An invertible transformation of CDF to 8-bit units (octets) to store character sets on old computers. 22 Encoding ACR CCS CEF CES A 0 D50 D 5 0 B 1 D51 D 5 1 C 2 D52 D 5 2 D 3 D53 D 5 3 E 4 D54 D 5 4 ... ... ... ... 23 Common Encodings • ISO 646: ASCII • EBCDIC: CP930 • ISO 8859: ISO 8859-1 Western Europe, ..., ISO 8859-16 • MS-Windows character sets: Windows-1250, ..., Windows-1258 • Mac OS Roman • Cork / T1 • JIS X 0208 weit verbreitet für Japanisch: z.B. EUC-JP, ISO-2022-JP • Chinese Guobiao: GB 2312, GBK (Microsoft Code page 936), GB 18030 • Unicode: UTF-8, UTF-16, ... 24 Unicode and Encodings • Unicode in Programs – UCS-2: two-byte characters – UCS-4: four-byte characters (future) • Unicode in files – UTF-8: ASCII is ASCII, rest are 1- to 4-bytes – UTF-16: two octets per character, initial • ASCII with (hexa)decimal position codes © for © character reference 25 UTF-8 (Unicode Transformation Format) Charakter Range Range UTF-8 (octet) Sequenz 0000 0000-0000 007F 0-127 (ASCII) 0xxxxxxx 0000 0080-0000 07FF 128-2047 110xxxxx 10xxxxxx 0000 0800-0000 FFFF 2048-65535 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF 65536-1114111 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx Example: A≠α. A<NOT IDENTICAL TO><ALPHA> . U+0041 U+2262 U+0391 U+002E 41 E2 89 A2 CE 91 2E http://www.ietf.org/rfc/rfc3629.txt 26 Examples and Tests http://www.columbia.edu/kermit/utf8.html http://www.cl.cam.ac.uk/~mgk25/unicode.html 27 UTF-16/UCS-2 • UCS = Universal Character Set code point Character UTF-16 code Glyph 122 (hex 7A) small Z (Latin) 007A z 27700 (hex 6C34) water (Chinese) 6C34 水 119070 (hex 1D11E) treble clef D834 DD1E • UTF-16 Encoding of these 3 characters: 27700 122 119070 Encoding Byte Order Byte Sequenz UTF-16LE little-endian 34 6C, 7A 00, 34 D8 1E DD UTF-16BE big-endian 6C 34, 00 7A, D8 34 DD 1E UTF-16 little-endian, w. BOM FF FE, 34 6C, 7A 00, 34 D8 1E DD UTF-16 big-endian, w. BOM FE FF, 6C 34, 00 7A, D8 34 DD 1E 28 pairs of bytes in inverse order BOM – Byte Order Mark • optional at the beginning of a file. • used to specify the byte order in UTF-16 or UTF-32 files • .. or to label UTF-8, UTF-16 or UTF-32 files • troublesome when used over platform borders Kodierung Bytefolge UTF-8 EF BB BF UTF-16 Big Endian FE FF UTF-16 Little Endian FF FE UTF-32 Big Endian 00 00 FE FF UTF-32 Little Endian FF FE 00 00 30 [http://de.wikipedia.org/wiki/Byte_Order_Mark] Analyze Encoding debug command: analyzes files hex dump BOM: FF FE 31 Unicode and XML <?xml version="1.0" encoding="ISO-8859-1" ?> <?xml version="1.0" encoding=“UTF-8" ?> • XML Programs usually work with UTF-8 and UTF-16 • ... but also ASCII, EBCDIC, JIS, KO18-R, Big5 are accepted by most programs. 32 Literature • J. Allen, J. Becker (Hrsg.), The Unicode Standard, Version 5.0, Addison-Wesley Longman, Amsterdam, 2006. • RFC 20, V. Cerf, ASCII format for Network Interchange, http://tools.ietf.org/html/rfc20 1969. 33 Hands-on section • Take a look at the unicode tables at www.unicode.org • Try to find a particular character using unicode character search at www.fileformat.info and integrate it into your documents. • Try to write a UTF-16 document and analyze it using debug or od. • Convert a document from isolatin to UTF-8 or vice versa using iconv. iconv -f ISO-8859-15 -t UTF-8 infile > outfile 34 .