Character Encoding

Unicode J. Schneeberger University of Applied Sciences Deggendorf [email protected] Unicode and Character Sets With the help of Olaf Winterstein (Greifswald 08) 2 What is Unicode? • An international standard • Goal: unify all characters of all languages worldwide – Until then, there were various national standards. – Standardization and integration of these standards • Publisher is the Unicode Consortium (founded 1991) • aka ISO 10646 3 ASCII – the base • American Standard Code for Information Interchange (ASCII) • 128 characters – presentable by 7 bits XY (X = 1..8, Y = 1..F) 4 ISO-Latin-1 • western characters • 256 characters • 8 bits • hexadecimal: XY (X=1..F, Y=1..F) http://www.unicode.org/charts/PDF/U0000.pdf 5 http://www.unicode.org/charts/PDF/U0080.pdf More character sets 6 Balinese • But also latin script 7 ... (indefinitely) more characters • 65,536 places for characters – 256 tables of 256 characters – enumeration from 1 to 65,536 – hexadecimal 0001 - FFFF • For readability: 256 blocks with 256 entries • e.g. block 00 for all entries 0001 to 00FF • normally: complete blocks for character systems • now: max. 1.114.112 characters • Unicode Consortium http://www.unicode.org/ 8 Unicode Character Set 9 Unicode Plane 0 10 [Wikipedia] Unicode Planes Plane Range Name Description 0 0000-FFFF Basic Multilingual BMP Integration of old Plane characters sets 1 10000-1FFFF Supplementary SMP Historic characters, Multilingual Plane music, mathematics 2 20000-2FFFF Supplementary SIP Han unification Ideographic Plane (40.000 Zeichen) 3-13 30000-DFFFF unassigned 14 E0000-EFFFF Supplementary SSP non graphical Special-purpose characters, e.g. Plane country codes 15 F0000-FFFFF Private Use Area PUA private usage 16 100000-10FFFF Private Use Area PUA 11 Unicode Plane 1 • Linear B (13.-15. century b.C.) • ancient Greek • Cuneiform script • Old Italian • gothic • Old Persian • Ottoman Turkish language • Phoenician [Wikipedia] • ... 12 Characters and Glyphs • Unicode stores characters no glyphs. • A glyph is a particular writing of a character. • Multiple glyphs for one character. • Glyphs are stored in fonts. • If a font is “Unicode-conform”, the character table of the font resembles the unicode index correctly. It does not contain all unicode characters. 13 Finding Unicode Characters • Different methods 1. Unicode NamesList http://unicode.org/Public/UNIDATA/NamesList.txt 2. Unicode Character Name Index http://unicode.org/charts/charindex.html 3. Unicode Character Code Charts http://unicode.org/charts/ 4. Unicode Search http://www.fileformat.info/info/unicode/char/searc h.htm 14 Unicode NamesList ftp://ftp.unicode.org/ 15 ftp://ftp.unicode.org/Public/5.2.0/ucd/NamesList-5.2.0d2.txt Unicode Character Name Index http://unicode.org/charts/charindex.html 16 Unicode Charts 17 http://www.unicode.org/charts/ Character Search http://www.fileformat.info 18 Complex Characters • Some characters are a combination of multiple other characters. • Search is difficult [http://www.unicode.org/standard/where/] 19 Character Encoding The pitfalls of Unicode ... 20 What is a character encoding? • Historic examples – Morse code – Braille – ... • There are multiple ways to code the numbers 1 – 65.536 with binaries 21 4 Levels / Steps • ACR: Abstract Character Repertoire – The set of characters which has to be coded (eg. an alphabet) • CCS: Coded Character Set – Mapping of the characters set or alphabet (abstract character repertoire – ACR) to a set of (non negative) numbers. Typically 1..n • CEF: Character Encoding Form – Mapping of a set of numbers to a set of units of fixed length (eg. 32-bit units). – If the size of the set exceeds the number of available units (eg. 265 for 8-bit) an escaping procedure has to be agreed on. • CES: Character Encoding Scheme – An invertible transformation of CDF to 8-bit units (octets) to store character sets on old computers. 22 Encoding ACR CCS CEF CES A 0 D50 D 5 0 B 1 D51 D 5 1 C 2 D52 D 5 2 D 3 D53 D 5 3 E 4 D54 D 5 4 ... ... ... ... 23 Common Encodings • ISO 646: ASCII • EBCDIC: CP930 • ISO 8859: ISO 8859-1 Western Europe, ..., ISO 8859-16 • MS-Windows character sets: Windows-1250, ..., Windows-1258 • Mac OS Roman • Cork / T1 • JIS X 0208 weit verbreitet für Japanisch: z.B. EUC-JP, ISO-2022-JP • Chinese Guobiao: GB 2312, GBK (Microsoft Code page 936), GB 18030 • Unicode: UTF-8, UTF-16, ... 24 Unicode and Encodings • Unicode in Programs – UCS-2: two-byte characters – UCS-4: four-byte characters (future) • Unicode in files – UTF-8: ASCII is ASCII, rest are 1- to 4-bytes – UTF-16: two octets per character, initial • ASCII with (hexa)decimal position codes © for © character reference 25 UTF-8 (Unicode Transformation Format) Charakter Range Range UTF-8 (octet) Sequenz 0000 0000-0000 007F 0-127 (ASCII) 0xxxxxxx 0000 0080-0000 07FF 128-2047 110xxxxx 10xxxxxx 0000 0800-0000 FFFF 2048-65535 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF 65536-1114111 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx Example: A≠α. A<NOT IDENTICAL TO><ALPHA> . U+0041 U+2262 U+0391 U+002E 41 E2 89 A2 CE 91 2E http://www.ietf.org/rfc/rfc3629.txt 26 Examples and Tests http://www.columbia.edu/kermit/utf8.html http://www.cl.cam.ac.uk/~mgk25/unicode.html 27 UTF-16/UCS-2 • UCS = Universal Character Set code point Character UTF-16 code Glyph 122 (hex 7A) small Z (Latin) 007A z 27700 (hex 6C34) water (Chinese) 6C34 水 119070 (hex 1D11E) treble clef D834 DD1E • UTF-16 Encoding of these 3 characters: 27700 122 119070 Encoding Byte Order Byte Sequenz UTF-16LE little-endian 34 6C, 7A 00, 34 D8 1E DD UTF-16BE big-endian 6C 34, 00 7A, D8 34 DD 1E UTF-16 little-endian, w. BOM FF FE, 34 6C, 7A 00, 34 D8 1E DD UTF-16 big-endian, w. BOM FE FF, 6C 34, 00 7A, D8 34 DD 1E 28 pairs of bytes in inverse order BOM – Byte Order Mark • optional at the beginning of a file. • used to specify the byte order in UTF-16 or UTF-32 files • .. or to label UTF-8, UTF-16 or UTF-32 files • troublesome when used over platform borders Kodierung Bytefolge UTF-8 EF BB BF UTF-16 Big Endian FE FF UTF-16 Little Endian FF FE UTF-32 Big Endian 00 00 FE FF UTF-32 Little Endian FF FE 00 00 30 [http://de.wikipedia.org/wiki/Byte_Order_Mark] Analyze Encoding debug command: analyzes files hex dump BOM: FF FE 31 Unicode and XML <?xml version="1.0" encoding="ISO-8859-1" ?> <?xml version="1.0" encoding=“UTF-8" ?> • XML Programs usually work with UTF-8 and UTF-16 • ... but also ASCII, EBCDIC, JIS, KO18-R, Big5 are accepted by most programs. 32 Literature • J. Allen, J. Becker (Hrsg.), The Unicode Standard, Version 5.0, Addison-Wesley Longman, Amsterdam, 2006. • RFC 20, V. Cerf, ASCII format for Network Interchange, http://tools.ietf.org/html/rfc20 1969. 33 Hands-on section • Take a look at the unicode tables at www.unicode.org • Try to find a particular character using unicode character search at www.fileformat.info and integrate it into your documents. • Try to write a UTF-16 document and analyze it using debug or od. • Convert a document from isolatin to UTF-8 or vice versa using iconv. iconv -f ISO-8859-15 -t UTF-8 infile > outfile 34 .

Character Encoding

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support