<<

J. Schneeberger University of Applied Sciences Deggendorf [email protected] Unicode and Character Sets

With the help of Olaf Winterstein (Greifswald 08)

2 What is Unicode?

• An international standard • Goal: unify all characters of all languages worldwide – Until then, there were various national standards. – Standardization and integration of these standards • Publisher is the (founded 1991) • aka ISO 10646

3 ASCII – the base

• American Standard Code for Information Interchange (ASCII) • 128 characters – presentable by 7 bits XY ( = 1..8, = 1..)

4 ISO-Latin-1

• western characters • 256 characters • 8 bits • hexa- : XY (X=1..F, Y=1..F) http://www.unicode.org/charts/PDF/U0000.pdf 5 http://www.unicode.org/charts/PDF/U0080.pdf More character sets

6 Balinese

• But also latin

7 ... (indefinitely) more characters

• 65,536 places for characters – 256 tables of 256 characters – enumeration from 1 65,536 – 0001 - FFFF • For readability: 256 blocks with 256 entries • .. block 00 for all entries 0001 to 00FF • normally: complete blocks for character systems • now: max. 1.114.112 characters • Unicode Consortium http://www.unicode.org/

8 Unicode Character Set

9 Unicode 0

10 [Wikipedia] Unicode Planes

Plane Range Name Description 0 0000-FFFF Basic Multilingual BMP Integration of old Plane characters sets 1 10000-1FFFF Supplementary SMP Historic characters, Multilingual Plane music, mathematics 2 20000-2FFFF Supplementary SIP Ideographic Plane (40.000 Zeichen) 3-13 30000-DFFFF unassigned 14 E0000-EFFFF Supplementary SSP non graphical Special-purpose characters, e.g. Plane country codes 15 F0000-FFFFF Private Use Area PUA private usage 16 100000-10FFFF Private Use Area PUA 11 Unicode Plane 1

• Linear (13.-15. century b..) • ancient Greek • script • Old Italian • gothic • Old Persian • Ottoman Turkish language

• Phoenician [Wikipedia] • ...

12 Characters and

• Unicode stores characters glyphs. • A is a particular writing of a character. • Multiple glyphs for one character. • Glyphs are stored in fonts. • If a font is “Unicode-conform”, the character table of the font resembles the unicode index correctly. It does not contain all unicode characters.

13 Finding Unicode Characters

• Different methods 1. Unicode NamesList http://unicode.org/Public/UNIDATA/NamesList.txt 2. Unicode Character Name Index http://unicode.org/charts/charindex.html 3. Unicode Character Code Charts http://unicode.org/charts/ 4. Unicode Search http://www.fileformat.info/info/unicode/char/searc .htm

14 Unicode NamesList

ftp://ftp.unicode.org/ 15 ftp://ftp.unicode.org/Public/5.2.0/ucd/NamesList-5.2.0d2.txt Unicode Character Name Index

http://unicode.org/charts/charindex.html 16 Unicode Charts

17 http://www.unicode.org/charts/ Character Search

http://www.fileformat.info 18 Complex Characters

• Some characters are a combination of multiple other characters. • Search is difficult

[http://www.unicode.org/standard/where/] 19

The pitfalls of Unicode ...

20 What is a character encoding?

• Historic examples – – ... • There are multiple ways to code the numbers 1 – 65.536 with binaries

21 4 Levels / Steps

• ACR: Abstract Character Repertoire – The set of characters which has to be coded (eg. an alphabet) • CCS: Coded Character Set – Mapping of the characters set or alphabet (abstract character repertoire – ACR) to a set of (non negative) numbers. Typically 1.. • CEF: Character Encoding Form – Mapping of a set of numbers to a set of units of fixed length (eg. 32-bit units). – If the size of the set exceeds the number of available units (eg. 265 for 8-bit) an escaping procedure has to be agreed on. • CES: Character Encoding Scheme – An invertible transformation of CDF to 8-bit units (octets) to store character sets on old .

22 Encoding

ACR CCS CEF CES

A 0 D50 5 0 B 1 D51 D 5 1 C 2 D52 D 5 2 D 3 D53 D 5 3 E 4 D54 D 5 4 ......

23 Common Encodings

• ISO 646: ASCII • EBCDIC: CP930 • ISO 8859: ISO 8859-1 Western Europe, ..., ISO 8859-16 • MS-Windows character sets: Windows-1250, ..., Windows-1258 • Mac Roman • Cork / T1 • JIS X 0208 weit verbreitet für Japanisch: .B. EUC-JP, ISO-2022-JP • Chinese Guobiao: GB 2312, GBK ( 936), GB 18030 • Unicode: UTF-8, UTF-16, ...

24 Unicode and Encodings

• Unicode in Programs – UCS-2: two-byte characters – UCS-4: four-byte characters (future) • Unicode in files – UTF-8: ASCII is ASCII, rest are 1- to 4-bytes – UTF-16: two octets per character, initial • ASCII with (hexa)decimal position codes © for ©

character reference 25 UTF-8 (Unicode Transformation Format) Charakter Range Range UTF-8 (octet) Sequenz 0000 0000-0000 007F 0-127 (ASCII) 0xxxxxxx 0000 0080-0000 07FF 128-2047 110xxxxx 10xxxxxx 0000 0800-0000 FFFF 2048-65535 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF 65536-1114111 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx Example: A≠α.

A .

U+0041 +2262 U+0391 U+002E

41 E2 89 A2 CE 91 2E http://www.ietf.org/rfc/rfc3629.txt 26 Examples and Tests

http://www.columbia.edu/kermit/utf8.html http://www.cl.cam.ac.uk/~mgk25/unicode.html 27 UTF-16/UCS-2 • UCS = Universal Character Set Character UTF-16 code Glyph 122 (hex 7A) small Z (Latin) 007A z 27700 (hex 6C34) water (Chinese) 6C34 水

119070 (hex 1D11E) treble clef D834 DD1E

• UTF-16 Encoding of these 3 characters: 27700 122 119070

Encoding Byte Order Byte Sequenz UTF-16LE little-endian 34 6C, 7A 00, 34 D8 1E DD UTF-16BE big-endian 6C 34, 00 7A, D8 34 DD 1E UTF-16 little-endian, . BOM FF FE, 34 6C, 7A 00, 34 D8 1E DD UTF-16 big-endian, w. BOM FE FF, 6C 34, 00 7A, D8 34 DD 1E

28 pairs of bytes in inverse order BOM –

• optional at the beginning of a file. • used to specify the byte order in UTF-16 or UTF-32 files • .. or to label UTF-8, UTF-16 or UTF-32 files • troublesome when used over platform borders

Kodierung Bytefolge UTF-8 EF BB BF UTF-16 Big Endian FE FF UTF-16 Little Endian FF FE UTF-32 Big Endian 00 00 FE FF UTF-32 Little Endian FF FE 00 00 30 [http://de.wikipedia.org/wiki/Byte_Order_Mark] Analyze Encoding debug command: analyzes files

hex dump

BOM: FF FE

31 Unicode and XML

• XML Programs usually work with UTF-8 and UTF-16 • ... but also ASCII, EBCDIC, JIS, KO18-, are accepted by most programs.

32 Literature

. Allen, J. Becker (Hrsg.), The Unicode Standard, Version 5.0, Addison-Wesley Longman, Amsterdam, 2006. • RFC 20, . Cerf, ASCII format for Network Interchange, http://tools.ietf.org/html/rfc20 1969.

33 Hands-on section

• Take a look at the unicode tables at www.unicode.org • Try to find a particular character using unicode character search at www.fileformat.info and integrate it into your documents. • Try to write a UTF-16 document and analyze it using debug or od. • Convert a document from isolatin to UTF-8 or vice versa using iconv. iconv -f ISO-8859-15 - UTF-8 infile > outfile 34