What Every Programmer Should Know About Unicode

U+1F4A9 = � What every Programmer should know about � Unicode � 2. Semester Medieninformatik Prof. Dr.-Ing. Carsten Bormann [email protected] © 2008–2013 Carsten Bormann 1 Textuelle Information – Zeichen Primäre Informationsquelle im Web: Text Zeichen: Buchstaben, Ziffern, Zeichensetzung, Sonderzeichen Welche Zeichen gibt es? Zeichenvorrat Wie werden sie digital kodiert? Zeichensatz Wie sehen sie aus? Font (Schrift, Schriftart) © 2008–2013 Carsten Bormann 3 Digitale Kodierung Kodierung über Kette von Bits – 0 oder 1 – n Bits 2n Möglichkeiten (25 = 32, 27 = 128, 28 = 256, ...) Beispiel: Zahlen © 2008–2013 Carsten Bormann 4 Zeichencodes: Baudot (IA2, ITU-T S.1) Telegrafie (50 bit/s): 5 Bits 32 Symbole A-Z = 26 Ziffern + Satzzeichen = 21 6 Symbole eindeutig 26 Symbole doppelt belegt Bu/Zi zum Umschalten © 2008–2013 Carsten Bormann 5 Zeichencodes: 7-Bit-Codes 7 Bit pro Zeichen (eins bleibt frei für Parity) ASCII ISO 646 = IA5 ~ DIN 66003 – Nationale Varianten: nicht alle Codes gleich belegt Steuerzeichen: CR, LF, ... (0 – 31) Schriftzeichen: !“#$...A-Z...a-z... (32* – 127*) © 2008–2013 Carsten Bormann 6 © 2008–2013 Carsten Bormann 7 8-Bit-Codes Problem: Nationale Varianten unhandlich – Europäische Integration… 8. Bit ungenutzt Idee: 2 Tabellen Linke Tabelle ~ ASCII © 2008–2013 Carsten Bormann 8 8-bit-Codes ISO 6937: – Linke Tabelle ISO 646:1973 (ASCII ohne $) – Rechte Tabelle für alle lateinischen Sprachen Diakritische Zeichen Besondere/zusammengesetzte Zeichen ISO 8859-n – Linke Tabelle ASCII (ISO 646:1990) – Rechte Tabelle in ca. 15 Varianten (ISO 8859-1 bis -15) © 2008–2013 Carsten Bormann 9 © 2008–2013 Carsten Bormann 10 © 2008–2013 Carsten Bormann 11 © 2008–2013 Carsten Bormann 12 Klassische Zeichen-Codes Telegrafie: 5-Bit-Code, 25 = 32 – Durch Doppelbelegung 26+26+6 = 58 Zeichen ASCII/ISO 646: 7-Bit-Code, 27 = 128 – C-Set: 32 Steuerzeichen; G-Set: 96 (94) Schriftzeichen ISO 6937: 8-Bit-Code, 28 = 256 – 2 C-Sets, 2 G-Sets; ca. 600 Zeichen durch Zusammensetzen ISO 8859-n: 8-Bit-Code, 28 = 256 – Wirtschaftsraumspezifische Varianten mit je 94+96 = 190 Zeichen (inkl. ASCII) © 2008–2013 Carsten Bormann 13 Probleme mit 8-Bit-Codes Bengali, Devanagari, Tamil, Thai, Tibetanisch, ... Was mit den ideographischen Schriften? – Kanji (Japan), Hanzi (China), Hanja (Korea, neben Hangul) – Tausende von Symbolen Sonstige Symbole – Dingbats, Mathematische Zeichen, E-Technik, ... – halbe Leerzeichen, linke untere Anführungszeichen, ... Kombination von Schriften in einer Anwendung Mehrfachbelegung = ISO 2022 (Codeerweiterung) 16-/32-Bit-Codes = ISO 10646 (Unicode) © 2008–2013 Carsten Bormann 14 © 2008–2013 Carsten Bormann 15 Unicode (ISO 10646) Ziel: alle definierten Zeichen repräsentieren können Idee: 32-Bit-Zeichensatz, effizient kodieren – 231 ~ 2 Milliarden Zeichen (real: bis 0x10FFFF ~ 220 ~ 1 Mio max.) 128 Gruppen, 256 Ebenen, 256 Zeilen, 256 Zellen © 2008–2013 Carsten Bormann 16 Unicode-BMP: 16-Bit-Zeichensatz Idee: Kanji und Hanzi-Varianten überlagern – Ebene 00, Gruppe 00 reicht Basic Multilingual Plane (BMP) UCS-2-Format – MSB first vs. LSB first: Byte Order Marker (BOM) FEFF… © 2008–2013 Carsten Bormann 17 Unicode BMP: A-Zone ASCII und Latin-1 sind code-kompatible Untermengen Andere 8859-n ebenfalls vorhanden (verschoben) Griechisch, Hebräisch, Arabisch, ... Zeichensetzung, Mathematik, Dingbats, ... © 2008–2013 Carsten Bormann 18 Repräsentation von Unicode UCS: UCS-2, UCS-4 – Byte-Order-Probleme FEFF (Byte Order Marker, BOM) UTF: UCS Transformation Format – UTF-7: +ACQ- – UTF-8: Aufteilen, eindeutig auch bei „Quereinstieg“ 0000 – 007F: 0xxx xxxx 0080 – 07FF: 110x xxxx, 10xx xxxx 0800 – FFFF: 1110 xxxx, 10xx xxxx, 10xx xxxx 10000 – 10FFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx – UTF-16: Wie UCS-2, aber mit Surrogat-Zeichen 10000 – 10FFFF: –10000, 1101 10xx xxxxxxxx, 1101 11xx xxxxxxxx UTF-16BE vs. UTF-16LE (oops) BOM... – UTF-32: Wie UCS-4, aber beschränkt auf 0..0x10FFFF © 2008–2013 Carsten Bormann 19 Zeichen vs. Glyphs Zeichencode: Codekombinationen für Schriftzeichen Aussehen kann sich aber unterscheiden: Formvarianten sind abstrahierbar: – z.B.: – Ligaturen: – Arabische Schreibung: initial, medial, terminal, isoliert – Arabisch vs. Europäisch: Glyphregistratur vs. weitere Zeichen in Unicode © 2008–2013 Carsten Bormann 20 © 2008–2013 Carsten Bormann 21 NFD NFC NFKD NFKC Normalisierung: NFD, NFC, NFKD, NFKC © 2008–2013 Carsten Bormann 22 Zeichensätze in der Praxis Industrie im Übergang von ISO 8859 zu Unicode – Windows-1252 (Erweiterung von ISO 8859-1) weit verbreitet Unicode ist Basiszeichensatz für HTML – HTML selbst aber oft in ISO 8859-1 kodiert (Default!) <meta http-equiv="Content-Type" content="text/ html; charset=ISO-8859-1" /> – <?xml version="1.0" encoding="iso-8859-1"?> <meta http-equiv="Content-Type" content="text/ html; charset=UTF-8" /> – <?xml version="1.0"?> © 2008–2013 Carsten Bormann 23 Apache und der Zeichensatz httpd.conf, .htaccess AddCharset UTF-8 .html AddType 'text/html; charset=UTF-8' html Selektiv: <Files "example.html"> AddCharset UTF-8 .html </Files> http://www.w3.org/International/questions/qa-htaccess-charset © 2008–2013 Carsten Bormann 24 Nützliche Unicode-Zeichen „Anführungszeichen“: – Links unten „ „ – Rechts oben “ (englisch: links) “ – Englisch rechts ” ” Gedankenstrich – Halbgeviertstrich (en dash) – heute üblich – – Geviertstrich (em dash) — traditionell/USA — Euro-Zeichen € € Achtung: Zeichen zwischen  und  sind Fehler (Überbleibsel aus Windows-1252) © 2008–2013 Carsten Bormann 25 ASCII-8BIT (BINARY) Big5 (CP950) CP51932 CP850 (IBM850) CP852 CP855 CP949 Emacs-Mule EUC-JP (eucJP) EUC-KR (eucKR) EUC-TW (eucTW) eucJP-ms (eucjp-ms) GB12345 GB18030 GB1988 GB2312 (EUC-CN, eucCN) GBK (CP936) IBM437 (CP437) IBM737 (CP737) IBM775 (CP775) IBM852 IBM855 IBM857 (CP857) IBM860 (CP860) IBM861 (CP861) IBM862 (CP862) IBM863 (CP863) IBM864 (CP864) IBM865 (CP865) IBM866 (CP866) IBM869 (CP869) ISO-2022-JP (ISO2022-JP) ISO-2022-JP-2 (ISO2022-JP2) ISO-8859-1 (ISO8859-1) ISO-8859-10 (ISO8859-10) ISO-8859-11 (ISO8859-11) ISO-8859-13 (ISO8859-13) ISO-8859-14 (ISO8859-14) ISO-8859-15 (ISO8859-15) ISO-8859-16 (ISO8859-16) ISO-8859-2 (ISO8859-2) ISO-8859-3 (ISO8859-3) ISO-8859-4 (ISO8859-4) ISO-8859-5 (ISO8859-5) ISO-8859-6 (ISO8859-6) ISO-8859-7 (ISO8859-7) ISO-8859-8 (ISO8859-8) ISO-8859-9 (ISO8859-9) KOI8-R (CP878) KOI8-U macCentEuro macCroatian macCyrillic macGreek macIceland MacJapaneseUTF-8 (MacJapan) macRoman in macRomania Programmiersprachen macThai macTurkish macUkraine Shift_JIS (SJIS) stateless-ISO-2022-JP TIS-620 US-ASCII (ASCII, ANSI_X3.4-1968, 646) UTF-16BE (UCS-2BE) UTF-16LE UTF-32BE (UCS-4BE) UTF-32LE (UCS-4LE) UTF-7 (CP65000) UTF-8 (CP65001, locale, external) UTF8- MAC (UTF-8-MAC) Windows-1250 (CP1250) Windows-1251 (CP1251) Windows-1252 (CP1252) Windows-1253 (CP1253) Windows-1254 (CP1254) Windows-1255 (CP1255) Windows-1256 (CP1256) Windows-1257 (CP1257) Windows-1258 (CP1258) Windows-31J (CP932, csWindows31J) Windows-874 (CP874) Ruby 1.8: – Strings sind Byte-Folgen – ASCII-Kompatibilität wird vorausgesetzt Ruby 1.9/2.0: # -*- coding: UTF-8 -*- – String#bytes, #codepoints, #chars DEFAULT IN RUBY 2.0 – String#encoding “a”.encoding ➔ #<Encoding:UTF-8> == Encoding::UTF_8 String.new.encoding ➔ #<Encoding:ASCII-8BIT> == Encoding::BINARY – String#force_encoding(Encoding::UTF_8) String#valid_encoding? – String#encode(Encoding::UTF_8, invalid: :replace) – String#encode(“UTF-8”, “ISO8859-1”) http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/ © 2008–2013 Carsten Bormann 25 Being “helpful” rarely helps (“ASCII compatible”) >> u = "a".encode("UTF-8") => "a" >> b = "a".force_encoding("BINARY") => "a" >> u + b => "aa" >> u = "ä".encode("UTF-8") >> => "a" >> b = "ä".force_encoding("BINARY") => "a" >> u + b Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT >> © 2008–2013 Carsten Bormann 26 WTF OSX HFS+ NFD Dateisystem von OSX: HFS+ – January 19, 1998 – Apple hatte Unicode noch nicht ganz verstanden HFS+: Dateinamen in NFD – Müller ➔ Mu¨ller alma:tmp cabo$ ls -l *ml*t -rw-r--r-- 1 cabo wheel 13 Feb 26 15:18 ümläut alma:tmp cabo$ irb >> Dir["*ml*t"].first.chars.to_a => ["u", "̈", "m", "l", "a", "̈", "u", "t"] >> Dir["*ml*t"].first.encode("UTF-8", "UTF-8-MAC").chars.to_a => ["ü", "m", "l", "ä", "u", "t"] ⟽ >> “UTF-8-MAC” als Trivialname für UTF-8 in NFD © 2008–2013 Carsten Bormann 27.

What Every Programmer Should Know About Unicode

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support