Programming with Unicode Documentation Release 2011
Total Page:16
File Type:pdf, Size:1020Kb
Programming with Unicode Documentation Release 2011 Victor Stinner Oct 01, 2019 Contents 1 About this book 1 1.1 License..................................................1 1.2 Thanks to.................................................1 1.3 Notations.................................................1 2 Unicode nightmare 3 3 Definitions 5 3.1 Character.................................................5 3.2 Glyph...................................................5 3.3 Code point................................................5 3.4 Character set (charset)..........................................5 3.5 Character string.............................................6 3.6 Byte string................................................6 3.7 UTF-8 encoded strings and UTF-16 character strings..........................7 3.8 Encoding.................................................7 3.9 Encode a character string.........................................7 3.10 Decode a byte string...........................................8 3.11 Mojibake.................................................8 3.12 Unicode: an Universal Character Set (UCS)...............................9 4 Unicode 11 4.1 Unicode Character Set.......................................... 11 4.2 Categories................................................ 11 4.3 Statistics................................................. 12 4.4 Normalization.............................................. 12 5 Charsets and encodings 15 5.1 Encodings................................................ 15 5.2 Popularity................................................ 15 5.3 Encodings performances......................................... 16 5.4 Examples................................................. 16 5.5 Handle undecodable bytes and unencodable characters......................... 16 5.5.1 Undecodable byte sequences.................................. 16 5.5.2 Unencodable characters..................................... 16 5.5.3 Error handlers.......................................... 17 5.5.4 Replace unencodable characters by a similar glyph...................... 17 i 5.5.5 Escape the character...................................... 17 5.6 Other charsets and encodings...................................... 17 6 Historical charsets and encodings 19 6.1 ASCII................................................... 19 6.2 ISO 8859 family............................................. 20 6.2.1 ISO 8859-1........................................... 21 6.2.2 cp1252............................................. 21 6.2.3 ISO 8859-15.......................................... 22 6.3 CJK: asian encodings.......................................... 22 6.3.1 Chinese encodings....................................... 22 6.3.2 Japanese encodings....................................... 22 6.3.3 ISO 2022............................................ 23 6.3.4 Extended Unix Code (EUC).................................. 23 6.4 Cyrillic.................................................. 24 7 Unicode encodings 25 7.1 UTF-8.................................................. 25 7.2 UCS-2, UCS-4, UTF-16 and UTF-32.................................. 25 7.3 UTF-7.................................................. 26 7.4 Byte order marks (BOM)......................................... 26 7.5 UTF-16 surrogate pairs.......................................... 26 8 How to guess the encoding of a document? 29 8.1 Is ASCII?................................................. 29 8.2 Check for BOM markers......................................... 30 8.3 Is UTF-8?................................................. 31 8.4 Libraries................................................. 33 9 Good practices 35 9.1 Rules................................................... 35 9.2 Unicode support levels.......................................... 35 9.3 Test the Unicode support of a program................................. 36 9.4 Get the encoding of your inputs..................................... 36 9.5 Switch from byte strings to character strings.............................. 37 10 Operating systems 39 10.1 Windows................................................. 39 10.1.1 Code pages........................................... 39 10.1.2 Encode and decode functions.................................. 40 10.1.3 Windows API: ANSI and wide versions............................ 42 10.1.4 Windows string types...................................... 42 10.1.5 Filenames............................................ 42 10.1.6 Windows console........................................ 42 10.1.7 File mode............................................ 43 10.2 Mac OS X................................................ 43 10.3 Locales.................................................. 43 10.3.1 Locale categories........................................ 44 10.3.2 The C locale........................................... 44 10.3.3 Locale encoding......................................... 44 10.3.4 Locale functions........................................ 44 10.4 Filesystems (filenames)......................................... 45 10.4.1 CD-ROM and DVD....................................... 45 10.4.2 Microsoft: FAT and NTFS filesystems............................. 45 10.4.3 Apple: HFS and HFS+ filesystems............................... 45 ii 10.4.4 Others.............................................. 46 11 Programming languages 47 11.1 C language................................................ 47 11.1.1 Byte API (char)......................................... 47 11.1.2 Byte string API (char*)..................................... 48 11.1.3 Character API (wchar_t).................................... 48 11.1.4 Character string API (wchar_t*)................................ 48 11.1.5 printf functions family..................................... 48 11.2 C++.................................................... 49 11.3 Python.................................................. 50 11.3.1 Python 2............................................. 50 11.3.2 Python 3............................................. 50 11.3.3 Differences between Python 2 and Python 3.......................... 51 11.3.4 Codecs............................................. 51 11.3.5 String methods......................................... 52 11.3.6 Filesystem............................................ 52 11.3.7 Windows............................................ 52 11.3.8 Modules............................................. 52 11.4 PHP.................................................... 53 11.5 Perl.................................................... 54 11.6 Java.................................................... 55 11.7 Go and D................................................. 55 12 Database systems 57 12.1 MySQL.................................................. 57 12.2 PostgreSQL................................................ 57 12.3 SQLite.................................................. 57 13 Libraries 59 13.1 Qt library................................................. 59 13.1.1 Character and string classes................................... 59 13.1.2 Codec.............................................. 60 13.1.3 Filesystem............................................ 60 13.2 The glib library.............................................. 60 13.2.1 Character strings........................................ 60 13.2.2 Codec functions......................................... 60 13.2.3 Filename functions....................................... 61 13.3 iconv library............................................... 61 13.4 ICU libraries............................................... 61 13.5 libunistring................................................ 61 14 Unicode issues 63 14.1 Security vulnerabilities.......................................... 63 14.1.1 Special characters........................................ 63 14.1.2 Non-strict UTF-8 decoder: overlong byte sequences and surrogates.............. 63 14.1.3 Check byte strings before decoding them to character strings................. 64 15 See also 65 Index 67 iii iv CHAPTER 1 About this book The book is written in reStructuredText (reST) syntax and compiled by Sphinx. I started to write in the 25th September 2010. 1.1 License This book is distributed under the CC BY-SA 3.0 license. 1.2 Thanks to Reviewers: Alexander Belopolsky, Antoine Pitrou, Feth Arezki and Nelle Varoquaux, Natal Ngétal. 1.3 Notations • 0bBBBBBBBB: 8 bit unsigned number written in binary, first digit is the most significant. For example, 0b10000000 is 128. • 0xHHHH: number written in hexadecimal, e.g. 0xFFFF is 65535. • 0xHH 0xHH ...: byte sequence with bytes written in hexadecimal, e.g. 0xC3 0xA9 (2 bytes) is the char- acter é (U+00E9) encoded to UTF-8. • U+HHHH: Unicode character with its code point written in hexadecimal. For example, U+20AC is the “euro sign” character, code point 8,364. Big code point are written with more than 4 hexadecimal digits, e.g. U+10FFFF is the biggest (unallocated) code point of Unicode Character Set 6.0: 1,114,111. • A—B: range including start and end. Examples: – 0x00—0x7F is the range 0 through 127 (128 bytes) – U+0000—U+00FF is the range 0 through 255 (256 characters) 1 Programming with Unicode Documentation, Release 2011 • {U+HHHH, U+HHHH, . }: a character string. For example, {U+0041, U+0042, U+0043} is the string “abc” (3 characters). 2 Chapter 1. About this book CHAPTER 2 Unicode nightmare Unicode is the nightmare of many developers (and users) for different, and sometimes good reasons. In the 1980s, only few people read documents