Programming with Unicode Documentation Release 2011

Programming with Unicode Documentation Release 2011 Victor Stinner Oct 01, 2019 Contents 1 About this book 1 1.1 License..................................................1 1.2 Thanks to.................................................1 1.3 Notations.................................................1 2 Unicode nightmare 3 3 Definitions 5 3.1 Character.................................................5 3.2 Glyph...................................................5 3.3 Code point................................................5 3.4 Character set (charset)..........................................5 3.5 Character string.............................................6 3.6 Byte string................................................6 3.7 UTF-8 encoded strings and UTF-16 character strings..........................7 3.8 Encoding.................................................7 3.9 Encode a character string.........................................7 3.10 Decode a byte string...........................................8 3.11 Mojibake.................................................8 3.12 Unicode: an Universal Character Set (UCS)...............................9 4 Unicode 11 4.1 Unicode Character Set.......................................... 11 4.2 Categories................................................ 11 4.3 Statistics................................................. 12 4.4 Normalization.............................................. 12 5 Charsets and encodings 15 5.1 Encodings................................................ 15 5.2 Popularity................................................ 15 5.3 Encodings performances......................................... 16 5.4 Examples................................................. 16 5.5 Handle undecodable bytes and unencodable characters......................... 16 5.5.1 Undecodable byte sequences.................................. 16 5.5.2 Unencodable characters..................................... 16 5.5.3 Error handlers.......................................... 17 5.5.4 Replace unencodable characters by a similar glyph...................... 17 i 5.5.5 Escape the character...................................... 17 5.6 Other charsets and encodings...................................... 17 6 Historical charsets and encodings 19 6.1 ASCII................................................... 19 6.2 ISO 8859 family............................................. 20 6.2.1 ISO 8859-1........................................... 21 6.2.2 cp1252............................................. 21 6.2.3 ISO 8859-15.......................................... 22 6.3 CJK: asian encodings.......................................... 22 6.3.1 Chinese encodings....................................... 22 6.3.2 Japanese encodings....................................... 22 6.3.3 ISO 2022............................................ 23 6.3.4 Extended Unix Code (EUC).................................. 23 6.4 Cyrillic.................................................. 24 7 Unicode encodings 25 7.1 UTF-8.................................................. 25 7.2 UCS-2, UCS-4, UTF-16 and UTF-32.................................. 25 7.3 UTF-7.................................................. 26 7.4 Byte order marks (BOM)......................................... 26 7.5 UTF-16 surrogate pairs.......................................... 26 8 How to guess the encoding of a document? 29 8.1 Is ASCII?................................................. 29 8.2 Check for BOM markers......................................... 30 8.3 Is UTF-8?................................................. 31 8.4 Libraries................................................. 33 9 Good practices 35 9.1 Rules................................................... 35 9.2 Unicode support levels.......................................... 35 9.3 Test the Unicode support of a program................................. 36 9.4 Get the encoding of your inputs..................................... 36 9.5 Switch from byte strings to character strings.............................. 37 10 Operating systems 39 10.1 Windows................................................. 39 10.1.1 Code pages........................................... 39 10.1.2 Encode and decode functions.................................. 40 10.1.3 Windows API: ANSI and wide versions............................ 42 10.1.4 Windows string types...................................... 42 10.1.5 Filenames............................................ 42 10.1.6 Windows console........................................ 42 10.1.7 File mode............................................ 43 10.2 Mac OS X................................................ 43 10.3 Locales.................................................. 43 10.3.1 Locale categories........................................ 44 10.3.2 The C locale........................................... 44 10.3.3 Locale encoding......................................... 44 10.3.4 Locale functions........................................ 44 10.4 Filesystems (filenames)......................................... 45 10.4.1 CD-ROM and DVD....................................... 45 10.4.2 Microsoft: FAT and NTFS filesystems............................. 45 10.4.3 Apple: HFS and HFS+ filesystems............................... 45 ii 10.4.4 Others.............................................. 46 11 Programming languages 47 11.1 C language................................................ 47 11.1.1 Byte API (char)......................................... 47 11.1.2 Byte string API (char*)..................................... 48 11.1.3 Character API (wchar_t).................................... 48 11.1.4 Character string API (wchar_t*)................................ 48 11.1.5 printf functions family..................................... 48 11.2 C++.................................................... 49 11.3 Python.................................................. 50 11.3.1 Python 2............................................. 50 11.3.2 Python 3............................................. 50 11.3.3 Differences between Python 2 and Python 3.......................... 51 11.3.4 Codecs............................................. 51 11.3.5 String methods......................................... 52 11.3.6 Filesystem............................................ 52 11.3.7 Windows............................................ 52 11.3.8 Modules............................................. 52 11.4 PHP.................................................... 53 11.5 Perl.................................................... 54 11.6 Java.................................................... 55 11.7 Go and D................................................. 55 12 Database systems 57 12.1 MySQL.................................................. 57 12.2 PostgreSQL................................................ 57 12.3 SQLite.................................................. 57 13 Libraries 59 13.1 Qt library................................................. 59 13.1.1 Character and string classes................................... 59 13.1.2 Codec.............................................. 60 13.1.3 Filesystem............................................ 60 13.2 The glib library.............................................. 60 13.2.1 Character strings........................................ 60 13.2.2 Codec functions......................................... 60 13.2.3 Filename functions....................................... 61 13.3 iconv library............................................... 61 13.4 ICU libraries............................................... 61 13.5 libunistring................................................ 61 14 Unicode issues 63 14.1 Security vulnerabilities.......................................... 63 14.1.1 Special characters........................................ 63 14.1.2 Non-strict UTF-8 decoder: overlong byte sequences and surrogates.............. 63 14.1.3 Check byte strings before decoding them to character strings................. 64 15 See also 65 Index 67 iii iv CHAPTER 1 About this book The book is written in reStructuredText (reST) syntax and compiled by Sphinx. I started to write in the 25th September 2010. 1.1 License This book is distributed under the CC BY-SA 3.0 license. 1.2 Thanks to Reviewers: Alexander Belopolsky, Antoine Pitrou, Feth Arezki and Nelle Varoquaux, Natal Ngétal. 1.3 Notations • 0bBBBBBBBB: 8 bit unsigned number written in binary, first digit is the most significant. For example, 0b10000000 is 128. • 0xHHHH: number written in hexadecimal, e.g. 0xFFFF is 65535. • 0xHH 0xHH ...: byte sequence with bytes written in hexadecimal, e.g. 0xC3 0xA9 (2 bytes) is the character é (U+00E9) encoded to UTF-8. • U+HHHH: Unicode character with its code point written in hexadecimal. For example, U+20AC is the “euro sign” character, code point 8,364. Big code point are written with more than 4 hexadecimal digits, e.g. U+10FFFF is the biggest (unallocated) code point of Unicode Character Set 6.0: 1,114,111. • A—B: range including start and end. Examples: – 0x00—0x7F is the range 0 through 127 (128 bytes) – U+0000—U+00FF is the range 0 through 255 (256 characters) 1 Programming with Unicode Documentation, Release 2011 • {U+HHHH, U+HHHH, . }: a character string. For example, {U+0041, U+0042, U+0043} is the string “abc” (3 characters). 2 Chapter 1. About this book CHAPTER 2 Unicode nightmare Unicode is the nightmare of many developers (and users) for different, and sometimes good reasons. In the 1980s, only few people read documents

Programming with Unicode Documentation Release 2011

Cumberland Tech Ref.Book

Tilde-Arrow-Out (~→O)

SLP-5 Setup for Uni-3/5/7 & WM-Nano Chinese BIG5 Characters

Edit Bibliographic Records

About:Config .Init About:Me About:Presentation Web 2.0 User

Consonant Characters and Inherent Vowels

Proposal to Add U+2B95 Rightwards Black Arrow to Unicode Emoji

SUPPORTING the CHINESE, JAPANESE, and KOREAN LANGUAGES in the OPENVMS OPERATING SYSTEM by Michael M. T. Yau ABSTRACT the Asian L

ISO Basic Latin Alphabet

PCL PC-8, Code Page 437 Page 1 of 5 PCL PC-8, Code Page 437

Unicode and Code Page Support

NCSA Telnet for the Macintosh User's Guide