Programming with Unicode Documentation Release 2011

Programming with Unicode Documentation Release 2011

Programming with Unicode Documentation Release 2011 Victor Stinner Oct 01, 2019 Contents 1 About this book 1 1.1 License..................................................1 1.2 Thanks to.................................................1 1.3 Notations.................................................1 2 Unicode nightmare 3 3 Definitions 5 3.1 Character.................................................5 3.2 Glyph...................................................5 3.3 Code point................................................5 3.4 Character set (charset)..........................................5 3.5 Character string.............................................6 3.6 Byte string................................................6 3.7 UTF-8 encoded strings and UTF-16 character strings..........................7 3.8 Encoding.................................................7 3.9 Encode a character string.........................................7 3.10 Decode a byte string...........................................8 3.11 Mojibake.................................................8 3.12 Unicode: an Universal Character Set (UCS)...............................9 4 Unicode 11 4.1 Unicode Character Set.......................................... 11 4.2 Categories................................................ 11 4.3 Statistics................................................. 12 4.4 Normalization.............................................. 12 5 Charsets and encodings 15 5.1 Encodings................................................ 15 5.2 Popularity................................................ 15 5.3 Encodings performances......................................... 16 5.4 Examples................................................. 16 5.5 Handle undecodable bytes and unencodable characters......................... 16 5.5.1 Undecodable byte sequences.................................. 16 5.5.2 Unencodable characters..................................... 16 5.5.3 Error handlers.......................................... 17 5.5.4 Replace unencodable characters by a similar glyph...................... 17 i 5.5.5 Escape the character...................................... 17 5.6 Other charsets and encodings...................................... 17 6 Historical charsets and encodings 19 6.1 ASCII................................................... 19 6.2 ISO 8859 family............................................. 20 6.2.1 ISO 8859-1........................................... 21 6.2.2 cp1252............................................. 21 6.2.3 ISO 8859-15.......................................... 22 6.3 CJK: asian encodings.......................................... 22 6.3.1 Chinese encodings....................................... 22 6.3.2 Japanese encodings....................................... 22 6.3.3 ISO 2022............................................ 23 6.3.4 Extended Unix Code (EUC).................................. 23 6.4 Cyrillic.................................................. 24 7 Unicode encodings 25 7.1 UTF-8.................................................. 25 7.2 UCS-2, UCS-4, UTF-16 and UTF-32.................................. 25 7.3 UTF-7.................................................. 26 7.4 Byte order marks (BOM)......................................... 26 7.5 UTF-16 surrogate pairs.......................................... 26 8 How to guess the encoding of a document? 29 8.1 Is ASCII?................................................. 29 8.2 Check for BOM markers......................................... 30 8.3 Is UTF-8?................................................. 31 8.4 Libraries................................................. 33 9 Good practices 35 9.1 Rules................................................... 35 9.2 Unicode support levels.......................................... 35 9.3 Test the Unicode support of a program................................. 36 9.4 Get the encoding of your inputs..................................... 36 9.5 Switch from byte strings to character strings.............................. 37 10 Operating systems 39 10.1 Windows................................................. 39 10.1.1 Code pages........................................... 39 10.1.2 Encode and decode functions.................................. 40 10.1.3 Windows API: ANSI and wide versions............................ 42 10.1.4 Windows string types...................................... 42 10.1.5 Filenames............................................ 42 10.1.6 Windows console........................................ 42 10.1.7 File mode............................................ 43 10.2 Mac OS X................................................ 43 10.3 Locales.................................................. 43 10.3.1 Locale categories........................................ 44 10.3.2 The C locale........................................... 44 10.3.3 Locale encoding......................................... 44 10.3.4 Locale functions........................................ 44 10.4 Filesystems (filenames)......................................... 45 10.4.1 CD-ROM and DVD....................................... 45 10.4.2 Microsoft: FAT and NTFS filesystems............................. 45 10.4.3 Apple: HFS and HFS+ filesystems............................... 45 ii 10.4.4 Others.............................................. 46 11 Programming languages 47 11.1 C language................................................ 47 11.1.1 Byte API (char)......................................... 47 11.1.2 Byte string API (char*)..................................... 48 11.1.3 Character API (wchar_t).................................... 48 11.1.4 Character string API (wchar_t*)................................ 48 11.1.5 printf functions family..................................... 48 11.2 C++.................................................... 49 11.3 Python.................................................. 50 11.3.1 Python 2............................................. 50 11.3.2 Python 3............................................. 50 11.3.3 Differences between Python 2 and Python 3.......................... 51 11.3.4 Codecs............................................. 51 11.3.5 String methods......................................... 52 11.3.6 Filesystem............................................ 52 11.3.7 Windows............................................ 52 11.3.8 Modules............................................. 52 11.4 PHP.................................................... 53 11.5 Perl.................................................... 54 11.6 Java.................................................... 55 11.7 Go and D................................................. 55 12 Database systems 57 12.1 MySQL.................................................. 57 12.2 PostgreSQL................................................ 57 12.3 SQLite.................................................. 57 13 Libraries 59 13.1 Qt library................................................. 59 13.1.1 Character and string classes................................... 59 13.1.2 Codec.............................................. 60 13.1.3 Filesystem............................................ 60 13.2 The glib library.............................................. 60 13.2.1 Character strings........................................ 60 13.2.2 Codec functions......................................... 60 13.2.3 Filename functions....................................... 61 13.3 iconv library............................................... 61 13.4 ICU libraries............................................... 61 13.5 libunistring................................................ 61 14 Unicode issues 63 14.1 Security vulnerabilities.......................................... 63 14.1.1 Special characters........................................ 63 14.1.2 Non-strict UTF-8 decoder: overlong byte sequences and surrogates.............. 63 14.1.3 Check byte strings before decoding them to character strings................. 64 15 See also 65 Index 67 iii iv CHAPTER 1 About this book The book is written in reStructuredText (reST) syntax and compiled by Sphinx. I started to write in the 25th September 2010. 1.1 License This book is distributed under the CC BY-SA 3.0 license. 1.2 Thanks to Reviewers: Alexander Belopolsky, Antoine Pitrou, Feth Arezki and Nelle Varoquaux, Natal Ngétal. 1.3 Notations • 0bBBBBBBBB: 8 bit unsigned number written in binary, first digit is the most significant. For example, 0b10000000 is 128. • 0xHHHH: number written in hexadecimal, e.g. 0xFFFF is 65535. • 0xHH 0xHH ...: byte sequence with bytes written in hexadecimal, e.g. 0xC3 0xA9 (2 bytes) is the char- acter é (U+00E9) encoded to UTF-8. • U+HHHH: Unicode character with its code point written in hexadecimal. For example, U+20AC is the “euro sign” character, code point 8,364. Big code point are written with more than 4 hexadecimal digits, e.g. U+10FFFF is the biggest (unallocated) code point of Unicode Character Set 6.0: 1,114,111. • A—B: range including start and end. Examples: – 0x00—0x7F is the range 0 through 127 (128 bytes) – U+0000—U+00FF is the range 0 through 255 (256 characters) 1 Programming with Unicode Documentation, Release 2011 • {U+HHHH, U+HHHH, . }: a character string. For example, {U+0041, U+0042, U+0043} is the string “abc” (3 characters). 2 Chapter 1. About this book CHAPTER 2 Unicode nightmare Unicode is the nightmare of many developers (and users) for different, and sometimes good reasons. In the 1980s, only few people read documents

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    73 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us