UTF-8 from Wikipedia, the Free Encyclopedia
Total Page:16
File Type:pdf, Size:1020Kb
UTF-8 From Wikipedia, the free encyclopedia UTF-8 is a character encoding capable of encoding all possible characters, or code points, defined by Unicode and originally designed by Ken Thompson and Rob Pike.[1] The encoding is variable-length and uses 8-bit code units. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in the alternative UTF-16 and UTF-32 encodings. The name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8- bit.[2] UTF-8 is the dominant character encoding for the World Wide Web, accounting for 89.1% of all Web pages in May 2017 (the most popular East Asian encodings, Shift JIS and GB 2312, have 0.9% and 0.7% respectively).[4][5][3] The Internet Mail Consortium (IMC) recommended that all e-mail programs be able to display and create mail using UTF-8,[6] and the W3C recommends UTF-8 as the default encoding in XML and HTML.[7] UTF-8 encodes each of the 1,112,064[8] valid code points in Unicode using one to four 8-bit bytes.[9] Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as '/' in filenames, '\' in escape sequences, and '%' in printf. Shows the usage of the main encodings on the web from 2001 to 2012 as recorded by Google,[3] with UTF-8 overtaking all others in Contents 2008 and nearing 50% of the web in 2012. Note that the ASCII only figure includes web 1 Description pages with any declared header if they are restricted to ASCII characters. 1.1 Examples 1.2 Codepage layout 1.3 Overlong encodings 1.4 Invalid byte sequences 1.5 Invalid code points 2 Official name and variants 3 Derivatives 3.1 CESU-8 3.2 Modified UTF-8 3.3 WTF-8 4 Byte order mark 5 History 6 Advantages and disadvantages 6.1 General 6.1.1 Advantages 6.2 Comparison with single-byte encodings 6.2.1 Advantages 6.2.2 Disadvantages 6.3 Comparison with other multi-byte encodings 6.3.1 Advantages 6.3.2 Disadvantages 6.4 Comparison with UTF-16 6.4.1 Advantages 6.4.2 Disadvantages 7 See also 8 Notes 9 References 10 External links Description Since the restriction of the Unicode code-space to 21-bit values in 2003, UTF-8 is defined to encode codepoints in one to four bytes, depending on the number of significant bits in the numerical value of the codepoint. The following table shows the structure of the encoding. The x characters are replaced by the bits of the code point. If the number of significant bits is no more than 7, the first line applies; if no more than 11 bits, the second line applies, and so on. Number Bits for First Last Byte 1 Byte 2 Byte 3 Byte 4 of bytes code point code point code point 1 7 U+0000 U+007F 0xxxxxxx 2 11 U+0080 U+07FF 110xxxxx 10xxxxxx 3 16 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx 4 21 U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode, which covers the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks. Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[10] including most Chinese, Japanese and Korean characters. Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols). The salient features of this scheme are as follows: Backward compatibility: One-byte codes are used for the ASCII values 0 through 127, so ASCII text is valid UTF-8. Bytes in this range are not used anywhere else, so UTF-8 text can be processed by software that can handle extended ASCII but only applies special meaning to ASCII characters, as it will not accidentally see those ASCII characters in the middle of a multi-byte character. Clear indication of byte sequence length: The first byte indicates the number of bytes in the sequence. This makes UTF-8 a prefix code: reading from a stream can instantaneously decode each individual fully received sequence, without first having to wait for either the first byte of a next sequence or an end-of-stream indication. The length of multi-byte sequences is easily determined as it is simply the number of high-order 1s in the leading byte. Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. Sorting order: The chosen values of the leading bytes and the fact that the continuation bytes have the high-order bits first means that sorting a list of UTF-8 strings will produce the same order as sorting the equivalent UTF-32 strings. Examples Consider the encoding of the Euro sign, €. 1. The Unicode code point for "€" is U+20AC. 2. According to the scheme table above, this will take three bytes to encode, since it is between U+0800 and U+FFFF. 3. Hexadecimal 20AC is binary 0010 0000 1010 1100. The two leading zeros are added because, as the scheme table shows, a three-byte encoding needs exactly sixteen bits from the code point. 4. Because the encoding will be three bytes long, its leading byte starts with three 1s, then a 0 (1110...) 5. The first four bits of the code point are stored in the remaining low order four bits of this byte (1110 0010), leaving 12 bits of the code point yet to be encoded (...0000 1010 1100). 6. All continuation bytes contain exactly six bits from the code point. So the next six bits of the code point are stored in the low order six bits of the next byte, and 10 is stored in the high order two bits to mark it as a continuation byte (so 1000 0010). 7. Finally the last six bits of the code point are stored in the low order six bits of the final byte, and again 10 is stored in the high order two bits (1010 1100). The three bytes 1110 0010 1000 0010 1010 1100 can be more concisely written in hexadecimal, as E2 82 AC. Since UTF-8 uses groups of six bits, it is sometimes useful to use octal notation which uses 3-bit groups. With a calculator which can convert between hexadecimal and octal it can be easier to manually create or interpret UTF-8 compared with using binary. Octal 0200–3777 (hex 80-7FF) shall be coded with two bytes. xxyy will be 3xx 2yy. Octal 4000–177777 (hex 800-FFFF) shall be coded with three bytes. xxyyzz will be (340+xx) 2yy 2zz. Octal 200000-4177777 (hex 10000-10FFFF) shall be coded with four bytes. wxxyyzz will be 36w 2xx 2yy 2zz. The following table summarises this conversion, as well as others with different lengths in UTF-8. The colors indicate how bits from the code point are distributed among the UTF-8 bytes. Additional bits added by the UTF-8 encoding process are shown in black. Character Octal code point Binary code point Binary UTF-8 Octal UTF-8 Hexadecimal UTF-8 $ U+0024 044 010 0100 00100100 044 24 ¢ U+00A2 0242 000 1010 0010 11000010 10100010 302 242 C2 A2 € U+20AC 020254 0010 0000 1010 1100 11100010 10000010 10101100 342 202 254 E2 82 AC U+10348 0201510 0 0001 0000 0011 0100 1000 11110000 10010000 10001101 10001000 360 220 215 210 F0 90 8D 88 Codepage layout The following table summarizes usage of UTF-8 code units (individual bytes or octets) in a code page format. The upper half (0_ to 7_) is for bytes only used in single- byte codes, so it looks like a normal code page; the lower half is for continuation bytes (8_ to B_) and (possible) leading bytes (C_ to F_), and is explained further in the legend below. UTF-8 _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 0_ 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 000A 000B 000C 000D 000E 000F 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 1_ 0010 0011 0012 0013 0014 0015 0016 0017 0018 0019 001A 001B 001C 001D 001E 001F 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 SP ! " # $ % & ' ( ) * + , - .