Character Encoding - the Transformation Between Code Point and Its Own Code Units

The ABC’s of Emoji stuartlau@github Sep, 2018 Quiz • Which language can use US-ASCII to encode all its characters? • How many characters can char represent in Java? • Can we use char to represent ‘�’ in Java? • What will return if you call “�”.length() and “�”.getBytes() in Java? • What about “�”? • Can we get the emoji calling “�c”.substring(0,1)? • Can we execute insert into tb(‘name’) values(‘�’) in MySQL? https://stuartlau.github.io Charactr How to Define Charactr? • Character set - defines all readable characters • Coded character set - use a code point to delegate a character in character repertoire • Character encoding - the transformation between code point and its own code units https://stuartlau.github.io Why we need character encoding? Code Point • A uniqued number assigned to each Unicode character • Usually expressed in Hexadecimal as U+xxxx, e.g. code point for A is U+0041 https://stuartlau.github.io Planes 17 Planes 136755 characters defined U+0000~U+10FFFF, 21-bit Supports over 1.1M possible characters https://stuartlau.github.io BMP • Basic Multilingual Plane, U+0000~U+FFFF, 65536 in total https://stuartlau.github.io Supplementary Characters • Code points between U+10000 and U+10FFFF are the supplementary characters • Can not be described as a single 16-bit entity https://stuartlau.github.io Character Encoding • A mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units • The most commonly used code units are bytes, but 16-bit, 32-bit integers can also be used for internal processing • UTF-32, UTF-16 and UTF-8 are character encoding schemas for the Unicode standard https://stuartlau.github.io UTF-32 • UTF-32 encodes each Unicode character as one 32-bit code units, e.g. A 00 00 00 41 • It’s the most convenient representation for internal processing • But it’s memory-wasting https://stuartlau.github.io UTF-16 • UTF-16 encodes each Unicode character as one or two 16-bit code units, U+0000~U+FFFF, 0~65535 • Each character is encoded using 2 or 4 bytes • The internal Java encoding • Code points between U+0000 and U+FFFF are represented as a 16-bit Java char value • e.g. U+4E2D - 中, 2 bytes, char c = ‘中’ • Code points between U+10000 and U+10FFFF are the supplementary characters which char in Java can not hold https://stuartlau.github.io Helo in UTF-16 00 48 00 65 00 6C 00 6C 00 6F H E L L O 6F 00 6C 00 6C 00 65 00 48 00 Endianness https://stuartlau.github.io BOM • BOM = Byte Order Mark • Appear at the start of Unicode text • Big Endian starts with U+FEFF • Little Endian starts with U+FFFE • UTF-16 and UTF-32 have to deal with the issue of BE and LE, because they use multi-byte code units https://stuartlau.github.io Java and Supplementary Characters • Unicode was originally designed as a fixed-width 16-bit character encoding • Java used to hold all Unicode characters using char • But later Unicode 3.1 has been extended up to 1,114,112, 21-bit character encoding • J2SE5.0 supports version 4.0 of Unicode standard https://stuartlau.github.io UTF-8 • 8-bit, variable-width encoding • Encodes each Unicode character using 1 to 4 bytes • .class files is encode using UTF-8 • No BOM needed https://stuartlau.github.io UTF-8 Encoding Availabl Byte1 Byte2 Byte3 Byte4 Sample e Bits 0xxxxxxx 7 abc 110xxxxx 10xxxxxx 11 āō 1110xxxx 10xxxxxx 10xxxxxx 16 汉字 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 21 �� https://stuartlau.github.io UTF-8 to UTF-16 • For those bytes： • starts with 0 • e.g. 0xxxxxxx ==> 00000000 0xxxxxxxx • starts with 110, • e.g. 110xxxxx 10yyyyyy ==> 00000xxx xxyyyyyy • starts with 1110 • e.g. 1110xxxx 10yyyyyy 10zzzzzz ==> xxxxyyyy yyzzzzzz • e.g. “中” • Unicode U+4E2D: 01001110 00101101 • UTF-8 4E B8 AD : 11100100 10111000 10101101 https://stuartlau.github.io UTF-16 to UTF-8 • For those bytes： • Less than 0x007F(00000000 01111111) • e.g. 0x00000000 xxxxxx ==> 0xxxxxxxx • Less than 0x07FF(00000111 11111111), • e.g. 00000aaa bbbbbbbb ==> 110aaabb 10bbbbbb • Others • e.g. aaaaaaaa bbbbbbbb ==> 1110aaaa 10aaaabb 10bbcccccc • e.g. “中” • Unicode U+4E2D: 01001110 00101101 • UTF-8 4E B8 AD : 11100100 10111000 10101101 https://stuartlau.github.io Emoji History In 1999, Shigetaka Kurita created the first 180 emoji collection for a Japanese mobile web platform The Oxford Dictionary named the “face with tears of joy” as its word of the year for 2015 ABC • Sounds /ɪˈmoʊdʒi/ from Japanese • “e” (picture), “moji” (character) • As of 2017 there were 2,666 emoji on the official Unicode Standard list spread across 22 blocks https://stuartlau.github.io Design Guideline https://stuartlau.github.io Text&Colorful Shape • Emoji character can have two main kinds of presentation: • An emoji presentation, with colorful and perhaps whimsical shapes, even animated • A text presentation, such as black & white https://stuartlau.github.io Diversity • Emoji display varies in different OS, Apps, version, etc. • Even in the same App you may get different display https://stuartlau.github.io Vendor Implementations Skin tone • What’s the difference? • U+1F64E • U+1F64E • U+1F64E U+1F3FB U+1F3FF U+200D U+2640 U+FE0F https://stuartlau.github.io What’s the • U+1F3FF EMOJI MODIFIER FITZAPATRICK TYPE-6 • u+200D ZERO WEDITH JOINER • U+2640 FEMAL SIGN • U+FE0F VARIATION SELECTOR-16 https://stuartlau.github.io Emoji Modifiers • Emoji modifier - A character that can be used to modify the appearance of a preceding emoji in an emoji modifier sequence • Emoji modifier base - A character whose appearance can be modified by a subsequent emoji modifier in an emoji modifier sequence • Emoji modifier sequence - A sequence of the following form: emoji_modifier_sequence := emoji_modifier_base emoji_modifier https://stuartlau.github.io Fitzpatrick Modifiers • When one of these characters follows certain characters, then a font should show the sequence as a single glyph with the specified skin tone • If the font doesn’t show the combined character, the user can still see that a skin tone was intended https://stuartlau.github.io Sample Use of Fitzpatrick Modifiers https://stuartlau.github.io Variation Selectors Variation Selectors • VS is a Unicode block containing 16 Variation Selector format characters(designated VS1 through VS16) • They are used to specify a specific glyph variant for a Unicode character • At present only standardized variation sequences with VS1, VS15 and VS16 have been defined https://stuartlau.github.io VS-15(U+FE0E) • An invisible code point which specifies that the preceding character should be rendered in a textual fashion https://stuartlau.github.io VS-16(U+FE0F) • An invisible code point which specifies that the preceding character should be displayed with emoji presentation • Only required if the preceding character defaults to text presentation • Often used in Emoji ZWJ Sequences, where one or more characters in the sequence have text and emoji presentation https://stuartlau.github.io ZWJ https://stuartlau.github.io Emoji ZWJ Sequences • ZERO WIDTH JOINER, U+0x200D • Joining characters as a single glyph if available • Behave like single emoji character, even though internally they are sequences https://stuartlau.github.io Example • The sequence U+1F468 �, U+200D ZWJ, U+1F469 �, U+200D ZWJ, U+1F467 � • could be displayed as a single emoji depicting a family & if the implementation supports it • or else system would ignore the ZWJs, show the base emoji in the sequence: #$% https://stuartlau.github.io Multi-Person Groupings https://stuartlau.github.io Gender Combinations • Some multi-person groupings explicitly indicate gender: MAN AND WOMAN HOLDING HANDS • Others do not: KISS, COUPLE WITH HEART https://stuartlau.github.io Practice https://stuartlau.github.io “'”.length? • Actually compose with 4 emojis, 1 font variation and 3 ZWJs https://stuartlau.github.io Example “#”.codePointAt(0) // 128104 -> U+1F468 // returns combined code point of surrogate “#”.codePointAt(1) // 56424 -> U+DC68 •The man Emoji has the code point U+1F468 •It can’t be represented in a single code unit in Java •That’s why a surrogate pair has to be used, making it consistent of two single code units https://stuartlau.github.io How does Java represent supplementary character? Surrogate Pair • It is possible to combine two code points defined in the BMP to express another code point that lies outside of the first 65635 code points. This combination is called surrogate pair. • Leading Surrogate: U+D800~U+DB7F • Trailing Surrogate: U+DC00~U+DFFF • The values from U+D800 to U+DFFF are reserved for used in UTF-16, no characters are assigned to them as code points https://stuartlau.github.io Deep Dive • e.g. U+1F468 # • 0x1F468 - 0x10000 = 0xF468 • => 11110100 01101000 • Using 20-bit => 0000111101 0001101000 • 0xD800 + 0x3D = 0xD83D • 0xDC00 + 0x68 = 0xDC68 https://stuartlau.github.io Supplementary Encoding in UTF-16 • UTF-16 covers U+0000~U+FFFF using 2 bytes • For Unicode U(U+10000~U+10FFFF)： • Minus 0x10000, get U’(0x00000~0xFFFFF), 20 bits • e.g. U’ = yyyyyyyyyyxxxxxxxxxx • Using W1 to represent the first 10 bits, • e.g. W1 = 110110yyyyyyyyyy, W1 in D800~DBFF • Using W2 to represent the second 10 bits, • e.g. W2 = 110111xxxxxxxxxx, W2 in DC00~DFFF https://stuartlau.github.io Tricks1 - native2ascii Tricks2 - Calculator https://stuartlau.github.io Tricks3 - Character Viewer Java API • String.codePointAt(int index):int • Character.highSurrogate(int codePoint):char • Character.lowSurrogate(int codePoint):char • Character.charCount(int codePoint):int • Character.isSupplementaryCodePoint(int codePoint):boolean • Character.isSurrogate(char):boolean • Character.isSurrogatePair(char, char):boolean • … https://stuartlau.github.io References • https://en.wikipedia.org/wiki/Emoji • http://stn.audible.com/abcs-of-unicode • https://twitter.github.io/twemoji/preview.html • http://www.unicode.org/reports/tr51/ • https://en.wikipedia.org/wiki/Fitzpatrick_scale • http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html • https://en.wikipedia.org/wiki/UTF-8 • https://vinoit.me/2016/10/07/codePoint-in-java-and-utf16/ • https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software- developer-absolutely-positively-must-know-about-unicode-and-character-sets-no- excuses/ https://stuartlau.github.io .

Character Encoding - the Transformation Between Code Point and Its Own Code Units

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support