Character Encoding - the Transformation Between Code Point and Its Own Code Units

The ABC’s of Emoji stuartlau@github Sep, 2018 Quiz • Which language can use US-ASCII to encode all its characters? • How many characters can char represent in Java? • Can we use char to represent ‘�’ in Java? • What will return if you call “�”.length() and “�”.getBytes() in Java? • What about “�”? • Can we get the emoji calling “�c”.substring(0,1)? • Can we execute insert into tb(‘name’) values(‘�’) in MySQL? https://stuartlau.github.io Charactr How to Define Charactr? • Character set - defines all readable characters • Coded character set - use a code point to delegate a character in character repertoire • Character encoding - the transformation between code point and its own code units https://stuartlau.github.io Why we need character encoding? Code Point • A uniqued number assigned to each Unicode character • Usually expressed in Hexadecimal as U+xxxx, e.g. code point for A is U+0041 https://stuartlau.github.io Planes 17 Planes 136755 characters defined U+0000~U+10FFFF, 21-bit Supports over 1.1M possible characters https://stuartlau.github.io BMP • Basic Multilingual Plane, U+0000~U+FFFF, 65536 in total https://stuartlau.github.io Supplementary Characters • Code points between U+10000 and U+10FFFF are the supplementary characters • Can not be described as a single 16-bit entity https://stuartlau.github.io Character Encoding • A mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units • The most commonly used code units are bytes, but 16-bit, 32-bit integers can also be used for internal processing • UTF-32, UTF-16 and UTF-8 are character encoding schemas for the Unicode standard https://stuartlau.github.io UTF-32 • UTF-32 encodes each Unicode character as one 32-bit code units, e.g. A 00 00 00 41 • It’s the most convenient representation for internal processing • But it’s memory-wasting https://stuartlau.github.io UTF-16 • UTF-16 encodes each Unicode character as one or two 16-bit code units, U+0000~U+FFFF, 0~65535 • Each character is encoded using 2 or 4 bytes • The internal Java encoding • Code points between U+0000 and U+FFFF are represented as a 16-bit Java char value • e.g. U+4E2D - 中, 2 bytes, char c = ‘中’ • Code points between U+10000 and U+10FFFF are the supplementary characters which char in Java can not hold https://stuartlau.github.io Helo in UTF-16 00 48 00 65 00 6C 00 6C 00 6F H E L L O 6F 00 6C 00 6C 00 65 00 48 00 Endianness https://stuartlau.github.io BOM • BOM = Byte Order Mark • Appear at the start of Unicode text • Big Endian starts with U+FEFF • Little Endian starts with U+FFFE • UTF-16 and UTF-32 have to deal with the issue of BE and LE, because they use multi-byte code units https://stuartlau.github.io Java and Supplementary Characters • Unicode was originally designed as a fixed-width 16-bit character encoding • Java used to hold all Unicode characters using char • But later Unicode 3.1 has been extended up to 1,114,112, 21-bit character encoding • J2SE5.0 supports version 4.0 of Unicode standard https://stuartlau.github.io UTF-8 • 8-bit, variable-width encoding • Encodes each Unicode character using 1 to 4 bytes • .class files is encode using UTF-8 • No BOM needed https://stuartlau.github.io UTF-8 Encoding Availabl Byte1 Byte2 Byte3 Byte4 Sample e Bits 0xxxxxxx 7 abc 110xxxxx 10xxxxxx 11 āō 1110xxxx 10xxxxxx 10xxxxxx 16 汉字 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 21 �� https://stuartlau.github.io UTF-8 to UTF-16 • For those bytes： • starts with 0 • e.g. 0xxxxxxx ==> 00000000 0xxxxxxxx • starts with 110, • e.g. 110xxxxx 10yyyyyy ==> 00000xxx xxyyyyyy • starts with 1110 • e.g. 1110xxxx 10yyyyyy 10zzzzzz ==> xxxxyyyy yyzzzzzz • e.g. “中” • Unicode U+4E2D: 01001110 00101101 • UTF-8 4E B8 AD : 11100100 10111000 10101101 https://stuartlau.github.io UTF-16 to UTF-8 • For those bytes： • Less than 0x007F(00000000 01111111) • e.g. 0x00000000 xxxxxx ==> 0xxxxxxxx • Less than 0x07FF(00000111 11111111), • e.g. 00000aaa bbbbbbbb ==> 110aaabb 10bbbbbb • Others • e.g. aaaaaaaa bbbbbbbb ==> 1110aaaa 10aaaabb 10bbcccccc • e.g. “中” • Unicode U+4E2D: 01001110 00101101 • UTF-8 4E B8 AD : 11100100 10111000 10101101 https://stuartlau.github.io Emoji History In 1999, Shigetaka Kurita created the first 180 emoji collection for a Japanese mobile web platform The Oxford Dictionary named the “face with tears of joy” as its word of the year for 2015 ABC • Sounds /ɪˈmoʊdʒi/ from Japanese • “e” (picture), “moji” (character) • As of 2017 there were 2,666 emoji on the official Unicode Standard list spread across 22 blocks https://stuartlau.github.io Design Guideline https://stuartlau.github.io Text&Colorful Shape • Emoji character can have two main kinds of presentation: • An emoji presentation, with colorful and perhaps whimsical shapes, even animated • A text presentation, such as black & white https://stuartlau.github.io Diversity • Emoji display varies in different OS, Apps, version, etc. • Even in the same App you may get different display https://stuartlau.github.io Vendor Implementations Skin tone • What’s the difference? • U+1F64E • U+1F64E • U+1F64E U+1F3FB U+1F3FF U+200D U+2640 U+FE0F https://stuartlau.github.io What’s the • U+1F3FF EMOJI MODIFIER FITZAPATRICK TYPE-6 • u+200D ZERO WEDITH JOINER • U+2640 FEMAL SIGN • U+FE0F VARIATION SELECTOR-16 https://stuartlau.github.io Emoji Modifiers • Emoji modifier - A character that can be used to modify the appearance of a preceding emoji in an emoji modifier sequence • Emoji modifier base - A character whose appearance can be modified by a subsequent emoji modifier in an emoji modifier sequence • Emoji modifier sequence - A sequence of the following form: emoji_modifier_sequence := emoji_modifier_base emoji_modifier https://stuartlau.github.io Fitzpatrick Modifiers • When one of these characters follows certain characters, then a font should show the sequence as a single glyph with the specified skin tone • If the font doesn’t show the combined character, the user can still see that a skin tone was intended https://stuartlau.github.io Sample Use of Fitzpatrick Modifiers https://stuartlau.github.io Variation Selectors Variation Selectors • VS is a Unicode block containing 16 Variation Selector format characters(designated VS1 through VS16) • They are used to specify a specific glyph variant for a Unicode character • At present only standardized variation sequences with VS1, VS15 and VS16 have been defined https://stuartlau.github.io VS-15(U+FE0E) • An invisible code point which specifies that the preceding character should be rendered in a textual fashion https://stuartlau.github.io VS-16(U+FE0F) • An invisible code point which specifies that the preceding character should be displayed with emoji presentation • Only required if the preceding character defaults to text presentation • Often used in Emoji ZWJ Sequences, where one or more characters in the sequence have text and emoji presentation https://stuartlau.github.io ZWJ https://stuartlau.github.io Emoji ZWJ Sequences • ZERO WIDTH JOINER, U+0x200D • Joining characters as a single glyph if available • Behave like single emoji character, even though internally they are sequences https://stuartlau.github.io Example • The sequence U+1F468 �, U+200D ZWJ, U+1F469 �, U+200D ZWJ, U+1F467 � • could be displayed as a single emoji depicting a family & if the implementation supports it • or else system would ignore the ZWJs, show the base emoji in the sequence: #$% https://stuartlau.github.io Multi-Person Groupings https://stuartlau.github.io Gender Combinations • Some multi-person groupings explicitly indicate gender: MAN AND WOMAN HOLDING HANDS • Others do not: KISS, COUPLE WITH HEART https://stuartlau.github.io Practice https://stuartlau.github.io “'”.length? • Actually compose with 4 emojis, 1 font variation and 3 ZWJs https://stuartlau.github.io Example “#”.codePointAt(0) // 128104 -> U+1F468 // returns combined code point of surrogate “#”.codePointAt(1) // 56424 -> U+DC68 •The man Emoji has the code point U+1F468 •It can’t be represented in a single code unit in Java •That’s why a surrogate pair has to be used, making it consistent of two single code units https://stuartlau.github.io How does Java represent supplementary character? Surrogate Pair • It is possible to combine two code points defined in the BMP to express another code point that lies outside of the first 65635 code points. This combination is called surrogate pair. • Leading Surrogate: U+D800~U+DB7F • Trailing Surrogate: U+DC00~U+DFFF • The values from U+D800 to U+DFFF are reserved for used in UTF-16, no characters are assigned to them as code points https://stuartlau.github.io Deep Dive • e.g. U+1F468 # • 0x1F468 - 0x10000 = 0xF468 • => 11110100 01101000 • Using 20-bit => 0000111101 0001101000 • 0xD800 + 0x3D = 0xD83D • 0xDC00 + 0x68 = 0xDC68 https://stuartlau.github.io Supplementary Encoding in UTF-16 • UTF-16 covers U+0000~U+FFFF using 2 bytes • For Unicode U(U+10000~U+10FFFF)： • Minus 0x10000, get U’(0x00000~0xFFFFF), 20 bits • e.g. U’ = yyyyyyyyyyxxxxxxxxxx • Using W1 to represent the first 10 bits, • e.g. W1 = 110110yyyyyyyyyy, W1 in D800~DBFF • Using W2 to represent the second 10 bits, • e.g. W2 = 110111xxxxxxxxxx, W2 in DC00~DFFF https://stuartlau.github.io Tricks1 - native2ascii Tricks2 - Calculator https://stuartlau.github.io Tricks3 - Character Viewer Java API • String.codePointAt(int index):int • Character.highSurrogate(int codePoint):char • Character.lowSurrogate(int codePoint):char • Character.charCount(int codePoint):int • Character.isSupplementaryCodePoint(int codePoint):boolean • Character.isSurrogate(char):boolean • Character.isSurrogatePair(char, char):boolean • … https://stuartlau.github.io References • https://en.wikipedia.org/wiki/Emoji • http://stn.audible.com/abcs-of-unicode • https://twitter.github.io/twemoji/preview.html • http://www.unicode.org/reports/tr51/ • https://en.wikipedia.org/wiki/Fitzpatrick_scale • http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html • https://en.wikipedia.org/wiki/UTF-8 • https://vinoit.me/2016/10/07/codePoint-in-java-and-utf16/ • https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software- developer-absolutely-positively-must-know-about-unicode-and-character-sets-no- excuses/ https://stuartlau.github.io .

Load more