The ABC’s of Emoji stuartlau@github Sep, 2018 Quiz
• Which language can use US-ASCII to encode all its characters?
• How many characters can char represent in Java?
• Can we use char to represent ‘�’ in Java?
• What will return if you call “�”.length() and “�”.getBytes() in Java?
• What about “�”?
• Can we get the emoji calling “�c”.substring(0,1)?
• Can we execute insert into tb(‘name’) values(‘�’) in MySQL?
https://stuartlau.github.io Charactr How to Define Charactr?
• Character set - defines all readable characters
• Coded character set - use a code point to delegate a character in character repertoire
• Character encoding - the transformation between code point and its own code units
https://stuartlau.github.io Why we need character encoding?
Code Point
• A uniqued number assigned to each Unicode character
• Usually expressed in Hexadecimal as U+xxxx, e.g. code point for A is U+0041
https://stuartlau.github.io Planes
17 Planes 136755 characters defined U+0000~U+10FFFF, 21-bit Supports over 1.1M possible characters
https://stuartlau.github.io BMP
• Basic Multilingual Plane, U+0000~U+FFFF, 65536 in total
https://stuartlau.github.io Supplementary Characters
• Code points between U+10000 and U+10FFFF are the supplementary characters
• Can not be described as a single 16-bit entity
https://stuartlau.github.io Character Encoding
• A mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units
• The most commonly used code units are bytes, but 16-bit, 32-bit integers can also be used for internal processing
• UTF-32, UTF-16 and UTF-8 are character encoding schemas for the Unicode standard
https://stuartlau.github.io UTF-32
• UTF-32 encodes each Unicode character as one 32-bit code units, e.g. A 00 00 00 41
• It’s the most convenient representation for internal processing
• But it’s memory-wasting
https://stuartlau.github.io UTF-16
• UTF-16 encodes each Unicode character as one or two 16-bit code units, U+0000~U+FFFF, 0~65535
• Each character is encoded using 2 or 4 bytes
• The internal Java encoding
• Code points between U+0000 and U+FFFF are represented as a 16-bit Java char value
• e.g. U+4E2D - 中, 2 bytes, char c = ‘中’
• Code points between U+10000 and U+10FFFF are the supplementary characters which char in Java can not hold
https://stuartlau.github.io Helo in UTF-16
00 48 00 65 00 6C 00 6C 00 6F
H E L L O
6F 00 6C 00 6C 00 65 00 48 00 Endianness
https://stuartlau.github.io BOM
• BOM = Byte Order Mark
• Appear at the start of Unicode text
• Big Endian starts with U+FEFF
• Little Endian starts with U+FFFE
• UTF-16 and UTF-32 have to deal with the issue of BE and LE, because they use multi-byte code units
https://stuartlau.github.io Java and Supplementary Characters
• Unicode was originally designed as a fixed-width 16-bit character encoding
• Java used to hold all Unicode characters using char
• But later Unicode 3.1 has been extended up to 1,114,112, 21-bit character encoding
• J2SE5.0 supports version 4.0 of Unicode standard
https://stuartlau.github.io
UTF-8
• 8-bit, variable-width encoding
• Encodes each Unicode character using 1 to 4 bytes
• .class files is encode using UTF-8
• No BOM needed
https://stuartlau.github.io UTF-8 Encoding
Availabl Byte1 Byte2 Byte3 Byte4 Sample e Bits
0xxxxxxx 7 abc
110xxxxx 10xxxxxx 11 āō
1110xxxx 10xxxxxx 10xxxxxx 16 汉字
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 21 ��
https://stuartlau.github.io UTF-8 to UTF-16
• For those bytes:
• starts with 0
• e.g. 0xxxxxxx ==> 00000000 0xxxxxxxx
• starts with 110,
• e.g. 110xxxxx 10yyyyyy ==> 00000xxx xxyyyyyy
• starts with 1110
• e.g. 1110xxxx 10yyyyyy 10zzzzzz ==> xxxxyyyy yyzzzzzz
• e.g. “中”
• Unicode U+4E2D: 01001110 00101101
• UTF-8 4E B8 AD : 11100100 10111000 10101101
https://stuartlau.github.io UTF-16 to UTF-8
• For those bytes:
• Less than 0x007F(00000000 01111111)
• e.g. 0x00000000 xxxxxx ==> 0xxxxxxxx
• Less than 0x07FF(00000111 11111111),
• e.g. 00000aaa bbbbbbbb ==> 110aaabb 10bbbbbb
• Others
• e.g. aaaaaaaa bbbbbbbb ==> 1110aaaa 10aaaabb 10bbcccccc
• e.g. “中”
• Unicode U+4E2D: 01001110 00101101
• UTF-8 4E B8 AD : 11100100 10111000 10101101
https://stuartlau.github.io Emoji History
In 1999, Shigetaka Kurita created the first 180 emoji collection for a Japanese mobile web platform The Oxford Dictionary named the “face with tears of joy” as its word of the year for 2015 ABC
• Sounds /ɪˈmoʊdʒi/ from Japanese
• “e” (picture), “moji” (character)
• As of 2017 there were 2,666 emoji on the official Unicode Standard list spread across 22 blocks
https://stuartlau.github.io Design Guideline
https://stuartlau.github.io Text&Colorful Shape
• Emoji character can have two main kinds of presentation:
• An emoji presentation, with colorful and perhaps whimsical shapes, even animated
• A text presentation, such as black & white
https://stuartlau.github.io Diversity
• Emoji display varies in different OS, Apps, version, etc.
• Even in the same App you may get different display
https://stuartlau.github.io Vendor Implementations Skin tone
• What’s the difference?
• U+1F64E • U+1F64E • U+1F64E U+1F3FB U+1F3FF U+200D U+2640 U+FE0F
https://stuartlau.github.io What’s the
• U+1F3FF EMOJI MODIFIER FITZAPATRICK TYPE-6
• u+200D ZERO WEDITH JOINER
• U+2640 FEMAL SIGN
• U+FE0F VARIATION SELECTOR-16
https://stuartlau.github.io Emoji Modifiers
• Emoji modifier - A character that can be used to modify the appearance of a preceding emoji in an emoji modifier sequence
• Emoji modifier base - A character whose appearance can be modified by a subsequent emoji modifier in an emoji modifier sequence
• Emoji modifier sequence - A sequence of the following form: emoji_modifier_sequence := emoji_modifier_base emoji_modifier
https://stuartlau.github.io Fitzpatrick Modifiers
• When one of these characters follows certain characters, then a font should show the sequence as a single glyph with the specified skin tone
• If the font doesn’t show the combined character, the user can still see that a skin tone was intended
https://stuartlau.github.io Sample Use of Fitzpatrick Modifiers
https://stuartlau.github.io Variation Selectors Variation Selectors
• VS is a Unicode block containing 16 Variation Selector format characters(designated VS1 through VS16)
• They are used to specify a specific glyph variant for a Unicode character
• At present only standardized variation sequences with VS1, VS15 and VS16 have been defined
https://stuartlau.github.io VS-15(U+FE0E)
• An invisible code point which specifies that the preceding character should be rendered in a textual fashion
https://stuartlau.github.io VS-16(U+FE0F)
• An invisible code point which specifies that the preceding character should be displayed with emoji presentation
• Only required if the preceding character defaults to text presentation
• Often used in Emoji ZWJ Sequences, where one or more characters in the sequence have text and emoji presentation
https://stuartlau.github.io ZWJ
https://stuartlau.github.io Emoji ZWJ Sequences
• ZERO WIDTH JOINER, U+0x200D
• Joining characters as a single glyph if available
• Behave like single emoji character, even though internally they are sequences
https://stuartlau.github.io Example
• The sequence U+1F468 �, U+200D ZWJ, U+1F469 �, U+200D ZWJ, U+1F467 �
• could be displayed as a single emoji depicting a family if the implementation supports it
• or else system would ignore the ZWJs, show the base emoji in the sequence: ���
https://stuartlau.github.io Multi-Person Groupings
https://stuartlau.github.io Gender Combinations
• Some multi-person groupings explicitly indicate gender: MAN AND WOMAN HOLDING HANDS
• Others do not: KISS, COUPLE WITH HEART
https://stuartlau.github.io Practice
https://stuartlau.github.io “�”.length?
• Actually compose with 4 emojis, 1 font variation and 3 ZWJs
https://stuartlau.github.io Example
“�”.codePointAt(0) // 128104 -> U+1F468 // returns combined code point of surrogate
“�”.codePointAt(1) // 56424 -> U+DC68
•The man Emoji has the code point U+1F468
•It can’t be represented in a single code unit in Java
•That’s why a surrogate pair has to be used, making it consistent of two single code units
https://stuartlau.github.io How does Java represent supplementary character? Surrogate Pair
• It is possible to combine two code points defined in the BMP to express another code point that lies outside of the first 65635 code points. This combination is called surrogate pair.
• Leading Surrogate: U+D800~U+DB7F
• Trailing Surrogate: U+DC00~U+DFFF
• The values from U+D800 to U+DFFF are reserved for used in UTF-16, no characters are assigned to them as code points
https://stuartlau.github.io Deep Dive
• e.g. U+1F468 �
• 0x1F468 - 0x10000 = 0xF468
• => 11110100 01101000
• Using 20-bit => 0000111101 0001101000
• 0xD800 + 0x3D = 0xD83D
• 0xDC00 + 0x68 = 0xDC68
https://stuartlau.github.io Supplementary Encoding in UTF-16
• UTF-16 covers U+0000~U+FFFF using 2 bytes
• For Unicode U(U+10000~U+10FFFF):
• Minus 0x10000, get U’(0x00000~0xFFFFF), 20 bits
• e.g. U’ = yyyyyyyyyyxxxxxxxxxx
• Using W1 to represent the first 10 bits,
• e.g. W1 = 110110yyyyyyyyyy, W1 in D800~DBFF
• Using W2 to represent the second 10 bits,
• e.g. W2 = 110111xxxxxxxxxx, W2 in DC00~DFFF
https://stuartlau.github.io Tricks1 - native2ascii Tricks2 - Calculator
https://stuartlau.github.io Tricks3 - Character Viewer Java API
• String.codePointAt(int index):int
• Character.highSurrogate(int codePoint):char
• Character.lowSurrogate(int codePoint):char
• Character.charCount(int codePoint):int
• Character.isSupplementaryCodePoint(int codePoint):boolean
• Character.isSurrogate(char):boolean
• Character.isSurrogatePair(char, char):boolean
• …
https://stuartlau.github.io References
• https://en.wikipedia.org/wiki/Emoji
• http://stn.audible.com/abcs-of-unicode
• https://twitter.github.io/twemoji/preview.html
• http://www.unicode.org/reports/tr51/
• https://en.wikipedia.org/wiki/Fitzpatrick_scale
• http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html
• https://en.wikipedia.org/wiki/UTF-8
• https://vinoit.me/2016/10/07/codePoint-in-java-and-utf16/
• https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software- developer-absolutely-positively-must-know-about-unicode-and-character-sets-no- excuses/
https://stuartlau.github.io