The ABC’s of stuartlau@github Sep, 2018 Quiz

• Which language can use US-ASCII to encode all its characters?

• How many characters can char represent in Java?

• Can we use char to represent ‘�’ in Java?

• What will return if you call “�”.length() and “�”.getBytes() in Java?

• What about “�”?

• Can we get the emoji calling “�c”.substring(0,1)?

• Can we execute insert into tb(‘name’) values(‘�’) in MySQL?

https://stuartlau.github.io Charactr How to Define Charactr?

• Character set - defines all readable characters

• Coded character set - use a to delegate a character in character repertoire

• Character encoding - the transformation between code point and its own code units

https://stuartlau.github.io Why we need character encoding?

Code Point

• A uniqued number assigned to each character

• Usually expressed in as U+xxxx, e.g. code point for A is U+0041

https://stuartlau.github.io Planes

17 Planes 136755 characters defined U+0000~U+10FFFF, 21-bit Supports over 1.1M possible characters

https://stuartlau.github.io BMP

• Basic Multilingual , U+0000~U+FFFF, 65536 in total

https://stuartlau.github.io Supplementary Characters

• Code points between U+10000 and U+10FFFF are the supplementary characters

• Can not be described as a single 16-bit entity

https://stuartlau.github.io Character Encoding

• A mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units

• The most commonly used code units are bytes, but 16-bit, 32-bit integers can also be used for internal processing

• UTF-32, UTF-16 and UTF-8 are character encoding schemas for the Unicode standard

https://stuartlau.github.io UTF-32

• UTF-32 encodes each Unicode character as one 32-bit code units, e.g. A 00 00 00 41

• It’s the most convenient representation for internal processing

• But it’s memory-wasting

https://stuartlau.github.io UTF-16

• UTF-16 encodes each Unicode character as one or two 16-bit code units, U+0000~U+FFFF, 0~65535

• Each character is encoded using 2 or 4 bytes

• The internal Java encoding

• Code points between U+0000 and U+FFFF are represented as a 16-bit Java char value

• e.g. U+4E2D - 中, 2 bytes, char c = ‘中’

• Code points between U+10000 and U+10FFFF are the supplementary characters which char in Java can not hold

https://stuartlau.github.io Helo in UTF-16

00 48 00 65 00 6C 00 6C 00 6F

H E L L O

6F 00 6C 00 6C 00 65 00 48 00 Endianness

https://stuartlau.github.io BOM

• BOM =

• Appear at the start of Unicode text

• Big Endian starts with U+FEFF

• Little Endian starts with U+FFFE

• UTF-16 and UTF-32 have to deal with the issue of BE and LE, because they use multi-byte code units

https://stuartlau.github.io Java and Supplementary Characters

• Unicode was originally designed as a fixed-width 16-bit character encoding

• Java used to hold all Unicode characters using char

• But later Unicode 3.1 has been extended up to 1,114,112, 21-bit character encoding

• J2SE5.0 supports version 4.0 of Unicode standard

https://stuartlau.github.io

UTF-8

• 8-bit, variable-width encoding

• Encodes each Unicode character using 1 to 4 bytes

• .class files is encode using UTF-8

• No BOM needed

https://stuartlau.github.io UTF-8 Encoding

Availabl Byte1 Byte2 Byte3 Byte4 Sample e Bits

0xxxxxxx 7 abc

110xxxxx 10xxxxxx 11 āō

1110xxxx 10xxxxxx 10xxxxxx 16 汉字

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 21 ��

https://stuartlau.github.io UTF-8 to UTF-16

• For those bytes:

• starts with 0

• e.g. 0xxxxxxx ==> 00000000 0xxxxxxxx

• starts with 110,

• e.g. 110xxxxx 10yyyyyy ==> 00000xxx xxyyyyyy

• starts with 1110

• e.g. 1110xxxx 10yyyyyy 10zzzzzz ==> xxxxyyyy yyzzzzzz

• e.g. “中”

• Unicode U+4E2D: 01001110 00101101

• UTF-8 4E B8 AD : 11100100 10111000 10101101

https://stuartlau.github.io UTF-16 to UTF-8

• For those bytes:

• Less than 0x007F(00000000 01111111)

• e.g. 0x00000000 xxxxxx ==> 0xxxxxxxx

• Less than 0x07FF(00000111 11111111),

• e.g. 00000aaa bbbbbbbb ==> 110aaabb 10bbbbbb

• Others

• e.g. aaaaaaaa bbbbbbbb ==> 1110aaaa 10aaaabb 10bbcccccc

• e.g. “中”

• Unicode U+4E2D: 01001110 00101101

• UTF-8 4E B8 AD : 11100100 10111000 10101101

https://stuartlau.github.io Emoji History

In 1999, created the first 180 emoji collection for a Japanese mobile web platform The Oxford Dictionary named the “face with tears of joy” as its word of the year for 2015 ABC

• Sounds /ɪˈmoʊdʒi/ from Japanese

• “e” (picture), “moji” (character)

• As of 2017 there were 2,666 emoji on the official Unicode Standard list spread across 22 blocks

https://stuartlau.github.io Design Guideline

https://stuartlau.github.io Text&Colorful Shape

• Emoji character can have two main kinds of presentation:

• An emoji presentation, with colorful and perhaps whimsical shapes, even animated

• A text presentation, such as black & white

https://stuartlau.github.io Diversity

• Emoji display varies in different OS, Apps, version, etc.

• Even in the same App you may get different display

https://stuartlau.github.io Vendor Implementations Skin tone

• What’s the difference?

• U+1F64E • U+1F64E • U+1F64E U+1F3FB U+1F3FF U+200D U+2640 U+FE0F

https://stuartlau.github.io What’s the

• U+1F3FF EMOJI MODIFIER FITZAPATRICK TYPE-6

• u+200D ZERO WEDITH JOINER

• U+2640 FEMAL SIGN

• U+FE0F VARIATION SELECTOR-16

https://stuartlau.github.io Emoji Modifiers

• Emoji modifier - A character that can be used to modify the appearance of a preceding emoji in an emoji modifier sequence

• Emoji modifier base - A character whose appearance can be modified by a subsequent emoji modifier in an emoji modifier sequence

• Emoji modifier sequence - A sequence of the following form: emoji_modifier_sequence := emoji_modifier_base emoji_modifier

https://stuartlau.github.io Fitzpatrick Modifiers

• When one of these characters follows certain characters, then a font should show the sequence as a single glyph with the specified skin tone

• If the font doesn’t show the combined character, the user can still see that a skin tone was intended

https://stuartlau.github.io Sample Use of Fitzpatrick Modifiers

https://stuartlau.github.io Variation Selectors Variation Selectors

• VS is a Unicode block containing 16 Variation Selector format characters(designated VS1 through VS16)

• They are used to specify a specific glyph variant for a Unicode character

• At present only standardized variation sequences with VS1, VS15 and VS16 have been defined

https://stuartlau.github.io VS-15(U+FE0E)

• An invisible code point which specifies that the preceding character should be rendered in a textual fashion

https://stuartlau.github.io VS-16(U+FE0F)

• An invisible code point which specifies that the preceding character should be displayed with emoji presentation

• Only required if the preceding character defaults to text presentation

• Often used in Emoji ZWJ Sequences, where one or more characters in the sequence have text and emoji presentation

https://stuartlau.github.io ZWJ

https://stuartlau.github.io Emoji ZWJ Sequences

• ZERO WIDTH JOINER, U+0x200D

• Joining characters as a single glyph if available

• Behave like single emoji character, even though internally they are sequences

https://stuartlau.github.io Example

• The sequence U+1F468 �, U+200D ZWJ, U+1F469 �, U+200D ZWJ, U+1F467 �

• could be displayed as a single emoji depicting a family if the implementation supports it

• or else system would ignore the ZWJs, show the base emoji in the sequence: ���

https://stuartlau.github.io Multi-Person Groupings

https://stuartlau.github.io Gender Combinations

• Some multi-person groupings explicitly indicate gender: MAN AND WOMAN HOLDING HANDS

• Others do not: KISS, COUPLE WITH HEART

https://stuartlau.github.io Practice

https://stuartlau.github.io “�”.length?

• Actually compose with 4 , 1 font variation and 3 ZWJs

https://stuartlau.github.io Example

“�”.codePointAt(0) // 128104 -> U+1F468 // returns combined code point of surrogate

“�”.codePointAt(1) // 56424 -> U+DC68

•The man Emoji has the code point U+1F468

•It can’t be represented in a single code unit in Java

•That’s why a surrogate pair has to be used, making it consistent of two single code units

https://stuartlau.github.io How does Java represent supplementary character? Surrogate Pair

• It is possible to combine two code points defined in the BMP to express another code point that lies outside of the first 65635 code points. This combination is called surrogate pair.

• Leading Surrogate: U+D800~U+DB7F

• Trailing Surrogate: U+DC00~U+DFFF

• The values from U+D800 to U+DFFF are reserved for used in UTF-16, no characters are assigned to them as code points

https://stuartlau.github.io Deep Dive

• e.g. U+1F468 �

• 0x1F468 - 0x10000 = 0xF468

• => 11110100 01101000

• Using 20-bit => 0000111101 0001101000

• 0xD800 + 0x3D = 0xD83D

• 0xDC00 + 0x68 = 0xDC68

https://stuartlau.github.io Supplementary Encoding in UTF-16

• UTF-16 covers U+0000~U+FFFF using 2 bytes

• For Unicode U(U+10000~U+10FFFF):

• Minus 0x10000, get U’(0x00000~0xFFFFF), 20 bits

• e.g. U’ = yyyyyyyyyyxxxxxxxxxx

• Using W1 to represent the first 10 bits,

• e.g. W1 = 110110yyyyyyyyyy, W1 in D800~DBFF

• Using W2 to represent the second 10 bits,

• e.g. W2 = 110111xxxxxxxxxx, W2 in DC00~DFFF

https://stuartlau.github.io Tricks1 - native2ascii Tricks2 - Calculator

https://stuartlau.github.io Tricks3 - Character Viewer Java API

• String.codePointAt(int index):int

• Character.highSurrogate(int codePoint):char

• Character.lowSurrogate(int codePoint):char

• Character.charCount(int codePoint):int

• Character.isSupplementaryCodePoint(int codePoint):boolean

• Character.isSurrogate(char):boolean

• Character.isSurrogatePair(char, char):boolean

• …

https://stuartlau.github.io References

• https://en.wikipedia.org/wiki/Emoji

• http://stn.audible.com/abcs-of-unicode

• https://twitter.github.io/twemoji/preview.html

• http://www.unicode.org/reports/tr51/

• https://en.wikipedia.org/wiki/Fitzpatrick_scale

• http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html

• https://en.wikipedia.org/wiki/UTF-8

• https://vinoit.me/2016/10/07/codePoint-in-java-and-utf16/

• https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software- developer-absolutely-positively-must-know-about-unicode-and-character-sets-no- excuses/

https://stuartlau.github.io