Graphemic Standardisation and Human Writing Systems Workshop

Victor Zimmermann 2019-06-22

Department of Computational Linguistics Heidelberg University Writing: A Human Invention

Human writing systems have independently been invented about ﬁve times in human history:

• Indus Valley (Harappan Script) • Sumer (Cuneiform) • Egypt (Hieroglyphs) • Huang (Chinese) • Central America (Maya Script)

Difference to speech, which exists in every human civilisation. There is no known case where a child has acquired writing by itself. It is a technology, just as the wheel or the iPhone.

tacos29 | unicode 1 Scripts vs. Character Sets

• Script / Writing System: Method of visualising verbal communication. • Character set: Set of symbols used in writing system. • Alphabet: Set of characters representing phonemes. • Abjad: Set of characters representing consonants. • Abugida: Set of characters representing consonants with vowel notations. • Syllabary: Set of characters representing syllables. • Logography: Set of characters representing semantic units.

tacos29 | unicode 2 Writing Systems around the World

tacos29 | unicode 3 Graphemic Standardisation

“When he wished to print, he took an iron frame and set it on the iron plate. In this he placed the types, set close together. When the frame was full, the whole made one solid block of type. He then placed it near the ﬁre to warm it. When the paste [at the back] was slightly melted, he took a smooth board and pressed it over the surface, so that the block of type became as even as a whetstone.” - Shen Kuo (1031–1095)

tacos29 | unicode 4 Graphemic Standardisation

• Early printing standards tightly linked to typography. • What could be printed was dependent on physical movable typesets. • Usage of German metal types used in the early printing presses led to the removal of characters like Thorn and Eth from the English alphabet.

tacos29 | unicode 5 Digital Standards

• Early bit encodings like Morse or Baudot code utilize bit-like character encryption. • American 7-Bit ASCII Encoding used throughout Latin writing world. • Various 8-Bit extensions of ASCII emerged, eg. adding various diacritic versions for Northern European countries. • IBM releases own 8-Bit standard, the Extended Binary Coded Decimal Interchange Code (EBCDIC). • Many competing standards for Japanese led to Mojigake when moving data between companies.

tacos29 | unicode 6 Workshop: Canadian Aboriginal Syllabics

Figure 1: Evans’ script, as published in 1841 tacos29 | unicode 7 Workshop: Cyrillic Scripts

tacos29 | unicode 8 Workshop: Cyrillic Scripts

tacos29 | unicode 9 Workshop: Hangul

tacos29 | unicode 10 Workshop: Hangul

tacos29 | unicode 11 Workshop Task

• Are there any cultural aspects you need to consider when creating a standard? How should they inﬂuence your design? • How would a standard deal with the syllabic blocks, ligatures or diacritics present in your language? • Is there a particularly efﬁcient way to represent this script? • Are there things you wish to include in your standard that go beyond character sets? • Does a program using your encoding need special instructions to represent your characters? What are they?

tacos29 | unicode 12 UTF-32, UTF-16, UTF-8

UTF-32: All code points are encoded in 32 bit.

UTF-16: All code points are encoded in 16 or 32 bit. Little Endian: left-to-right, Big Endian: right-to-left

UTF-8: Variable width encoding through continuation marks and 8 bit chunks.

tacos29 | unicode 13 Examples

UTF-32 00000000 00000000 00000000 01000001 UTF-16 LE 00000000 01000001 UTF-16 BE 10000010 00000000 UTF-8 01000001

LATIN CAPITAL LETTER A (U+0041)

UTF-32 00000000 00000001 11110011 00101110 UTF-16 LE 11011000 00111100 11011111 00101110 UTF-16 BE 01110100 11111011 00111100 00011011 UTF-8 11110000 10011111 10001100 10101110

TACO (U+1F32E) tacos29 | unicode 14 History of Emoji

The history of Unicode is the history of compromise. UTF-8 came to be because ASCII users did not want to move from 7 to 16 or even 32 bit systems.

Because of its logographic nature, Japanese uses a lot of code points. Since encodings always double in size for each bit, Japanese was left with a lot of open code points. Some Japanese phone companies used this space for symbols like emoticons or the poop emoji.

tacos29 | unicode 15 History of Emoji

2010 emojis were incorporated in the Unicode Standard to allow compatibility with Japanese phones. iPhone users found out. Poop emojis appeared in messages around the world soon after.

Today Emojis are spread over multiple Unicode blocks, the Japanese blocks and a special miscellaneous block. The addition of new emojis is governed by the Unicode Consortium (not Apple).

tacos29 | unicode 16 References References

[Uni19] The Unicode Consortium. The Unicode Standard Version 12.0 - Core Speciﬁcation. Vol. 1. Mountain View, CA: The Unicode Consortium, 2019. isbn: 9781936213238.

coll | references 17 Thank you!

That’s it! Thank you for coming!

Have fun at TaCoS29!

en.axtimhaus.eu [email protected]

coll | references 18