Unicode What Is Unicode? Coverage Standard

Unicode Coverage ● Examples: – Cherokee https://www.unicode.org/charts/PDF/U13A0.pdf – Peter Jeszenszky Imperial Aramaic https://www.unicode.org/charts/PDF/U10840.pdf – Old Hungarian https://www.unicode.org/charts/PDF/U10C80.pdf Faculty of Informatics, University of Debrecen – Egyptian hieroglyphs [email protected] https://www.unicode.org/charts/PDF/U13000.pdf – Emoticons https://www.unicode.org/charts/PDF/U1F600.pdf Last modified: September 6, 2021 – Alchemical symbols https://www.unicode.org/charts/PDF/U1F700.pdf – … 3 What is Unicode? Standard ● Universal character encoding standard for ● Developed by the Unicode Consortium, a non- written characters and text. profit organization. ● Covers all the characters for all the writing – See: systems of the world, modern and ancient. https://www.unicode.org/consortium/consort.html – It also includes technical symbols, punctuations, ● The current standard is version 13.0.0 (2020 and many other characters used in writing text. March 10). ● Widely used and supported. – See: The Unicode Standard, Version 13.0.0 https://www.unicode.org/versions/Unicode13.0.0/ 2 4 Universal Coded Character Set Code Points (UCS) ● A standard character set defined by ISO. ● When referring to code points, the usual ● The current standard: practice is to refer to them by their numeric – ISO/IEC 10646:2020 Information technology – Universal Coded Character Set (UCS) https://www.iso.org/standard/76835.html value expressed in hexadecimal using four to ● Can be downloaded from here: https://standards.iso.org/ittf/PubliclyAvailableStandards/index.html six digits, with a U+ prefix. ● Developed in conjunction with Unicode. – Leading zeros are omitted, unless the code point – The characters and their code points are the same in both standards. would have fewer than four hexadecimal digits. – The difference is that Unicode imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. – Examples: U+0020, U+265F, U+130E0 ● Further information: – Frequently Asked Questions – Unicode and ISO 10646 https://www.unicode.org/faq/unicode_iso.html 5 7 Basic Concepts Properties ● Codespace: the range of integers used to code ● Unicode associates a rich set of semantics with characters (code the characters. points), these semantics are defined by character properties. – Unicode identifies more than 100 different character properties. ● Code point: an element of the codespace, i.e., ● Examples: an integer encoding a character. – Name – General category (letter, number, symbol, punctuation, …) – Case (uppercase, lowercase, titlecase) – … ● The Unicode Character Database (UCD) contains character properties. https://unicode.org/ucd/ 6 8 Character Names Codespace ● ● Each character is identified by a unique name. The range of integers from 016 to 10FFFF16. – Examples: – The total number of code points is 1,114,112, of ● U+0041 – LATIN CAPITAL LETTER A (A) which 143,859 are used currently. https://www.fileformat.info/info/unicode/char/0041/index.h ● Character code charts: https://www.unicode.org/charts/ tm ● U+2605 – BLACK STAR (★) ) https://www.fileformat.info/info/unicode/char/2605/index.h tm ● U+1F63A – SMILING CAT FACE WITH OPEN MOUTH () ) https://www.fileformat.info/info/unicode/char/1f63a/index. htm 9 11 Characters and Glyphs Planes and Blocks ● The character identified by a Unicode code point is ● The codespace is divided into planes, each of which contains an abstract entity, such as “LATIN CAPITAL LETTER 65,536 (216) code points. A” or “BENGALI DIGIT FIVE”. – The last four hexadecimal digits in each code point indicate a character’s position inside a plane. The remaining digits indicate the plane. ● ● A visual representation of a character is called a For example, U+130F7 is found at location 30F716 in Plane 1. ● glyph. The total number of planes is 17 (016, …, 1016). ● ● The Unicode standard does not define glyph images. Planes are divided into non-overlapping blocks. – Blocks are named ranges, where the number of code points in a block is ● The visual appearance of characters on a device always a multiply of 16. (e.g., screen or printer) is fully left to the software or – Characters used in a single writing system may be found in several hardware responsible for rendering characters. different blocks. 10 12 BMP UTF-32 ● Basic Multilingual Plane (BMP): ● Each code point is represented by 4 bytes – The plane containing the first 65,536 code points (fixed-width character encoding form). (range U+0000–U+FFFF) (Plane 0). ● The simplest Unicode encoding form. – Contains the common-use characters for all the ● The most efficient in terms of processing. modern scripts of the world as well as many historical and rare characters. ● The least efficient encoding in terms of the ● By far the majority of all Unicode characters for almost all number of bytes used. textual data can be found in the BMP. 13 15 Character Encodings UTF-16 ● Character encodings defined by the Unicode ● Each code point is represented by 2 or 4 bytes standard: (variable-width character encoding form): – – UTF-8 Code points in the BMP are represented by using 2 bytes, for all other code points 4 bytes are used. – UTF-16 ● Optimizes the representation of characters in the BMP. – UTF-32 – For code points in the BMP can effectively be treated as if it ● All three encoding forms can be used to were a fixed-width encoding form. represent the full range of Unicode characters. ● Maintains a balance between efficient access and economical use of storage. 14 16 UTF-8 Byte Order (2) ● Each code point is represented by using from 1 to 4 bytes (variable- ● Byte order mark (BOM): width character encoding form): – Code points in the range U+0000–U+007F are represented by a single byte – The character U+FEFF (ZERO WIDTH NO-BREAK (the 128 ASCII characters). SPACE) is used as the byte order mark to indicate – Code points in the range U+0080–U+07FF are represented by using 2 bytes. the byte order at the very beginning of text. – All other code points in the BMP are represented by using 3 bytes. – – Code points outside of the BMP are represented by using 4 bytes. It is not part of the textual content and should be ● The first byte of a byte sequence representing a code points removed before processing. determines the number of bytes in the byte sequence. Encoding Scheme Byte Sequence ● The most compact encoding in terms of the number of bytes used. UTF-16 big-endian FE FF – It is less efficient when used for East Asian writing systems, such as Chinese, Japanese, and Korean. UTF-16 little-endian FF FE UTF-32 big-endian 00 00 FE FF 17 UTF-32 little-endian FF FE 00 00 19 Byte Order (1) Unicode Input ● For the UTF-16 and UTF-32 encoding forms the byte order ● Linux: (big-endian or little-endian) must also be also specified. – In GTK+ applications Unicode characters can be ● Taking account of the byte order, Unicode defines the entered by typing Ctrl-Shift-U, followed by a following seven encoding schemes: hexadecimal Unicode code point. – UTF-8 ● – UTF-16, UTF-16BE, UTF-16LE See: – UTF-32, UTF-32BE, UTF-32LE – Unicode input ● For the UTF-16 and UTF-32 encoding schemes the byte https://en.wikipedia.org/wiki/Unicode_input order is determined by the BOM at the very beginning of text. 18 20 ISO/IEC 8859 ECMAScript ● 8-bit character encoding standards (ISO/IEC 8859-1, … ● In string literals, regular expression literals, template literals , ISO/IEC 8859-16). and identifiers, any Unicode character may also be expressed using Unicode escape sequences of the form: – See: ISO/IEC 8859 – 8-bit single-byte coded graphic character sets – \uhhhh where hhhh is a sequence of four hexadecimal digits representing the code point. ● Relevant encodings for us in Hungary: ● Examples: \u00A9, \u262F – ISO/IEC 8859-1 (Latin-1): for Western European languages – \u{hhhhhh} where hhhhhh is a sequence of one to six – ISO/IEC 8859-2 (Latin-2): for Central European languages hexadecimal digits representing the code point. ● Examples: \u{A9}, \u{1F63A} ● Suitable to be used for the following languages: Albanian, Bosnian, Czech, Croatian, Polish, Hungarian, German, Romanian, Serbian ● See: https://262.ecma-international.org/12.0/#sec-source-text (Latin alphabet), Slovakian, Slovenian, Sorbian. 21 23 CSS JSON ● Unicode characters can be specified with escape sequences of the ● In strings, Unicode characters in the BMP may form \hhhhhh, where hhhhhh is a sequence of one to six hexadecimal digits representing the code point of the Unicode also be expressed using escape sequences of character. the form \uhhhh, where is a sequence of four – If the number of hex digits is less than six and a character in the range [0- hexadecimal digits representing the code point. 9a-fA-F] follows the hexadecimal number, then a whitespace character must end the escape sequence. – Examples: \u00A9, \u262F ● A whitespace character that immediately follows an escape sequence will be ignored. ● See: – Examples: https://www.rfc-editor.org/rfc/rfc8259#section-7 ● \A9, \0A9, …, \0000A9 ● \262F, \0262F, \00262F ● See: https://www.w3.org/TR/css-syntax-3/#escaping 22 24 XML, XHTML Conversion Tools ● In text, attribute values, and literal entity values ● Linux: Unicode characters may also be expressed – Recode (license: GNU GPL v3) using character references of the form: https://github.com/rrthomas/recode/ – &#nnnn; where nnnn is a sequence of decimal ● Example of use: digits representing the code point. $ recode --list $ recode UTF-8..ISO-8859-2 file.txt ● Examples: ©, ☯, 😺 $ recode UTF-8..UTF-16 *.txt – &#xhhhh; where hhhh is a sequence of hexadecimal digits representing the code point. ● Examples: ©, ☯, 😺 25 27 HMTL Useful Links ● A number of Unicode characters may be ● Shapecatcher: Draw the Unicode character you expressed using named character references of want! https://shapecatcher.com/ the form &name;.

Unicode What Is Unicode? Coverage Standard

1 Introduction 1

Base64 Character Encoding and Decoding Modeling

Unicode Ate My Brain

The Unicode Cookbook for Linguists: Managing Writing Systems Using Orthography Profiles

Unicode and Code Page Support

Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress

Bbedit 13.5 User Manual

Chinese Script Generation Panel Document

SAS 9.3 UTF-8 Encoding Support and Related Issue Troubleshooting

DFDL WG Stephen M Hanson, IBM [email protected] September 2014

JS Character Encodings

San José, October 2, 2000 Feel Free to Distribute This Text