<<

Coverage

● Examples: – Cherokee https://www.unicode.org/charts/PDF/U13A0.pdf – Peter Jeszenszky Imperial Aramaic https://www.unicode.org/charts/PDF/U10840.pdf – Old Hungarian https://www.unicode.org/charts/PDF/U10C80.pdf Faculty of Informatics, University of Debrecen – [email protected] https://www.unicode.org/charts/PDF/U13000.pdf – https://www.unicode.org/charts/PDF/U1F600.pdf Last modified: September 6, 2021 – Alchemical symbols https://www.unicode.org/charts/PDF/U1F700.pdf – …

3

What is Unicode? Standard

● Universal encoding standard for ● Developed by the , non- written characters and text. profit organization. ● Covers all the characters for all the writing – See: systems of the world, modern and ancient. https://www.unicode.org/consortium/consort.html – It also includes technical symbols, , ● The current standard is version 13.0.0 (2020 and many other characters used in writing text. March 10). ● Widely used and supported. – See: The Unicode Standard, Version 13.0.0 https://www.unicode.org/versions/Unicode13.0.0/

2 4 Universal Coded Character Set Points (UCS)

● A standard character set defined by ISO. ● When referring code points, the usual ● The current standard: practice is to refer to them by their numeric – ISO/IEC 10646:2020 – Universal Coded Character Set (UCS) https://www.iso.org/standard/76835.html value expressed in using four to ● Can be downloaded from here: https://standards.iso.org/ittf/PubliclyAvailableStandards/index.html six digits, with a + prefix. ● Developed in conjunction with Unicode. – zeros are omitted, unless the code – The characters and their code points are the same in both standards. would have fewer than four hexadecimal digits. – The difference is that Unicode imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. – Examples: U+0020, U+265F, U+130E0 ● Further information: – Frequently Asked Questions – Unicode and ISO 10646 https://www.unicode.org/faq/unicode_iso.html

5 7

Basic Concepts Properties

● Codespace: the range of integers used to code ● Unicode associates a rich set of semantics with characters (code the characters. points), these semantics are defined by character properties. – Unicode identifies more than 100 different character properties. ● : an element of the codespace, .., ● Examples: an integer encoding a character. – Name – General category (letter, number, , , …) – Case (uppercase, lowercase, titlecase) – … ● The Unicode Character (UCD) contains character properties. https://unicode.org/ucd/

6 8 Character Names Codespace

● ● Each character is identified by a unique name. The range of integers from 016 to 10FFFF16. – Examples: – The total number of code points is 1,114,112, of ● U+0041 – LATIN CAPITAL LETTER A (A) which 143,859 are used currently. https://www.fileformat.info/info/unicode/char/0041/.h ● Character code charts: https://www.unicode.org/charts/ tm ● U+2605 – BLACK STAR (★) ) https://www.fileformat.info/info/unicode/char/2605/index.h tm ● U+1F63A – SMILING CAT FACE WITH OPEN MOUTH (😺) ) https://www.fileformat.info/info/unicode/char/1f63a/index. htm

9 11

Characters and Planes and Blocks

● The character identified by a Unicode code point is ● The codespace is divided into planes, each of which contains an abstract entity, such as “LATIN CAPITAL LETTER 65,536 (216) code points. A” or “BENGALI DIGIT FIVE”. – The last four hexadecimal digits in each code point indicate a character’s position inside a . The remaining digits indicate the plane.

● ● A visual representation of a character is called a For example, U+130F7 is found at location 30F716 in Plane 1.

. The total number of planes is 17 (016, …, 1016).

● ● The Unicode standard does not define glyph images. Planes are divided into non-overlapping blocks. – Blocks are named ranges, where the number of code points in a block is ● The visual appearance of characters on a device always a multiply of 16. (e.g., screen or printer) is fully left to the or – Characters used in a single may be found in several hardware responsible for rendering characters. different blocks.

10 12 BMP UTF-32

● Basic Multilingual Plane (BMP): ● Each code point is represented by 4 – The plane containing the first 65,536 code points (fixed-width form). (range U+0000–U+FFFF) (Plane 0). ● The simplest Unicode encoding form. – Contains the common-use characters for all the ● The most efficient in terms of processing. modern scripts of the world as well as many historical and rare characters. ● The least efficient encoding in terms of the ● By far the majority of all Unicode characters for almost all number of bytes used. textual data can be found in the BMP.

13 15

Character Encodings UTF-16

● Character encodings defined by the Unicode ● Each code point is represented by 2 or 4 bytes standard: (variable-width character encoding form): – – UTF-8 Code points in the BMP are represented by using 2 bytes, for all other code points 4 bytes are used. – UTF-16 ● Optimizes the representation of characters in the BMP. – UTF-32 – For code points in the BMP can effectively be treated as if it ● All three encoding forms can be used to were a fixed-width encoding form. represent the full range of Unicode characters. ● Maintains a balance between efficient access and economical use of storage.

14 16 UTF-8 Order (2)

● Each code point is represented by using from 1 to 4 bytes (variable- ● (BOM): width character encoding form): – Code points in the range U+0000–U+007F are represented by a single byte – The character U+FEFF (ZERO WIDTH -BREAK (the 128 ASCII characters). ) is used as the byte order mark to indicate – Code points in the range U+0080–U+07FF are represented by using 2 bytes. the byte order at the very beginning of text. – All other code points in the BMP are represented by using 3 bytes. – – Code points outside of the BMP are represented by using 4 bytes. It is not part of the textual content and should be ● The first byte of a byte sequence representing a code points removed before processing. determines the number of bytes in the byte sequence. Encoding Scheme Byte Sequence ● The most compact encoding in terms of the number of bytes used. UTF-16 big-endian FE FF – It is less efficient when used for East Asian writing systems, such as Chinese, Japanese, and Korean. UTF-16 little-endian FF FE UTF-32 big-endian 00 00 FE FF

17 UTF-32 little-endian FF FE 00 00 19

Byte Order (1)

● For the UTF-16 and UTF-32 encoding forms the byte order ● : (big-endian or little-endian) must also be also specified. – In GTK+ applications Unicode characters can be ● Taking account of the byte order, Unicode defines the entered by typing Ctrl-Shift-U, followed by a following seven encoding schemes: hexadecimal Unicode code point. – UTF-8 ● – UTF-16, UTF-16BE, UTF-16LE See: – UTF-32, UTF-32BE, UTF-32LE – Unicode input ● For the UTF-16 and UTF-32 encoding schemes the byte https://en.wikipedia.org/wiki/Unicode_input order is determined by the BOM at the very beginning of text.

18 20 ISO/IEC 8859 ECMAScript

● 8-bit character encoding standards (ISO/IEC 8859-1, … ● In string literals, literals, template literals , ISO/IEC 8859-16). and identifiers, any Unicode character may also be expressed using Unicode escape sequences of the form: – See: ISO/IEC 8859 – 8-bit single-byte coded sets – \uhhhh where hhhh is a sequence of four hexadecimal digits representing the code point. ● Relevant encodings for us in Hungary: ● Examples: \u00A9, \u262F – ISO/IEC 8859-1 (Latin-1): for Western European languages – \u{hhhhhh} where hhhhhh is a sequence of one to six – ISO/IEC 8859-2 (Latin-2): for Central European languages hexadecimal digits representing the code point. ● Examples: \u{A9}, \u{1F63A} ● Suitable to be used for the following languages: Albanian, Bosnian, Czech, Croatian, Polish, Hungarian, German, Romanian, Serbian ● See: https://262.ecma-international.org/12.0/#sec-source-text (Latin ), Slovakian, Slovenian, Sorbian.

21 23

CSS JSON

● Unicode characters can be specified with escape sequences of the ● In strings, Unicode characters in the BMP may form \hhhhhh, where hhhhhh is a sequence of one to six hexadecimal digits representing the code point of the Unicode also be expressed using escape sequences of character. the form \uhhhh, where is a sequence of four – If the number of hex digits is less than six and a character in the range [0- hexadecimal digits representing the code point. 9a-fA-F] follows the hexadecimal number, then a must end the . – Examples: \u00A9, \u262F ● A whitespace character that immediately follows an escape sequence will be ignored. ● See: – Examples: https://www.rfc-editor.org/rfc/rfc8259#section-7 ● \A9, \0A9, …, \0000A9 ● \262F, \0262F, \00262F ● See: https://www.w3.org/TR/css-syntax-3/#escaping

22 24 XML, XHTML Conversion Tools

● In text, attribute values, and literal entity values ● Linux: Unicode characters may also be expressed – Recode (license: GNU GPL v3) using character references of the form: https://github.com/rrthomas/recode/ – &#nnnn; where nnnn is a sequence of ● Example of use: digits representing the code point. $ recode --list $ recode UTF-8..ISO-8859-2 file.txt ● Examples: ©, ☯, 😺 $ recode UTF-8..UTF-16 *.txt – &#xhhhh; where hhhh is a sequence of hexadecimal digits representing the code point. ● Examples: ©, ☯, 😺

25 27

HMTL Useful Links

● A number of Unicode characters may be ● Shapecatcher: Draw the Unicode character you expressed using named character references of want! https://shapecatcher.com/ the form &name;. ● Unicode Character Search – The list of supported character reference names: https://www.fileformat.info/info/unicode/char/search. https://html.spec.whatwg.org/#named-character-ref htm erences ● Unicode Character Table https://unicode-table.com/ – Examples: ● Unicode Party: The Unicode Search Engine ● É – U+00C9 (É) https://unicode.party/ ● é – U+00E9 (é) ● &what; http://www.amp-what.com/ ● ☆ – U+2606 (☆) )

26 28