Unicode What Is Unicode? Coverage Standard

Unicode Coverage ● Examples: – Cherokee https://www.unicode.org/charts/PDF/U13A0.pdf – Peter Jeszenszky Imperial Aramaic https://www.unicode.org/charts/PDF/U10840.pdf – Old Hungarian https://www.unicode.org/charts/PDF/U10C80.pdf Faculty of Informatics, University of Debrecen – Egyptian hieroglyphs [email protected] https://www.unicode.org/charts/PDF/U13000.pdf – Emoticons https://www.unicode.org/charts/PDF/U1F600.pdf Last modified: September 6, 2021 – Alchemical symbols https://www.unicode.org/charts/PDF/U1F700.pdf – … 3 What is Unicode? Standard ● Universal character encoding standard for ● Developed by the Unicode Consortium, a non- written characters and text. profit organization. ● Covers all the characters for all the writing – See: systems of the world, modern and ancient. https://www.unicode.org/consortium/consort.html – It also includes technical symbols, punctuations, ● The current standard is version 13.0.0 (2020 and many other characters used in writing text. March 10). ● Widely used and supported. – See: The Unicode Standard, Version 13.0.0 https://www.unicode.org/versions/Unicode13.0.0/ 2 4 Universal Coded Character Set Code Points (UCS) ● A standard character set defined by ISO. ● When referring to code points, the usual ● The current standard: practice is to refer to them by their numeric – ISO/IEC 10646:2020 Information technology – Universal Coded Character Set (UCS) https://www.iso.org/standard/76835.html value expressed in hexadecimal using four to ● Can be downloaded from here: https://standards.iso.org/ittf/PubliclyAvailableStandards/index.html six digits, with a U+ prefix. ● Developed in conjunction with Unicode. – Leading zeros are omitted, unless the code point – The characters and their code points are the same in both standards. would have fewer than four hexadecimal digits. – The difference is that Unicode imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. – Examples: U+0020, U+265F, U+130E0 ● Further information: – Frequently Asked Questions – Unicode and ISO 10646 https://www.unicode.org/faq/unicode_iso.html 5 7 Basic Concepts Properties ● Codespace: the range of integers used to code ● Unicode associates a rich set of semantics with characters (code the characters. points), these semantics are defined by character properties. – Unicode identifies more than 100 different character properties. ● Code point: an element of the codespace, i.e., ● Examples: an integer encoding a character. – Name – General category (letter, number, symbol, punctuation, …) – Case (uppercase, lowercase, titlecase) – … ● The Unicode Character Database (UCD) contains character properties. https://unicode.org/ucd/ 6 8 Character Names Codespace ● ● Each character is identified by a unique name. The range of integers from 016 to 10FFFF16. – Examples: – The total number of code points is 1,114,112, of ● U+0041 – LATIN CAPITAL LETTER A (A) which 143,859 are used currently. https://www.fileformat.info/info/unicode/char/0041/index.h ● Character code charts: https://www.unicode.org/charts/ tm ● U+2605 – BLACK STAR (★) ) https://www.fileformat.info/info/unicode/char/2605/index.h tm ● U+1F63A – SMILING CAT FACE WITH OPEN MOUTH () ) https://www.fileformat.info/info/unicode/char/1f63a/index. htm 9 11 Characters and Glyphs Planes and Blocks ● The character identified by a Unicode code point is ● The codespace is divided into planes, each of which contains an abstract entity, such as “LATIN CAPITAL LETTER 65,536 (216) code points. A” or “BENGALI DIGIT FIVE”. – The last four hexadecimal digits in each code point indicate a character’s position inside a plane. The remaining digits indicate the plane. ● ● A visual representation of a character is called a For example, U+130F7 is found at location 30F716 in Plane 1. ● glyph. The total number of planes is 17 (016, …, 1016). ● ● The Unicode standard does not define glyph images. Planes are divided into non-overlapping blocks. – Blocks are named ranges, where the number of code points in a block is ● The visual appearance of characters on a device always a multiply of 16. (e.g., screen or printer) is fully left to the software or – Characters used in a single writing system may be found in several hardware responsible for rendering characters. different blocks. 10 12 BMP UTF-32 ● Basic Multilingual Plane (BMP): ● Each code point is represented by 4 bytes – The plane containing the first 65,536 code points (fixed-width character encoding form). (range U+0000–U+FFFF) (Plane 0). ● The simplest Unicode encoding form. – Contains the common-use characters for all the ● The most efficient in terms of processing. modern scripts of the world as well as many historical and rare characters. ● The least efficient encoding in terms of the ● By far the majority of all Unicode characters for almost all number of bytes used. textual data can be found in the BMP. 13 15 Character Encodings UTF-16 ● Character encodings defined by the Unicode ● Each code point is represented by 2 or 4 bytes standard: (variable-width character encoding form): – – UTF-8 Code points in the BMP are represented by using 2 bytes, for all other code points 4 bytes are used. – UTF-16 ● Optimizes the representation of characters in the BMP. – UTF-32 – For code points in the BMP can effectively be treated as if it ● All three encoding forms can be used to were a fixed-width encoding form. represent the full range of Unicode characters. ● Maintains a balance between efficient access and economical use of storage. 14 16 UTF-8 Byte Order (2) ● Each code point is represented by using from 1 to 4 bytes (variable- ● Byte order mark (BOM): width character encoding form): – Code points in the range U+0000–U+007F are represented by a single byte – The character U+FEFF (ZERO WIDTH NO-BREAK (the 128 ASCII characters). SPACE) is used as the byte order mark to indicate – Code points in the range U+0080–U+07FF are represented by using 2 bytes. the byte order at the very beginning of text. – All other code points in the BMP are represented by using 3 bytes. – – Code points outside of the BMP are represented by using 4 bytes. It is not part of the textual content and should be ● The first byte of a byte sequence representing a code points removed before processing. determines the number of bytes in the byte sequence. Encoding Scheme Byte Sequence ● The most compact encoding in terms of the number of bytes used. UTF-16 big-endian FE FF – It is less efficient when used for East Asian writing systems, such as Chinese, Japanese, and Korean. UTF-16 little-endian FF FE UTF-32 big-endian 00 00 FE FF 17 UTF-32 little-endian FF FE 00 00 19 Byte Order (1) Unicode Input ● For the UTF-16 and UTF-32 encoding forms the byte order ● Linux: (big-endian or little-endian) must also be also specified. – In GTK+ applications Unicode characters can be ● Taking account of the byte order, Unicode defines the entered by typing Ctrl-Shift-U, followed by a following seven encoding schemes: hexadecimal Unicode code point. – UTF-8 ● – UTF-16, UTF-16BE, UTF-16LE See: – UTF-32, UTF-32BE, UTF-32LE – Unicode input ● For the UTF-16 and UTF-32 encoding schemes the byte https://en.wikipedia.org/wiki/Unicode_input order is determined by the BOM at the very beginning of text. 18 20 ISO/IEC 8859 ECMAScript ● 8-bit character encoding standards (ISO/IEC 8859-1, … ● In string literals, regular expression literals, template literals , ISO/IEC 8859-16). and identifiers, any Unicode character may also be expressed using Unicode escape sequences of the form: – See: ISO/IEC 8859 – 8-bit single-byte coded graphic character sets – \uhhhh where hhhh is a sequence of four hexadecimal digits representing the code point. ● Relevant encodings for us in Hungary: ● Examples: \u00A9, \u262F – ISO/IEC 8859-1 (Latin-1): for Western European languages – \u{hhhhhh} where hhhhhh is a sequence of one to six – ISO/IEC 8859-2 (Latin-2): for Central European languages hexadecimal digits representing the code point. ● Examples: \u{A9}, \u{1F63A} ● Suitable to be used for the following languages: Albanian, Bosnian, Czech, Croatian, Polish, Hungarian, German, Romanian, Serbian ● See: https://262.ecma-international.org/12.0/#sec-source-text (Latin alphabet), Slovakian, Slovenian, Sorbian. 21 23 CSS JSON ● Unicode characters can be specified with escape sequences of the ● In strings, Unicode characters in the BMP may form \hhhhhh, where hhhhhh is a sequence of one to six hexadecimal digits representing the code point of the Unicode also be expressed using escape sequences of character. the form \uhhhh, where is a sequence of four – If the number of hex digits is less than six and a character in the range [0- hexadecimal digits representing the code point. 9a-fA-F] follows the hexadecimal number, then a whitespace character must end the escape sequence. – Examples: \u00A9, \u262F ● A whitespace character that immediately follows an escape sequence will be ignored. ● See: – Examples: https://www.rfc-editor.org/rfc/rfc8259#section-7 ● \A9, \0A9, …, \0000A9 ● \262F, \0262F, \00262F ● See: https://www.w3.org/TR/css-syntax-3/#escaping 22 24 XML, XHTML Conversion Tools ● In text, attribute values, and literal entity values ● Linux: Unicode characters may also be expressed – Recode (license: GNU GPL v3) using character references of the form: https://github.com/rrthomas/recode/ – &#nnnn; where nnnn is a sequence of decimal ● Example of use: digits representing the code point. $ recode --list $ recode UTF-8..ISO-8859-2 file.txt ● Examples: ©, ☯, 😺 $ recode UTF-8..UTF-16 *.txt – &#xhhhh; where hhhh is a sequence of hexadecimal digits representing the code point. ● Examples: ©, ☯, 😺 25 27 HMTL Useful Links ● A number of Unicode characters may be ● Shapecatcher: Draw the Unicode character you expressed using named character references of want! https://shapecatcher.com/ the form &name;.

Unicode What Is Unicode? Coverage Standard

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support