Unicode What Is Unicode? Coverage Standard

Total Page:16

File Type:pdf, Size:1020Kb

Unicode What Is Unicode? Coverage Standard Unicode Coverage ● Examples: – Cherokee https://www.unicode.org/charts/PDF/U13A0.pdf – Peter Jeszenszky Imperial Aramaic https://www.unicode.org/charts/PDF/U10840.pdf – Old Hungarian https://www.unicode.org/charts/PDF/U10C80.pdf Faculty of Informatics, University of Debrecen – Egyptian hieroglyphs [email protected] https://www.unicode.org/charts/PDF/U13000.pdf – Emoticons https://www.unicode.org/charts/PDF/U1F600.pdf Last modified: September 6, 2021 – Alchemical symbols https://www.unicode.org/charts/PDF/U1F700.pdf – … 3 What is Unicode? Standard ● Universal character encoding standard for ● Developed by the Unicode Consortium, a non- written characters and text. profit organization. ● Covers all the characters for all the writing – See: systems of the world, modern and ancient. https://www.unicode.org/consortium/consort.html – It also includes technical symbols, punctuations, ● The current standard is version 13.0.0 (2020 and many other characters used in writing text. March 10). ● Widely used and supported. – See: The Unicode Standard, Version 13.0.0 https://www.unicode.org/versions/Unicode13.0.0/ 2 4 Universal Coded Character Set Code Points (UCS) ● A standard character set defined by ISO. ● When referring to code points, the usual ● The current standard: practice is to refer to them by their numeric – ISO/IEC 10646:2020 Information technology – Universal Coded Character Set (UCS) https://www.iso.org/standard/76835.html value expressed in hexadecimal using four to ● Can be downloaded from here: https://standards.iso.org/ittf/PubliclyAvailableStandards/index.html six digits, with a U+ prefix. ● Developed in conjunction with Unicode. – Leading zeros are omitted, unless the code point – The characters and their code points are the same in both standards. would have fewer than four hexadecimal digits. – The difference is that Unicode imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. – Examples: U+0020, U+265F, U+130E0 ● Further information: – Frequently Asked Questions – Unicode and ISO 10646 https://www.unicode.org/faq/unicode_iso.html 5 7 Basic Concepts Properties ● Codespace: the range of integers used to code ● Unicode associates a rich set of semantics with characters (code the characters. points), these semantics are defined by character properties. – Unicode identifies more than 100 different character properties. ● Code point: an element of the codespace, i.e., ● Examples: an integer encoding a character. – Name – General category (letter, number, symbol, punctuation, …) – Case (uppercase, lowercase, titlecase) – … ● The Unicode Character Database (UCD) contains character properties. https://unicode.org/ucd/ 6 8 Character Names Codespace ● ● Each character is identified by a unique name. The range of integers from 016 to 10FFFF16. – Examples: – The total number of code points is 1,114,112, of ● U+0041 – LATIN CAPITAL LETTER A (A) which 143,859 are used currently. https://www.fileformat.info/info/unicode/char/0041/index.h ● Character code charts: https://www.unicode.org/charts/ tm ● U+2605 – BLACK STAR (★) ) https://www.fileformat.info/info/unicode/char/2605/index.h tm ● U+1F63A – SMILING CAT FACE WITH OPEN MOUTH () ) https://www.fileformat.info/info/unicode/char/1f63a/index. htm 9 11 Characters and Glyphs Planes and Blocks ● The character identified by a Unicode code point is ● The codespace is divided into planes, each of which contains an abstract entity, such as “LATIN CAPITAL LETTER 65,536 (216) code points. A” or “BENGALI DIGIT FIVE”. – The last four hexadecimal digits in each code point indicate a character’s position inside a plane. The remaining digits indicate the plane. ● ● A visual representation of a character is called a For example, U+130F7 is found at location 30F716 in Plane 1. ● glyph. The total number of planes is 17 (016, …, 1016). ● ● The Unicode standard does not define glyph images. Planes are divided into non-overlapping blocks. – Blocks are named ranges, where the number of code points in a block is ● The visual appearance of characters on a device always a multiply of 16. (e.g., screen or printer) is fully left to the software or – Characters used in a single writing system may be found in several hardware responsible for rendering characters. different blocks. 10 12 BMP UTF-32 ● Basic Multilingual Plane (BMP): ● Each code point is represented by 4 bytes – The plane containing the first 65,536 code points (fixed-width character encoding form). (range U+0000–U+FFFF) (Plane 0). ● The simplest Unicode encoding form. – Contains the common-use characters for all the ● The most efficient in terms of processing. modern scripts of the world as well as many historical and rare characters. ● The least efficient encoding in terms of the ● By far the majority of all Unicode characters for almost all number of bytes used. textual data can be found in the BMP. 13 15 Character Encodings UTF-16 ● Character encodings defined by the Unicode ● Each code point is represented by 2 or 4 bytes standard: (variable-width character encoding form): – – UTF-8 Code points in the BMP are represented by using 2 bytes, for all other code points 4 bytes are used. – UTF-16 ● Optimizes the representation of characters in the BMP. – UTF-32 – For code points in the BMP can effectively be treated as if it ● All three encoding forms can be used to were a fixed-width encoding form. represent the full range of Unicode characters. ● Maintains a balance between efficient access and economical use of storage. 14 16 UTF-8 Byte Order (2) ● Each code point is represented by using from 1 to 4 bytes (variable- ● Byte order mark (BOM): width character encoding form): – Code points in the range U+0000–U+007F are represented by a single byte – The character U+FEFF (ZERO WIDTH NO-BREAK (the 128 ASCII characters). SPACE) is used as the byte order mark to indicate – Code points in the range U+0080–U+07FF are represented by using 2 bytes. the byte order at the very beginning of text. – All other code points in the BMP are represented by using 3 bytes. – – Code points outside of the BMP are represented by using 4 bytes. It is not part of the textual content and should be ● The first byte of a byte sequence representing a code points removed before processing. determines the number of bytes in the byte sequence. Encoding Scheme Byte Sequence ● The most compact encoding in terms of the number of bytes used. UTF-16 big-endian FE FF – It is less efficient when used for East Asian writing systems, such as Chinese, Japanese, and Korean. UTF-16 little-endian FF FE UTF-32 big-endian 00 00 FE FF 17 UTF-32 little-endian FF FE 00 00 19 Byte Order (1) Unicode Input ● For the UTF-16 and UTF-32 encoding forms the byte order ● Linux: (big-endian or little-endian) must also be also specified. – In GTK+ applications Unicode characters can be ● Taking account of the byte order, Unicode defines the entered by typing Ctrl-Shift-U, followed by a following seven encoding schemes: hexadecimal Unicode code point. – UTF-8 ● – UTF-16, UTF-16BE, UTF-16LE See: – UTF-32, UTF-32BE, UTF-32LE – Unicode input ● For the UTF-16 and UTF-32 encoding schemes the byte https://en.wikipedia.org/wiki/Unicode_input order is determined by the BOM at the very beginning of text. 18 20 ISO/IEC 8859 ECMAScript ● 8-bit character encoding standards (ISO/IEC 8859-1, … ● In string literals, regular expression literals, template literals , ISO/IEC 8859-16). and identifiers, any Unicode character may also be expressed using Unicode escape sequences of the form: – See: ISO/IEC 8859 – 8-bit single-byte coded graphic character sets – \uhhhh where hhhh is a sequence of four hexadecimal digits representing the code point. ● Relevant encodings for us in Hungary: ● Examples: \u00A9, \u262F – ISO/IEC 8859-1 (Latin-1): for Western European languages – \u{hhhhhh} where hhhhhh is a sequence of one to six – ISO/IEC 8859-2 (Latin-2): for Central European languages hexadecimal digits representing the code point. ● Examples: \u{A9}, \u{1F63A} ● Suitable to be used for the following languages: Albanian, Bosnian, Czech, Croatian, Polish, Hungarian, German, Romanian, Serbian ● See: https://262.ecma-international.org/12.0/#sec-source-text (Latin alphabet), Slovakian, Slovenian, Sorbian. 21 23 CSS JSON ● Unicode characters can be specified with escape sequences of the ● In strings, Unicode characters in the BMP may form \hhhhhh, where hhhhhh is a sequence of one to six hexadecimal digits representing the code point of the Unicode also be expressed using escape sequences of character. the form \uhhhh, where is a sequence of four – If the number of hex digits is less than six and a character in the range [0- hexadecimal digits representing the code point. 9a-fA-F] follows the hexadecimal number, then a whitespace character must end the escape sequence. – Examples: \u00A9, \u262F ● A whitespace character that immediately follows an escape sequence will be ignored. ● See: – Examples: https://www.rfc-editor.org/rfc/rfc8259#section-7 ● \A9, \0A9, …, \0000A9 ● \262F, \0262F, \00262F ● See: https://www.w3.org/TR/css-syntax-3/#escaping 22 24 XML, XHTML Conversion Tools ● In text, attribute values, and literal entity values ● Linux: Unicode characters may also be expressed – Recode (license: GNU GPL v3) using character references of the form: https://github.com/rrthomas/recode/ – &#nnnn; where nnnn is a sequence of decimal ● Example of use: digits representing the code point. $ recode --list $ recode UTF-8..ISO-8859-2 file.txt ● Examples: ©, ☯, 😺 $ recode UTF-8..UTF-16 *.txt – &#xhhhh; where hhhh is a sequence of hexadecimal digits representing the code point. ● Examples: ©, ☯, 😺 25 27 HMTL Useful Links ● A number of Unicode characters may be ● Shapecatcher: Draw the Unicode character you expressed using named character references of want! https://shapecatcher.com/ the form &name;.
Recommended publications
  • 1 Introduction 1
    The Unicode® Standard Version 13.0 – Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trade- mark claim, the designations have been printed with initial capital letters or in all capitals. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries. The authors and publisher have taken care in the preparation of this specification, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. © 2020 Unicode, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction. For information regarding permissions, inquire at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please see http://www.unicode.org/copyright.html. The Unicode Standard / the Unicode Consortium; edited by the Unicode Consortium. — Version 13.0. Includes index. ISBN 978-1-936213-26-9 (http://www.unicode.org/versions/Unicode13.0.0/) 1.
    [Show full text]
  • Base64 Character Encoding and Decoding Modeling
    Base64 Character Encoding and Decoding Modeling Isnar Sumartono1, Andysah Putera Utama Siahaan2, Arpan3 Faculty of Computer Science,Universitas Pembangunan Panca Budi Jl. Jend. Gatot Subroto Km. 4,5 Sei Sikambing, 20122, Medan, Sumatera Utara, Indonesia Abstract: Security is crucial to maintaining the confidentiality of the information. Secure information is the information should not be known to the unreliable person, especially information concerning the state and the government. This information is often transmitted using a public network. If the data is not secured in advance, would be easily intercepted and the contents of the information known by the people who stole it. The method used to secure data is to use a cryptographic system by changing plaintext into ciphertext. Base64 algorithm is one of the encryption processes that is ideal for use in data transmission. Ciphertext obtained is the arrangement of the characters that have been tabulated. These tables have been designed to facilitate the delivery of data during transmission. By applying this algorithm, errors would be avoided, and security would also be ensured. Keywords: Base64, Security, Cryptography, Encoding I. INTRODUCTION Security and confidentiality is one important aspect of an information system [9][10]. The information sent is expected to be well received only by those who have the right. Information will be useless if at the time of transmission intercepted or hijacked by an unauthorized person [7]. The public network is one that is prone to be intercepted or hijacked [1][2]. From time to time the data transmission technology has developed so rapidly. Security is necessary for an organization or company as to maintain the integrity of the data and information on the company.
    [Show full text]
  • Unicode Ate My Brain
    UNICODE ATE MY BRAIN John Cowan Reuters Health Information Copyright 2001-04 John Cowan under GNU GPL 1 Copyright • Copyright © 2001 John Cowan • Licensed under the GNU General Public License • ABSOLUTELY NO WARRANTIES; USE AT YOUR OWN RISK • Portions written by Tim Bray; used by permission • Title devised by Smarasderagd; used by permission • Black and white for readability Copyright 2001-04 John Cowan under GNU GPL 2 Abstract Unicode, the universal character set, is one of the foundation technologies of XML. However, it is not as widely understood as it should be, because of the unavoidable complexity of handling all of the world's writing systems, even in a fairly uniform way. This tutorial will provide the basics about using Unicode and XML to save lots of money and achieve world domination at the same time. Copyright 2001-04 John Cowan under GNU GPL 3 Roadmap • Brief introduction (4 slides) • Before Unicode (16 slides) • The Unicode Standard (25 slides) • Encodings (11 slides) • XML (10 slides) • The Programmer's View (27 slides) • Points to Remember (1 slide) Copyright 2001-04 John Cowan under GNU GPL 4 How Many Different Characters? a A à á â ã ä å ā ă ą a a a a a a a a a a a Copyright 2001-04 John Cowan under GNU GPL 5 How Computers Do Text • Characters in computer storage are represented by “small” numbers • The numbers use a small number of bits: from 6 (BCD) to 21 (Unicode) to 32 (wchar_t on some Unix boxes) • Design choices: – Which numbers encode which characters – How to pack the numbers into bytes Copyright 2001-04 John Cowan under GNU GPL 6 Where Does XML Come In? • XML is a textual data format • XML software is required to handle all commercially important characters in the world; a promise to “handle XML” implies a promise to be international • Applications can do what they want; monolingual applications can mostly ignore internationalization Copyright 2001-04 John Cowan under GNU GPL 7 $$$ £££ ¥¥¥ • Extra cost of building-in internationalization to a new computer application: about 20% (assuming XML and Unicode).
    [Show full text]
  • The Unicode Cookbook for Linguists: Managing Writing Systems Using Orthography Profiles
    Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2017 The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles Moran, Steven ; Cysouw, Michael DOI: https://doi.org/10.5281/zenodo.290662 Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-135400 Monograph The following work is licensed under a Creative Commons: Attribution 4.0 International (CC BY 4.0) License. Originally published at: Moran, Steven; Cysouw, Michael (2017). The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles. CERN Data Centre: Zenodo. DOI: https://doi.org/10.5281/zenodo.290662 The Unicode Cookbook for Linguists Managing writing systems using orthography profiles Steven Moran & Michael Cysouw Change dedication in localmetadata.tex Preface This text is meant as a practical guide for linguists, and programmers, whowork with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together. The intersection of the Unicode Standard and the International Phonetic Al- phabet is often not met without frustration by users. Nevertheless, thetwo standards have provided language researchers with a consistent computational architecture needed to process, publish and analyze data from many different languages. We bring to light common, but not always transparent, pitfalls that researchers face when working with Unicode and IPA. Our research uses quantitative methods to compare languages and uncover and clarify their phylogenetic relations. However, the majority of lexical data available from the world’s languages is in author- or document-specific orthogra- phies.
    [Show full text]
  • Unicode and Code Page Support
    Natural for Mainframes Unicode and Code Page Support Version 4.2.6 for Mainframes October 2009 This document applies to Natural Version 4.2.6 for Mainframes and to all subsequent releases. Specifications contained herein are subject to change and these changes will be reported in subsequent release notes or new editions. Copyright © Software AG 1979-2009. All rights reserved. The name Software AG, webMethods and all Software AG product names are either trademarks or registered trademarks of Software AG and/or Software AG USA, Inc. Other company and product names mentioned herein may be trademarks of their respective owners. Table of Contents 1 Unicode and Code Page Support .................................................................................... 1 2 Introduction ..................................................................................................................... 3 About Code Pages and Unicode ................................................................................ 4 About Unicode and Code Page Support in Natural .................................................. 5 ICU on Mainframe Platforms ..................................................................................... 6 3 Unicode and Code Page Support in the Natural Programming Language .................... 7 Natural Data Format U for Unicode-Based Data ....................................................... 8 Statements .................................................................................................................. 9 Logical
    [Show full text]
  • Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
    1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only.
    [Show full text]
  • Bbedit 13.5 User Manual
    User Manual BBEdit™ Professional Code and Text Editor for the Macintosh Bare Bones Software, Inc. ™ BBEdit 13.5 Product Design Jim Correia, Rich Siegel, Steve Kalkwarf, Patrick Woolsey Product Engineering Jim Correia, Seth Dillingham, Matt Henderson, Jon Hueras, Steve Kalkwarf, Rich Siegel, Steve Sisak Engineers Emeritus Chris Borton, Tom Emerson, Pete Gontier, Jamie McCarthy, John Norstad, Jon Pugh, Mark Romano, Eric Slosser, Rob Vaterlaus Documentation Fritz Anderson, Philip Borenstein, Stephen Chernicoff, John Gruber, Jeff Mattson, Jerry Kindall, Caroline Rose, Allan Rouselle, Rich Siegel, Vicky Wong, Patrick Woolsey Additional Engineering Polaschek Computing Icon Design Bryan Bell Factory Color Schemes Luke Andrews Additional Color Schemes Toothpaste by Cat Noon, and Xcode Dark by Andrew Carter. Used by permission. Additional Icons By icons8. Used under license Additional Artwork By Jonathan Hunt PHP keyword lists Contributed by Ted Stresen-Reuter. Previous versions by Carsten Blüm Published by: Bare Bones Software, Inc. 73 Princeton Street, Suite 206 North Chelmsford, MA 01863 USA (978) 251-0500 main (978) 251-0525 fax https://www.barebones.com/ Sales & customer service: [email protected] Technical support: [email protected] BBEdit and the BBEdit User Manual are copyright ©1992-2020 Bare Bones Software, Inc. All rights reserved. Produced/published in USA. Copyrights, Licenses & Trademarks cmark ©2014 by John MacFarlane. Used under license; part of the CommonMark project LibNcFTP Used under license from and copyright © 1996-2010 Mike Gleason & NcFTP Software Exuberant ctags ©1996-2004 Darren Hiebert (source code here) PCRE2 Library Written by Philip Hazel and Zoltán Herczeg ©1997-2018 University of Cambridge, England Info-ZIP Library ©1990-2009 Info-ZIP.
    [Show full text]
  • Chinese Script Generation Panel Document
    Chinese Script Generation Panel Document Proposal for the Generation Panel for the Chinese Script Label Generation Ruleset for the Root Zone 1. General Information Chinese script is the logograms used in the writing of Chinese and some other Asian languages. They are called Hanzi in Chinese, Kanji in Japanese and Hanja in Korean. Since the Hanzi unification in the Qin dynasty (221-207 B.C.), the most important change in the Chinese Hanzi occurred in the middle of the 20th century when more than two thousand Simplified characters were introduced as official forms in Mainland China. As a result, the Chinese language has two writing systems: Simplified Chinese (SC) and Traditional Chinese (TC). Both systems are expressed using different subsets under the Unicode definition of the same Han script. The two writing systems use SC and TC respectively while sharing a large common “unchanged” Hanzi subset that occupies around 60% in contemporary use. The common “unchanged” Hanzi subset enables a simplified Chinese user to understand texts written in traditional Chinese with little difficulty and vice versa. The Hanzi in SC and TC have the same meaning and the same pronunciation and are typical variants. The Japanese kanji were adopted for recording the Japanese language from the 5th century AD. Chinese words borrowed into Japanese could be written with Chinese characters, while Japanese words could be written using the character for a Chinese word of similar meaning. Finally, in Japanese, all three scripts (kanji, and the hiragana and katakana syllabaries) are used as main scripts. The Chinese script spread to Korea together with Buddhism from the 2nd century BC to the 5th century AD.
    [Show full text]
  • SAS 9.3 UTF-8 Encoding Support and Related Issue Troubleshooting
    SAS 9.3 UTF-8 Encoding Support and Related Issue Troubleshooting Jason (Jianduan) Liang SAS certified: Platform Administrator, Advanced Programmer for SAS 9 Agenda Introduction UTF-8 and other encodings SAS options for encoding and configuration Other Considerations for UTF-8 data Encoding issues troubleshooting techniques (tips) Introduction What is UTF-8? . A character encoding capable of encoding all possible characters Why UTF-8? . Dominant encoding of the www (86.5%) SAS system options for encoding . Encoding – instructs SAS how to read, process and store data . Locale - instructs SAS how to present or display currency, date and time, set timezone values UTF-8 and other Encodings ASSCII (American Standard Code for Information Interchange) . 7-bit . 128 - character set . Examples (code point-char-hex): 32-Space-20; 63-?-3F; 64-@-40; 65-A-41 UTF-8 and other Encodings ISO 8859-1 (Latin-1) for Western European languages Windows-1252 (Latin-1) for Western European languages . 8-bit (1 byte, 256 character set) . Identical to asscii for the first 128 chars . Extended ascii chars examples: . 155-£-A3; 161- ©-A9 . SAS option encoding value: wlatin1 (latin1) UTF-8 and other Encodings UTF-8 and other Encodings Problems . Only covers English and Western Europe languages, ISO-8859-2, …15 . Multiple encoding is required to support national languages . Same character encoded differently, same code point represents different chars Unicode . Unicode – assign a unique code/number to every possible character of all languages . Examples of unicode points: o U+0020 – Space U+0041 – A o U+00A9 - © U+C3BF - ÿ UTF-8 and other Encodings UTF-8 .
    [Show full text]
  • DFDL WG Stephen M Hanson, IBM [email protected] September 2014
    GFD-P-R.207 (OBSOLETED by GFD-P-R.240) Michael J Beckerle, Tresys Technology OGF DFDL WG Stephen M Hanson, IBM [email protected] September 2014 Data Format Description Language (DFDL) v1.0 Specification Status of This Document Grid Final Draft (GFD) Obsoletes This document obsoletes GFD-P-R.174 dated January 2011 [OBSOLETE_DFDL]. Copyright Notice Copyright © Global Grid Forum (2004-2006). Some Rights Reserved. Distribution is unlimited. Copyright © Open Grid Forum (2006-2014). Some Rights Reserved. Distribution is unlimited Abstract This document is OBSOLETE. It is superceded by GFD-P-R.240. This document provides a definition of a standard Data Format Description Language (DFDL). This language allows description of text, dense binary, and legacy data formats in a vendor- neutral declarative manner. DFDL is an extension to the XML Schema Description Language (XSDL). GFD-P-R.207 (OBSOLETED by GFD-P-R.240) September 2014 Contents Data Format Description Language (DFDL) v1.0 Specification ...................................................... 1 1. Introduction ............................................................................................................................... 9 1.1 Why is DFDL Needed? ................................................................................................... 10 1.2 What is DFDL? ................................................................................................................ 10 Simple Example ......................................................................................................
    [Show full text]
  • JS Character Encodings
    JS � Character Encodings Anna Henningsen · @addaleax · she/her 1 It’s good to be back! 2 ??? https://travis-ci.org/node-ffi-napi/get-symbol-from-current-process-h/jobs/641550176 3 So … what’s a character encoding? People are good with text, computers are good with numbers Text List of characters “Encoding” List of bytes List of integers 4 So … what’s a character encoding? People are good with text, computers are good with numbers Hello [‘H’,’e’,’l’,’l’,’o’] 68 65 6c 6c 6f [72, 101, 108, 108, 111] 5 So … what’s a character encoding? People are good with text, computers are good with numbers 你好! [‘你’,’好’] ??? ??? 6 ASCII 0 0x00 <NUL> … … … 65 0x41 A 66 0x42 B 67 0x43 C … … … 97 0x61 a 98 0x62 b … … … 127 0x7F <DEL> 7 ASCII ● 7-bit ● Covers most English-language use cases ● … and that’s pretty much it 8 ISO-8859-*, Windows code pages ● Idea: Usually, transmission has 8 bit per byte available, so create ASCII-extending charsets for more languages ISO-8859-1 (Western) ISO-8859-5 (Cyrillic) Windows-1251 (Cyrillic) (aka Latin-1) … … … … 0xD0 Ð а Р 0xD1 Ñ б С 0xD2 Ò в Т … … … … 9 GBK ● Idea: Also extend ASCII, but use 2-byte for Chinese characters … … 0x41 A 0x42 B … … 0xC4 0xE3 你 0xC4 0xE4 匿 … … 10 https://xkcd.com/927/ 11 Unicode: Multiple encodings! 4d c3 bc 6c 6c (UTF-8) U+004D M “Müll” U+00FC ü 4d 00 fc 00 6c 00 6c 00 (UTF-16LE) U+006C l U+006C l 00 4d 00 fc 00 6c 00 6c (UTF-16BE) 12 Unicode ● New idea: Don’t create a gazillion charsets, and drop 1-byte/2-byte restriction ● Shared character set for multiple encodings: U+XXXX with 4 hex digits, e.g.
    [Show full text]
  • San José, October 2, 2000 Feel Free to Distribute This Text
    San José, October 2, 2000 Feel free to distribute this text (version 1.2) including the author’s email address ([email protected]) and to contact him for corrections and additions. Please do not take this text as a literal translation, but as a help to understand the standard GB 18030-2000. Insertions in brackets [] are used throughout the text to indicate corresponding sections of the published Chinese standard. Thanks to Markus Scherer (IBM) and Ken Lunde (Adobe Systems) for initial critical reviews of the text. SUMMARY, EXPLANATIONS, AND REMARKS: CHINESE NATIONAL STANDARD GB 18030-2000: INFORMATION TECHNOLOGY – CHINESE IDEOGRAMS CODED CHARACTER SET FOR INFORMATION INTERCHANGE – EXTENSION FOR THE BASIC SET (信息技术-信息交换用汉字编码字符集 Xinxi Jishu – Xinxi Jiaohuan Yong Hanzi Bianma Zifuji – Jibenji De Kuochong) March 17, 2000, was the publishing date of the Chinese national standard (国家标准 guojia biaozhun) GB 18030-2000 (hereafter: GBK2K). This standard tries to resolve issues resulting from the advent of Unicode, version 3.0. More specific, it attempts the combination of Uni- code's extended character repertoire, namely the Unihan Extension A, with the character cov- erage of earlier Chinese national standards. HISTORY The People’s Republic of China had already expressed her fundamental consent to support the combined efforts of the ISO/IEC and the Unicode Consortium through publishing a Chinese National Standard that was code- and character-compatible with ISO 10646-1/ Unicode 2.1. This standard was named GB 13000.1. Whenever the ISO and the Unicode Consortium changed or revised their “common” standard, GB 13000.1 adopted these changes subsequently. In order to remain compatible with GB 2312, however, which at the time of publishing Unicode/GB 13000.1 was an already existing national standard widely used to represent the Chinese “simplified” characters, the “specification” GBK was created.
    [Show full text]