10.1 Han CJK Unified Ideographs
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only. -
ISO/IEC JTC1/SC2/WG2 N 3716 Date: 2009-10-29
ISO/IEC JTC1/SC2/WG2 N 3716 Date: 2009-10-29 ISO/IEC JTC1/SC2/WG2 Coded Character Set Secretariat: Japan (JISC) Doc. Type: Disposition of comments Title: Disposition of comments on SC2 N 4079 (ISO/IEC CD 10646, Information Technology – Universal Coded Character Set (UCS)) Source: Michel Suignard (project editor) Project: JTC1 02.10646.00.00.00.02 Status: For review by WG2 Date: 2009-10-14 Distribution: WG2 Reference: SC2 N4079, 4088, WG2 N3691 Medium: Paper, PDF file Comments were received from India, Japan, Korea (ROK), U.K, and U.S.A. The following document is the draft disposition of those comments. The disposition is organized per country. Note – The full content of the ballot comments have been included in this document to facilitate the reading. The dispositions are inserted in between these comments and are marked in Underlined Bold Serif text, with explanatory text in italicized serif. As a result of these dispositions, only one country (Korea, ROK) maintained its NO vote. Page 1 of 31 India: Positive with comments Technical comments T1 Proposal to add one character in the Arabic block for representation of Kasmiri and annotation of existing characters Kashmiri: Perso-Arabic script The Kashmiri language is mainly written in the Perso-Arabic script. Arabic is already encoded in the Unicode to cater the requirement of Arabic based languages which includes Urdu, Kashmiri and Sindhi. Experts at Shrinagar University examined the Arabic Unicode for representation of Kashmiri language and opined that few characters need to be added for representation of the Kashmiri using Perso-Arabic script. -
Proposal for a Korean Script Root Zone LGR 1 General Information
(internal doc. #: klgp220_101f_proposal_korean_lgr-25jan18-en_v103.doc) Proposal for a Korean Script Root Zone LGR LGR Version 1.0 Date: 2018-01-25 Document version: 1.03 Authors: Korean Script Generation Panel 1 General Information/ Overview/ Abstract The purpose of this document is to give an overview of the proposed Korean Script LGR in the XML format and the rationale behind the design decisions taken. It includes a discussion of relevant features of the script, the communities or languages using it, the process and methodology used and information on the contributors. The formal specification of the LGR can be found in the accompanying XML document below: • proposal-korean-lgr-25jan18-en.xml Labels for testing can be found in the accompanying text document below: • korean-test-labels-25jan18-en.txt In Section 3, we will see the background on Korean script (Hangul + Hanja) and principal language using it, i.e., Korean language. The overall development process and methodology will be reviewed in Section 4. The repertoire and variant groups in K-LGR will be discussed in Sections 5 and 6, respectively. In Section 7, Whole Label Evaluation Rules (WLE) will be described and then contributors for K-LGR are shown in Section 8. Several appendices are included with separate files. proposal-korean-lgr-25jan18-en 1 / 73 1/17 2 Script for which the LGR is proposed ISO 15924 Code: Kore ISO 15924 Key Number: 287 (= 286 + 500) ISO 15924 English Name: Korean (alias for Hangul + Han) Native name of the script: 한글 + 한자 Maximal Starting Repertoire (MSR) version: MSR-2 [241] Note. -
Alan Ng Systems Librarian the Chinese University of Hong Kong Library Agenda
Enhancing Chinese character normalization in Primo with the HKIUG TSVCC mapping table Alan Ng Systems Librarian The Chinese University of Hong Kong Library Agenda ❖ Primo out-of-box character normalization ❖ Background on CJK normalization ❖ HKIUG TSVCC mapping table ❖ Implementing TSVCC on Primo About CUHK Library ❖ established in 1963 ❖ 7 branches ❖ 200 staff ❖ 260K current patrons ❖ 130K journal subscriptions, 4.5M ebooks, 2.5M printed volumes ❖ special collections includes from oracles bones, Chinese rare books, modern Chinese literary archives … character normalization ❖ different people type differently ❖ normal to expect “Apple” will have the exact results from “aPPLE”, “ApPle”, “appLE” … ❖ before indexing, Primo will first “clean up” (normalize) the data to its lower case (e.g. A —> a, B —> b …) ❖ Primo FE will do the same for the search term typed by users, to get a match with the index Primo out-of-box normalizations ❖ Primo provides OTB normalizations for different languages at: ❖ /exlibris/primo/p4_1/ng/jaguar/home/profile/ analysis/specialCharacters/CharConversion/OTB/ OTB ❖ e.g. ❖ latin languages (non_cjk_unicode_normalization.txt) ❖ CJK (cjk_unicode_trad_to_simp_normalization.txt) OTB CJK normalization table ❖ 2700+ entries ❖ mainly for mapping Traditional Chinese into its Simplified form ❖ assume it is a 1:1 mapping, Simplified Chinese being the “lowercase” like the English language ❖ But in fact, Simplified Chinese is only one kind of variant form for Chinese character ❖ other variant forms (ideograph) of the same character need to be cover as well extract of the OTB table background on CJK ❖ Traditional Chinese characters have been used since as early as 2nd centuryBC (Han Dynasty, 漢朝) ❖ used by people in Taiwan, Hong Kong and Macau ❖ Simplified Chinese characters were introduced by PRC government during 1950’s ❖ used by people in PRC, SE Asia countries e.g. -
CJK Compatibility Ideographs Range: F900–FAFF
CJK Compatibility Ideographs Range: F900–FAFF This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 14.0 This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See https://www.unicode.org/errata/ for an up-to-date list of errata. See https://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See https://www.unicode.org/charts/PDF/Unicode-14.0/ for charts showing only the characters added in Unicode 14.0. See https://www.unicode.org/Public/14.0.0/charts/ for a complete archived file of character code charts for Unicode 14.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 14.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 14.0, online at https://www.unicode.org/versions/Unicode14.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, #45, and #50, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See https://www.unicode.org/ucd/ and https://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation. -
Hong Kong Supplementary Character Set – 2016 (Draft)
中 文 界 面 諮 詢 委 員 會 工 作 小 組 文 件 編 號 2017/02 (B) Hong Kong Supplementary Character Set – 2016 (Draft) Office of the Government Chief Information Officer & Official Languages Division, Civil Service Bureau The Government of the Hong Kong Special Administrative Region April 2017 1/21 中 文 界 面 諮 詢 委 員 會 工 作 小 組 文 件 編 號 2017/02 (B) Table of Contents Preface Section 1 Overview……………….……………………………………………. 1 - 1 Section 2 Coding Scheme of the HKSCS–2016….……………………………. 2 - 1 Section 3 HKSCS–2016 under the Architecture of the ISO/IEC 10646………. 3 - 1 Table 1: Code Table of the HKSCS–2016……………………………………….. i - 1 Table 2: Newly Included Characters in the HKSCS–2016...………………….…. ii - 1 Table 3: Compatibility Characters in the HKSCS–2016…......………………..…. iii - 1 2/21 中 文 界 面 諮 詢 委 員 會 工 作 小 組 文 件 編 號 2017/02 (B) Preface After the first release of the Hong Kong Supplementary Character Set (HKSCS) in 1999, there have been three updated versions. The HKSCS-2001, HKSCS-2004 and HKSCS-2008 were published with 116, 123 and 68 new characters added respectively. A total of 5 009 characters were included in the HKSCS-2008. These publications formed the foundation for promoting the adoption of the ISO/IEC 10646 international coding standard, and were widely supported and adopted by the IT sector and members of the public. The ISO/IEC 10646 international coding standard is developed by the International Organization for Standardization (ISO) to provide a common technical basis for the storage and exchange of electronic information. -
Wg2 N5125 & L2/19-386
ISO/IEC JTC1/SC2/WG2 N5125 L2/19-386 Title: Proposal to change the code chart font for 21 blocks in Unicode Version 14.0 Author: Ken Lunde Date: 2019-12-02 This document proposes to change the code chart font for 21 blocks—affecting 1,762 charac- ters—that are CJK-related or otherwise include characters that are used primarily in CJK con- texts and are therefore supported in fonts for typical CJK typefaces. The suggested target for this change is Unicode Version 14.0 (2021). The purpose of this font change is four-fold: • Reduce the number of fonts that are necessary for code chart production. • Improve the consistency of the glyphs for similar or related characters by using a uniform typeface design and typeface style, at least for characters whose glyphs do not need to ad- here to a particular specification.. • Simplify code chart font management by covering as many complete blocks as possible using a single font. • Ensure that the font is accessible via open source. The actual font, CJK Symbols, will be open-sourced in both TrueType and OpenType/CFF for- mats under the terms of the SIL Open Font License, Version 1.1, and therefore can be updated to include glyphs for additional blocks, or to add glyphs to blocks that are already supported. This is maintenance work that I plan to take on for the foreseeable future. Code Tables This and subsequent pages include left-to-right, top-to-bottom code tables that serve as a glyph synopsis for the CJK Symbols font. -
Suggestions for the ISO/IEC 14651 CTT Part for Hangul
SC22/WG20 N891R ISO/IEC JTC 1/SC2/WG2 N2405R L2/01-469 (formerly L2/01-405) Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation internationale de normalisation Title: Ordering rules for Hangul Source: Kent Karlsson Date: 2001-11-29 Status: Expert Contribution Document Type: Working Group Document Action: For consideration by the UTC, JTC 1/SC 2/WG 2’s ad hoc on Korean, and JTC 1/SC 22/WG 20 1 Introduction The Hangul script as such is very elegantly designed. However, its incarnation in 10646/Unicode is far from elegant. This paper is about restoring the elegance of Hangul, as much as it can be restored, for the process of string ordering. 1.1 Hangul syllables A lot of Hangul syllables have a character of their own in the range AC00-D7A3. They each have a canonical decomposition into two (choseong, jungseong) or three (choseong, jungseong, jongseong) Hangul Jamo characters in the ranges 1100-1112, 1161-1175, and 11A8-11C2. The choseong are leading consonants, one of which is mute. The jungseong are vowels. And the jongseong are trailing consonants. A Hangul Jamo character is either a letter or letter cluster. The Hangul syllable characters alone can represent most modern Hangul words. They cannot represent historic Hangul words (Middle Korean), nor modern/future Hangul words using syllables not preallocated. However, all Hangul words can elegantly be represented by sequences of single-letter Hangul Jamo characters plus optional tone mark. 1 1.2 Single-letter and cluster Hangul Jamo characters Cluster Hangul Jamo characters represent either clusters of two or three consonants, or clusters of two or three vowels. -
Jamo Pair Encoding: Subcharacter Representation-Based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 3490–3497 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization Sangwhan Moonyz, Naoaki Okazakiy Tokyo Institute of Technologyy, Odd Concepts Inc.z, [email protected], [email protected] Abstract In the context of multilingual language model pre-training, vocabulary size for languages with a broad set of potential characters is an unsolved problem. We propose two algorithms applicable in any unsupervised multilingual pre-training task, increasing the elasticity of budget required for building the vocabulary in Byte-Pair Encoding inspired tokenizers, significantly reducing the cost of supporting Korean in a multilingual model. Keywords: tokenization, vocabulary compaction, sub-character representations, out-of-vocabulary mitigation 1. Background BPE. Roughly, the minimum size of the subword vocab- ulary can be approximated as jV j ≈ 2jV j, where V is the With the introduction of large-scale language model pre- c minimal subword vocabulary, and V is the character level training in the domain of natural language processing, the c vocabulary. domain has seen significant advances in the performance Since languages such as Japanese require at least 2000 char- of downstream tasks using transfer learning on pre-trained acters to express everyday text, in a multilingual training models (Howard and Ruder, 2018; Devlin et al., 2018) when setup, one must make a tradeoff. One can reduce the av- compared to conventional per-task models. As a part of this erage surface of each subword for these character vocabu- trend, it has also become common to perform this form of lary intensive languages, or increase the vocabulary size. -
2 Hangul Jamo Auxiliary Canonical Decomposition Mappings
DRAFT Unicode technical note NN Auxiliary character decompositions for supporting Hangul Kent Karlsson 2006-09-24 1 Introduction The Hangul script is very elegantly designed. There are just a small number of letters (28, plus a small number of variant letters introduced later, but the latter have fallen out of use) and even a featural design philosophy for the shapes of the letters. However, the incarnation of Hangul as characters in ISO/IEC 10646 and Unicode is not so elegant. In particular, there are many Hangul characters that are not needed, for precomposed letter clusters as well as precomposed syllable characters. The precomposed syllables have arithmetically specified canonical decompositions into Hangul jamos (conjoining Hangul letters). But unfortunately the letter cluster Hangul jamos do not have canonical decompositions to their constituent letters, which they should have had. This leads to multiple representations for exactly the same sequence of letters. There is not even any compatibility-like distinction; i.e. no (intended) font difference, no (intended) width difference, no (intended) ligaturing difference of any kind. They have even lost the compatibility decompositions that they had in Unicode 2.0. There are also some problems with the Hangul compatibility letters, and their proper compatibility decompositions to Hangul jamo characters. Just following their compatibility decompositions in UnicodeData.txt does not give any useful results in any setting. In this paper and its two associated datafiles these problems are addressed. Note that no changes to the standard Unicode normal forms (NFD, NFC, NFKD, and NFKC) are proposed, since these normal forms are stable for already allocated characters. -
CJK Compatibility Ideographs Supplement Range: 2F800–2FA1F
CJK Compatibility Ideographs Supplement Range: 2F800–2FA1F This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 14.0 This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See https://www.unicode.org/errata/ for an up-to-date list of errata. See https://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See https://www.unicode.org/charts/PDF/Unicode-14.0/ for charts showing only the characters added in Unicode 14.0. See https://www.unicode.org/Public/14.0.0/charts/ for a complete archived file of character code charts for Unicode 14.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 14.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 14.0, online at https://www.unicode.org/versions/Unicode14.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, #45, and #50, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See https://www.unicode.org/ucd/ and https://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation. -
Areal Script Form Patterns with Chinese Characteristics James Myers
Areal script form patterns with Chinese characteristics James Myers National Chung Cheng University http://personal.ccu.edu.tw/~lngmyers/ To appear in Written Language & Literacy This study was made possible through a grant from Taiwan’s Ministry of Science and Technology (103-2410-H-194-119-MY3). Iwano Mariko helped with katakana and Minju Kim with hangul, while Tsung-Ying Chen, Daniel Harbour, Sven Osterkamp, two anonymous reviewers and the special issue editors provided all sorts of useful suggestions as well. Abstract It has often been claimed that writing systems have formal grammars structurally analogous to those of spoken and signed phonology. This paper demonstrates one consequence of this analogy for Chinese script and the writing systems that it has influenced: as with phonology, areal script patterns include the borrowing of formal regularities, not just of formal elements or interpretive functions. Whether particular formal Chinese script regularities were borrowed, modified, or ignored also turns out not to depend on functional typology (in morphemic/syllabic Tangut script, moraic Japanese katakana, and featural/phonemic/syllabic Korean hangul) but on the benefits of making the borrowing system visually distinct from Chinese, the relative productivity of the regularities within Chinese character grammar, and the level at which the borrowing takes place. Keywords: Chinese characters, Tangut script, Japanese katakana, Korean hangul, writing system grammar, script outer form, areal patterns 1. Areal phonological patterns and areal script patterns Sinoform writing systems look Chinese, even when they are functionally quite different. The visual traits of non-logographic Japanese katakana and Korean hangul are nontrivially like those of logographic Chinese script, as are those of the logographic but structurally unique script of Tangut, an extinct Tibeto-Burman language of what is now north-central China.