Jamo Pair Encoding: Subcharacter Representation-Based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization

Total Page:16

File Type:pdf, Size:1020Kb

Jamo Pair Encoding: Subcharacter Representation-Based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 3490–3497 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization Sangwhan Moonyz, Naoaki Okazakiy Tokyo Institute of Technologyy, Odd Concepts Inc.z, [email protected], [email protected] Abstract In the context of multilingual language model pre-training, vocabulary size for languages with a broad set of potential characters is an unsolved problem. We propose two algorithms applicable in any unsupervised multilingual pre-training task, increasing the elasticity of budget required for building the vocabulary in Byte-Pair Encoding inspired tokenizers, significantly reducing the cost of supporting Korean in a multilingual model. Keywords: tokenization, vocabulary compaction, sub-character representations, out-of-vocabulary mitigation 1. Background BPE. Roughly, the minimum size of the subword vocab- ulary can be approximated as jV j ≈ 2jV j, where V is the With the introduction of large-scale language model pre- c minimal subword vocabulary, and V is the character level training in the domain of natural language processing, the c vocabulary. domain has seen significant advances in the performance Since languages such as Japanese require at least 2000 char- of downstream tasks using transfer learning on pre-trained acters to express everyday text, in a multilingual training models (Howard and Ruder, 2018; Devlin et al., 2018) when setup, one must make a tradeoff. One can reduce the av- compared to conventional per-task models. As a part of this erage surface of each subword for these character vocabu- trend, it has also become common to perform this form of lary intensive languages, or increase the vocabulary size. pre-training against multiple languages when training a sin- The former trades off the performance and representational gle model. For these multilingual pre-training cases, state- power of the model, and the latter has a computational cost. of-the-art methods have relied on subword based tokenizers, Similar problems also apply to Chinese, as it shares a sig- such as Byte-Pair Encoding (BPE) (Sennrich et al., 2016) nificant portion of the character level vocabulary. However, or SentencePiece (Kudo and Richardson, 2018) as a robust this also allows some level of sharing, which reduces the mechanism to mitigate the out-of-vocabulary (OOV) prob- final budget needed in the vocabulary. lem at a tokenizer level, by having a fallback to a character Korean is an outlier in the CJK family, which linguistically level vocabulary. Not only have these methods shown to has a shared vocabulary in terms of roots, but uses an en- be robust against OOV compared to standard lexicon-based tirely different character representation. A straightforward tokenization methods, but they also have benefited from a approach would be to share the character level vocabulary computational cost perspective as it reduces the size of the between CJK languages, as it was possible between Chi- input and output layer. nese and Japanese. However, this, unfortunately, is not a While these methods have shown significant improvements straightforward operation, as Hangul (the Korean writing in alphabetic languages such as English and other Western system) is phonetic, unlike the other two examples. This European languages that use a Latin alphabet, these methods means that while the lexicon may have the exact same roots, have limitations when applied to languages that have a large the phonetic transcription is challenging to do an inverse and diverse character level vocabulary, such as Chinese, transform algorithmically. This requires comprehension of Japanese, and Korean (CJK) languages. the context to select the most likely candidate, which would In this paper, we describe the challenges of subword tok- be analogous to a quasi-masked language modeling task. enization when applied against CJK. We discuss the differ- ence of Korean compared to other CJK languages, how to 3. Related Work and Background take advantage of the difference which Korean has when a subword tokenizer is used, and finally, propose a subword The fundamental idea of characters is not new; in the past, tokenizer-agnostic method, which allows the tokenizer to many character-level approaches have been proposed in the take advantage of Korean specific properties. form of task-specific architectures. There are also sub- character level methods analogous to our method, all of 2. Problem Definition which we discuss in the language-specific sections below. CJK languages, due to the strong linguistic dependency of 3.1. Non-Korean Languages borrowed words from Chinese as part of their vocabulary, A study on a limited subset of Brahmic languages (Ding have a much more extensive range of characters needed to et al., 2018) proposes a method which can be used to re- express the language compared to other alphabetic (e.g., duce the vocabulary budget needed for all languages by Latin) languages. This reflects directly on the vocabu- generalizing, simplifying, then aligning multiple language lary budget requirements needed for an algorithm, which alphabets together. This is applicable when the writing builds a subword vocabulary on character pairs such as systems have genealogical relations that allow this form of 3490 We attempt to address this limitation in our work. This ㄱ[choseong] is analogous to how subword tokenization methods have 江 ㅏ[joongseong] brought to the field guarantees of lossless encoding and Phonetic 강 Jamo decoding, which was not possible with conventional lossy ㅇ[jongseong] encoding methods such as lemmatization, stemming, and other normalization methods. ㄱ[choseong] [joongseong] Standard Form 宮 Phonetic 궁 Jamo ㅜ ㅇ[jongseong] 하다 ㅎㅏㄷㅏ Declarative Present Formal Low Figure 1: Transformation process and Hangul Jamo sub- 한다 ㅎㅏㄴㄷㅏ character composition. In the real world, Hangul to Chinese almost always has a 1:n mapping. Declarative Future Informal Low 할거야 ㅎㅏㄹㄱㅓㅇㅑ alignment. Previous works (He et al., 2018; Shi et al., 2015; Declarative Present Formal High Sun et al., 2014; Yin et al., 2016) demonstrate the potential of sub-character based methods in the context Chinese and 합니다 ㅎㅏㅂㄴㅣㄷㅏ Japanese of CJK languages through radical based decom- position. Figure 2: In this example, the standard form X다 (to do) 3.2. Korean can be conjugated into many forms. The first two Jamo Korean, as shown in figure 1 builds on a small phonetic correspond to a common morpheme, which through agglu- alphabet, and uses a combination of the consonants and tination becomes different conjugations. vowels called Jamo as a building block, and use a combi- nation of those when composing each character. Following the notation in (Stratos, 2017), given this Jamo alphabet J , 4. Method where the following table can explain jJ j = 51, and each The core motivation of our method is to remove the Uni- possible role when forming a complete character. code bottleneck we have described in the previous section, The Jamo alphabet J is a union defined as J = Jh [ expose the alphabet composition characteristics and the ag- Jv [Jt, where Jh is a Choseong (head consonant), Jh is glutinative nature of Korean to the subword tokenizers. We Jungseong (vowel), and Jh is Jongseong (tail consonant). also introduce a hard round-trip requirement, which guar- Therefore, the first example illustrated in figure 1 can be antees lossless reconstruction of the transformed text back explained as 12 Jh, O2 Jv, and G2 Jt. Note that Jt to the original input while not introducing any unexpected can be omitted, in which case it corresponds to <nil> 2 Jt. side-effects when training with other languages. Our method uses special unused characters in the Hangul Level Jamo (Subcharacters) Unicode page as hints for the processors to operate. To- Jh, Choseong 124789ABCEFG kenizers will treat these invisible characters the same as a HIJKLMN standard character in the same Unicode page. This is crucial Jv, Jungseong OPQRSTUVWXYZ for tokenizer implementations that treat pairs from different [ \ ] ^ _ ` a b c Unicode pages as non-mergeable. Jt, Jongseong <nil> 1 2 3 4 5 6 7 9 : ; The method itself is implemented and provided as two mod- <=>?@ABDEFGH ules, with two different algorithms. The two modules are JKLMN the pre-processor, which performs the Jamo decomposition, and the post-processor, which reconstructs it back to a more Table 1: Hangul Jamo sub-characters, along with their re- human-readable form. The post-processor is only needed spective positional roles for composition. for generative tasks when the output will be presented to a human. The pre-processor needs to be run at least once for Exploiting these characteristics is not a new idea, and have each input. been explored in the context of an end-to-end architecture We propose two different algorithms: a simple method in (Stratos, 2017), as a method for learning subword embed- which aligns the output to a character grid of 3, and a dings in (Park et al., 2018), and for classification in (Cho et complex method which does not have alignment and in- al., 2019). stead relies on an automaton to reconstruct the transformed One significant contribution of our method is the guaran- text. Both methods prefix orphaned Jamo (e.g., KK, which tees of round trip consistency. Previous work, which we roughly means "laugh out loud"), which is extremely com- discussed above, also discuss sub-character (Jamo) based mon in internet corpora, with a post-processor hint. These methods, but the evaluation was limited to tasks that do not methods will be referenced as aligned and automaton in require generation, and reconstruction was not discussed. later parts of our paper. 3491 The two methods have different characteristics. Aligned can Input 강 강가 KK be reconstructed with an extremely simple post-processor, Output 1OG 1OG1O<f> <o>K<o>K and has much higher guarantees for reconstruction.
Recommended publications
  • Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
    1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only.
    [Show full text]
  • Proposal for a Korean Script Root Zone LGR 1 General Information
    (internal doc. #: klgp220_101f_proposal_korean_lgr-25jan18-en_v103.doc) Proposal for a Korean Script Root Zone LGR LGR Version 1.0 Date: 2018-01-25 Document version: 1.03 Authors: Korean Script Generation Panel 1 General Information/ Overview/ Abstract The purpose of this document is to give an overview of the proposed Korean Script LGR in the XML format and the rationale behind the design decisions taken. It includes a discussion of relevant features of the script, the communities or languages using it, the process and methodology used and information on the contributors. The formal specification of the LGR can be found in the accompanying XML document below: • proposal-korean-lgr-25jan18-en.xml Labels for testing can be found in the accompanying text document below: • korean-test-labels-25jan18-en.txt In Section 3, we will see the background on Korean script (Hangul + Hanja) and principal language using it, i.e., Korean language. The overall development process and methodology will be reviewed in Section 4. The repertoire and variant groups in K-LGR will be discussed in Sections 5 and 6, respectively. In Section 7, Whole Label Evaluation Rules (WLE) will be described and then contributors for K-LGR are shown in Section 8. Several appendices are included with separate files. proposal-korean-lgr-25jan18-en 1 / 73 1/17 2 Script for which the LGR is proposed ISO 15924 Code: Kore ISO 15924 Key Number: 287 (= 286 + 500) ISO 15924 English Name: Korean (alias for Hangul + Han) Native name of the script: 한글 + 한자 Maximal Starting Repertoire (MSR) version: MSR-2 [241] Note.
    [Show full text]
  • Suggestions for the ISO/IEC 14651 CTT Part for Hangul
    SC22/WG20 N891R ISO/IEC JTC 1/SC2/WG2 N2405R L2/01-469 (formerly L2/01-405) Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation internationale de normalisation Title: Ordering rules for Hangul Source: Kent Karlsson Date: 2001-11-29 Status: Expert Contribution Document Type: Working Group Document Action: For consideration by the UTC, JTC 1/SC 2/WG 2’s ad hoc on Korean, and JTC 1/SC 22/WG 20 1 Introduction The Hangul script as such is very elegantly designed. However, its incarnation in 10646/Unicode is far from elegant. This paper is about restoring the elegance of Hangul, as much as it can be restored, for the process of string ordering. 1.1 Hangul syllables A lot of Hangul syllables have a character of their own in the range AC00-D7A3. They each have a canonical decomposition into two (choseong, jungseong) or three (choseong, jungseong, jongseong) Hangul Jamo characters in the ranges 1100-1112, 1161-1175, and 11A8-11C2. The choseong are leading consonants, one of which is mute. The jungseong are vowels. And the jongseong are trailing consonants. A Hangul Jamo character is either a letter or letter cluster. The Hangul syllable characters alone can represent most modern Hangul words. They cannot represent historic Hangul words (Middle Korean), nor modern/future Hangul words using syllables not preallocated. However, all Hangul words can elegantly be represented by sequences of single-letter Hangul Jamo characters plus optional tone mark. 1 1.2 Single-letter and cluster Hangul Jamo characters Cluster Hangul Jamo characters represent either clusters of two or three consonants, or clusters of two or three vowels.
    [Show full text]
  • 2 Hangul Jamo Auxiliary Canonical Decomposition Mappings
    DRAFT Unicode technical note NN Auxiliary character decompositions for supporting Hangul Kent Karlsson 2006-09-24 1 Introduction The Hangul script is very elegantly designed. There are just a small number of letters (28, plus a small number of variant letters introduced later, but the latter have fallen out of use) and even a featural design philosophy for the shapes of the letters. However, the incarnation of Hangul as characters in ISO/IEC 10646 and Unicode is not so elegant. In particular, there are many Hangul characters that are not needed, for precomposed letter clusters as well as precomposed syllable characters. The precomposed syllables have arithmetically specified canonical decompositions into Hangul jamos (conjoining Hangul letters). But unfortunately the letter cluster Hangul jamos do not have canonical decompositions to their constituent letters, which they should have had. This leads to multiple representations for exactly the same sequence of letters. There is not even any compatibility-like distinction; i.e. no (intended) font difference, no (intended) width difference, no (intended) ligaturing difference of any kind. They have even lost the compatibility decompositions that they had in Unicode 2.0. There are also some problems with the Hangul compatibility letters, and their proper compatibility decompositions to Hangul jamo characters. Just following their compatibility decompositions in UnicodeData.txt does not give any useful results in any setting. In this paper and its two associated datafiles these problems are addressed. Note that no changes to the standard Unicode normal forms (NFD, NFC, NFKD, and NFKC) are proposed, since these normal forms are stable for already allocated characters.
    [Show full text]
  • Proposal for a Korean Script Root Zone LGR
    Proposal for a Korean Script Root Zone LGR LGR Version K_LGR_v2.3 Date: 2021-05-01 Document version: K_LGR_v23_20210501 Authors: Korean script Generation Panel 1 General Information/ Overview/ Abstract The purpose of this document is to give an overview of the proposed Korean Script LGR in the XML format and the rationale behind the design decisions taken. It includes a discussion of relevant features of the script, the communities or languages using it, the process and methodology used and information on the contributors. The formal specification of the LGR can be found in the accompanying XML document below: • proposal-korean-lgr-01may21-en.xml Labels for testing can be found in the accompanying text document below: • korean-test-labels-01may21-en.txt In Section 3, we will see the background on Korean script (Hangul + Hanja) and principal language using it, i.e., Korean language. The overall development process and methodology will be reviewed in Section 4. The repertoire and variant sets in K-LGR will be discussed in Sections 5 and 6, respectively. In Section 7, Whole Label Evaluation Rules (WLE) will be described and then contributors for K-LGR are shown in Section 8. Several appendices are included with separate files. 2 Script for which the LGR is proposed ISO 15924 Code: Kore proposal_korean_lgr_v23_20210201 1/20 ISO 15924 Key Number: 287 (= 286 + 500) ISO 15924 English Name: Korean (alias for Hangul + Han) Native name of the script: 한글 + 한자 Maximal Starting Repertoire (MSR) version: MSR-4 [241] Note. 'Korean script' usually means 'Hangeul' or 'Hangul'. However, in the context of the Korean LGR, Korean script is a union of Hangul and Hanja.
    [Show full text]
  • Poorman's Hangul Jamo Input Method
    Poorman’s Hangul Jamo Input Method pmhanguljamo.sty Kangsoo Kim 20 Sep 2021 version 0.3.6 Contents 1 Introduction 1 2 Usage 2 2.1 Loading the package .......................... 2 2.2 Commands and Environment Provided ................. 2 2.3 Setting up in your Preamble ....................... 3 3 Transliteration Rule of This Package 4 3.1 Tone Marks and Syllable Serapator ................... 4 3.2 Consonants ............................... 4 3.3 Vowels .................................. 5 3.4 Compatibility Jamos .......................... 6 4 Proper Fonts 6 5 Examples 7 5.1 Modern Hangul ............................. 7 5.2 pre-1933 Hangul ............................ 8 6 The RRK Input Method: An Alternative Way 8 6.1 Transliteration Rule of RRK ...................... 9 6.2 Example of RRK method ........................ 10 7 Further Information 11 8 Acknowledgement 11 1 Introduction 1 This LATEX package provides Hangul transliteration input method, which allows to typeset Korean Letters (Hangul) with the help of proper fonts. The name comes from “Poorman’s Hangul Jamo Input Method.” It is mainly for the people who have a system without Korean keyboard IM, but want to typeset Hangul in their document. Not only 1Hangul is the Korean alphabet to write the Korean language. In both South and North Korea, the standard writing system uses Hangul. 1 modern Hangul, but so-colled “Old Hangul” characters that uses the lost letters such as ‘Arae-A’(ㆍ), ‘Yet Ieung’(ㆁ) or ‘Pan-Sios’(ㅿ) etc. can also be typeset. X LE ATEX or LuaLATEX is required. The legacy pdfTEX is not supported. The Korean Language supporting packages such as xetexko and luatexko (in the ko.TEX bundle) or polyglossia package with Korean support are recommended, but without them typeset- ting Hangul is of no problem with this package pmhanguljamo.
    [Show full text]
  • The Unicode Standard, Version 4.0--Online Edition
    This PDF file is an excerpt from The Unicode Standard, Version 4.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (http://www.unicode.org/errata/). For information on more recent versions of the standard, see http://www.unicode.org/standard/versions/enumeratedversions.html. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten.
    [Show full text]
  • The Unicode Standard, Version 3.0, Issued by the Unicode Consor- Tium and Published by Addison-Wesley
    The Unicode Standard Version 3.0 The Unicode Consortium ADDISON–WESLEY An Imprint of Addison Wesley Longman, Inc. Reading, Massachusetts · Harlow, England · Menlo Park, California Berkeley, California · Don Mills, Ontario · Sydney Bonn · Amsterdam · Tokyo · Mexico City Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. If these files have been purchased on computer-readable media, the sole remedy for any claim will be exchange of defective media within ninety days of receipt. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc.
    [Show full text]
  • Fonts & Encodings
    Fonts & Encodings Yannis Haralambous To cite this version: Yannis Haralambous. Fonts & Encodings. O’Reilly, 2007, 978-0-596-10242-5. hal-02112942 HAL Id: hal-02112942 https://hal.archives-ouvertes.fr/hal-02112942 Submitted on 27 Apr 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. ,title.25934 Page iii Friday, September 7, 2007 10:44 AM Fonts & Encodings Yannis Haralambous Translated by P. Scott Horne Beijing • Cambridge • Farnham • Köln • Paris • Sebastopol • Taipei • Tokyo ,copyright.24847 Page iv Friday, September 7, 2007 10:32 AM Fonts & Encodings by Yannis Haralambous Copyright © 2007 O’Reilly Media, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (safari.oreilly.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or [email protected]. Printing History: September 2007: First Edition. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Fonts & Encodings, the image of an axis deer, and related trade dress are trademarks of O’Reilly Media, Inc.
    [Show full text]
  • Cho-Tug2013handout.Pdf
    | October 26, 2013 | | TUG 2013, The University of Tokyo in Japan | Background from the theme Blackboard in Keynote A Case Study on TEX’s Superior Power Giving different colors to building blocks of Korean syllables Typeset by X TE EX-ko and Beamer with fonts Optima & HCR | Jin-Hwan CHO | | The University of Suwon & The Korean TEX Society | “One of the concerns of many people in the TEX world is that TEX is relatively unknown in the larger worlds of type- setting and word processing compared with commercial programs such as Adobe’s InDesign and Microsoft Word. How do you see the future of TEX when it comes to Asian languages?” (2007) Dave Walden The primary interviewer of TUG’s Interview Corner Hmm... My answer? PLEASE IGNORE http://tug.org/interviews/chof.html FIND... TEX products which commercial programs such as Adobe’s InDesign and Microsoft Word CANNOT reproduce. 子曰 しいわく、 ᅟᅟᅟ ᅟᅟ The Master said — to people in the TEX world 学而時習之 不亦説乎 まなびてときにこれをならう、またよろこばしからずや、 ᅟᅟᅟ ᅟᅟᅟ ᅟ ᅟ ᅟ ᅟᅟᅟ To learn TEX and timely practice it — it is a pleasure, isn’t it? 有朋自遠方來 不亦楽乎 ともありえんぽうよりきたる、またたのしからずや、 ᅟᅟ ᅟᅟ ᅟᅟ ᅟ ᅟ ᅟ ᅟᅟᅟ To have friends coming from far away and discuss TEX — it is a delight, isn’t it? 人不知而不慍 不亦君子乎 ひとしらずしていきどおらず、またくんしならずや。 ᅟ ᅟᅟᅟ ᅟᅟ ᅟᅟᅟ ᅟ ᅟᅟ ᅟ Not to blame a person who doesn’t recognize TEX — it is a man of complete virtue, isn’t it? Chinese Radicals (部首) A graphical component of a Chinese character under which the character is traditionally listed in a Chinese dictionary Chinese Dictionary Lookup 1.
    [Show full text]
  • Korean LGP Status Update ICANN #54 DUB (Dublin)| 2015.10
    Korean LGP Status Update ICANN #54 DUB (DUBlin)| 2015.10. Agenda Introduction and a list of Hangul Syllables for K-LGR v0.3 A list of Hangul Syllables , Hanja characters for K-LGR v0.3 Review of C (Chinese) and K (Korean) Var Groups Timeline of KLGP activities | 2 1. Introduction Characters to be included in "kore" (Korean Label) Both Hangeul (Hangul) and Hanja are included. K-LGR v02 --- revised --> K-LGR v0.3 (2015.08.13.) | 3 2. K-LGR v0.3 A list of Hangul Syllables for K-LGR v0.3 (2015.08.13.) 11,172 Hangul Syllbles (U+AC00 ~ U+D7A3) A list of Hanja characters for K-LGR v0.3 (2015.08.13.) Source of Hanja Character Set # chars 1) KS X 1001 (268 comptb. chars excluded) 4,620 2) KPS 9566 4,653 3) IICORE - K column marked 4,743 4) IICORE - KP column marked (= KPS 9566) 4,653 5) Qualifying Test of Korean Hanja Proficiency 4,641 (한국 한자 능력 검정 시험) K-LGR v0.3 (2015.08.13.): Hanja List 4,819 | 4 3. Review of C (Chinese) and K (Korean) Variant Groups C-LGR (2015.04.30.): 3093 variant groups (a variant group is composed of two or more variants) K-LGR v0.3 (2015.08.13.): 37 variant groups Analysis of 3093 C (Chinese) variant groups extracted 303 variant groups where there are two or more K characters • K character is a character belonging to K-LGR v0.3 (2015.08.13.) Korea classified 303 variant groups into three categories | 5 3.
    [Show full text]
  • Adobe Technical Note #5093: the Adobe-Korea1-2 Character Collection 2
    Adobe Enterprise & Developer Support bc Adobe Technical Note #5093 The Adobe-Korea1-2 Character Collection Introduction The purpose of this document is to define and describe the Adobe-Korea1-2 character collection, which enumerates 18,352 glyphs, and whose designation is derived from the following three /CIDSystemInfo dictionary entries: ● /Registry (Adobe) ● /Ordering (Korea1) ● /Supplement 2 CIDFont resources that reference this character collection must include a /CIDSystemInfo dictionary that matches the /Registry and /Ordering strings shown above. This document is designed for font developers, for the purpose of developing Korean fonts for use with PostScript products, or for developing OpenType Korean fonts. It is also useful for application developers and end users who need to know more about the glyphs in this character collection. This document expects that its readers are familiar with the CID-keyed font file format, which is described in Adobe Technical Note #5014, entitledAdobe CMap and CIDFont Files Specification.* A character collection contains the glyphs that are required to develop font products for a specific language, script, or market. Specific encodings are defined through the use of CMap resources that are instantiated as files, and generally reference a subset of the character collection. The character collection that results from each Supplement includes the glyphs associated with all earlier Supplements. For example, Supplement 2 includes all glyphs defined in Supplements 0 and 1. The Adobe-Korea1-2 character collection enumerates 18,352 glyphs, specifically CIDs 0 through 18351, among three Supplements, designated 0 through 2. Adobe-Korea1-2 supports the KS X 1001:1992 (formerly KS C 5601- 1992) character set standard, along with Apple® Macintosh® extension thereof.
    [Show full text]