Coarse-To-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database

Total Page:16

File Type:pdf, Size:1020Kb

Coarse-To-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database Canwen Xu1∗, Tao Ge2, Chenliang Li3, Furu Wei2 1 University of California, San Diego 2 Microsoft Research Asia 3 Wuhan University 1 [email protected] 2 ftage,[email protected] 3 [email protected] Abstract JA á1風2/熱3Î4低気5圧6.K¶7です。 1 2/ 364 5Ë的 .7 Chinese and Japanese share many charac- T-ZH ± ¨ ± # 一 。 ters with similar surface morphology. To S-ZH 台1Î2/热3&4N气5压6的一Í7。 better utilize the shared knowledge across EN Typhoon is a type of tropical depression. the languages, we propose UnihanLM, a self-supervised Chinese-Japanese pretrained masked language model (MLM) with a novel Table 1: A sentence example in Japanese (JA), Tradi- two-stage coarse-to-fine training approach. tional Chinese (T-ZH) and Simplified Chinese (S-ZH) We exploit Unihan, a ready-made database with its English translation (EN). The characters that constructed by linguistic experts to first merge already share the same Unicode are marked with an morphologically similar characters into clus- underline. In this work, we further match characters ters. The resulting clusters are used to re- with identical meanings but different Unicode, then place the original characters in sentences for merge them. Characters eligible to be merged together the coarse-grained pretraining of the MLM. are marked with the same superscript. Then, we restore the clusters back to the original characters in sentences for the fine- grained pretraining to learn the representation objective (translation language modeling, TLM). of the specific characters. We conduct ex- XLM-R (Conneau et al., 2019) has a larger size tensive experiments on a variety of Chinese and is trained with more data. Based on XLM, Uni- and Japanese NLP benchmarks, showing that coder (Huang et al., 2019) collects more data and our proposed UnihanLM is effective on both uses multi-task learning to train on three supervised mono- and cross-lingual Chinese and Japanese tasks, shedding light on a new path to exploit tasks. the homology of languages.1 The census of cross-lingual approaches is to al- low lexical information to be shared between lan- 1 Introduction guages. XLM and mBERT exploit shared lexical Recently, Pretrained Language Models have shown information by Byte Pair Encoding (BPE) (Sen- promising performance on many NLP tasks (Peters nrich et al., 2016) and WordPiece (Wu et al., 2016), et al., 2018; Devlin et al., 2019; Liu et al., 2019; respectively. However, these automatically learned Yang et al., 2019c; Lan et al., 2020). Many attempts shared representations have been criticized by re- have been made to train a model that supports mul- cent work (K et al., 2020), which reveals their limi- tiple languages. Among them, Multilingual BERT tations in sharing meaningful semantics across lan- (mBERT) (Devlin et al., 2019) is released as a part guages. Also, words in both Chinese and Japanese of BERT. It directly adopts the same model ar- are short, which prohibits an effective learning of chitecture and training objective, and is trained sub-word representations. Different from European on Wikipedia in different languages. XLM (Lam- languages, Chinese and Japanese naturally share ple and Conneau, 2019) is proposed with an ad- Chinese characters as a subword component. Early ditional language embedding and a new training work (Chu et al., 2013) shows that shared charac- ters in these two languages can benefit Example- ∗ This work was done during Canwen’s internship at based Machine Translation (EBMT) with a statisti- Microsoft Research Asia. 1The code and pretrained weights are available at https: cal based phrase extraction and alignment. For Neu- //github.com/JetRunner/unihan-lm. ral Machine Translation (NMT), (Zhang and Ko- 201 Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 201–211 December 4 - 7, 2020. c 2020 Association for Computational Linguistics machi, 2019) exploited such information by learn- trained language model for Chinese and Japanese ing a BPE representation over sub-character (i.e., NLP tasks. (2) We pioneer to apply the language ideograph and stroke) sequence. They applied this resource – the Unihan Database to help model pre- technique to unsupervised Chinese-Japanese ma- training, allowing more lexical information to be chine translation and achieved state-of-the-art per- shared between the two languages. (3) We devise formance. However, this approach greatly relies on a novel coarse-to-fine two-stage pretraining strat- unreliable automatic BPE learning and may suffer egy with different granularity for Chinese-Japanese from the noise brought by various variants. language modeling. To facilitate lexical sharing, we propose Unihan 2 Preliminaries Language Model (UnihanLM), a cross-lingual pre- trained masked language model for Chinese and 2.1 Chinese Character Japanese. We propose a two-stage coarse-to-fine Chinese character is a pictograph used in Chinese pretraining procedure to empower better general- and Japanese. These characters often share the ization and take advantages of shared characters same background and origin. However, due to his- in Japanese, Traditional and Simplified Chinese. toric reasons, Chinese characters have developed First, we let the model exploit maximum possible into different writing systems, including Japanese shared lexical information. Instead of learning a Kanji, Traditional Chinese and Simplified Chinese. shared sub-word vocabulary like the prior work, Also, even in a single text, multiple variants of we leverage Unihan database (Jenkins et al., 2019), the same characters can be used interchangeably a ready-made constituent of the Unicode standard, (e.g., “台c” and “úc” for “Taiwan”, in Tradi- to extract the shared lexical information across the tional Chinese). These characters have identical languages. By exploiting this database, we can ef- or overlapping meanings. Thus, it is critical to fectively merge characters with the similar surface better exploit such information for modeling both morphology but independent Unicodes, as shown cross-lingual (i.e., between Chinese and Japanese), in Table1 into thousands of clusters. The clusters cross-system (i.e., between Traditional and Simpli- will be used to replace the characters in sentences fied Chinese) and cross-variant semantics. during the first-stage coarse-grained pretraining. Both Chinese and Japanese have no delimiter After the coarse-grained pretraining finishes, we re- (e.g., white space) to mark the boundaries of words. store the clusters back to the original characters and There have always been debates over whether word initialize their representation with their correspond- segmentation is necessary for Chinese NLP. Re- ing cluster’s representation and then learn their spe- cent work (Li et al., 2019) concludes that it is cific representation during the second-stage fine- not necessary for various NLP tasks in Chinese. grained pretraining. In this way, our model can Previous cross-lingual language models use dif- make full use of shared characters while maintain- ferent methods for tokenization. mBERT adds ing a good sense for nuances of similar characters. white spaces around Chinese characters and lefts To verify the effectiveness of our approach, we Katakana/Hiragana Japanese (also known as kanas) evaluate on both lexical and semantic tasks in unprocessed. Different from mBERT, XLM uses Chinese and Japanese. On word segmentation, Stanford Tokenizer2 and KyTea3 to segment Chi- our model outperforms monolingual and multi- nese and Japanese sentences, respectively. After to- lingual BERT (Devlin et al., 2019) and shows a kenization, mBERT and XLM use WordPiece (Wu much higher performance on cross-lingual zero- et al., 2016) and Byte Pair Encoding (Sennrich shot transfer. Also, our model achieves state-of-the- et al., 2016) for sub-word encoding, respectively. art performance on unsupervised Chinese-Japanese Nevertheless, both approaches suffer from obvi- machine translation, and is even comparable to the ous drawbacks. For mBERT, the kanas and Chinese supervised baseline on Chinese-to-Japanese transla- characters are treated differently, which causes a tion. On classification tasks, our model achieves a mismatch for labeling tasks. Also, leaving kanas comparable performance with monolingual BERT untokenized may cause the data sparsity problem. and other cross-lingual models trained with the For XLM, as pointed out in (Li et al., 2019), an same scale of data. 2https://nlp.stanford.edu/software/ To summarize, our contributions are three-fold: tokenizer.html (1) We propose UnihanLM, a cross-lingual pre- 3http://www.phontron.com/kytea/ 202 Variant Description Example Traditional Variant The traditional versions of a simplified Chinese character. 发 ! 髮 (hair), | (to burgeon) Simplified Variant The simplified version of a traditional Chinese character. 團 ! 团 (group) Z-Variant Same character with different unicodes only for compatibility. ª $ ! (say) Semantic Variant Characters with identical meaning. 兎 $ T (rabbit) Specialized Semantic Variant Characters with overlapping meaning. 丼 (rice bowl, well) $ 井 (well) Table 2: The five types of variants in the Unihan database. external word segmenter would introduce extra seg- Tokenization Scheme Result mentation errors and compromise the performance BERT (2019) á風 / はひどい of the model. Also, as a word-based model, it is XLM (2019) á風 / / / ひどい difficult to share cross-lingual characters unless the á 風 / 2 ) い segmented words in both Chinese and Japanese are UnihanLM / / / / / exactly matched. Furthermore, both approaches Table 3: Different tokenization schemes used in re- would enlarge the vocabulary size and thus intro- cent work and ours. Note that the tokenized results duce more parameters. of both BERT and XLM in this table are before Word- Piece/BPE applied. WordPiece/BPE may further split 2.2 Unihan Database a token. Chinese, Japanese and Korean (CJK) characters share a common origin from the ancient Chinese have multiple variant characters (e.g., the tradi- characters. However, with the development of each tional variants of “发” in Table2).
Recommended publications
  • Download the Specification
    Internationalizing and Localizing Applications in Oracle Solaris Part No: E61053 November 2020 Internationalizing and Localizing Applications in Oracle Solaris Part No: E61053 Copyright © 2014, 2020, Oracle and/or its affiliates. License Restrictions Warranty/Consequential Damages Disclaimer This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. Warranty Disclaimer The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. Restricted Rights Notice If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, then the following notice is applicable: U.S. GOVERNMENT END USERS: Oracle programs (including any operating system, integrated software, any programs embedded, installed or activated on delivered hardware, and modifications of such programs) and Oracle computer documentation or other Oracle data delivered to or accessed by U.S. Government end users are "commercial
    [Show full text]
  • Writing As Aesthetic in Modern and Contemporary Japanese-Language Literature
    At the Intersection of Script and Literature: Writing as Aesthetic in Modern and Contemporary Japanese-language Literature Christopher J Lowy A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2021 Reading Committee: Edward Mack, Chair Davinder Bhowmik Zev Handel Jeffrey Todd Knight Program Authorized to Offer Degree: Asian Languages and Literature ©Copyright 2021 Christopher J Lowy University of Washington Abstract At the Intersection of Script and Literature: Writing as Aesthetic in Modern and Contemporary Japanese-language Literature Christopher J Lowy Chair of the Supervisory Committee: Edward Mack Department of Asian Languages and Literature This dissertation examines the dynamic relationship between written language and literary fiction in modern and contemporary Japanese-language literature. I analyze how script and narration come together to function as a site of expression, and how they connect to questions of visuality, textuality, and materiality. Informed by work from the field of textual humanities, my project brings together new philological approaches to visual aspects of text in literature written in the Japanese script. Because research in English on the visual textuality of Japanese-language literature is scant, my work serves as a fundamental first-step in creating a new area of critical interest by establishing key terms and a general theoretical framework from which to approach the topic. Chapter One establishes the scope of my project and the vocabulary necessary for an analysis of script relative to narrative content; Chapter Two looks at one author’s relationship with written language; and Chapters Three and Four apply the concepts explored in Chapter One to a variety of modern and contemporary literary texts where script plays a central role.
    [Show full text]
  • Legacy Character Sets & Encodings
    Legacy & Not-So-Legacy Character Sets & Encodings Ken Lunde CJKV Type Development Adobe Systems Incorporated bc ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/unicode/iuc15-tb1-slides.pdf Tutorial Overview dc • What is a character set? What is an encoding? • How are character sets and encodings different? • Legacy character sets. • Non-legacy character sets. • Legacy encodings. • How does Unicode fit it? • Code conversion issues. • Disclaimer: The focus of this tutorial is primarily on Asian (CJKV) issues, which tend to be complex from a character set and encoding standpoint. 15th International Unicode Conference Copyright © 1999 Adobe Systems Incorporated Terminology & Abbreviations dc • GB (China) — Stands for “Guo Biao” (国标 guóbiâo ). — Short for “Guojia Biaozhun” (国家标准 guójiâ biâozhün). — Means “National Standard.” • GB/T (China) — “T” stands for “Tui” (推 tuî ). — Short for “Tuijian” (推荐 tuîjiàn ). — “T” means “Recommended.” • CNS (Taiwan) — 中國國家標準 ( zhôngguó guójiâ biâozhün) in Chinese. — Abbreviation for “Chinese National Standard.” 15th International Unicode Conference Copyright © 1999 Adobe Systems Incorporated Terminology & Abbreviations (Cont’d) dc • GCCS (Hong Kong) — Abbreviation for “Government Chinese Character Set.” • JIS (Japan) — 日本工業規格 ( nihon kôgyô kikaku) in Japanese. — Abbreviation for “Japanese Industrial Standard.” — 〄 • KS (Korea) — 한국 공업 규격 (韓國工業規格 hangug gongeob gyugyeog) in Korean. — Abbreviation for “Korean Standard.” — ㉿ — Designation change from “C” to “X” on August 20, 1997. 15th International Unicode Conference Copyright © 1999 Adobe Systems Incorporated Terminology & Abbreviations (Cont’d) dc • TCVN (Vietnam) — Tiu Chun Vit Nam in Vietnamese. — Means “Vietnamese Standard.” • CJKV — Chinese, Japanese, Korean, and Vietnamese. 15th International Unicode Conference Copyright © 1999 Adobe Systems Incorporated What Is A Character Set? dc • A collection of characters that are intended to be used together to create meaningful text.
    [Show full text]
  • Extension of VHDL to Support Multiple-Byte Characters
    Extension of VHDL to support multiple-byte characters Kiyoshi Makino Masamichi Kawarabayashi Seiko Instruments Inc. NEC Corporation 6-41-6 Kameido, Koto-ku 1753 Shimonumabe, Nakahara-ku Tokyo 136-8512 Japan Kawasaki, Kanagawa 211-8666 Japan [email protected] [email protected] Abstract Written Japanese is comprised of many kinds of characters. Whereas one-byte is sufficient for the Roman alphabet, two-byte are required to support written Japanese. It should be noted that written Chinese require three-byte. The scope of this paper is not restricted to written Japanese because we should consider the implementation of the standard which covers the major languages using the multiple-byte characters. Currently, VHDL does not support multiple-byte characters. This has proven to be a major impediment to the productivity for electronics designers in Japan, and possibly in other Asian countries. In this paper, we briefly describe the problem, give a short background of the character set required to support Japanese and other multiple-byte characters language, and propose a required change in the IEEE Std 1076. 1. Introduction VHDL[1] have recently become very popular to design the logical circuits all over the world. We, the Electronic Industries Association of Japan (EIAJ), have been working for the standardization activities with the Design Automation Sub-Committee (DASC) in IEEE. In the meanwhile, the VHDL and other related standards have evolved and improved. As the number of designers using VHDL increases so does the need to support the local language though designers have asked EDA tool vendors to enhance their products to support local language, it have often been rejected unfortunately.
    [Show full text]
  • Implementing Cross-Locale CJKV Code Conversion
    Implementing Cross-locale CJKV Code Conversion Ken Lunde, Adobe Systems Incorporated [email protected] http://www.oreilly.com/~lunde/ 1. Introduction Most operating systems today deal with single locales. Within a single CJKV locale, different operating sys- tems often use different encodings for the same character set. Consider Shift-JIS and EUC-JP encodings for Japanese—Shift-JIS is historically used on MacOS and Windows, but EUC-JP is used on Unix. This makes code conversion a necessity. Code conversion within a single locale is, by and large, a trivial operation that typically involves a mathematical algorithm. In the past, a lot of code conversion was performed by users through dedicated software tools. Many of today’s applications include built-in code conversion routines, but these routines deal with only multiple encodings of a single locale (such as EUC-KR, ISO-2022-KR, Johab, and Unified hangul Code encodings for Korean). Code conversion across CJKV locales, such as between Chinese and Japanese, is more problematic. While Unicode serves as an excellent basis for implementing cross-locale code conversion, there are still problems to be addressed, such as unmappable characters. 2. Code Conversion Basics Converting between different encodings of a single locale, which represents a trivial effort that involves well- established code conversion algorithms (or mapping tables), is a well-understood process these days. How- ever, as soon as code conversion extends beyond a single locale, there are additional complexities that arise, such as the following: • Code conversion algorithms almost always must be replaced by mapping tables because the ordering of characters in different CJKV character sets are different.
    [Show full text]
  • University of Montana Commencement Program, 1975
    University of Montana ScholarWorks at University of Montana University of Montana Commencement Programs, 1898-2020 Office of the Registrar 6-15-1975 University of Montana Commencement Program, 1975 University of Montana (Missoula, Mont. : 1965-1994). Office of the Registrar Follow this and additional works at: https://scholarworks.umt.edu/um_commencement_programs Let us know how access to this document benefits ou.y Recommended Citation University of Montana (Missoula, Mont. : 1965-1994). Office of the Registrar, "University of Montana Commencement Program, 1975" (1975). University of Montana Commencement Programs, 1898-2020. 78. https://scholarworks.umt.edu/um_commencement_programs/78 This Program is brought to you for free and open access by the Office of the Registrar at ScholarWorks at University of Montana. It has been accepted for inclusion in University of Montana Commencement Programs, 1898-2020 by an authorized administrator of ScholarWorks at University of Montana. For more information, please contact [email protected]. SEVENTY-EIGHTH ANNUAL COMMENCEMENT UNIVERSITY OF MONTANA MISSOULA SUNDAY, JUNE THE FIFTEENTH NINETEEN HUNDRED AND SEVENTY-FIVE FIELD HOUSE AUDITORIUM THE MARSHALS James H. Lowe Chairman, Faculty Senate Associate Professor of Forestry Walter L. Brown R. Keith Osterheld Professor o f English Professor of Chemistry ORDER OF EXERCISES PROCESSIONAL BRASS ENSEMBLE AND ORGAN Lance Boyd, Director John Ellis, Organ PROCESSION Marshals, the Colors, Candidates for Degrees, the Faculty, Members of the Governing Boards, Guests of Honor, the President PRESENTATION OF COLORS NATIONAL ANTHEM The Star Spangled Banner O, say! can you see by the dawn’s early light, What so proudly we hailed at the twilight’s last gleaming, Whose broad stripes and bright stars, through the perilous flight O’er the ramparts we watched, were so gallantly streaming? And the rockets’ red glare, the bombs bursting in air, Gave proof through the night that our flag was still there.
    [Show full text]
  • L2/18-279 & IRG N2334 (Proposal to Define New Unihan Database Property: Kiicore2020)
    ISO/IEC JTC1/SC2/WG2/IRG N2334 L2/18-279 Universal Multiple-Octet Coded Character Set International Organization for Standardization Doc Type: ISO/IEC JTC1/SC2/WG2/IRG Title: Proposal to define new Unihan Database property: kIICore2020 Source: Ken Lunde (Adobe) Status: Individual Contribution Action: For consideration by the UTC & IRG Date: 2018-09-03 (see L2/18-066, L2/18-066R & L2/18-066R2) Per the documents listed directly above in parentheses, two of which are subsequent revisions of the first document, I previously proposed what I considered to be modest changes to the existing kIICore property, mainly to address some shortcomings that were identified in a series of five CJK Type Blog articles. Given the reluctance on the part of some national bodies to accept such modest changes, I decided to instead propose a completely new Unihan Database property that releases the set from being hampered by memory constraints that may have been applicable 15 years ago, but which arguably no longer apply to modern environments. The proposed property name is kIICore2020, which includes as part of its name the year in which the first version of Unicode that could include this new property is released, specifically Version 13.0. The attached iicore2020-data.txt data file provides all of the property data as proposed in this document, and covers 20,625 CJK Unified Ideographs and 69 CJK Compatibility Ideographs. The seven sections that follow describe the scope of each of the seven supported source tags, which are the same as those used by the existing kIICore property.
    [Show full text]
  • Proposal to Establish a CJK Unified Ideographs “Urgently Needed Characters” Process (Revised) Authors: Dr
    ISO/IEC JTC1/SC2/WG2 Nxxxx L2/11-442R Universal Multiple-Octet Coded Character Set International Organization for Standardization Doc Type: Working Group Document Title: Proposal to establish a CJK Unified Ideographs “Urgently Needed Characters” process (revised) Authors: Dr. Ken Lunde (小林釰) & John Jenkins (井作恆) Source: The Unicode Consortium Status: Liason Contribution Action: For consideration by JTC1/SC2/WG2 Date: 2012-02-07 (originally submitted on 2011-11-17) Background The process of standardizing repertoires of CJK Unified Ideographs is long and cumbersome, and is almost always measured in years. This is primarily because the typical CJK Unified Ideograph repertoire includes thousands or tens of thousands of characters, and thus requires several rounds of review and discussion before it can be stan- dardized. Extension E, for example, began as Extension D, whose national body submissions were accepted in early 2007, and included characters that were deferred from Extension C. Extension E is currently at the final stages of standardization. To address this particular process shortcoming, the IRG established a one-time UNC (Urgently Needed Characters) repertoire as one of the IRG 29 Resolutions (see IRG N1377, specifically Resolution IRG M29.5), which eventually became CJK Unified Ideographs Extension D, with 222 CJK Unified Ideographs (U+2B740 through U+2B81D), and which was subsequently included in Unicode Version 6.0. Without a formalized UNC-like process in place, which would serve as a parallel pipeline for smaller repertoires of urgently-needed CJK Unified Ideographs, it is extraordinarily difficult for national bodies to standardize smaller sets of urgently-needed CJK Unified Ideographs in a timely manner.
    [Show full text]
  • NAME DESCRIPTION Supported Encodings
    Perl version 5.8.6 documentation - Encode::Supported NAME Encode::Supported -- Encodings supported by Encode DESCRIPTION Encoding Names Encoding names are case insensitive. White space in names is ignored. In addition, an encoding may have aliases. Each encoding has one "canonical" name. The "canonical" name is chosen from the names of the encoding by picking the first in the following sequence (with a few exceptions). The name used by the Perl community. That includes 'utf8' and 'ascii'. Unlike aliases, canonical names directly reach the method so such frequently used words like 'utf8' don't need to do alias lookups. The MIME name as defined in IETF RFCs. This includes all "iso-"s. The name in the IANA registry. The name used by the organization that defined it. In case de jure canonical names differ from that of the Encode module, they are always aliased if it ever be implemented. So you can safely tell if a given encoding is implemented or not just by passing the canonical name. Because of all the alias issues, and because in the general case encodings have state, "Encode" uses an encoding object internally once an operation is in progress. Supported Encodings As of Perl 5.8.0, at least the following encodings are recognized. Note that unless otherwise specified, they are all case insensitive (via alias) and all occurrence of spaces are replaced with '-'. In other words, "ISO 8859 1" and "iso-8859-1" are identical. Encodings are categorized and implemented in several different modules but you don't have to use Encode::XX to make them available for most cases.
    [Show full text]
  • Multilingual Vi Clones: Past, Now and the Future
    THE ADVANCED COMPUTING SYSTEMS ASSOCIATION The following paper was originally published in the Proceedings of the FREENIX Track: 1999 USENIX Annual Technical Conference Monterey, California, USA, June 6–11, 1999 Multilingual vi Clones: Past, Now and the Future Jun-ichiro itojun Hagino KAME Project © 1999 by The USENIX Association All Rights Reserved Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org Multilingual vi clones: past, now and the future Jun-ichiro itojun Hagino/KAME Project itojun@{iijlab,kame}.net Yoshitaka Tokugawa/WIDE Project Outline Internal structures and issues in: Japanized elvis Multilingual nvi Experiences gained in asian multibyte characters support Note: Unicode is not a solution here to be discussed later Assumptions in normal vi/vi clones ASCII (7bit) only, 8bit chars just go through The terminal software defines interpretation One byte occupies 1 column on screen (except tabs) Assumes western languages - space between words Architecture of normal vi tty input, filesystem, tty output (curses), vi internal buffer use the same encoding files tty vi buffer curses management tty regex Western character encodings Character encoding and the language
    [Show full text]
  • Internationalization
    E D I GETTING U G STARTED INTERNATIONALIZATION Internationalization MultiLingual Internationalization Computing & Technology This guide is an introduction to internationalization — what it is, why we do Editor-in-Chief, Publisher Donna Parrish it, and how it is done. Managing Editor Laurel Wagers When I talk to people encountering this term for the first time, I tell them Translation Department Editor Jim Healey Copy Editor Cecilia Spence about my cookie recipe. I will be the first to tell you that I am not a cook, but I Research David Shadbolt do have a cookie recipe that is a very nice combination of butter and sugar and News Kendra Gray, Becky Bennett flour. What does this have to do with internationalization? Well, I can make this Illustrator Doug Jones cookie batter into a wide variety of cookies. I can add oatmeal and raisins or Production Sandy Compton chocolate chips or cinnamon and nutmeg. The results are (almost) always tasty Cover photograph courtesy cookies, but they are tailored to the preferences of the recipients. I think it is of Seattle Public Library because I started with a good quality item that has been carefully designed to allow for many “localizations.” Editorial Board Are you hungry for more? Here is what we’ve included in this guide to help Jeff Allen, Henri Broekmate, Bill Hall, you get started. Andres Heuberger, Chris Langewis, Many people think of software when they think of internationalization. But Ken Lunde, John O’Conner, Mandy Pet, Reinhard Schäler Tracy Russell takes us beyond that to give us a description of important interna- tionalization principles that apply to content and design.
    [Show full text]
  • L2/21-072 (CJK & Unihan Group Recommendations For
    L2/21-072 Title: CJK & Unihan Group Recommendations for UTC #167 Meeting Author: Ken Lunde Date: 2021-04-26 This document provides the CJK & Unihan Group recommendations for UTC #167 based on a meeting that took place from 6 to 7:30PM PDT on 2021-04-16, which was attended by Eiso Chan, Lee Collins, John Jenkins, Ken Lunde, William Nelson, Xieyang Wang, and Yifán Wáng via Zoom. The Chair of the CJK & Unihan Group, Ken Lunde, chaired the meeting with John Jenkins as Vice-Chair. The CJK & Unihan Group reviewed public feedback and documents that were received since UTC #166. Comments are marked in green, and Copy&Paste-ready Recommendations to the UTC are marked in red. 1) UAX #45 / U-Source Public Feedback The single item of public feedback from PRI #420 (Proposed Update UAX #45) that is in this sec- tion was discussed by the group. Date/Time: Sun Apr 11 05:37:38 CDT 2021 Name: Wang Yifan Report Type: Error Report Opt Subject: PRI #420 comments Regarding Sections 1.1 and 2.4: IRG is going to change its procedure from the next working set that: (1) Submitters must provide glyphs in the TrueType format (2) Fill 0 in the first stroke field if the character has no residual strokes Should the description be altered to have proposers strictly follow the re- quirements? Comments: The group reviewed this feedback, and agreed that the first residual stroke value 0 (zero), which is now an IRG procedure, should be added to UAX #45 Section 2.4. The ver- biage in UAX #45 Section 1.1 about providing glyphs was deemed sufficient for the Unicode Consortium’s purposes.
    [Show full text]