27 §6.0 § PG. Annotated English Version and Transcription of 許慎 Xǔ Shèn's SW Postface

Total Page:16

File Type:pdf, Size:1020Kb

27 §6.0 § PG. Annotated English Version and Transcription of 許慎 Xǔ Shèn's SW Postface NEH: ScHolarly EditioNS aNd traNSlatioNS applicatioN §6.0 6.0) project application appendiceS appendix §6.1: index of Sample pageS § pg. type Source deScription 6.1.0 28-9 InterlInear Cook 2003: Annotated English version and transcription of translatIon 47-8 許慎 Xǔ Shèn’s SW Postface (his last known writing), & apparatus 段玉裁 Duàn Yùcái 1815 [1989:783.2-784.1] 6.1.1 30-3 text, Cook 1995: 《辰字的原始義》The Etymology of Chinese 辰 Chén. translatIon 0-2,9 Section 1: “說文解字的辰字” The Shuō Wén Analysis & exegesIs [frontispiece and two translations,literal and free] 6.1.2 34- InterlInear Cook 2003: 《汲古閣·說文·訂·序·注》 45 annotated 57-68; Jígǔ Gé SW Dìng Xù Zhù [Annotated translation of translatIon 段玉裁 Duàn Yùcái’s (1799) Preface to his book, The Jígǔ Gé & Inventory Duàn Yùcái Studio’s Shuō Wén Reproductions: A fair appraisal.] of texts 1799 Illustrates Duàn’s study of transmission issues. [3 editions] 6.1.3 46-7 CollatIon: Cook 2003: Case Study: Comparing Four Commentary Editions Base text & 10,19 徐鍇 Xú Kǎi (974 ad), 徐鉉 Xú Xuàn (991 ad), Commentary 桂馥 Guì Fù (1805 ad), 段玉裁 Duàn Yùcái (1815 ad) 6.1.4 48 ConCor- Cook 2003: Sample concordance page (10607-10613). danCe 1744 6.1.5 49 urtext Cook 2003 Sample reconstructed Urtext entries, p. 608 (10600-10613). 6.1.6 50 Qīng 段玉裁 Modern (上海古籍出版社) photographic reproduction Commentary Duàn of Duàn’s original (1815) text, typeset by his family orIgInal Yùcái 1815 in vertical columns: the base text is in large type, and [1989:745] Duàn’s commentary is in half-size small type. 6.1.7 51 the 540 Cook 2003 《說文解字·注·部首》SW - Bùshǒu sW lexICal Frontispiece The set of 540 SW Lexical Classifiers (Radicals) ClassIfIers [typeset in Adobe InDesign CS4, in the OpenType font] appendix §6.2: project Staff reSuméS § page perSon 6.2.1 52-3 BiSHop, Thomas E. 6.2.2 54-5 cook, Richard S. EaStErN HaN cHiNESE dictioNary traNSlatioN projEct 27 [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] Thomas Eugene Bishop 2500 Spring Street, Eureka, California 95501, USA [email protected] 877.4.WENLIN Born May 21, 1962 in Princeton, New Jersey, USA WORKS PUBLISHED Wenlin Software for Learning Chinese Wenlin Institute, Inc., 1997 to Present I started what was to become the Wenlin project in 1988 after returning from a year in China. I wrote the dictionaries as well as the software for the first version, published in 1997. Subsequent versions 2 and 3 incorporate the ABC Dictionary (see below). I co-authored the 250-page printed User's Guide. The current version of the software is 3.4.1. Thousands of students and scholars around the world use this software. Customers include universities and governmental institutions as well as individuals. Version 4 is planned for publication in 2010. ABC English-Chinese/Chinese-English Dictionary University of Hawai'i Press, 2010 (forthcoming) The chief editors are John DeFrancis and Zhang Yanyin. I am an associate editor. The English-Chinese part of this student-oriented dictionary is new, while the Chinese-English part is abridged and improved from the Comprehensive edition (see below). I provided linguistic expertise and technical implementations, including addition of International Phonetic Alphabet for English pronunciations, creation of new entries, checking and revision of existing entries (for consistency, orthography, parts of speech, grammar, etc.), abridgment, indexing, and typesetting. ABC Chinese-English Comprehensive Dictionary University of Hawai'i Press, 2003 (ISBN 0-8248-2766-X) The chief editor was the famous linguist and author John DeFrancis. I was an associate editor, and was responsible for the technical aspects of the project as well as sharing in linguistic tasks, including addition of complex (traditional) forms of Chinese characters, creation of new entries, consistency checking, orthography, parts of speech, grammar, indexing, and typesetting.This dictionary contains approximately two hundred thousand entries. We actively maintain it in preparation for future Comprehensive editions, in both printed and digital forms. Pop Mandarin: A Postmodern Chinese Phrasebook from Feng Shui to Wallstreet Red Mansions Publishing, 2005 (ISBN 1-891688-04-9) By Kirsten Ditterich-Shilakes & Janey Chen. I did the typesetting and addition of pinyin romanization, and assisted in design and proofreading. A Specification for CDL: Character Description Language and Character Description Language (CDL): The Set of Basic CJK Unified Stroke Types Published in digital form, 2003-2004 With Richard Cook. These documents were submitted as expert contributions for reference of the Unicode Technical Committee and the Ideographic Rapporteur Group (part of the International Standards Organization, in charge of encoding Chinese, Japanese, Korean, and Vietnamese characters). One result was the encoding of the CJK Strokes block characters in Unicode. ABC Dictionary of Chinese Proverbs University of Hawai'i Press, 2002 (ISBN 0-8248-2221-8) By John S. Rohsenow. I did the typesetting and assisted in indexing and proofreading. [52] (C.V. of Thomas Bishop, second page) EMPLOYMENT President, 1996 to Present Wenlin Institute, Inc., Eureka, California, http://www.wenlin.com In addition to running the corporation, duties include continued development and support of Wenlin Software for Learning Chinese, serving as an associate editor of the ABC Dictionary, and sublicensing the ABC Dictionary to other software developers. Another exciting current project (Fall, 2009) is conversion of Wenlin Institute into a non-profit corporation, which entails forming a board of directors, and applying for 501(c)(3) status. Computer Programmer/Consultant, 1988 to Present Self-employed Designed and wrote Wenlin Software for Learning Chinese. Also worked as independent consultant for high tech industry, programming in C and assembly language. Programmer/Mathematician, 1989 to 1990 Netrologic, Inc., San Diego, California Applied neural networks to diverse problems such as handwritten digit recognition and space shuttle main engine fault detection. Co-authored papers for NASA. Teacher of English as a Foreign Language, 1986 to 1987 Chongqing Institute of Posts and Telecommunications, People's Republic of China Taught advanced and beginning English, as well as mathematics (complex analysis) to teachers and students at a 4-year technical college for one year. EDUCATION Bachelor of Arts in Applied Mathematics, March 1985 University of California at San Diego (Also attended U.C. Santa Cruz and U.C. Berkeley.) Graduate Record Examination (GRE) scores: Quantitative 800, Analytical 790, Verbal 780 (Highest possible score is 800.) Language and Linguistics courses: Russian (two years) Phonetics (one quarter) Spanish (one year) Advanced Syntax (one quarter) Chinese (one quarter) Grammar & Cognition (one quarter) Japanese (one quarter) Latin (three years, Jr. High) Grade A in each of these courses except Grammar & Cognition, a graduate course taken on a pass/not-pass basis; and Spanish and Phonetics, both of which were taken at U.C. Santa Cruz, where written evaluations were assigned instead of letter grades: “easily the best student” (Spanish) and “brilliant...in the top three of a class of over ninety” (Phonetics). Computer Languages: C, Perl, Java, HTML, XML, PostScript, TEX, Metafont, SQL [53] Richard Sterling Cook, Jr. <mailto:[email protected]> <http://linguistics.berkeley.edu/~rscook/> STEDT Project · Linguistics Dept. · UC Berkeley · 1203 Dwinelle Hall · Berkeley, CA 94720 ·(510) 643-9910 WCS · Artificial Intelligence Group · ICSI · 1947 Center St., Suite 600 · Berkeley, CA 94704 · (510) 666-2954 HOME · 2048 Cleveland St., San Leandro, CA 94577 · (510) 667-0957 Academic University of California at Berkeley, Department of lingUistiCs 2003 ◊ Ph.D.《說文解字·電子版》 Digital Recension of the Eastern Hàn Chinese Grammaticon. 2000 ◊ M.A. [Chinese languages, instrumental phonetics and phonology, computational linguistics.] ColUmBia University in the City of new york, Dept. of english anD Comparative literatUre 1989 ◊ B.A. [Epic literature, Classical Greek & Latin, Anglo-Saxon, Russian.] Current 1998- University of California at Berkeley, Department of lingUistiCs The Sino-Tibetan Etymological Dictionary and Thesaurus Project (STEDT) National Endowment for the Humanities (NEH) & National Science Foundation (NSF) funding (1987-) • Researcher, Programmer, Archivist, Grant Writer <http://stedt.berkeley.edu/> [w/ Prof. James Matisoff] 2002- international CompUter sCienCe institUte (iCsi), artifiCial intelligenCe groUp The World Color Survey (WCS) • Researcher, Programmer, Archivist, Grant Writer <http://icsi.berkeley.edu/wcs/> [w/ Prof. Paul Kay] Awards 2007-2008 national enDowment for the hUmanities (neh) Digital hUmanities start-Up grant Character Description Language (CDL) Project <http://wenlin.com/cdl/> [w/ Thomas Bishop] 1999-2003 STEDT Graduate Research Fellow (Linguistics) 1998-1999 US Foreign Language and Area Studies Fellow (Chinese) 1998-1999 University of California Regents Graduate Fellow (Linguistics) Languages & Computing • Modern & historical languages & writing systems • Chinese; Tangut; Classical Greek & Latin; Anglo-Saxon, Russian; Marshallese • Lexicographic database, corpora, publications systems; C, Perl, SQL, XML • Digital typography: Modern & historical orthographies, phonetic transcription, CDL • UniCoDe teChniCal & eDitorial Committees; ISO/IEC JTC1/SC2/WG2/IRG • sCript enCoDing initiative (SEI) <http://linguistics.berkeley.edu/sei/> [w/ Dr. Deborah Anderson] Publications (Selected)
Recommended publications
  • Download the Specification
    Internationalizing and Localizing Applications in Oracle Solaris Part No: E61053 November 2020 Internationalizing and Localizing Applications in Oracle Solaris Part No: E61053 Copyright © 2014, 2020, Oracle and/or its affiliates. License Restrictions Warranty/Consequential Damages Disclaimer This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. Warranty Disclaimer The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. Restricted Rights Notice If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, then the following notice is applicable: U.S. GOVERNMENT END USERS: Oracle programs (including any operating system, integrated software, any programs embedded, installed or activated on delivered hardware, and modifications of such programs) and Oracle computer documentation or other Oracle data delivered to or accessed by U.S. Government end users are "commercial
    [Show full text]
  • Writing As Aesthetic in Modern and Contemporary Japanese-Language Literature
    At the Intersection of Script and Literature: Writing as Aesthetic in Modern and Contemporary Japanese-language Literature Christopher J Lowy A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2021 Reading Committee: Edward Mack, Chair Davinder Bhowmik Zev Handel Jeffrey Todd Knight Program Authorized to Offer Degree: Asian Languages and Literature ©Copyright 2021 Christopher J Lowy University of Washington Abstract At the Intersection of Script and Literature: Writing as Aesthetic in Modern and Contemporary Japanese-language Literature Christopher J Lowy Chair of the Supervisory Committee: Edward Mack Department of Asian Languages and Literature This dissertation examines the dynamic relationship between written language and literary fiction in modern and contemporary Japanese-language literature. I analyze how script and narration come together to function as a site of expression, and how they connect to questions of visuality, textuality, and materiality. Informed by work from the field of textual humanities, my project brings together new philological approaches to visual aspects of text in literature written in the Japanese script. Because research in English on the visual textuality of Japanese-language literature is scant, my work serves as a fundamental first-step in creating a new area of critical interest by establishing key terms and a general theoretical framework from which to approach the topic. Chapter One establishes the scope of my project and the vocabulary necessary for an analysis of script relative to narrative content; Chapter Two looks at one author’s relationship with written language; and Chapters Three and Four apply the concepts explored in Chapter One to a variety of modern and contemporary literary texts where script plays a central role.
    [Show full text]
  • Legacy Character Sets & Encodings
    Legacy & Not-So-Legacy Character Sets & Encodings Ken Lunde CJKV Type Development Adobe Systems Incorporated bc ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/unicode/iuc15-tb1-slides.pdf Tutorial Overview dc • What is a character set? What is an encoding? • How are character sets and encodings different? • Legacy character sets. • Non-legacy character sets. • Legacy encodings. • How does Unicode fit it? • Code conversion issues. • Disclaimer: The focus of this tutorial is primarily on Asian (CJKV) issues, which tend to be complex from a character set and encoding standpoint. 15th International Unicode Conference Copyright © 1999 Adobe Systems Incorporated Terminology & Abbreviations dc • GB (China) — Stands for “Guo Biao” (国标 guóbiâo ). — Short for “Guojia Biaozhun” (国家标准 guójiâ biâozhün). — Means “National Standard.” • GB/T (China) — “T” stands for “Tui” (推 tuî ). — Short for “Tuijian” (推荐 tuîjiàn ). — “T” means “Recommended.” • CNS (Taiwan) — 中國國家標準 ( zhôngguó guójiâ biâozhün) in Chinese. — Abbreviation for “Chinese National Standard.” 15th International Unicode Conference Copyright © 1999 Adobe Systems Incorporated Terminology & Abbreviations (Cont’d) dc • GCCS (Hong Kong) — Abbreviation for “Government Chinese Character Set.” • JIS (Japan) — 日本工業規格 ( nihon kôgyô kikaku) in Japanese. — Abbreviation for “Japanese Industrial Standard.” — 〄 • KS (Korea) — 한국 공업 규격 (韓國工業規格 hangug gongeob gyugyeog) in Korean. — Abbreviation for “Korean Standard.” — ㉿ — Designation change from “C” to “X” on August 20, 1997. 15th International Unicode Conference Copyright © 1999 Adobe Systems Incorporated Terminology & Abbreviations (Cont’d) dc • TCVN (Vietnam) — Tiu Chun Vit Nam in Vietnamese. — Means “Vietnamese Standard.” • CJKV — Chinese, Japanese, Korean, and Vietnamese. 15th International Unicode Conference Copyright © 1999 Adobe Systems Incorporated What Is A Character Set? dc • A collection of characters that are intended to be used together to create meaningful text.
    [Show full text]
  • Extension of VHDL to Support Multiple-Byte Characters
    Extension of VHDL to support multiple-byte characters Kiyoshi Makino Masamichi Kawarabayashi Seiko Instruments Inc. NEC Corporation 6-41-6 Kameido, Koto-ku 1753 Shimonumabe, Nakahara-ku Tokyo 136-8512 Japan Kawasaki, Kanagawa 211-8666 Japan [email protected] [email protected] Abstract Written Japanese is comprised of many kinds of characters. Whereas one-byte is sufficient for the Roman alphabet, two-byte are required to support written Japanese. It should be noted that written Chinese require three-byte. The scope of this paper is not restricted to written Japanese because we should consider the implementation of the standard which covers the major languages using the multiple-byte characters. Currently, VHDL does not support multiple-byte characters. This has proven to be a major impediment to the productivity for electronics designers in Japan, and possibly in other Asian countries. In this paper, we briefly describe the problem, give a short background of the character set required to support Japanese and other multiple-byte characters language, and propose a required change in the IEEE Std 1076. 1. Introduction VHDL[1] have recently become very popular to design the logical circuits all over the world. We, the Electronic Industries Association of Japan (EIAJ), have been working for the standardization activities with the Design Automation Sub-Committee (DASC) in IEEE. In the meanwhile, the VHDL and other related standards have evolved and improved. As the number of designers using VHDL increases so does the need to support the local language though designers have asked EDA tool vendors to enhance their products to support local language, it have often been rejected unfortunately.
    [Show full text]
  • Implementing Cross-Locale CJKV Code Conversion
    Implementing Cross-locale CJKV Code Conversion Ken Lunde, Adobe Systems Incorporated [email protected] http://www.oreilly.com/~lunde/ 1. Introduction Most operating systems today deal with single locales. Within a single CJKV locale, different operating sys- tems often use different encodings for the same character set. Consider Shift-JIS and EUC-JP encodings for Japanese—Shift-JIS is historically used on MacOS and Windows, but EUC-JP is used on Unix. This makes code conversion a necessity. Code conversion within a single locale is, by and large, a trivial operation that typically involves a mathematical algorithm. In the past, a lot of code conversion was performed by users through dedicated software tools. Many of today’s applications include built-in code conversion routines, but these routines deal with only multiple encodings of a single locale (such as EUC-KR, ISO-2022-KR, Johab, and Unified hangul Code encodings for Korean). Code conversion across CJKV locales, such as between Chinese and Japanese, is more problematic. While Unicode serves as an excellent basis for implementing cross-locale code conversion, there are still problems to be addressed, such as unmappable characters. 2. Code Conversion Basics Converting between different encodings of a single locale, which represents a trivial effort that involves well- established code conversion algorithms (or mapping tables), is a well-understood process these days. How- ever, as soon as code conversion extends beyond a single locale, there are additional complexities that arise, such as the following: • Code conversion algorithms almost always must be replaced by mapping tables because the ordering of characters in different CJKV character sets are different.
    [Show full text]
  • University of Montana Commencement Program, 1975
    University of Montana ScholarWorks at University of Montana University of Montana Commencement Programs, 1898-2020 Office of the Registrar 6-15-1975 University of Montana Commencement Program, 1975 University of Montana (Missoula, Mont. : 1965-1994). Office of the Registrar Follow this and additional works at: https://scholarworks.umt.edu/um_commencement_programs Let us know how access to this document benefits ou.y Recommended Citation University of Montana (Missoula, Mont. : 1965-1994). Office of the Registrar, "University of Montana Commencement Program, 1975" (1975). University of Montana Commencement Programs, 1898-2020. 78. https://scholarworks.umt.edu/um_commencement_programs/78 This Program is brought to you for free and open access by the Office of the Registrar at ScholarWorks at University of Montana. It has been accepted for inclusion in University of Montana Commencement Programs, 1898-2020 by an authorized administrator of ScholarWorks at University of Montana. For more information, please contact [email protected]. SEVENTY-EIGHTH ANNUAL COMMENCEMENT UNIVERSITY OF MONTANA MISSOULA SUNDAY, JUNE THE FIFTEENTH NINETEEN HUNDRED AND SEVENTY-FIVE FIELD HOUSE AUDITORIUM THE MARSHALS James H. Lowe Chairman, Faculty Senate Associate Professor of Forestry Walter L. Brown R. Keith Osterheld Professor o f English Professor of Chemistry ORDER OF EXERCISES PROCESSIONAL BRASS ENSEMBLE AND ORGAN Lance Boyd, Director John Ellis, Organ PROCESSION Marshals, the Colors, Candidates for Degrees, the Faculty, Members of the Governing Boards, Guests of Honor, the President PRESENTATION OF COLORS NATIONAL ANTHEM The Star Spangled Banner O, say! can you see by the dawn’s early light, What so proudly we hailed at the twilight’s last gleaming, Whose broad stripes and bright stars, through the perilous flight O’er the ramparts we watched, were so gallantly streaming? And the rockets’ red glare, the bombs bursting in air, Gave proof through the night that our flag was still there.
    [Show full text]
  • L2/18-279 & IRG N2334 (Proposal to Define New Unihan Database Property: Kiicore2020)
    ISO/IEC JTC1/SC2/WG2/IRG N2334 L2/18-279 Universal Multiple-Octet Coded Character Set International Organization for Standardization Doc Type: ISO/IEC JTC1/SC2/WG2/IRG Title: Proposal to define new Unihan Database property: kIICore2020 Source: Ken Lunde (Adobe) Status: Individual Contribution Action: For consideration by the UTC & IRG Date: 2018-09-03 (see L2/18-066, L2/18-066R & L2/18-066R2) Per the documents listed directly above in parentheses, two of which are subsequent revisions of the first document, I previously proposed what I considered to be modest changes to the existing kIICore property, mainly to address some shortcomings that were identified in a series of five CJK Type Blog articles. Given the reluctance on the part of some national bodies to accept such modest changes, I decided to instead propose a completely new Unihan Database property that releases the set from being hampered by memory constraints that may have been applicable 15 years ago, but which arguably no longer apply to modern environments. The proposed property name is kIICore2020, which includes as part of its name the year in which the first version of Unicode that could include this new property is released, specifically Version 13.0. The attached iicore2020-data.txt data file provides all of the property data as proposed in this document, and covers 20,625 CJK Unified Ideographs and 69 CJK Compatibility Ideographs. The seven sections that follow describe the scope of each of the seven supported source tags, which are the same as those used by the existing kIICore property.
    [Show full text]
  • Proposal to Establish a CJK Unified Ideographs “Urgently Needed Characters” Process (Revised) Authors: Dr
    ISO/IEC JTC1/SC2/WG2 Nxxxx L2/11-442R Universal Multiple-Octet Coded Character Set International Organization for Standardization Doc Type: Working Group Document Title: Proposal to establish a CJK Unified Ideographs “Urgently Needed Characters” process (revised) Authors: Dr. Ken Lunde (小林釰) & John Jenkins (井作恆) Source: The Unicode Consortium Status: Liason Contribution Action: For consideration by JTC1/SC2/WG2 Date: 2012-02-07 (originally submitted on 2011-11-17) Background The process of standardizing repertoires of CJK Unified Ideographs is long and cumbersome, and is almost always measured in years. This is primarily because the typical CJK Unified Ideograph repertoire includes thousands or tens of thousands of characters, and thus requires several rounds of review and discussion before it can be stan- dardized. Extension E, for example, began as Extension D, whose national body submissions were accepted in early 2007, and included characters that were deferred from Extension C. Extension E is currently at the final stages of standardization. To address this particular process shortcoming, the IRG established a one-time UNC (Urgently Needed Characters) repertoire as one of the IRG 29 Resolutions (see IRG N1377, specifically Resolution IRG M29.5), which eventually became CJK Unified Ideographs Extension D, with 222 CJK Unified Ideographs (U+2B740 through U+2B81D), and which was subsequently included in Unicode Version 6.0. Without a formalized UNC-like process in place, which would serve as a parallel pipeline for smaller repertoires of urgently-needed CJK Unified Ideographs, it is extraordinarily difficult for national bodies to standardize smaller sets of urgently-needed CJK Unified Ideographs in a timely manner.
    [Show full text]
  • Coarse-To-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database
    UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database Canwen Xu1∗, Tao Ge2, Chenliang Li3, Furu Wei2 1 University of California, San Diego 2 Microsoft Research Asia 3 Wuhan University 1 [email protected] 2 ftage,[email protected] 3 [email protected] Abstract JA á1風2/熱3Î4低気5圧6.K¶7です。 1 2/ 364 5Ë的 .7 Chinese and Japanese share many charac- T-ZH ± ¨ ± # 一 。 ters with similar surface morphology. To S-ZH 台1Î2/热3&4N气5压6的一Í7。 better utilize the shared knowledge across EN Typhoon is a type of tropical depression. the languages, we propose UnihanLM, a self-supervised Chinese-Japanese pretrained masked language model (MLM) with a novel Table 1: A sentence example in Japanese (JA), Tradi- two-stage coarse-to-fine training approach. tional Chinese (T-ZH) and Simplified Chinese (S-ZH) We exploit Unihan, a ready-made database with its English translation (EN). The characters that constructed by linguistic experts to first merge already share the same Unicode are marked with an morphologically similar characters into clus- underline. In this work, we further match characters ters. The resulting clusters are used to re- with identical meanings but different Unicode, then place the original characters in sentences for merge them. Characters eligible to be merged together the coarse-grained pretraining of the MLM. are marked with the same superscript. Then, we restore the clusters back to the original characters in sentences for the fine- grained pretraining to learn the representation objective (translation language modeling, TLM). of the specific characters. We conduct ex- XLM-R (Conneau et al., 2019) has a larger size tensive experiments on a variety of Chinese and is trained with more data.
    [Show full text]
  • NAME DESCRIPTION Supported Encodings
    Perl version 5.8.6 documentation - Encode::Supported NAME Encode::Supported -- Encodings supported by Encode DESCRIPTION Encoding Names Encoding names are case insensitive. White space in names is ignored. In addition, an encoding may have aliases. Each encoding has one "canonical" name. The "canonical" name is chosen from the names of the encoding by picking the first in the following sequence (with a few exceptions). The name used by the Perl community. That includes 'utf8' and 'ascii'. Unlike aliases, canonical names directly reach the method so such frequently used words like 'utf8' don't need to do alias lookups. The MIME name as defined in IETF RFCs. This includes all "iso-"s. The name in the IANA registry. The name used by the organization that defined it. In case de jure canonical names differ from that of the Encode module, they are always aliased if it ever be implemented. So you can safely tell if a given encoding is implemented or not just by passing the canonical name. Because of all the alias issues, and because in the general case encodings have state, "Encode" uses an encoding object internally once an operation is in progress. Supported Encodings As of Perl 5.8.0, at least the following encodings are recognized. Note that unless otherwise specified, they are all case insensitive (via alias) and all occurrence of spaces are replaced with '-'. In other words, "ISO 8859 1" and "iso-8859-1" are identical. Encodings are categorized and implemented in several different modules but you don't have to use Encode::XX to make them available for most cases.
    [Show full text]
  • Multilingual Vi Clones: Past, Now and the Future
    THE ADVANCED COMPUTING SYSTEMS ASSOCIATION The following paper was originally published in the Proceedings of the FREENIX Track: 1999 USENIX Annual Technical Conference Monterey, California, USA, June 6–11, 1999 Multilingual vi Clones: Past, Now and the Future Jun-ichiro itojun Hagino KAME Project © 1999 by The USENIX Association All Rights Reserved Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org Multilingual vi clones: past, now and the future Jun-ichiro itojun Hagino/KAME Project itojun@{iijlab,kame}.net Yoshitaka Tokugawa/WIDE Project Outline Internal structures and issues in: Japanized elvis Multilingual nvi Experiences gained in asian multibyte characters support Note: Unicode is not a solution here to be discussed later Assumptions in normal vi/vi clones ASCII (7bit) only, 8bit chars just go through The terminal software defines interpretation One byte occupies 1 column on screen (except tabs) Assumes western languages - space between words Architecture of normal vi tty input, filesystem, tty output (curses), vi internal buffer use the same encoding files tty vi buffer curses management tty regex Western character encodings Character encoding and the language
    [Show full text]
  • Internationalization
    E D I GETTING U G STARTED INTERNATIONALIZATION Internationalization MultiLingual Internationalization Computing & Technology This guide is an introduction to internationalization — what it is, why we do Editor-in-Chief, Publisher Donna Parrish it, and how it is done. Managing Editor Laurel Wagers When I talk to people encountering this term for the first time, I tell them Translation Department Editor Jim Healey Copy Editor Cecilia Spence about my cookie recipe. I will be the first to tell you that I am not a cook, but I Research David Shadbolt do have a cookie recipe that is a very nice combination of butter and sugar and News Kendra Gray, Becky Bennett flour. What does this have to do with internationalization? Well, I can make this Illustrator Doug Jones cookie batter into a wide variety of cookies. I can add oatmeal and raisins or Production Sandy Compton chocolate chips or cinnamon and nutmeg. The results are (almost) always tasty Cover photograph courtesy cookies, but they are tailored to the preferences of the recipients. I think it is of Seattle Public Library because I started with a good quality item that has been carefully designed to allow for many “localizations.” Editorial Board Are you hungry for more? Here is what we’ve included in this guide to help Jeff Allen, Henri Broekmate, Bill Hall, you get started. Andres Heuberger, Chris Langewis, Many people think of software when they think of internationalization. But Ken Lunde, John O’Conner, Mandy Pet, Reinhard Schäler Tracy Russell takes us beyond that to give us a description of important interna- tionalization principles that apply to content and design.
    [Show full text]