L2/07-143 Title: Proposal to Encode Tangut Characters in UCS Plane 1

Total Page:16

File Type:pdf, Size:1020Kb

L2/07-143 Title: Proposal to Encode Tangut Characters in UCS Plane 1 TANGUT ENCODING PROPOSAL: U+17000 .. U+18715 L2/07-143 ISO/IEC JTC1/SC2/WG2 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация по стандартизации Title: Proposal to encode Tangut characters in UCS Plane 1 Doc Type: Working Group Document Source: UC Berkeley Script Encoding Initiative (Universal Scripts Project) Author: Richard COOK Status: Liaison Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Date: 2007-05-09 This proposal presents a new block of 5,910 Tangut (a.k.a. 西夏, Xī Xià, Тангут, Си Ся) characters for en- coding in Plane 1 of the UCS, in the range (U+17000..U+18715), coordinating contributions from scholars in China, Japan, Russia, Taiwan, and the United States. Tangut is an extinct language of central China (都興慶府, 今寧夏銀川). Tangut characters were in use for less than 500 years (1036-1502), including some 300 years of classical use beyond the Mongol destruction of Tangut civilization (1227). Tangut writing was invented by imperial decree, with Chinese writing as its con- ceptual model. Tangut characters are confined to a uniform em-square, and comprised of a finite set of stroke types combined in recurrent patterns. The resemblance to Chinese and CJK writing ends there: Tangut is a unique writing system, with no overlap with Unified CJK Ideographs. There is a considerable body of native Tangut literature that has been uncovered since the early 1900s, and a large body of paleographic and linguis- tic work has also been published, making a standard encoding of great consequence for future research. The first attempt to define a standard electronic encoding of Tangut was Grinstead’s “Tangut Telecode” (1971), but neither this nor more ambitious subsequent computerized systems have gained much currency among scholars, though a de facto standard point of reference in Tangut studies has slowly emerged. The proposed repertory derives from three main sources, two by 李范文 Lǐ Fànwén (TY:1986, XiaHan:1997) and one by 韓小忙 Hán Xiǎománg (HXM:2004). Hán Xiǎománg was the executive editor of his teacher Lǐ Fànwén’s mammoth 1997 《夏漢字典》Xià-Hàn Zìdiǎn [Tangut / Chinese Dictionary]. Seven years later Han’s doctoral dissertation 《西夏文正字研究》 [On Tangut Orthography] presents a comprehensive and sys- tematic analysis of the distinctive elements of Tangut writing, mapped to nine native Tangut dictionaries. Complete mappings and glyphs for these three sources (TY, XiaHan, HXM) are provided in the Multi-Col- umn Code Chart (L2/07-144) accompanying this proposal; these mappings are a subset of the complete UniHan-style Tangut mapping database, described in the Proposed Draft Unicode Technical Report #43: A User’s Guide to the UniTangut Database <http://unicode.org/~rscook/Xixia/UCS_proposal/tr43.html>. The Multi-Column Code Chart (L2/07-144) has four fields per record, one field per source, and 6217 records total (including variants), for 5910 proposed new characters. It is printed with 24 pt. Tangut glyphs, in 4 col- umns and 17 records per page, over 92 pages. The four columns of the Multi-Column Code Chart have header labels “W” (TY:1986), “X” (XiaHan:1997), “Y” (HXM:2004), “Z” (proposed UCS representative glyph). Under each glyph in the first 3 columns (W,X,Y) is its source mapping, and under each cell in column Z is the proposed UCS code point. Each record in this chart has at least two glyphs, and variant classes occupy adjacent rows with the same proposed code point: the first row for each variant class provides the proposed representative glyph in field Z (as determined 1 TANGUT ENCODING PROPOSAL: U+17000 .. U+18715 L2/07-143 by HXM). The Single-Column Code Chart (L2/07-145) shows only the column “Z” proposed representative glyph. The consolidated mappings and glyphs underwent rigorous proofing and correction over the course of the year 2006 and into 2007, resulting in the set of five related TangutTrueType fonts presented in this proposal. The base font and mappings for column “W” came from Taiwan (Academia Sincia); the fonts for columns “X”, and “Y” were created specifically for this proposal, based on scans of the two main print-sources (Xia- Han, HXM). The font for column “Z” and its subset font for the Single-Column Code Chart were also built specifically for this proposal, based on a bitmap set from Japan (created in work by 荒川慎太郎 Arakawa Shintaro et al., based also on the XiaHan source), corrected, expanded, and brought into line with the HXM source. Three of the fonts in the Multi-Column Code Chart (X,Y,Z) contain glyphs for the full proposal rep- ertory (5910 characters); the Sinica font (W) only contains Tongyin (TY) glyphs. The ordering in the Single and Multi- Column Charts derives from the system presented in HXM (2004), and virtual positions are assigned for characters not in that source. That scheme is especially attractive for its logic and high degree of usability. For every character, the left-side, top- or bottom-spanning component is the “radical”. The stroke types and counts of these components order these component classes, and within a given class the ordering is also by stroke count and by the type of the first residual stroke. This ordering eliminates the immediate need for the encoding of a block of Tangut radicals. The competing systems of Tangut radicals (and there are several, though apparently no known native systems) are idiosyncratic and partial. The task of enumerating the complete set of Tangut radicals is rather open-ended, since this is in ef- fect a subset of the similarly open-ended set of Tangut components (which is especially open-ended when character variants, and variant component analyses are considered). On-going CDL analysis of the encoded Tangut repertory will provide the basis for the future encoding of a well-defined set of Tangut components. CDL itself is the most effective way to address the radical/component problem in Tangut, and the larger problems relating to the indexing of this large character set. Tangut characters are processed for most purposes like Chinese characters. They occupy unit squares, line- breaks can occur between any characters (subject to ordinary Han script rules for placement of punctuation). The default sort order is determined by binary (code-chart) order. Tangut, Chinese, Russian, English, IPA, etc. are commonly intermixed in the same lines of left-to-right running text. Tangut may also be used in verti- cal text. For examples, see the images at <http://unicode.org/~rscook/Xixia/>. The Single-Column Code Chart (L2/07-145) includes at the end a full set of names. All character names have the form “TANGUT CHARACTER-17000”. Due to the large number of Tangut characters, it is suggested that UniHan-style short-hand notation be used in “NamesList.txt”. In addition to lexical source mappings, the mapping data documented in Proposed Draft Unicode Technical Report #43: A User’s Guide to the UniTangut Database will provide several other kinds of property informa- tion. Some of these are described below. 2 TANGUT ENCODING PROPOSAL: U+17000 .. U+18715 L2/07-143 Sources and Properties The following is a list of fields and abbreviations used in the online mapping database: <http://linguistics.berkeley.edu/~rscook/cgi/ztangut.html> B5 : Academia Sinica’s Big5-based encoding of this character (see PUA). HXM : 韓小忙 Hán Xiǎománg (2004): 《西夏文正字研究》 [On Tangut Orthography; Ph.D. dissertation K246.3 H211.7, directed by 李范文 Lǐ Fànwén, see TY]. HXM undertakes a comprehensive and sys- tematic collation of Tangut characters, based on nine Tangut dictionaries (《同音》, 《文海寶韻》, 《 同音文海寶韻合編》, 《番漢合時掌中珠》, 《三才雜字》, 《纂要》, 《同義》, 《五音切韻》, 《新集碎金置 掌文》), and catalogues a total of 6,066 forms, including 169 variants, 36 errors, and 5,861 unique ‘standard-style characters’ (“正字” zhèngzì ‘orthography’). In addition to the primary source map- pings, this work contains mappings to Lǐ (1997) and Sofronov (1968). Kychanov : Е. И. Кычанов Словарь Тангутского (Си Ся) Яазыка [E.I. Kychanov; Tangut Dictionary: Tangut-Russian-English-Chinese Dictionary], Kyoto Univ., 2006; this dictionary uses the Arakawa Mojikyo (文字鏡) fonts; pronunciations after Sofronov, gloss material after 李范文; 5803 indexed en- tries, including many variants. Nevsky : Н. А. Невский (Nevsky, N. A.) [1892-1938] (1960) Тангутская Филология. Издатепьство Восточны литературы, Москва [Tangut Philology. 2 vols (Russian) Moskow]. (This field may have a maximum of four space-delimited values.) Nishida : 《西夏語的研究》 (西田龍雄 Nishida Tatsuo) 第二冊, 西夏文字小字典 Appendix I, p. 303-507 上面 的編號. PUA : Academia Sinica’s Unicode Private Use Area encoding of B5, see UNI. SN : Serial Number, a numbering of all 5,809 elements of the TY character set. Sofronov : Софронов, М. В. (M. V. Sofronov) 索夫羅諾夫著的《西夏語文法》 (Грамматика Тангуцково Языка [Grammatika Tangutskovo Yazyka ‘Tangut Garammar’], 1968). TY : 《同音研究》Tóngyīn Yánjiū (‘Homophones’ Research, 李范文 Lǐ Fànwén. 寧夏人民出版社, 1986). The Sinica TY database contains a total of 5,809 records, including variants. TYBH :《同音研究》筆畫, (TY Stroke-count Range). TYBS : 《同音研究》部首 (TY Radical). TYP : 《同音研究》品, (TY Class). TYYY :《同音研究》音韻, (TY Rhyme). TYYZ : 《同音研究》頁字, (TY character mapping [page + character ID]). UNI : non-PUA Unicode code point (in the proposed range [U+17000 .. U+18715]), with block ordering as in HXM. WHYJ : 《文海研究》 Wén Hǎi Yánjiū (史金波, 白濱, 黃振華, 1983). WenHai : 《文海》 Wén Hǎi (K. V. Keping et al., 1969). XiaHan : 《夏漢字典》Xià-Hàn Zìdiǎn [Tangut / Chinese Dictionary] (李范文 Lǐ Fànwén 1997; ISBN: 7- 5004-2113-3). This dictionary has 6,000 numbered entries, including a number of variants unified in the proposed repertory. In the mapping data, indices > 6000 are virtual. YTYL : 《義同》一類 Yì Tóng yīlèi (李范文, 韓小忙, 2000; cf. 韓小忙 2004:354). Several other fields are in the process of being added to this database, including phonological and semantic information.
Recommended publications
  • Iso/Iec Jtc1/Sc2/Wg2 N5064
    JTC1/SC2/WG2 N5064 2019-05-27 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Proposal to encode nine Tangut ideographs and six Tangut components Source: Andrew West, Viacheslav Zaytsev (Institute of Oriental Manuscripts, Russian Academy of Sciences), Jia Changye (Ningxia Academy of Social Sciences), Jing Yongshi (Beifang University of Nationalities), Sun Bojun (Institute of Ethnology and Anthropology, Chinese Academy of Social Sciences) Status: Individual Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Date: 2019-05-27 1. Introduction At the Ad hoc meeting on Tangut held at Yinchuan, China in August 2016, under the auspices of the Script Encoding Initiative, Professors Jia Changye and Jing Yongshi reported that they had identified a number of misunified Tangut ideographs (see WG2 N4736; L2/16-243). The ideographs in question each have two unrelated meanings with separate entries in Li Fanwen’s 2008 Tangut-Chinese Dictionary (Xià-Hàn zìdiǎn 夏漢字典), but because they have identical glyphs in Li Fanwen’s dictionary and all other modern sources each of the two meanings were unified as a single encoded character. The recent research by Jia and Jing indicates that there are subtle but systematic glyph differences that distinguish the two readings and meanings of each of these encoded characters, as listed in their Xīxià zìfú jí shǔxìng biāozhù biǎo (cǎogǎo) 西夏字符及属性标注表(草稿) [Table of Xixia Characters with Annotated Properties (Draft)] (August 2016). Their document also identifies five components which should each be disunified into two encoded characters.
    [Show full text]
  • Tangut (Xixia) Script and Unicode (L2/07-289 = WG2/N3307)
    Tangut Background WG2/N3307 L2/07-289 ISO/IEC JTC1/SC2/WG2 Universal Multiple-Octet Coded Character Set Title: Tangut Background Reference: L2/07-143 “Proposal to encode Tangut characters in UCS Plane 1” Doc Type: Working Group Document Source: UC Berkeley Script Encoding Initiative (Universal Scripts Project) Author: Richard Cook Status: Liaison Contribution Action: For consideration by UTC; provided for information to JTC1/SC2/WG2 Date: 2007-09-01 No action by WG2 is requested on this document — it is only provided for information. The current document provides background information relating to L2/07-143 (WG2/N3297) “Proposal to encode Tangut characters in UCS Plane 1”. L2/07-143 (p.2,3) refers to the following “Research Notes” link, for scanned examples, and for details on Tangut history, writing, and for background information relating to the mapping database and code charts: http://unicode.org/~rscook/Xixia/index.html These notes (included below in PDF snapshot), go into some depth on points summarized in L2/07-143, and also reflect on-going revision of the property documentation in L2/07-290 (== L2/07-158, “Proposed Draft Unicode Technical Report #43: A User’s Guide to the UniTangut Database” [PDUTR #43]). These notes are now being provided to the technical committees for use in preparing text for a Tangut Block Introduction, for a future version of the standard. The online database look-up tool (also mentioned in L2/07-143), has been updated with images for the full set of representative glyphs (in four styles, and in three sizes each) from the Multi-Column Code Chart (L2/07-144), and accesses data from a draft of “UniTangut.txt” (L2/07-291): http://linguistics.berkeley.edu/~rscook/cgi/ztangut.html Copies of the latest revision of PDUTR #43 (L2/07-290) and latest “UniTangut.txt” (L2/07-291) draft are also linked on that page.
    [Show full text]
  • Iso/Iec Jtc1/Sc2/Wg2 N3577r L2/09-095R
    ISO/IEC JTC1/SC2/WG2 N3577R L2/09-095R 2009-04-08 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Proposal for a revised Tangut character set for encoding in the SMP of the UCS Source: Michael Everson, Nathan Hill, Guillaume Jacques, Andrew West, Viacheslav Zaytsev Status: Individual Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Date: 2009-04-08 1. Introduction. This is a proposal to replace the set of Tangut characters under ballot on Amd.7 with an extended and consistently ordered set of characters. Extensive discussion with Tangutologists from China, France, Japan, Russia, the UK, and the USA has indicated that the current set of 5,910 characters does not meet the requirements of the user community. In particular Tangut users need to be able to represent all graphically distinct Tangut characters that are used in modern dictionaries and scholarly works, which the character repertoire under ballot does not allow. The set of 6,221 Tangut characters that we are proposing includes all graphically distinct characters used in Kyčanov 2006, Lǐ Fànwén 1997/2008 and Hán Xiǎománg 2004. Previous documents on Tangut included: N3297 Proposal to encode Tangut characters in UCS Plane 1. Richard Cook (UC Berkeley Script Encoding Initiative), 2007-05-09 N3307 Tangut Background. Richard Cook (UC Berkeley Script Encoding Initiative), 2007-09-01 N3338 Response to UC Berkeley’s proposals on Tangut. China NB, 2007-09-16 N3343 Expert feedback on Chinese NB input on WG2/N3297 Tangut Encoding Proposal.
    [Show full text]
  • Characters Associated with Different Scripts Or Sets of Symbols
    The Unicode® Standard Version 13.0 – Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trade- mark claim, the designations have been printed with initial capital letters or in all capitals. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries. The authors and publisher have taken care in the preparation of this specification, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. © 2020 Unicode, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction. For information regarding permissions, inquire at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please see http://www.unicode.org/copyright.html. The Unicode Standard / the Unicode Consortium; edited by the Unicode Consortium. — Version 13.0. Includes index. ISBN 978-1-936213-26-9 (http://www.unicode.org/versions/Unicode13.0.0/) 1.
    [Show full text]
  • Unidings.Pdf
    Unidings Glyphs and Icons for Blocks of The Unicode Standard O Unidings, version 13.00, March 2020 free strictly for personal, non-commercial use available under the general ufas licence Unicode Fonts for Ancient Scripts George Douros C0 Controls � � 0000..001F Basic Latin � � 0020..007F C1 Controls � � 0080..009F Latin-1 Supplement � � 00A0..00FF Latin Extended-A � � 0100..017F Latin Extended-B � � 0180..024F IPA Extensions � � 0250..02AF Spacing Modifier Leters � � 02B0..02FF Combining Diacritical Marks � � 0300..036F Greek and Coptic � � 0370..03FF Cyrillic � � 0400..04FF Cyrillic Supplement � � 0500..052F Armenian � � 0530..058F Hebrew � � 0590..05FF Arabic � � 0600..06FF Syriac � � 0700..074F Arabic Supplement � � 0750..077F Thaana � � 0780..07BF NKo � � 07C0..07FF Samaritan � � 0800..083F Mandaic � � 0840..085F Syriac Supplement � � 0860..086F � � Arabic Extended-B Arabic Extended-A � � 08A0..08FF Devanagari � � 0900..097F Bengali � � 0980..09FF Gurmukhi � � 0A00..0A7F Gujarati � � 0A80..0AFF Oriya � � 0B00..0B7F Tamil � � 0B80..0BFF Telugu � � 0C00..0C7F Kannada � � 0C80..0CFF Malayalam � � 0D00..0D7F Sinhala � � 0D80..0DFF Thai � � 0E00..0E7F Lao � � 0E80..0EFF Tibetan � � 0F00..0FFF Myanmar � � 1000..109F Georgian � � 10A0..10FF Hangul Jamo � � 1100..11FF Ethiopic � � 1200..137F Ethiopic Supplement � � 1380..139F Cherokee � � 13A0..13FF Unified Canadian Aboriginal Syllabics � � 1400..167F Ogham � � 1680..169F Runic � � 16A0..16FF Tagalog � � 1700..171F Hanunoo � � 1720..173F Buhid � � 1740..175F Tagbanwa � � 1760..177F Khmer
    [Show full text]
  • Imre Galambos Translating Chinese Tradition and Teaching Tangut Culture Studies in Manuscript Cultures
    Imre Galambos Translating Chinese Tradition and Teaching Tangut Culture Studies in Manuscript Cultures Edited by Michael Friedrich Harunaga Isaacson Jörg B. Quenzer Volume 6 Imre Galambos Translating Chinese Tradition and Teaching Tangut Culture Manuscripts and Printed Books from Khara-khoto ISBN 978-3-11-044406-3 e-ISBN (PDF) 978-3-11-045395-9 e-ISBN (EPUB) 978-3-11-045316-4 This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License. For details go to http://creativecommons.org/licenses/by-nc-nd/3.0/. Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. © 2015 Walter de Gruyter GmbH, Berlin/Boston The book is published with open access at degruyter.com. Printing and binding: CPI books GmbH, Leck ♾ Printed on acid-free paper Printed in Germany www.degruyter.com Acknowledgements Writing this book was in many ways a collaborative project in the course of which I received help from lots of colleagues and friends. I first became interested in Tangut texts and the Tangut script while working for the International Dunhuang Project (IDP) at the British Library, where I had a chance to view some original manuscripts while they were being digitised. Susan Whitfield, the head of the project, has been continuously encouraging of my interest in the languages and scripts of Central Asia and urged me to study them.
    [Show full text]
  • Tangut Glyph Modifications and Corrections
    JTC1/SC2/WG2 N5134 2020-07-07 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Tangut Glyph Modifications and Corrections Source: Andrew West and Viacheslav Zaytsev Status: Individual Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Date: 2020-07-07 1. Introduction This document presents the second stage of glyph corrections for Tangut ideographs and components in response to the latest understanding of Tangut glyph shapes, based largely on the recent research by Profs. Jiǎ Chángyè 贾常业 and Jǐng Yǒngshí 景永时. The background investigation into the Tangut glyph issues raised by Jiǎ and Jǐng was carried out by West and Zaytsev, and presented in WG2 N5031 = L2/19-064. Subsequently, a joint proposal to disunify nine Tangut ideographs and six Tangut components was made in WG2 N5064 = L2/19-207, and these fifteen characters were encoded in Unicode 13.0. The first stage corrected the original misunification of nine Tangut ideographs, and laid the foundations for further glyph corrections by encoding additional required components. However, as noted in N5031, and discussed in person at WG2 Meeting 68 at Redmond in June 2019, the issues of glyph shape (joined versus unjoined strokes) that underlie the Unicode 13.0 disunifications affect a very large number of other Tangut ideographs. The second stage has been to identify the appropriate glyph forms (joined versus unjoined strokes) for all potentially affected Tangut ideographs, and apply the identified glyph corrections (joining adjacent strokes where appropriate) to the code chart font.
    [Show full text]
  • Unicode Technical Note No. 42: Tangut Character Additions and Glyph Corrections Andrew West and Viacheslav Zaytsev
    Unicode Technical Note No. 42: Tangut Character Additions and Glyph Corrections Andrew West and Viacheslav Zaytsev Version 2 2019-12-21 Contents 1. Introduction ................................................................................................................................................ 2 2. Additional Tangut Ideographs .......................................................................................................... 11 2.1 Tangut Ideographs 187ED, 187EE, and 187EF .................................................................. 11 2.2 Tangut Ideograph 187F0 ............................................................................................................. 16 2.3 Tangut Ideograph 187F1 ............................................................................................................. 18 2.4 Tangut Ideographs 187F2 and 187F3 .................................................................................... 24 2.5 Tangut Ideographs 187F4, 187F5, 187F6, and 187F7 .................................................... 35 3. Tangut Glyph Corrections ................................................................................................................... 44 3.1 Tangut Component 134 (23 Tangut ideographs) .............................................................. 44 3.2 Tangut Component 350 (4 Tangut ideographs) ................................................................ 55 3.3 Tangut Component 553 (1 Tangut ideograph)..................................................................
    [Show full text]