Chinese Taalverwerking Op De Computer

Total Page:16

File Type:pdf, Size:1020Kb

Chinese Taalverwerking Op De Computer FACULTEIT LETTEREN DEPARTEMENT OOSTERSE EN SLAVISCHE STUDIES KATHOLIEKE UNIVERSITEIT LEUVEN CHINESE TAALVERWERKING OP DE COMPUTER Deel I : Theoretisch Overzicht Promotor : Prof. Dr. Fred Truyen Verhandeling aangeboden tot het verkrijgen van de graad van licentiaat in de Sinologie door: Sébastien Bruggeman - 2001-2002 - VOORWOORD Dit theoretische overzicht handelt over de Chinese taalverwerking op de computer. Het heeft de bedoeling om zo volledig mogelijk te zijn, maar zal het helaas nooit kunnen zijn door de uitgebreidheid van dit onderwerp. Hoewel dit deel veel technische details bevat is er geen voorkennis vereist. Naast dit theoretisch overzicht is er ook nog een praktische handleiding voor mensen die Chinees in de praktijk op hun computer willen gebruiken. Ook voor dit deel is geen voorkennis vereist, wel wordt er gerekend op een basiskennis van Microsoft Windows. Het voorhanden hebben van een computer met internetverbinding maakt het mogelijk om alles onmiddellijk in de praktijk om te zetten. Het derde luik van deze verhandeling is een website. Op deze website kunnen extra documentatie, voorbeelden en links gevonden worden. Daarnaast kan men ook terecht op het forum voor extra vragen en antwoorden. Tot slot wens ik U nog veel leesplezier en hoop ik dat U door deze licentiaatsverhandeling een betere kijk krijgt op de Chinese taalverwerking op de computer. Sébastien Bruggeman Thesis Sébastien Bruggeman Pagina 2 Thesis Sébastien Bruggeman Pagina 3 INHOUDSTAFEL 0. Gebruikte conventies......................................................................................................11 1. Inleiding...........................................................................................................................14 1.1. Talen en schriften.....................................................................................................14 1.2. Vereenvoudiging van Chinese karakters..................................................................16 1.3. Typografie................................................................................................................18 1.4. Karakters en computers............................................................................................18 2. Karaktersets....................................................................................................................20 2.1. Westerse talen...........................................................................................................20 2.2. Oosterse talen...........................................................................................................22 2.2.1. Traditioneel Chinees.................................................................................................23 a) CCCII en EACC.......................................................................................................23 b) CNS..........................................................................................................................25 c) Big5..........................................................................................................................28 d) Big5+........................................................................................................................29 e) Big5E........................................................................................................................29 f) Hong Kong GCCS en SCS.......................................................................................30 2.2.2. Vereenvoudigd Chinees............................................................................................30 a) GB 1988-80..............................................................................................................30 b) GB 2312-80..............................................................................................................31 c) GB 6345.1-86...........................................................................................................31 d) GB 8565.2-88...........................................................................................................32 e) ISO-IR-165:1992......................................................................................................33 f) GB/T 12345-90.........................................................................................................34 g) GBK..........................................................................................................................34 h) GB 13000.1...............................................................................................................35 i) GB 18030-2000........................................................................................................36 j) Andere GB karaktersets............................................................................................37 2.3. Meertalige karaktersets.............................................................................................37 a) Unicode en ISO 10646.............................................................................................38 2.4. Conversie..................................................................................................................41 Thesis Sébastien Bruggeman Pagina 4 3. Codering..........................................................................................................................43 3.1. Westerse talen...........................................................................................................44 3.2. Chinees.....................................................................................................................44 a) HZ en EHZ...............................................................................................................44 b) ISO 2022...................................................................................................................46 c) EUC..........................................................................................................................47 d) GBK..........................................................................................................................48 e) Big5 en Big5+...........................................................................................................48 f) Overzicht..................................................................................................................48 3.3. Meertalig...................................................................................................................49 a) UCS..........................................................................................................................49 b) UTF...........................................................................................................................49 4. Hardware.........................................................................................................................52 4.1. Toetsenbord..............................................................................................................52 a) Uitspraak gebaseerd..................................................................................................53 b) Structuur gebaseerd..................................................................................................57 c) Combinatie uitspraak – structuur..............................................................................63 d) Directe invoer...........................................................................................................63 4.2. Andere......................................................................................................................64 5. Applicaties, toepassingen...............................................................................................65 5.1. Dos............................................................................................................................65 5.2. Microsoft Windows..................................................................................................65 a) Native Chinese Windows.........................................................................................65 b) Niet-Chinese Windows.............................................................................................66 5.3. Unix / Linux.............................................................................................................66 a) Native Chinese Linux...............................................................................................67 b) Niet-Chinese Linux..................................................................................................67 c) Linux in China & Taiwan.........................................................................................68 5.4. Apple........................................................................................................................69 5.5. Chinees en programmeertalen..................................................................................70 5.6. Chinees en databases................................................................................................72 Thesis Sébastien Bruggeman Pagina 5 6. Het Chinese internet.......................................................................................................74 7. Appendix.........................................................................................................................78 7.1. Bibliografie...............................................................................................................78
Recommended publications
  • Iso/Iec Jtc1/Sc2/Wg2 N 3936 L2/10-385
    ISO/IEC JTC1/SC2/WG2 N 3936 Date: 2010-10-06 ISO/IEC JTC1/SC2/WG2 Coded Character Set Secretariat: Japan (JISC) Doc. Type: Disposition of comments Title: Disposition of comments on SC2 N 4146 (ISO/IEC CD 10646, 3rd Ed. Information Technology – Universal Coded Character Set (UCS)) Source: Michel Suignard (project editor) Project: JTC1 02.10646.00.00.00.03 Status: For review by WG2 Date: 2010-09-24 Distribution: WG2 Reference: SC2 N4146, N4156, WG2 N3892 Medium: Paper, PDF file Comments were received from Armenia, China, Egypt, Ireland, Japan, Korea (ROK), Norway, and U.S.A. The following document is the disposition of those comments. The disposition is organized per country. Note – The full content of the ballot comments have been included in this document to facilitate the reading. The dispositions are inserted in between these comments and are marked in Underlined Bold Serif text, with explanatory text in italicized serif. As a result of these dispositions all countries with negative vote have changed their vote to positive. Page 1 of 20 Armenia: comments Technical comments T1. a) Armenian Dram Sign Upon consultation with the local specialist and the Armenian Dram Sign author SARM decided to stay with its request to place the sign in the "Currency Symbols" range 20A0-20CF at the available position 20B9. One of the main reasons for that is that the currency symbols are united in one and the same block on the basis of the main elements repeated in those things, and not on the basis of national alphabets or scripts. In other words the signs in this range are grouped in accordance with their functionality alike the three-letter abbreviations for the monetary instruments.
    [Show full text]
  • Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
    1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only.
    [Show full text]
  • Combining Diacritical Marks Range: 0300–036F the Unicode Standard
    Combining Diacritical Marks Range: 0300–036F The Unicode Standard, Version 4.0 This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 4.0. Characters in this chart that are new for The Unicode Standard, Version 4.0 are shown in conjunction with any existing characters. For ease of reference, the new characters have been highlighted in the chart grid and in the names list. This file will not be updated with errata, or when additional characters are assigned to the Unicode Standard. See http://www.unicode.org/charts for access to a complete list of the latest character charts. Disclaimer These charts are provided as the on-line reference to the character contents of the Unicode Standard, Version 4.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this excerpt file, please consult the appropriate sections of The Unicode Standard, Version 4.0 (ISBN 0-321-18578-1), as well as Unicode Standard Annexes #9, #11, #14, #15, #24 and #29, the other Unicode Technical Reports and the Unicode Character Database, which are available on-line. See http://www.unicode.org/Public/UNIDATA/UCD.html and http://www.unicode.org/unicode/reports A thorough understanding of the information contained in these additional sources is required for a successful implementation. Fonts The shapes of the reference glyphs used in these code charts are not prescriptive. Considerable variation is to be expected in actual fonts.
    [Show full text]
  • Proposal for a Korean Script Root Zone LGR 1 General Information
    (internal doc. #: klgp220_101f_proposal_korean_lgr-25jan18-en_v103.doc) Proposal for a Korean Script Root Zone LGR LGR Version 1.0 Date: 2018-01-25 Document version: 1.03 Authors: Korean Script Generation Panel 1 General Information/ Overview/ Abstract The purpose of this document is to give an overview of the proposed Korean Script LGR in the XML format and the rationale behind the design decisions taken. It includes a discussion of relevant features of the script, the communities or languages using it, the process and methodology used and information on the contributors. The formal specification of the LGR can be found in the accompanying XML document below: • proposal-korean-lgr-25jan18-en.xml Labels for testing can be found in the accompanying text document below: • korean-test-labels-25jan18-en.txt In Section 3, we will see the background on Korean script (Hangul + Hanja) and principal language using it, i.e., Korean language. The overall development process and methodology will be reviewed in Section 4. The repertoire and variant groups in K-LGR will be discussed in Sections 5 and 6, respectively. In Section 7, Whole Label Evaluation Rules (WLE) will be described and then contributors for K-LGR are shown in Section 8. Several appendices are included with separate files. proposal-korean-lgr-25jan18-en 1 / 73 1/17 2 Script for which the LGR is proposed ISO 15924 Code: Kore ISO 15924 Key Number: 287 (= 286 + 500) ISO 15924 English Name: Korean (alias for Hangul + Han) Native name of the script: 한글 + 한자 Maximal Starting Repertoire (MSR) version: MSR-2 [241] Note.
    [Show full text]
  • Chapter 4 Formatting Text Copyright
    Writer 6.0 Guide Chapter 4 Formatting Text Copyright This document is Copyright © 2018 by the LibreOffice Documentation Team. Contributors are listed below. You may distribute it and/or modify it under the terms of either the GNU General Public License (http://www.gnu.org/licenses/gpl.html), version 3 or later, or the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), version 4.0 or later. All trademarks within this guide belong to their legitimate owners. Contributors Jean Hollis Weber Bruce Byfield Gillian Pollack Acknowledgments This chapter is updated from previous versions in the LibreOffice Writer Guide. Contributors to earlier versions are: Jean Hollis Weber John A. Smith Hazel Russman John M. Długosz Ron Faile Jr. Figure 4 is from Bruce Byfield’s Designing with LibreOffice. This chapter is adapted from part of Chapter 3 of the OpenOffice.org 3.3 Writer Guide. The contributors to that chapter are: Jean Hollis Weber Agnes Belzunce Daniel Carrera Laurent Duperval Katharina Greif Peter Hillier-Brook Michael Kotsarinis Peter Kupfer Iain Roberts Gary Schnabl Barbara M. Tobias Michele Zarri Sharon Whiston Feedback Please direct any comments or suggestions about this document to the Documentation Team’s mailing list: [email protected] Note Everything you send to a mailing list, including your email address and any other personal information that is written in the message, is publicly archived and cannot be deleted. Publication date and software version Published July 2018. Based on LibreOffice 6.0. Note for macOS users Some keystrokes and menu items are different on macOS from those used in Windows and Linux.
    [Show full text]
  • The Not So Short Introduction to Latex2ε
    The Not So Short Introduction to LATEX 2ε Or LATEX 2ε in 139 minutes by Tobias Oetiker Hubert Partl, Irene Hyna and Elisabeth Schlegl Version 4.20, May 31, 2006 ii Copyright ©1995-2005 Tobias Oetiker and Contributers. All rights reserved. This document is free; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This document is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this document; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. Thank you! Much of the material used in this introduction comes from an Austrian introduction to LATEX 2.09 written in German by: Hubert Partl <[email protected]> Zentraler Informatikdienst der Universität für Bodenkultur Wien Irene Hyna <[email protected]> Bundesministerium für Wissenschaft und Forschung Wien Elisabeth Schlegl <noemail> in Graz If you are interested in the German document, you can find a version updated for LATEX 2ε by Jörg Knappen at CTAN:/tex-archive/info/lshort/german iv Thank you! The following individuals helped with corrections, suggestions and material to improve this paper. They put in a big effort to help me get this document into its present shape.
    [Show full text]
  • Bopomofo Extended Range: 31A0–31BF
    Bopomofo Extended Range: 31A0–31BF This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 14.0 This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See https://www.unicode.org/errata/ for an up-to-date list of errata. See https://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See https://www.unicode.org/charts/PDF/Unicode-14.0/ for charts showing only the characters added in Unicode 14.0. See https://www.unicode.org/Public/14.0.0/charts/ for a complete archived file of character code charts for Unicode 14.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 14.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 14.0, online at https://www.unicode.org/versions/Unicode14.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, #45, and #50, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See https://www.unicode.org/ucd/ and https://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation.
    [Show full text]
  • AIX Globalization
    AIX Version 7.1 AIX globalization IBM Note Before using this information and the product it supports, read the information in “Notices” on page 233 . This edition applies to AIX Version 7.1 and to all subsequent releases and modifications until otherwise indicated in new editions. © Copyright International Business Machines Corporation 2010, 2018. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents About this document............................................................................................vii Highlighting.................................................................................................................................................vii Case-sensitivity in AIX................................................................................................................................vii ISO 9000.....................................................................................................................................................vii AIX globalization...................................................................................................1 What's new...................................................................................................................................................1 Separation of messages from programs..................................................................................................... 1 Conversion between code sets.............................................................................................................
    [Show full text]
  • List of Approved Special Characters
    List of Approved Special Characters The following list represents the Graduate Division's approved character list for display of dissertation titles in the Hooding Booklet. Please note these characters will not display when your dissertation is published on ProQuest's site. To insert a special character, simply hold the ALT key on your keyboard and enter in the corresponding code. This is only for entering in a special character for your title or your name. The abstract section has different requirements. See abstract for more details. Special Character Alt+ Description 0032 Space ! 0033 Exclamation mark '" 0034 Double quotes (or speech marks) # 0035 Number $ 0036 Dollar % 0037 Procenttecken & 0038 Ampersand '' 0039 Single quote ( 0040 Open parenthesis (or open bracket) ) 0041 Close parenthesis (or close bracket) * 0042 Asterisk + 0043 Plus , 0044 Comma ‐ 0045 Hyphen . 0046 Period, dot or full stop / 0047 Slash or divide 0 0048 Zero 1 0049 One 2 0050 Two 3 0051 Three 4 0052 Four 5 0053 Five 6 0054 Six 7 0055 Seven 8 0056 Eight 9 0057 Nine : 0058 Colon ; 0059 Semicolon < 0060 Less than (or open angled bracket) = 0061 Equals > 0062 Greater than (or close angled bracket) ? 0063 Question mark @ 0064 At symbol A 0065 Uppercase A B 0066 Uppercase B C 0067 Uppercase C D 0068 Uppercase D E 0069 Uppercase E List of Approved Special Characters F 0070 Uppercase F G 0071 Uppercase G H 0072 Uppercase H I 0073 Uppercase I J 0074 Uppercase J K 0075 Uppercase K L 0076 Uppercase L M 0077 Uppercase M N 0078 Uppercase N O 0079 Uppercase O P 0080 Uppercase
    [Show full text]
  • Suggestions for the ISO/IEC 14651 CTT Part for Hangul
    SC22/WG20 N891R ISO/IEC JTC 1/SC2/WG2 N2405R L2/01-469 (formerly L2/01-405) Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation internationale de normalisation Title: Ordering rules for Hangul Source: Kent Karlsson Date: 2001-11-29 Status: Expert Contribution Document Type: Working Group Document Action: For consideration by the UTC, JTC 1/SC 2/WG 2’s ad hoc on Korean, and JTC 1/SC 22/WG 20 1 Introduction The Hangul script as such is very elegantly designed. However, its incarnation in 10646/Unicode is far from elegant. This paper is about restoring the elegance of Hangul, as much as it can be restored, for the process of string ordering. 1.1 Hangul syllables A lot of Hangul syllables have a character of their own in the range AC00-D7A3. They each have a canonical decomposition into two (choseong, jungseong) or three (choseong, jungseong, jongseong) Hangul Jamo characters in the ranges 1100-1112, 1161-1175, and 11A8-11C2. The choseong are leading consonants, one of which is mute. The jungseong are vowels. And the jongseong are trailing consonants. A Hangul Jamo character is either a letter or letter cluster. The Hangul syllable characters alone can represent most modern Hangul words. They cannot represent historic Hangul words (Middle Korean), nor modern/future Hangul words using syllables not preallocated. However, all Hangul words can elegantly be represented by sequences of single-letter Hangul Jamo characters plus optional tone mark. 1 1.2 Single-letter and cluster Hangul Jamo characters Cluster Hangul Jamo characters represent either clusters of two or three consonants, or clusters of two or three vowels.
    [Show full text]
  • Jamo Pair Encoding: Subcharacter Representation-Based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 3490–3497 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization Sangwhan Moonyz, Naoaki Okazakiy Tokyo Institute of Technologyy, Odd Concepts Inc.z, [email protected], [email protected] Abstract In the context of multilingual language model pre-training, vocabulary size for languages with a broad set of potential characters is an unsolved problem. We propose two algorithms applicable in any unsupervised multilingual pre-training task, increasing the elasticity of budget required for building the vocabulary in Byte-Pair Encoding inspired tokenizers, significantly reducing the cost of supporting Korean in a multilingual model. Keywords: tokenization, vocabulary compaction, sub-character representations, out-of-vocabulary mitigation 1. Background BPE. Roughly, the minimum size of the subword vocab- ulary can be approximated as jV j ≈ 2jV j, where V is the With the introduction of large-scale language model pre- c minimal subword vocabulary, and V is the character level training in the domain of natural language processing, the c vocabulary. domain has seen significant advances in the performance Since languages such as Japanese require at least 2000 char- of downstream tasks using transfer learning on pre-trained acters to express everyday text, in a multilingual training models (Howard and Ruder, 2018; Devlin et al., 2018) when setup, one must make a tradeoff. One can reduce the av- compared to conventional per-task models. As a part of this erage surface of each subword for these character vocabu- trend, it has also become common to perform this form of lary intensive languages, or increase the vocabulary size.
    [Show full text]
  • 2 Hangul Jamo Auxiliary Canonical Decomposition Mappings
    DRAFT Unicode technical note NN Auxiliary character decompositions for supporting Hangul Kent Karlsson 2006-09-24 1 Introduction The Hangul script is very elegantly designed. There are just a small number of letters (28, plus a small number of variant letters introduced later, but the latter have fallen out of use) and even a featural design philosophy for the shapes of the letters. However, the incarnation of Hangul as characters in ISO/IEC 10646 and Unicode is not so elegant. In particular, there are many Hangul characters that are not needed, for precomposed letter clusters as well as precomposed syllable characters. The precomposed syllables have arithmetically specified canonical decompositions into Hangul jamos (conjoining Hangul letters). But unfortunately the letter cluster Hangul jamos do not have canonical decompositions to their constituent letters, which they should have had. This leads to multiple representations for exactly the same sequence of letters. There is not even any compatibility-like distinction; i.e. no (intended) font difference, no (intended) width difference, no (intended) ligaturing difference of any kind. They have even lost the compatibility decompositions that they had in Unicode 2.0. There are also some problems with the Hangul compatibility letters, and their proper compatibility decompositions to Hangul jamo characters. Just following their compatibility decompositions in UnicodeData.txt does not give any useful results in any setting. In this paper and its two associated datafiles these problems are addressed. Note that no changes to the standard Unicode normal forms (NFD, NFC, NFKD, and NFKC) are proposed, since these normal forms are stable for already allocated characters.
    [Show full text]