Chinese Script Generation Panel Document

Chinese Script Generation Panel Document Proposal for the Generation Panel for the Chinese Script Label Generation Ruleset for the Root Zone 1. General Information Chinese script is the logograms used in the writing of Chinese and some other Asian languages. They are called Hanzi in Chinese, Kanji in Japanese and Hanja in Korean. Since the Hanzi unification in the Qin dynasty (221-207 B.C.), the most important change in the Chinese Hanzi occurred in the middle of the 20th century when more than two thousand Simplified characters were introduced as official forms in Mainland China. As a result, the Chinese language has two writing systems: Simplified Chinese (SC) and Traditional Chinese (TC). Both systems are expressed using different subsets under the Unicode definition of the same Han script. The two writing systems use SC and TC respectively while sharing a large common “unchanged” Hanzi subset that occupies around 60% in contemporary use. The common “unchanged” Hanzi subset enables a simplified Chinese user to understand texts written in traditional Chinese with little difficulty and vice versa. The Hanzi in SC and TC have the same meaning and the same pronunciation and are typical variants. The Japanese kanji were adopted for recording the Japanese language from the 5th century AD. Chinese words borrowed into Japanese could be written with Chinese characters, while Japanese words could be written using the character for a Chinese word of similar meaning. Finally, in Japanese, all three scripts (kanji, and the hiragana and katakana syllabaries) are used as main scripts. The Chinese script spread to Korea together with Buddhism from the 2nd century BC to the 5th century AD. In times past, until the 15th century, in Korea, Literary Chinese was the dominant form of written communication, prior to the creation of Hangul, the Korean alphabet. In the modern Hangul-based Korean writing system, Chinese characters are no longer officially used to represent native morphemes, but still sometimes used by a few Korean people in daily life. Historically, Chinese characters were also used in Mongolia and Vietnam, but not anymore. Accordingly, the Chinese Generation Panel does not take into account the usage of Chinese scripts within Mongolia and Vietnam. 1 Chinese Script Generation Panel Document 2 Countries with Significant User Communities for Chinese Script Chinese script is used to write a diverse set of languages across East Asia and South East Asia. Some major countries and regions using Chinese script are depicted as follows: Traditional Chinese script used exclusively or almost exclusively (Taiwan, Macau and Hong Kong) Simplified Chinese script used exclusively or almost exclusively (Mainland China and Singapore) Simplified Chinese script used formally but Traditional script still used widely (Malaysia) Chinese script used in conjunction with other systems of writing in the same language, K anji (Japan) Chinese script no longer officially used, Hanja (Republic of Korea) 3 Target Script and Scope 3.1 Target Script: Hani In ISO 15924, the script for Chinese Language is mainly defined in this specification: ISO 15924 code: Hani ISO 15924 no.: 500 English Name: Han (Hanzi, Kanji, Hanja) 2 Chinese Script Generation Panel Document Following ISO setting, CGP will directly take “Hani” as the Language Tag for Chinese. 3.2 Principal Languages using the Script As shown in the following non-exhaustive table, Chinese, Japanese and Korean are three main languages using the Chinese script today but it does not imply that unlisted languages are less significant. For example, there are cases where a language may have a large population, but only a small part of it writes it in Chinese script. Such languages are also excluded from this list. For these language all ISO 639-3 available as “living” are included from http://www-01.sil.org/ISO639-3/codes.asp, which may refer to a macro or an individual language. Language ISO 639-3 Code(s) Countries Local Names of the Script Chinese cdo, cjy, cmn, cpx, czh, China 汉字 czo, gan, hak, hsn, lzh, mnp, nan, wuu, yue, zho Japanese jpn Japan 漢字 Korean kor Korea 한자 3.3 Scope In 2004, according to RFC 3743 and RFC 4713, the Chinese Domain Name Consortium (CDNC) submitted to IANA a unified Chinese Character Set (19520 characters) for domain name registration, building up mapping relationships between any given simplified character, its traditional character(s) and its variant(s). All those 19520 characters are included in MSR-1. In 2012, CDNC added 17 more Chinese characters into the character set as requested by HKIRC on behalf of Hongkong community, increasing the set number to 19537. But only 15 of those 17 characters are included in MSR-1. Thus CGP takes the intersection of MSR-1 and the latest version of CDNC character set, amounting to 19535 characters, without Latin Hyphen, digits (002D, 0030-0039, which are not allowed in LGR for the root zone) and letters (0061-007A), as the repertoire (listed in Appendix A), which consequentially falls in the range of MSR-1. Before the step of whole label rules, CGP will coordinate with KGP and JGP on the variant 3 Chinese Script Generation Panel Document characters inclusion and removal within the range of MSR-1, modifying the CGP repertoire if it is necessary. 3.4 Chinese Characters in Different Languages and Script Chinese characters are used as Hanzi in Chinese, Kanji in Japanese and Hanja in Korean: Hanzi normally consists of two subsets, Simplified Chinese characters (Hans) and Traditional Chinese characters (Hant). Kanji is used to in Japanese in addition to two other scripts (hiragana and katakana), together known as Jpan (ISO 15924 code). Hanja is used in Korean in addition to the Hangul script, together known as Kore (ISO 15924 code). The relationship between Hanzi, Kanji and Hanja is as shown below, Hanzi (Hans & Hans), Kanji and Hanjia will all be covered by CGP. 4 Variants Coordination between C, J and K As introduced above, different panels (C, J and K) have different views on the variants corresponding to the same Chinese character. This will accordingly generates impacts on the regulation of Chinese script sets and rules for Chinese domain name registration (even results in conflicts during IDN TLD registration).The following will explain it from the aspects of Chinese, Japanese and Korean languages. Chinese: 4 Chinese Script Generation Panel Document In the Chinese language, there are two types of variant: The first type is created by regional variations in the standard writing system. There are now two common writing systems: Simplified Chinese and Traditional Chinese. Both writing systems use different subsets of the same Unicode Han script, but they are not mutually exclusive to each other. The second type is the generic variant. Several Chinese characters are visually different in forms, but treated equally with universal interchangeability. This relationship of interchangeability is much stronger than the relationship between the Traditional and Simplified forms. This understanding and variants mapping relationship has been reflected in the CDNC Character Set, the IANA IDN table for .CN and .TW. Japanese There are more than 6,000 Kanji characters used in Japan, 2,000 of which are regularly used. Among those 2,000 Kanji characters, some are in a simplified form (called the “new character form”), derived from the traditional imported form (called the “old character form”). For Chinese characters used in Japanese (Kanji), it is appropriate to distinguish new and old forms as different and independent characters instead of pure variants. This understanding has been reflected in the IANA IDN table developed by the .JP registry, JPRS, in which no variants are identified for Kanji. Korean Hanja is no longer widely used in the ROK. A law enacted on April 14th, 2011 orders all ROK official government documents to be written only in Hangul. Hanja or other scripts can only be written within parentheses if allowed by presidential decree. Many Korean words though derived from Hanja are usually written in Hangul. Staff from the Korea Internet & Security Agency, the registry operator for the .KR ccTLD, stated that its IDN policy does not allow Hanja to be reflected by the language table submitted to IANA, and nor does they have any intention of allowing the use of Hanja in their domestic market. Thus, a coordination scheme between different panels is needed in that we expect unified Chinese script generation rules in the DNS root zone. However, if the Chinese script (Hanja) is no longer widely used in the ROK (we still need conformation of this from the KGP), this issue turns into coordination mainly between the CGP and the JGP. During the CDNC meeting in Shanghai (May, 2014), we were glad to see a coordination scheme 5 Chinese Script Generation Panel Document proposed by the Integration Panel. Following are the basic principles : Each CJK panel creates an LGR and each LGR includes a repertoire and variants. If an LGR includes Han characters, the variant mappings must agree for all three panels. The variant types may be different (blocked or allocatable), the variant types do not have to agree across LGRs Based on the above principles, CGP, JGP and KGP need to cooperate together to define a unified variant mapping table for Chinese scripts, then define variant types (e.g., allocatable or blocked) for each character contained in this table. (as described in the document of “Procedure to Develop and Maintain the Label Generation Rules for the Root Zone in Respect of IDNA Labels”). During this process, we can reasonably anticipate that the CGP repertoire will be increased to add some characters or decreased due to some characters with controversy. All relevant characters are supposed to be included in MSR-1.

Chinese Script Generation Panel Document

Stochastic Model of Stroke Order Variation

1 Introduction 1

The Study of Old Documents of Hokkaido and Kuril Ainu

The Unicode Cookbook for Linguists: Managing Writing Systems Using Orthography Profiles

Neural Substrates of Hanja (Logogram) and Hangul (Phonogram) Character Readings by Functional Magnetic Resonance Imaging

Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress

A Functional MRI Study on the Japanese Orthographies

Nushu and the Writing of Religious

Sinitic Language and Script in East Asia: Past and Present

A Student Model of Katakana Reading Proficiency for a Japanese Language Intelligent Tutoring System

A Comparative Analysis of the Simplification of Chinese Characters in Japan and China

Proposal for a Korean Script Root Zone LGR 1 General Information