L2/19-388 (New Unihan Database Property: Kunihancore2020)
Total Page:16
File Type:pdf, Size:1020Kb
L2/19-388 Title: New Unihan Database property: kUnihanCore2020 Author: Ken Lunde Date: 2019-12-02 Per L2/18-066R2, I previously proposed what I considered to be modest changes to the existing kIICore prop- erty, mainly to address some shortcomings that were identified in a series of five CJK Type Blog articles (ap- pended to this document). Given the reluctance on the part of some national bodies to accept such modest changes, I decided to instead propose via L2/18-279R a completely new Unihan Database property that re- leases the set from being hampered by memory constraints that may have been applicable 15 years ago, but which arguably no longer apply to modern environments. The new Unihan Database property name is kUnihanCore2020, which includes as part of its name the year in which the first version of Unicode that would include this new property is released, specifically Version 13.0. The attached unihancore2020-data.txt data file provides all of the property data, which covers 20,652 CJK Uni- fied Ideographs and 68 CJK Compatibility Ideographs. Compared to the existing kIICore property, the proposed kUnihanCore2020 property includes 10,910 additional ideographs. 22 ideographs that have a kIICore property value failed to meet the criteria for the kUnihanCore2020 property, but have been grandfathered. The follow- ing table lists these 22 grandfathered ideographs, their kIICore property value, and the source reference that corresponds to their source tag: Code Point kIICore Property Value Corresponding Source Reference U+3960 CK K3-2554 㥠 U+4137 CK K3-2D4F 䄷 U+48B5 CG G5-6F4F 䢵 U+48C5 CG G3-6F29 䣅 U+48D3 CG G3-7B67 䣓 U+49D1 CG GKX-1352.16 䧑 U+4A12 CK K3-3455 䨒 U+4CB3 CT T3-5028 䲳 U+4D08 CT T4-6C52 䴈 U+593D CK K2-2B54 夽 U+5D44 CK K2-2F33 嵄 U+5F34 CJ J13-7436 弴 U+5F45 CJ J13-743A 彅 U+66A3 CK K1-5B6F 暣 U+713F CT T3-6552 焿 U+7807 CK none 砇 U+7A66 CK none 穦 U+974D CJ J3-7D68 靍 U+974F CJ J13-7D6A 靏 U+9964 CG G8-2D43 饤 U+997E CG G8-2D48 饾 U+9AD9 CJ none 髙 Also see the attached grandfathered-22.txt data file. 1 The seven sections that follow describe the scope of each of the seven supported source tags, which are the same as those used by the existing kIICore property. G—PRC The scope of the “G” source tag is the union of the GB 2312 (6,763), TGH-2013/ /Tōngyòng Guīfàn Hànzìbiǎo (8,105—see the kTGH property), and /Xiàndài通用规范汉字表 Hànyǔ Tōngyòngzìbiǎo (7,000) standards, which results in 8,241 unique ideographs,现代汉语通用字表 all of which are CJK Unified Ideographs. This fig- ure is only 136 ideographs more than TGH-2013 itself. The following six ideographs were grandfathered from the kIICore property and use the “G” source tag: U+48B5 (G5), U+48C5 (G3), U+48D3 (G3), U+49D1 (GKX), U+9964 (G8) & U+997E (G8). The total number䢵 of ideographs 䣅with the “G” source䣓 tag is therefore䧑 8,247. 饤 饾 SPECIAL NOTES: 22 existing kIICore ideographs with the “G” source tag are excluded, because they are outside the scope of the three specified standards, but are included via other source tags. See the attached excluded- g-22.txt data file. H—Hong Kong SAR The scope of the “H” source tag is the union of the Big Five (13,060—see the kBigFive property) and HKSCS (4,603) standards, which results in 17,663 unique ideographs, 11 of which are CJK Compatibility Ideographs. There is no overlap between these two standards. J—Japan The scope of the “J” source tag is the union of the JIS X 0208 (6,356), /Jōyō Kanji (2,136—see the kJoyoKanji property), /Jinmei-yō Kanji (863—see the kJinmeiyoKanji常用漢字 property), and / Hyōgai Kanji (1,022) standards,人名用漢字 which results in 6,485 unique ideographs, 58 of which are CJK Compatibility表外漢字 Ideographs. This figure is only 129 more ideographs than JIS X 0208 itself. The following five ideographs were grandfathered from the kIICore property and use the “J” source tag: U+5F34 (J13), U+5F45 (J13), U+974D (J3), U+974F (J13) & U+9AD9 (no kIRG_JSource). The total number of弴 ideographs with彅 the “J” source tag靍 is therefore 6,490.靏 髙 SPECIAL NOTES: One existing kIICore ideograph with the “J” source tag is excluded, because it is outside the scope of the four specified standards, but is included via other source tags. See the attached excluded-j-1.txt data file. K—ROK The scope of the “K” source tag is the union of the KS X 1001 (4,620) and /Hanmun Gyoyug-yong Gicho Hanja (1,800—see the kKoreanEducationHanja한문 교육용 property)기초 한자/漢文敎育 standards, which用基礎漢字 results in 4,632 unique ideographs, all of which are CJK Unified Ideographs. This figure is only 12 more ideographs than KS X 1001 itself. The following eight ideographs were grandfathered from the kIICore property and use the “K” source tag: U+3960 (K3), U+4137 (K3), U+4A12 (K3), U+593D (K2), U+5D44 (K2), U+66A3 (K1), U+7807 (no kIRG_KSource㥠 ) & U+7A66䄷 (no kIRG_KSource䨒 ). The total夽 number of ideographs嵄 with the暣 “K” source tag is砇 therefore 4,640. 穦 SPECIAL NOTES: 126 existing kIICore ideographs with the “K” source tag are excluded, because they are out- side the scope of the two specified standards, but are included via other source tags. See the attached exclud- ed-k-126.txt data file. M—Macao SAR The scope of the “M” source tag is the union of the Big Five standard (13,060—see the kBigFive property) and the existing kIICore ideographs that have the “M” source tag (4,954), which results in 13,119 unique ideographs, all of which are CJK Unified Ideographs. This figure is only 59 more ideographs than Big Five itself. 2 SPECIAL NOTES: Only one existing kIICore ideograph with the “M” source tag, U+5F66 , is excluded for rea- sons explained in the 2018-02-15 CJK Type Blog article, but is covered by four of the other彦 six source tags (G, J, K & P): Only one ideograph, U+5F66 , stands out as odd in that its source references do not suggest Macao SAR use. Its related ideograph, U+5F65 彦, is also tagged “M” in kIICore (ATHM), and its source references, particularly T1-507D, more strongly suggest Macao彥 SAR use. See the attached excluded-m-1.txt data file. P—DPRK The scope of the “P” source tag is the KPS 9566 (4,653) standard, which means that this is unchanged from kIICore. T—ROC The scope of the “T” source tag is the union of the CNS 11643 Levels 1 & 2 (13,064) and Big Five (13,060—see the kBigFive property) standards, which results in 13,065 unique ideographs, all of which are CJK Unified Ideo- graphs. The following three ideographs were grandfathered from the kIICore property and use the “T” source tag: U+4CB3 (T3), U+4D08 (T4) & U+713F (T3). The total number of ideographs with the “T” source tag is therefore 13,068.䲳 䴈 焿 SPECIAL NOTES: 90 existing kIICore ideographs with the “T” source tag are excluded, because they are outside the scope of the two specified standards, but are included via other source tags. See the attached excluded- t-90.txt data file. No Priority Tags Because the notion of priority is largely source-specific, the kUnihanCore2020 property does not have a provi- sion to specify priority tags. The author of the proposal felt that they are not necessary, and that the source tags are sufficient. CJK Compatibility Ideographs Although the kUnihanCore2020 property specifies source tags for 68 CJK Compatibility Ideographs—11 with the “H” source tag, and 57 with the “J” source tag—it is expected that their corresponding SVSes (Standardized Variation Sequences) be used in actual implementations. In addition, the CJK Compatibility Ideographs that correspond to the Big Five (2) and KS X 1001 (268) standards have been intentionally excluded, because they represent genuine duplicate ideographs. See the attached svs-68.txt data file that provides a correspondence between these 68 CJK Compatibility Ideographs and their SVSes. That is all. 3 adobe.com CJK Type Blog CJK Fonts, Character Sets & Encodings. All CJK. #AllOfTheTime. HOME Exploring IICore—Part 1 By Dr. Ken Lunde Comments (0) Created February 5, 2018 Exploring IICore—Part 1 Today’s article is the very frst one that references IICore ( International Ideographs Core), which is best described as a region agnostic subset that includes the most commonly used CJK Unifed Ideographs in Unicode, and is intended for use in memory- challenged devices and environments. Included are 9,810 ideographs, the bulk of which are in the URO (9,706), with the remaining ones in Extensions A (42) and B (62). IICore is instantiated as the kIICore property of the Unihan Database, and documented in UAX #38. The kIICore property v consist of an initial letter—A, B, or C—that indicates priority, followed by one or more letters that specify a source that more or less corresponds to a region: G, H, J, K, M, P (short for KP), and T. In Part 1 of what may eventually become a multiple-part series about IICore, I will briefy explore the ideographs that are tagged “K” for Korean use, along with pointing out some that should have been tagged “K” after examining the mappings to the KS X 1001 standard.