<<

Multiple-policy Character Annotation based on CHISE

Multiple-policy Character Annotation based on CHISE Tomohiko Morioka*

Abstract This paper describes knowledge based character processing model resolve some problems of coded character model. Currently, in the field of information processing of digital texts, each character is represented and processed by the “Coded Character Model.” In this model, each character is defined and shared using a coded character set (code) and represented by a code-point (integer) of the code. In other words, when knowledge about characters is defined (standardized) in a specification of a coded character set, then there is need to store large and detailed knowledge about characters into computers for basic text processing. In terms of flexibility, however, the coded character model has some problems, because it assumes a finite set of characters, with each character of the set having a stable concept shared in the community. However, real character usage is not static and stable. Especially in , it is not so easy to select a finite set of characters which covers all usages. To resolve these problems, have proposed the “Chaon” model. This is a new model of character processing based on character ontology. This report briefly describes the Chaon model and the CHISE (Character Information Service Environment) project, and focuses on how to represent Chinese characters and their glyphs in the context of multiple unification rules.

1 Introduction Digital texts are usually represented by sequences of coded characters, defined in major coded character sets for information interchange, such as UCS1 (). In this sense, digital texts are highly dependent on the definition of the coded character set. UCS is an international standard, so its use for encoding digital texts has two opposite problems: 1. It is difficult to change, so necessary characters cannot be encoded in the current standard. 2. It is always in flux, so representations based on old standards need to be modified for the new standard. Especially for the digitization of premodern Chinese characters (漢字; J. , Ch. hanzi), the first problem is very serious. Even if required characters are proposed and added to UCS, we

* Kyoto University 1 International Organization for Standardization (ISO), Information technology — Universal Multiple- Octet Coded Character Set (UCS), June 2012, ISO/IEC 10646:2012.

Journal of the Japanese Association for Digital Humanities, vol. 1, 86 Multiple-policy Character Annotation based on CHISE

need several years and a lot of work before the ideograph is published in the standard. Usually we cannot wait for this, and therefore we usually use gaiji (non-standard characters encoded in PUA []). However, gaiji may cause problems in an open environment, because in most cases each gaiji has only information about its glyph image. For digital texts and databases, searchability is very important issue. For digitization of Chinese characters, we often think about only their glyph images. However, the pure glyph images that represent characters are not characters themselves. They do not contain annotations of characters, so they are not searchable. The second problem is also serious. For digital archives, the information in digital texts should be stable and permanent. However, UCS disunified some abstract characters of the CJK Unified Ideographs in BMP (Basic Multilingual ) and added some glyphs of them into CJK Unified Ideographs Extension B (Ext-B) as different abstract characters. As shown in the disunification of CJK Unified Ideographs when Ext-B was added, addition of characters potentially may divide semantic coverages indicated by existing code points, especially while the addition of characters to UCS is ongoing. In UCS and other standards of coded character sets for Chinese characters, each indicates a set of glyph images unified by the corresponding abstract character even if it shows a representative glyph image. Semantics of code points are the corresponding sets of glyph images (unifiable coverages) defined by unification rules of the coded character set. Stability of unifiable coverages and/or unification rules is very important—especially for digital archives. In addition, in the case of the digitization of Chinese characters in premodern texts, paying attention to variations of character shapes and their semantics is very important. In some cases, glyph variations of a character may be out of the unifiable coverage of the character. In other cases, unifiable coverage of a character based on premodern character usages may be narrower than in most cases of modern character usage. In addition, its shape is an element of the meaning represented by the character, but it is not the only element—there are several elements involved in the construction of a character’s meaning. For example, the ideograph 「芸」 (“talent”) is used as a simplified (and canonical) variant of gei 「藝」 in the current Japanese orthography. In this usage, 「芸」has the Japanese pronunciation of gei. However, in earlier historical periods 「芸」has other origins unrelated to 「藝」. In that usage, 「芸」(“rue”) has the Japanese pronunciation un. In the digitization of ancient Chinese texts, a modern character that corresponds to an ancient one based on a structure of its shape is often different from a modern character whose correspondence is based on its semantics. In any case, in the digitization of Chinese characters, we may have to think about multidimensional factors of character concepts including detailed differences among glyph images, the differing usage of characters in current writing systems, and various factors unrelated to shape. To resolve these problems, we have proposed the “Chaon” model. This is a new model of character processing based on character ontology. In this model, each character is represented by

Journal of the Japanese Association for Digital Humanities, vol. 1, 87 Multiple-policy Character Annotation based on CHISE

a set of features that represents information about and/or the concept of the character. So we have created a large-scale character ontology that includes 241,000 character objects (899,000 triples2) including Unicode characters, non-Unicode characters and their glyphs. This report briefly describes the Chaon model (Morioka 2008) and the CHISE (Character Information Service Environment) project,3 and focuses on how to represent Chinese characters and their glyphs in the context of multiple unification rules.

2 Character Representation in CHISE 2.1 The Chaon Model In the Chaon model, representation of each character is done not according to its code point within a coded character set, but according to its feature set. Each character is represented as a character object, and the character object is defined by character features. Character objects and character features are stored in a character ontology, and each character can be accessed using its features as key. There are various kinds of information related to characters, so we can regard various things as character features, for example, shapes, phonetic values, semantic values, or code points in various character sets. Figure 1 shows a sample image of a character representation in the Chaon model.

Figure 1. Sample of character features for “吉” In the Chaon model, each character is represented by a set of character features, so we can use set operations to compare characters. Figure 2 shows a sample of a Venn diagram of character

2 "Triple" is an RDF term. A triple consists of a subject, a predicate, and an object. It is a node of an RDF graph. In CHISE, feature-name and feature-value correspond with a predicate and object of RDF, and a character object described by a set of feature pairs (a feature pair consists of feature-name and feature-value) corresponds with an RDF subject. See World Wide Web Consortium 2004. 3 CHISE Project, http://www.chise.org.

Journal of the Japanese Association for Digital Humanities, vol. 1, 88 Multiple-policy Character Annotation based on CHISE

objects. The diagram indicates that the characters “言”, “云” and “謂” share semantic and phonetic values in modern Japanese even if they have quite different glyphs, and the characters “云” and “雲” share semantic and phonetic values in modern Chinese used in the People’s Republic of . As we have already explained, a coded character set (CCS) and a code point in the set can also be character features. Those features enable the exchange of character information with the applications that depend on coded character sets. If a character object has only a CCS feature, processing for the character object is the same as with processing based on the coded-character model presently in standard use. This means that we can regard the coded-character model as a subset of the Chaon model.

Figure 2. Venn diagram of characters

2.2 Notations In the Chaon model, each character is represented by a set of character features. Each character feature is a pair of feature-name and feature-value. Therefore each character can be represented by an associative array. For example, it can be represented as:

name 1 : value 1 name 2 : value 2 : name : value n

Journal of the Japanese Association for Digital Humanities, vol. 1, 89 Multiple-policy Character Annotation based on CHISE

This style may be human readable. Therefore, CHISE-Wiki (see section 5.2) uses it as the default style. To avoid ambiguities, JSON style { “name 1”: “value 1”, “name 2”: “value 2”, : “name n”: “ value n” } may be useful. For historical reasons, Lisp style notation (S-expression) ((name 1 . value 1) (name 2 . value 2) : (name n . value n)) is used as the standard format of the CHISE character ontology. It is called ‘char-spec’ (character- specification) format. To represent complex information, feature names are structured (see sections 2.4, 2.5, and 2.6). A defined character and its feature-name/value can be corresponding with a triple (a set of subject, predicate, and object) in RDF. However, an RDF predicate is not structured like a feature- name as in CHISE. Therefore, some mapping rules are required.

2.3 Categorization of Features In character processing based on the Chaon model, it is important to analyze characters and their various properties and behaviors and represent their character features. We can find various properties and behaviors of characters, and we can define infinite numbers of character features. However, general purpose character ontology requires a guideline about character features. We think it is feasible to regard each character feature as an abstraction of an operation for characters. Under this approach, each character feature can be categorized as follows: 1. general character property (such as radicals, strokes, phonetic values, or other descriptions of Chinese character dictionaries) 2. mapping for character ID 3. relationship between characters For example, radicals, strokes, and phonetic values can be placed into category 1, code points of UCS can be placed into category 2, and relations between character variants can be placed into category 3.

Journal of the Japanese Association for Digital Humanities, vol. 1, 90 Multiple-policy Character Annotation based on CHISE

2.4 Mapping for Character ID The information from category 2 is used for character code processing, such as code conversion. Character code processing consists of two kind of operations: encoding and decoding. To encode a character using a CCS (coded-character set) is to get the CCS feature’s value in the character. To decode a code point of a CCS is to search for a character whose value of the CCS feature is the code point. Character code processing should be fast, so the character ontology of CHISE has special indexes for decoding.4 For the processing concerned with character variants, information from category 3 is used. In character representation based on the Chaon model, each character object is independent of character codes, so a code point of a CCS may have more than one of the corresponding character objects.. In the sense of CCS standards, such as UCS or JIS X 0208:1997, a code point in a CCS indicates an abstract character and what it covers is defined in the standards’ unification rules. However, a CCS code point is usually regarded as an abstract glyph or a glyph image printed as the representative glyph image of the code point. In the case of glyph sets, such as Adobe-Japan1, a code point of a glyph set indicates a glyph and what it covers is ambiguous because the unification rules of the glyph set are not defined in the standard. In such a case, we have to work out the code point’s coverage inductively. In any case, because what a code point covers may be ambiguous, we should specify its coverage. Therefore, we introduced prefixes of features to indicate granularity of coverage (table 1). For example, =jis-x0208 means abstract glyphs corresponding with the representative glyph-images defined in JIS X 0208 and =>jis-x0208 means abstract characters defined in JIS X 0208.

Prefix Description =-> Super abstract character (a set of abstract characters which have similar phonetic values and concepts) => Abstract character =+> Unified abstract glyph = Abstract glyph =>> Concrete glyph == Abstract glyph image === Representative glyph image Table 1. Prefixes of CCS features

4 In special cases, we distinguish categories 1 and 2, but there are no essential differences.

Journal of the Japanese Association for Digital Humanities, vol. 1, 91 Multiple-policy Character Annotation based on CHISE

2.5 Relations between Characters The Chaon model itself is a method to indicate characters or sets of characters by their character features—it is originally not based on graph theory. However, we can introduce a kind of character feature that includes a list of characters as its feature value. Characters stored in the list can also have this kind of character feature, so we actually can represent a network of characters (fig. 7). This kind of character feature is called a “relation feature.” In CHISE, each relation feature name starts with -> or <- (except its metadata feature names described in section 2.6). The property of relation features is defined as follows: If characters A and B exist and there is a character relation A ->foo B, character A has a relation feature whose name is ->foo and character B is an element of its value. In this case, character B has a reversed relation feature whose name is <-foo and character A is an element of its value.

2.6 Description for Complex Information For development of a general-purpose character ontology, we may find some cases where there are different kinds of usages, purposes, applications, sources, interpretations, or theories, which makes it hard to choose one feature value. Therefore we need to provide alternative values. In such cases, we may want to add metadata, for instance the sources of the values. This requires introducing a structured feature value or a structured feature name. To represent structured feature values, a format called “character reference” (char-ref) is used in CHISE. For example, (:char ((fn1 . fv1)(fn2 . fv2)...) :prop1 pval1 :prop2 pval2 ...) Char-ref is a kind of property-list of S-expression (Lisp). The property name indicates the kind of metadata and the property value indicates its data. As a special property name, :char is reserved to indicate a character to which metadata is added. Currently :sources is defined to indicate information source. CHISE also has a format for representing structured feature names. In the structured feature names, “domain identifiers” and/or “metadata identifiers” are added to ordinary (base) feature names. The format is defined as follows: := | := | @

Journal of the Japanese Association for Digital Humanities, vol. 1, 92 Multiple-policy Character Annotation based on CHISE

:= * | * | @ := | / For example, when the total number of strokes is represented by a character feature total- strokes and UCS is used as a domain identifier, total-strokes in the UCS domain is represented by total-strokes@ucs. When source is represented by the metadata identifier sources, total- strokes@ucs’s source is represented by the metadata feature name total-strokes@ucs*sources. If there is a correspondence between different kinds of features, such as radical and body- strokes, we can represent the correspondence using a domain identifier. For example, when radical is represented by ideographic-radical and body-strokes is represented by ideographic- strokes, there are two concrete feature names that correspond: ideographic-radical@ucs ideographic-strokes@ucs.

2.7 Inheritance of character definition If we construct a large-scale character database including a lot of character variants, inheritance of character definition is good way to avoid writing a lot of common features. So CHISE introduces four special features to represent parent and child relationships: <-subsumptive: defined character is a child of each character indicated by its value <-denotational: likewise ->subsumptive: each character indicated by its value is a child of the defined character ->denotational: likewise In the sense of inheritance, A ->subsumptive B and A ->denotational B have the same semantic values. However, they have a small nuance: A ->subsumptive B means that B is a close (glyph image) variant of A and they (or brothers/sisters of B) have quite similar differences. A - >denotational B means that B is a distant (glyph) variant of A and they (or brothers/sisters of B) have relatively large differences. The ->denotational feature is useful for describing abstract characters.

3 Implementation Character processing based on the Chaon model represents each character as a set of character features instead of a code point of a coded character set and processes the character by various character features. This mechanism indicates that the character processing system based on the Chaon model is a kind of database system to refer and to edit character ontology. So the major

Journal of the Japanese Association for Digital Humanities, vol. 1, 93 Multiple-policy Character Annotation based on CHISE

goals of the CHISE Project are (1) character database systems, (2) character database contents, and (3) CHISE-based applications. The CHISE Project is working on these goals and makes some of its results available as free software.5 As character database systems (and language bindings), the following implementations are available: Concord is a library that provides fundamental features for manipulating linked data represented by labeled and directed graphs libchise is a library that provides fundamental features for manipulating character databases XEmacs CHISE is a Chaon implementation based on XEmacs6 (extensible text editing environment with an Emacs Lisp interpreter) (fig. 3) For implementations of Chaon, currently the contents of two databases are available: CHISE basic character ontology is a general character ontology attached to XEmacs CHISE; it now includes about 241,000 defined character objects (about 899,000 triples) CHISE-IDS is a database for the structure of the components of Chinese characters, presently including about eighty thousand characters. It is designed to integrate into the CHISE basic character ontology (‘ideographic-structure' feature is used in CHISE). The former is a basic character ontology attached to XEmacs CHISE, which is realized by a collection of define-chars (S-expressions) while the latter is a database of Chinese characters that provides information about shapes represented by the “Ideographic Description Sequence” (IDS) format defined in ISO/IEC 10646. Currently the CHISE basic character ontology and CHISE-IDS package are distributed separately. However, the CHISE-IDS package provides an installer to integrate the CHISE-IDS database files with the CHISE basic character ontology.

5 Available as Git repositories in http://git.chise.org/gitweb/; information about them is are available at http://www.chise.org. 6 XEmacs, http://www.xemacs.org/.

Journal of the Japanese Association for Digital Humanities, vol. 1, 94 Multiple-policy Character Annotation based on CHISE

Figure 3. XEmacs CHISE

4 Information about the Structures of Shapes 4.1 The CHISE IDS Database Structural information on Chinese characters is information about combinations of components of Chinese characters. Most Chinese characters can be represented by a combination of components, so the information is very useful. The information in the database represents not only abstract shapes, but their relationships with semantic and/or phonetic values. Therefore, we started off by planning to develop a database containing the structural information for every

Journal of the Japanese Association for Digital Humanities, vol. 1, 95 Multiple-policy Character Annotation based on CHISE

Chinese character that consists of a combination of components. In 2002, we completed the CHISE-IDS database with support from the “Exploratory Software Project” run by IPA (Information technology Promotion Agency, Japan) (Morioka and Wittern 2002). The database currently covers most of the CJKV Unified Ideographs of ISO/IEC 10646, including Extensions A, B, C, D and . We are also working to includerepresentative glyph images from JIS X 0208:1990 and the Morohashi Daikanwa Jiten. Before we developed the CHISE-IDS database, there were three other databases that included structural information about Chinese characters: the CDP (Chinese Document Processing Lab) database7 by , a database covering gaiji (private use characters) used in CBETA, and the Konjaku Mojikyo.8 Since these databases use their own idiosyncratic formats, they are not easy to convert to IDS format, and since Konjaku Mojikyo is a proprietary software package, its data are not open to the public. The CDP and CBETA databases are available with free software licensed under the terms of the GPL, while Konjaku Mojikyo is not. We have therefore attempted to convert the CDP database and the CBETA database to IDS and integrate them with the CHISE database.

4.2 Multiple granularity IDS In the syntax of Ideographic Description Sequence (IDS) defined in ISO/IEC 10646, only coded ideographs (Chinese characters included in UCS) and radical characters can be used as terminal components (leaf nodes of a IDS tree). However, theoretically, any component can be used as a leaf node. CHISE can represent and process characters not included in UCS. Therefore, in CHISE, Chinese characters and special components not included in UCS are also available as components of extended IDS represented by the ideographic-structure feature. As discussed in section 2.7, CHISE supports inheritance of character definition and CHISE character ontology uses this to represent relationships among different unification granularity, such as abstract character, abstract glyph, and glyph image. If we use abstract characters as terminal components of an IDS, the IDS represents a structure of an abstract character. If we use abstract glyphs, the IDS represents a structure of an abstract glyph. If we use glyph images, the IDS represents a structure of an abstract glyph image. Thus, the extended IDS of CHISE can represent unification coverage (granularity) of a character object (fig. 4).

7 漢字庫, http://www.sinica.edu.tw/~cdp/zip/hanzi/hanzicd.zip. 8 文字鏡研究会, http://www.mojikyo.org/.

Journal of the Japanese Association for Digital Humanities, vol. 1, 96 Multiple-policy Character Annotation based on CHISE

Figure 4. Conceptual graph of multiple-granularity IDS(花)

5 Web Applications 5.1 CHISE IDS Find “CHISE IDS Find”9 is a web service to search the CHISE ontology for kanji (Chinese character) and/or historical Chinese characters, such as oracle bone , using the feature ideographic- structure. If a user specifies one or more components of a kanji into the “Character components” window and runs a search, the characters that include every specified component are displayed. In the display of results, if a character is used as a component of other characters, these derived characters are indented after the character in tree form (fig. 5).

9 CHISE IDS Find, http://www.chise.org/ ids-find.

Journal of the Japanese Association for Digital Humanities, vol. 1, 97 Multiple-policy Character Annotation based on CHISE

Figure 5. CHISE IDS Find (example of search results)

5.2 CHISE-wiki In a results page of the CHISE IDS Find, the first column of each line indicates an entry for a character. It has a link for a page in the CHISE-wiki to display details of the character (fig. 6). CHISE-wiki is a Web service to view and edit character objects in the CHISE character ontology. The CHISE-wiki is based on EST which is a Wiki-like system for structured data. A CHISE-wiki page represents a character object in the CHISE character ontology. It shows

Journal of the Japanese Association for Digital Humanities, vol. 1, 98 Multiple-policy Character Annotation based on CHISE

a representative glyph image (or name) of an object and each feature of the objects.

5.3 CHISE link map On the CHISE IDS Find search results page, the third column of each line shows a link, “link map,” to the CHISE Link map by Koichi Kamichi10 (fig. 7). The CHISE link map is a web service for visualizing relationships among character objects. It visualizes an overview of a character’s variant network.

6 External Representations 6.1 Entity Reference XEmacs CHISE and some other implementations of CHISE support a special format to encode non-UCS characters: &(-); It is designed to be interpreted as an SGML/XML entity reference and is therefore called “entity reference”. The entity reference format is not rigidly based on the Chaon model. However, CHISE uses a special convention to represent granularities of encoded characters. Under this a PREFIX>; table 2), a CCS-name part (; table 3), and a code-point part (). For example, &ZOB-0123; is an entity reference indicating an abstract glyph that is included in Zinbun Oracle Bones (a private character set for Oracle Bones characters), whose code-point is 123, and &A- ZOB-0123; indicates an abstract character corresponding with &ZOB-0123;.

Prefix Description a2 Super abstract character A Abstract character Unified abstract glyph (none) Abstract glyph G Concrete glyph g2 Abstract glyph image R Representative glyph image Table 2. Parts of prefixex

10 Kamichi, Koichi. CHISE Link Map, http://fonts.jp/chise_linkmap/.

Journal of the Japanese Association for Digital Humanities, vol. 1, 99 Multiple-policy Character Annotation based on CHISE

Figure 6. CHISE-wiki

Journal of the Japanese Association for Digital Humanities, vol. 1, 100 Multiple-policy Character Annotation based on CHISE

Figure 7. CHISE link map (+8FC5)

CNAME CPOS Description MJ dddddd Moji-Joho kiban AJ1- ddddd Adobe Japan1 GT- ddddd GT Mincho GT-K ddddd GT components J90- hhhh JIS X0208:1990 J83- hhhh JIS X0208:1983 J78- hhhh JIS X0208:1978 JX1- hhhh JIS X0213:2000 plane 1 JX2- hhhh JIS X0213:2000 plane 2 JX3- hhhh JIX X0213:2004 plane 1 JSP- hhhh JIS X0212:1990 HD-JA- hhhh Hanyo-Denshi/JA HD-JB- hhhh Hanyo-Denshi/JB HD-JC- hhhh Hanyo-Denshi/JC HD-JD- hhhh Hanyo-Denshi/JD HD-FT- hhhh Hanyo-Denshi/FT HD-IA- hhhh Hanyo-Denshi/IA HD-IB- hhhh Hanyo-Denshi/IB

Journal of the Japanese Association for Digital Humanities, vol. 1, 101 Multiple-policy Character Annotation based on CHISE

HD-HG- hhhh Hanyo-Denshi/HG HD-IP- hhhh Hanyo-Denshi/IP HD-JT- hhhh Hanyo-Denshi/JT HD-KS- dddddd Hanyo-Denshi/KS C1- hhhh CNS 11643 plane 1 C2- hhhh CNS 11643 plane 2 C3- hhhh CNS 11643 plane 3 C4- hhhh CNS 11643 plane 4 C5- hhhh CNS 11643 plane 5 C6- hhhh CNS 11643 plane 6 C7- hhhh CNS 11643 plane 7 CDP- hhhh CDP private components B- hhhh BE- hhhh Big5-ETEN ZOB- dddd Zinbun Oracle Bones M- ddddd Daikanwa (non changed) M-p ddddd Daikanwa with ' M-2p ddddd Daikanwa with '' M-H dddd Daikanwa Hokan r2M- ddddd Daikanwa revised ver.2 r1M- ddddd Daikanwa revised version K0- hhhh KS X1001 EGB- hhhh ISO IR165 KOSEKI- dddddd Koseki-Toitsu-Moji CB ddddd CBETA private characters BUCS- hhhh BUCS J97- hhhh JIS X0208:1997 unification UU+ hhhh Unicode representative glyphs IU+ hhhhh UCS representative glyphs JU+ hhhh JIS representative glyphs of UCS J04U+ hhhh JIS2004 representative glyphs of UCS J00U+ hhhh JIS2000 representative glyphs of UCS J90+ hhhh JIS1990 representative glyphs of UCS + hhhh KS representative glyphs of UCS CU+ hhhh CNS representative glyphs of UCS GU+ hhhh GB representative glyphs of UCS RUI6- hhhh Ruimoku-v6 private characters JC3- hhhh JEF+CHINA3 private characters Table 3: Prefixes of CCS features

Journal of the Japanese Association for Digital Humanities, vol. 1, 102 Multiple-policy Character Annotation based on CHISE

6.2 TEI TEI P5 includes features for representing “Non-standard Characters and Glyphs.” 11 The specifications were influenced by an early version of the Chaon model and XEmacs CHISE ( was a member of the working group that developed the specifications), so TEI P5 allows for the representation of character features.12 The tag corresponds with the define-char function of XEmacs CHISE, and the tag corresponds with char-spec (see section 2.2). However, TEI requires using the tag for glyph variants of characters included in Unicode. Therefore, if a character can be mapped to Unicode, one should use the tag instead of the tag. The tag can be used to represent character features. If a character feature corresponds with a character property of Unicode, the tag should be used for the feature name. Otherwise each character feature name can be represented by the tag. The tag can be used to represent feature value in either case. The tag of TEI P5 does not support inheritance of character definition (see section 2.7). Instead, the tag may be available to map information about inheritance represented by the <-subsumptive feature and the <-denotational feature.

7 Derived Works 7.1 Character Analysis Character ontology is available to analyze networks of characters. For example, the CHISE-IDS database makes a network of characters based on relationships among characters which have shared character components. Yoshi Fujiwara and colleagues analyzed these kinds of character relationships based on shared character components (Fujiwara, Suzuki, and Morioka 2004).

7.2 IDS-based activity in IRG At the IRG13 Meeting no. 24 in Kyoto, Japan (May 24–27, 2005), I demonstrated the CHISE project and related systems including CHISE-IDS Find (Kawabata 2005). Since then, IRG has used IDS for their work. Taichi Kawabata’s works on this topic are especially important. processed the CHISE-IDS database and analyzed every Chinese character14 } included in UCS.

11 http://www.tei-c.org/Vault/P5/1.8.0/doc/tei-p5-doc/en/html/WD.html 12 http://www.tei-c.org/Vault/P5/1.8.0/doc/tei-p5-doc/en/html/WD.html. 13 ISO/IEC JTC1/SC2/WG2/IRG (Ideographic Rapporteur Group). 14 In ISO/IEC 10646 and Unicode, Chinese characters are called “CJKV Ideographs”. There are two categories of CJKV Ideographs: one is “CJKV Unified Ideographs” and the other is “Compatibility Ideographs”. The second category is maintained for backward compatibility

Journal of the Japanese Association for Digital Humanities, vol. 1, 103 Multiple-policy Character Annotation based on CHISE

Based on the results of this work, he defined a unification rule for CJKV Unified Ideographs to check duplications in proposed Ideographs (Kawabata 2006). In addition, he proposed requiring proponents to add IDS for each proposed ideograph. This proposal has now been adopted by the IRG (Kawabata and Kobayashi 2007).

8 Related Works 8.1 Hantology Hantology (Chou and Huang 2005, 2006) is a Hanzi ontology based on the conventional orthographic system of Chinese characters. It is based on SUMO (Suggested Upper Merged Ontology) and coded in OWL. Compared with Hantology, CHISE is designed to be high- performance, lightweight system software to replace coded-character technology, while Hantology seems to be designed for heavier applications.

8.2 CDP C. C. Hsieh and his colleagues at Academia Sinica were among the earliest pioneers (since the 1990s) who worked to develop with Chinese character processing system based on knowledge processing. For the purpose of expressing the parts of characters that are not characters themselves, more than 2,000 code points from the private use area (PUA) of Big5 had been used by the system. Furthermore, the CDP database initially used a set of only three operators for connecting the characters, although in practice, this was expanded to eleven through the introduction of shortcut operators for handling multiple occurrences of the same component in one character. There are three more operator-like characters, which are used when embedding glyph expressions into running text. CHISE and CDP may share similar purpose and spirit. However, CHISE supports not only Chinese ideographs but also various scripts, while CDP is designed specifically for Chinese ideographs. In fact, CHISE ontology includes information on Unicode data and uses the same mechanism for every script. In addition, CHISE is designed not to be dependent on a particular character code, while CDP is dependent on Big5. In the representation of the ideographic structure of components, CDP uses original infix format while CHISE uses IDS format.

9 Conclusion We have described the character processing model Chaon and given an overview of CHISE. The Chaon character representation model is powerful and radical enough to solve the problems of the present coded character model, and implementations of the Chaon model, such as XEmacs CHISE or CHISE IDS Find, have shown that the model is sufficiently viable for supporting

Journal of the Japanese Association for Digital Humanities, vol. 1, 104 Multiple-policy Character Annotation based on CHISE

application development. The Chaon model gives users freedom to create, define, and exchange characters accord to their needs, as it is easy to change character ontology or modify character features dynamically. With the CHISE basic character database, XEmacs CHISE can even handle characters not defined in Unicode. If a character is not defined in Unicode, users can add it into the CHISE character ontology to define it by its features. Users can handle each character based on their point of view or policy. For example, CHISE provides some unification rules or mapping policies dealing with Unicode. With the CHISE-IDS database, users can easily search Chinese characters. This method is also available for non-Unicode characters. CHISE is an open source project, so its results are distributed as free software. These web pages, various programs, and data are managed using Git (a kind of version control system), allowing users to get the latest snapshot. There are mailing lists about the CHISE project in English and Japanese. Anyone interested in the CHISE project is invited to join the lists.

References Chou, -Min, and Chu-Ren Huang. 2005. “Hantology: An Ontology based on Conventionalized Conceptualization.” In Proceedings of OntoLex 2005 Workshop in IJCNLP 2005, 7-15. Chou, Ya-Min, and Chu-Ren Huang. 2006. “Hantology: A Linguistic Resource for Chinese Language Processing and Studying.” In LREC 2006: 5th edition of the International Conference on Language Resources and Evaluation, 587–90. http://www.lrec- conf.org/proceedings/lrec2006/. Fujiwara, Yoshi, Yasuhiro Suzuki, and Tomohiko Morioka. 2004. “Network of Words.” Artificial Life and Robotics 7 (4): 160–63. doi:10.1007/BF02471199. Kawabata, Taichi. 2005. “Reference Information on Ideographs and IDS Informations on Japanese researchers”. May 26. IRG N 1139. http://www.cse.cuhk.edu.hk/~irg/irg/ irg24/ IRGN1139IDSResearch.pdf2005. Kawabata, Taichi. 2006. “A Judgment Method of ‘Equivalence’ of UCS Ideographs Based on IDS” (in Japanese). In The 17th seminar on Computing for Oriental Studies, 105–19. Kyoto University. Kawabata, Taichi and Kobayashi, Tatsuo. 2007. “On better usage of IDS for IRG development process”. IRGN 1372. http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg29/IRGN1372_ IDSUse.doc. Kawabata, Taichi. “Unification/subsumption Criterion of UCS Ideographs” (beta edition). http://kanji-database.sourceforge.net/ucv.html. Morioka, Tomohiko. 2008. “CHISE: Character Processing based on Character Ontology.” In Large-scale Knowledge Resources: Construction and Application. Third International

Journal of the Japanese Association for Digital Humanities, vol. 1, 105 Multiple-policy Character Annotation based on CHISE

Conference on Large-Scale Knowledge Resources, LKR2008, Tokyo, Japan, March 2008, Proceedings, edited by Takenobu Tokunaga and Antonio Ortega, 148–62. Berlin: Springer. doi:10.1007/978-3-540-78159-2_14.15 Morioka, Tomohiko, and Christian Wittern. 2002. “Developing of Character Object Technology with Character Databases” (in Japanese). In IPA Result Report 2002. Information- Technology Promotion Agency, Japan, 2002. https://www.ipa.go.jp/files/0000055 66.pdf. Morohashi, Tetsuji (諸橋轍次). 1930. Dai Kanwa Jiten. Tokyo: Taishūkan. World Wide Web Consortium. 2004. Resource Description Framework (RDF): Concepts and Abstract Syntax, W3 Recommendation 10, February. http://www.w3.org/TR/2004/ REC-rdf-concepts-20040210/.

15 This paper described an old version of CHISE. The current CHISE described in the present article expands models and features to support multiple granularity of glyphs: (a) introducing structured naming scheme of CCS features to represent granularity of coverage (see section 2.4, table 1) (b) introducing multiple granularity IDS (see section 4.2, fig. 4) (c) expansion of entity reference feature to support multiple granularity (see section 6.1). As an improvement of the web service, CHISE-wiki (see section 5.2) is used instead of “CHISE character description” described in the old paper. In addition, the present paper added description about external representations (see section 6).

Journal of the Japanese Association for Digital Humanities, vol. 1, 106