Multiple Indexing in an Electronic Kanji Dictionary James BREEN Monash University Clayton 3800, Australia [email protected]
Total Page:16
File Type:pdf, Size:1020Kb
Multiple Indexing in an Electronic Kanji Dictionary James BREEN Monash University Clayton 3800, Australia [email protected] Abstract and contain such information as the classification of the character according to Kanji dictionaries, which need to present a shape, usage, components, etc., the large number of complex characters in an pronunciation or reading of the character, order that makes them accessible by users, variants of the character, the meaning or traditionally use several indexing techniques semantic application of the character, and that are particularly suited to the printed often a selection of words demonstrating the medium. Electronic dictionary technology use of the character in the language's provides the opportunity of both introducing orthography. These dictionaries are usually new indexing techniques that are not ordered on some visual characteristic of the feasible with printed dictionaries, and also characters. allowing a wide range of index methods with each dictionary. It also allows A typical learner of Japanese needs to have dictionaries to be interfaced at the character both forms of dictionary, and the process of level with documents and applications, thus "looking up" an unknown word often removing much of the requirement for involves initially using the character complex index methods. This paper surveys dictionary to determine the pronunciation of the traditional indexing methods, introduces one or more of the characters, then using some of the new indexing techniques that that pronunciation as an index to a word have become available with electronic kanji dictionary, in a process that can be time- dictionaries, and reports on an analysis of consuming and error-prone. the index usage patterns in a major WWW- The advent of electronic dictionaries has had based electronic kanji dictionary. This is a considerable impact on Japanese believed to be the first such analysis dictionary usage: conducted and reported. a. it has facilitated the integration or 1 Introduction association of character and word Unlike languages written in alphabetic, dictionaries such that a user can index syllabic or similar scripts, languages such as between them in a relatively straightforward Japanese and Chinese, which are written manner. This integration was pioneered by using a large number of characters: hanzi in the author in the early 1990s (Breen, 1995), Chinese, kanji in Japanese, require two and is now a common feature of almost all distinct sets of dictionaries. These are: hand-held electronic Japanese dictionaries and PC-based dictionary packages; a. the traditional "word" dictionaries, as used in most recorded languages. Such b. it has allowed the direct transfer of dictionaries are usually ordered in some words between text documents and recognized phonetic sequence, and typically dictionary software, thus removing the include the pronunciation or reading of the often-laborious character identification; word as well as the usual dictionary c. for kanji dictionaries, it has greatly components: part-of-speech, explanation. increased the number of character indexing etc. methods that can effectively be used, and b. character dictionaries, which has also provided the opportunity for new typically have an entry for each character, indexing methods that are not available to radical (札, 朷, 李, 松, etc.), with the traditional paper dictionaries. grouped kanji ordered by the number of This paper will concentrate on the issues strokes in the remainder of the kanji. Radical associated with Japanese kanji dictionaries. systems have been used in Chinese character Many of these also apply to Chinese. dictionaries for nearly 2,000 years, and the dominant 214-radical system was first used 2 Indexing a Kanji Dictionary in the 康煕字典 (kangxi zidian) published in The general problem confronting the 1716. publication of kanji dictionaries is the large Virtually all major kanji dictionaries number of kanji in use and the absence of an published in Japan use the radical indexing intrinsic and recognized lexical order for method as the primary index, as do a kanji. In the post-war educational reforms in number of dictionaries published elsewhere. Japan, the number of kanji taught in schools Some dictionaries use modified or reduced was restricted to a basic 1,850, which has sets of radicals. The technique is not simple now been increased to 1,945. This set of to use, and some skill and practice is kanji, along with a small set designated for required in correctly identifying the radical use in personal names, accounts for all but a and counting the residual strokes. The small proportion of kanji usage in modern difficulty has been compounded by recent Japanese. Many dictionaries and similar simplifications of the glyphs of the kanji, reference books compiled for students are which in some cases have modified or based on this set (Sakade, 1961; Henshall, eliminated the radical. 1988; Halpern, 1999; etc.). The main computer character-set standard used in There are a number of other techniques used Japan, JIS X 0208 (JIS, 1997), which for indexing kanji in a dictionary: extends to less-common kanji including a. reading. The reading or those used in places-names, has 6,355 kanji. pronunciation of a kanji is a common and This set is the basis for several kanji useful method of identification, and virtually dictionaries (Nelson, 1997; Spahn & all kanji dictionaries have a separate Hadamitzky, 1996), while larger sets of reading/kanji index. The reading cannot be kanji are covered in many dictionaries, e.g. used effectively as the primary index, as in the Kodansha Daijiten (Ueda, 1963) has Japanese each kanji usually has two sets of 14,900 kanji and the 13-volume Morohashi readings, and some kanji have as many as Daikanwajiten (Morohashi, 1989) has over fifteen distinct readings. 45,000 kanji. b. shape/stroke. A number of In this paper, the term "primary index" has techniques have been used to decompose the been used for the method of ordering the shape of a kanji according to coded patterns. kanji entries, and "secondary index" has One, which was popular in China, is the been used for cross-reference lists of kanji Four-Corner code, which allocates a number based on alternative ordering systems. (0-9) to the pattern of strokes at each corner The major traditional indexing technique for of the character, leading to a four-digit kanji and hanzi dictionaries has been the index. Another method, which is quite radical system (bushu in Japanese), based on popular, is the SKIP (System of Kanji 214 elements plus about 150 variants. These Indexing by Patterns) used by Jack Halpern elements are graphic components of the in his kanji dictionaries (Halpern, 1990, character that occur frequently enough to be 1999). In this, a kanji is typically divided used for indexing purposes. For example, into two portions, and a code constructed the kanji 村 (mura: village) is identified by from the division type and the stroke-counts the 木 radical, and in a dictionary would be in the portions. Thus 村 has a SKIP code of grouped with other kanji identified by that 1-4-3, indicating a vertical division into four the emergence of dictionaries with these as and three-stroke portions. the primary index. The Sanseido Unicode Kanji Information Dictionary (Tanaka, c. school grade. In Japan the kanji to 2000) uses the Unicode code-point as the be taught in each grade of elementary school primary index, and the first edition of the are prescribed, and some references either JIS Kanji Dictionary (Shibano, 1997) used organize kanji in those groupings or provide the JIS X 0208 code-point. It is interesting a secondary index of grades. to note that the second edition (Shibano, d. stroke count. The number of pen or 2002) changed to the traditional radical brush strokes making up a kanji, ranging system, with the codepoints being relegated from one to over forty, can be an effective to a secondary index. indexing technique, particularly for the A summary of the indices available in a simpler kanji. Some dictionaries employ a selection of dictionaries and references is in secondary index using the total number of Table 1. The "P" indicates the primary index strokes in a kanji. and an "S" indicates a secondary index. (The e. frequency. The ranking of kanji original Nelson uses a slightly modified according to frequency-of-use can be a version of the traditional radical index, and useful secondary index, especially for the the Spahn & Hadamitzky Kanji Dictionary commonly used kanji. uses a simplified 79-radical system. f. code-point. The standardization of character set code-points for kanji has led to Index Type Dictionary Radical Shape Code-point Grade Reading Frequency Stroke Order Morohashi (1989) P S Ueda (1963) P S Nelson (1974) P* S S S Nelson (1997) P S S S&H (1996) P* S S&H (1997) S S P S Halpern (1990) S P S S S Halpern (1999) S P S S Sakade (1961) P S Henshall (1988) P S S Shibano (1997) P S Shibano (2002) P S S Tanaka (2000) S P Table 1: Index Types in Printed Kanji Dictionaries. 3 Electronic Kanji Dictionaries indexing methods available, and in particular have navigational advantages over As mentioned above, electronic kanji traditional paper dictionaries: dictionaries have an increased number of a. the concept of a "secondary" index c. the above-mentioned capability to no longer applies, as every index is capable index directly to a kanji entry from a kanji of linking directly to the kanji entries; selected from a text or application; b. dictionary users can choose flexibly d. suitable GUIs can enhance the kanji between index methods according to lookup process by providing visual cues and preference, and can select a method a degree of interactivity. appropriate to the characteristics of an Figures 1 and 2 show the GUIs for the bushu individual kanji; and SKIP methods in the kanji dictionary module of the JWPce word-processor (Rosenthal, 2002).