Bringing the Dictionary to the User: the FOKS System

Bringing the Dictionary to the User: the FOKS system Slaven Bilac†, Timothy Baldwin∗ and Hozumi Tanaka† † Tokyo Institute of Technology 2-12-1 Ookayama, Meguro-ku, Tokyo 152-8552 JAPAN {sbilac,tanaka}@cl.cs.titech.ac.jp ∗ CSLI, Ventura Hall, Stanford University Stanford, CA 94305-4115 USA [email protected] Abstract major difficulty for the learner. Kanji characters The dictionary look-up of unknown words is partic- (ideograms), on the other hand, present a much big- ularly difficult in Japanese due to the complicated ger obstacle. The high number of these characters writing system. We propose a system which allows (1,945 prescribed by the government for daily use, learners of Japanese to look up words according to and up to 3,000 appearing in newspapers and formal their expected, but not necessarily correct, reading. publications) in itself presents a challenge, but the This is an improvement over previous systems which matter is further complicated by the fact that each character can and often does take on several different provide no handling of incorrect readings. In preprocessing, we calculate the possible readings each kanji and frequently unrelated readings. The kanji , for character can take and different types of phonolog- example, has readings including hatsu and ta(tsu), ical and conjugational changes that can occur, and whereas ¡ has readings including omote, hyou and arawa(reru). Based on simple combinatorics, there- associate a probability with each. Using these probabilities and corpus-based frequencies we calculate a fore, the kanji compound ¡ happyou “announce- plausibility measure for each generated reading given ment” can take at least 6 basic readings, and when a dictionary entry, based on the naive Bayes model. one considers phonological and conjugational varia- tion, this number becomes much greater. Learners In response to a reading input, we calculate the plausibility of each dictionary entry corresponding to the presented with the string ¡ for the first time will, reading and display a list of candidates for the user therefore, have a possibly large number of potential to choose from. We have implemented our system readings (conditioned on the number of component in a web-based environment and are currently eval- character readings they know) to choose from. The uating its usefulness to learners of Japanese. problem is further complicated by the occurrence of character combinations which do not take on com- £ 1 Introduction positional readings. For example ¢ kaze “com- mon cold” is formed non-compositionally from ¢ Unknown words are a major bottleneck for learners kaze/fuu “wind” and £ yokoshima/ja “evil”. of any language, due to the high overhead involved in With paper dictionaries, look-up typically occurs looking them up in a dictionary. This is particularly in two forms: (a) directly based on the reading of the true in non-alphabetic languages such as Japanese, entire word, or (b) indirectly via component kanji as there is no easy way of looking up the component characters and an index of words involving those characters of new words. This research attempts to kanji. Clearly in the first case, the correct reading alleviate the dictionary look-up bottleneck by way of the word must be known in order to look it up, of a comprehensive dictionary interface which allows which is often not the case. In the second case, the Japanese learners to look up Japanese words in an ef- complicated radical and stroke count systems make ficient, robust manner. While the proposed method the kanji look-up process cumbersome and time con- is directly transferable to other language pairs, for suming. the purposes of this paper, we will focus exclusively With electronic dictionaries—both commercial on a Japanese–English dictionary interface. and publicly available (e.g. EDICT (2000))—the The Japanese writing system consists of the options are expanded somewhat. In addition to three orthographies of hiragana, katakana and kanji, reading- and kanji-based look-up, for electronic which appear intermingled in modern-day texts texts, simply copying and pasting the desired string (NLI, 1986). The hiragana and katakana syllabaries, into the dictionary look-up window gives us direct collectively referred to as kana, are relatively small access to the word.1. Several reading-aid systems (46 characters each), and each character takes a unique and mutually exclusive reading which can 1Although even here, life is complicated by Japanese being easily be memorized. Thus they do not present a a non-segmenting language, putting the onus on the user to (e.g. Reading Tutor (Kitamura and Kawamura, are designed for native speakers of Japanese and as 2000) and Rikai2) provide greater assistance by seg- such expect accurate input. In cases when the cor- menting longer texts and outputing individual trans- rect or standardized reading is not available, kanji lations for each segment (word). If the target text characters have to be converted one by one. This can is available only in hard copy, it is possible to use be a painstaking process due to the large number of kana-kanji conversion to manually input component characters taking on identical readings, resulting in kanji, assuming that at least one reading or lexical large lists of characters for the user to choose from. instantiation of those kanji is known by the user. Es- Our system, on the other hand, does not assume sentially, this amounts to individually inputting the 100% accurate knowledge of readings, but instead readings of words the desired kanji appear in, and expects readings to be predictably derived from the searching through the candidates returned by the source kanji. What we do assume is that the user kana-kanji conversion system. Again, this is com- is able to determine word boundaries, which is in plicated and time inefficient so the need for a more reality a non-trivial task due to Japanese being non- user-friendly dictionary look-up remains. segmenting (see Kurohashi et al. (1994) and Na- In this paper we describe the FOKS (Forgiving gata (1994), among others, for details of automatic Online Kanji Search) system, that allows a learner segmentation methods). In a sense, the problem of to use his/her knowledge of kanji to the fullest extent word segmentation is distinct from the dictionary in looking up unknown words according to their ex- look-up task, so we do not tackle it in this paper. pected, but not necessarily correct, reading. Learn- To be able to infer how kanji characters can be ers are exposed to certain kanji readings before oth- read, we first determine all possible readings a kanji ers, and quickly develop a sense of the pervasiveness character can take based on automatically-derived of different readings. We attempt to tap into this alignment data. Then, we machine learn phonologi- intuition, in predicting how Japanese learners will cal rules governing the formation of compound kanji read an arbitrary kanji string based on the relative strings. Given this information we are able to gen- frequency of readings of the component kanji, and erate a set of readings for each dictionary entry that also the relative rates of application of phonological might be perceived as correct by a learner possessing processes. An overall probability is attained for each some, potentially partial, knowledge of the charac- candidate reading using the naive Bayes model over ter readings. Our generative method is analogous these component probabilities. Below, we describe to that successfully applied by Knight and Graehl how this is intended to mimic the cognitive ability (1998) to the related problem of Japanese (back) of a learner, how the system interacts with a user transliteration. and how it benefits a user. The remainder of this paper is structured as fol- 2.2 Generating and grading readings lows. Section 2 describes the preprocessing steps of In order to generate a set of plausible readings we reading generation and ranking. Section 3 describes first extract all dictionary entries containing kanji, the actual system as is currently visible on the in- and for each entry perform the following steps: ternet. Finally, Section 4 provides an analysis and evaluation of the system. 1. Segment the kanji string into minimal morpho- phonemic units3 and align each resulting unit 2 Data Preprocessing with the corresponding reading. For this pur- pose, we modified the TF-IDF based method 2.1 Problem domain proposed by Baldwin and Tanaka (2000) to ac- Our system is intended to handle strings both in the cept bootstrap data. form they appear in texts (as a combination of the three Japanese orthographies) and as they are read 2. Perform conjugational, phonological and mor- (with the reading expressed in hiragana). Given a phological analysis of each segment–reading reading input, the system needs to establish a rela- pair and standardize the reading to canonical tionship between the reading and one or more dictio- form (see Baldwin et al. (2002) for full de- nary entries, and rate the plausibility of each entry tails). In particular, we consider gemination being realized with the entered reading. (onbin) and sequential voicing (rendaku) as the In a sense this problem is analogous to kana–kanji most commonly-occurring phonological alterna- tions in kanji compound formation (Tsujimura, conversion (see, e.g., Ichimura et al. (2000) and 4 Takahashi et al. (1996)), in that we seek to deter- 1996) . The canonical reading for a given seg- mine a ranked listing of kanji strings that could cor- 3A unit is not limited to one character. For example, verbs respond to the input kana string. There is one major and adjectives commonly have conjugating suffices that are difference, however. Kana–kanji conversion systems treated as part of the same segment. 4 happyou ¡ In the previous example of “announcement” correctly identify word boundaries.

Bringing the Dictionary to the User: the FOKS System

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support