Bringing the the User: the FOKS system Slaven Bilac†, Timothy Baldwin∗ and Hozumi Tanaka† † Institute of Technology 2-12-1 Ookayama, Meguro-, Tokyo 152-8552 JAPAN {sbilac,tanaka}@cl.cs.titech.ac.jp ∗ CSLI, Ventura Hall, Stanford University Stanford, CA 94305-4115 USA [email protected]

Abstract major difficulty for the learner. characters The dictionary look-up of unknown words is partic- (ideograms), on the other hand, present much big- ularly difficult in Japanese due to the complicated ger obstacle. The high number of these characters writing system. propose a system which allows (1,945 prescribed by the government for daily use, learners of Japanese to look up words according to and up to 3,000 appearing in newspapers and formal their expected, but not necessarily correct, reading. publications) in itself presents a challenge, but the This is an improvement over previous systems which matter is further complicated by the fact that each character can and often does take on several different provide handling of incorrect readings. In prepro- cessing, we calculate the possible readings each kanji and frequently unrelated readings. The kanji , for character can take and different types of phonolog- example, has readings including hatsu and (), ical and conjugational changes that can occur, and whereas ¡ has readings including omote, hyou and arawa(reru). Based on simple combinatorics, there- associate a probability with each. Using these prob- abilities and corpus-based frequencies we calculate a fore, the kanji compound ¡ happyou “announce- plausibility measure for each generated reading given ment” can take at least 6 basic readings, and when a dictionary entry, based on the naive Bayes model. one considers phonological and conjugational varia- tion, this number becomes much greater. Learners In response to a reading input, we calculate the plau- sibility of each dictionary entry corresponding to the presented with the string ¡ for the first time will, reading and display a list of candidates for the user therefore, have a possibly large number of potential to choose from. We have implemented our system readings (conditioned on the number of component in a web-based environment and are currently eval- character readings they know) to choose from. The uating its usefulness to learners of Japanese. problem is further complicated by the occurrence of

character combinations which do not take on com- £ 1 Introduction positional readings. For example ¢ kaze “com-

mon cold” is formed non-compositionally from ¢

Unknown words are a major bottleneck for learners kaze/fuu “wind” and £ yokoshima/ja “evil”. of any language, due to the high overhead involved in With paper , look-up typically occurs looking them up in a dictionary. This is particularly in two forms: (a) directly based on the reading of the true in non-alphabetic languages such as Japanese, entire word, or (b) indirectly via component kanji as there is no easy way of looking up the component characters and an index of words involving those characters of new words. This research attempts to kanji. Clearly in the first case, the correct reading alleviate the dictionary look-up bottleneck by way of the word must be known in order to look it up, of a comprehensive dictionary interface which allows which is often not the case. In the second case, the Japanese learners to look up Japanese words in an ef- complicated radical and stroke count systems make ficient, robust manner. While the proposed method the kanji look-up process cumbersome and time con- is directly transferable to other language pairs, for suming. the purposes of this paper, we will focus exclusively With electronic dictionaries—both commercial on a Japanese–English dictionary interface. and publicly available (.g. EDICT (2000))—the The consists of the options are expanded somewhat. In addition to three orthographies of , and kanji, reading- and kanji-based look-up, for electronic which appear intermingled in modern-day texts texts, simply copying and pasting the desired string (NLI, 1986). The hiragana and katakana , into the dictionary look-up window gives us direct collectively referred to as kana, are relatively small access to the word.1. Several reading-aid systems (46 characters each), and each character takes a unique and mutually exclusive reading which can 1Although even here, life is complicated by Japanese being easily be memorized. Thus they do not present a a non-segmenting language, putting the onus on the user to (e.g. Reading Tutor (Kitamura and Kawamura, are designed for native speakers of Japanese and as 2000) and Rikai2) provide greater assistance by seg- such expect accurate input. In cases when the cor- menting longer texts and outputing individual trans- rect or standardized reading is not available, kanji lations for each segment (word). If the target text characters have to be converted one by one. This can is available only in hard copy, it is possible to use be a painstaking process due to the large number of kana-kanji conversion to manually input component characters taking on identical readings, resulting in kanji, assuming that at least one reading or lexical large lists of characters for the user to choose from. instantiation of those kanji is known by the user. Es- Our system, on the other hand, does not assume sentially, this amounts to individually inputting the 100% accurate knowledge of readings, but instead readings of words the desired kanji appear in, and expects readings to be predictably derived from the searching through the candidates returned by the source kanji. What we do assume is that the user kana-kanji conversion system. Again, this is com- is able to determine word boundaries, which is in plicated and time inefficient the need for a more reality a non-trivial task due to Japanese being non- user-friendly dictionary look-up remains. segmenting (see Kurohashi et al. (1994) and - In this paper we describe the FOKS (Forgiving gata (1994), among others, for details of automatic Online Kanji Search) system, that allows a learner segmentation methods). In a sense, the problem of to use his/her knowledge of kanji to the fullest extent word segmentation is distinct from the dictionary in looking up unknown words according to their ex- look-up task, so we do not tackle it in this paper. pected, but not necessarily correct, reading. Learn- To be able to infer how kanji characters can be ers are exposed to certain kanji readings before oth- read, we first determine all possible readings a kanji ers, and quickly develop a sense of the pervasiveness character can take based on automatically-derived of different readings. We attempt to tap into this alignment data. Then, we machine learn phonologi- intuition, in predicting how Japanese learners will cal rules governing the formation of compound kanji read an arbitrary kanji string based on the relative strings. Given this information we are able to gen- frequency of readings of the component kanji, and erate a set of readings for each dictionary entry that also the relative rates of application of phonological might be perceived as correct by a learner possessing processes. An overall probability is attained for each some, potentially partial, knowledge of the charac- candidate reading using the naive Bayes model over ter readings. Our generative method is analogous these component probabilities. Below, we describe to that successfully applied by Knight and Graehl how this is intended to mimic the cognitive ability (1998) to the related problem of Japanese (back) of a learner, how the system interacts with a user transliteration. and how it benefits a user. The remainder of this paper is structured as fol- 2.2 Generating and grading readings lows. Section 2 describes the preprocessing steps of In order to generate a set of plausible readings we reading generation and ranking. Section 3 describes first extract all dictionary entries containing kanji, the actual system as is currently visible on the in- and for each entry perform the following steps: ternet. Finally, Section 4 provides an analysis and evaluation of the system. 1. Segment the kanji string into minimal morpho- phonemic units3 and align each resulting unit 2 Data Preprocessing with the corresponding reading. For this pur- pose, we modified the TF-IDF based method 2.1 Problem domain proposed by Baldwin and Tanaka (2000) to ac- Our system is intended to handle strings both in the cept bootstrap data. form they appear in texts (as a combination of the three Japanese orthographies) and as they are read 2. Perform conjugational, phonological and mor- (with the reading expressed in hiragana). Given a phological analysis of each segment–reading reading input, the system needs to establish a rela- pair and standardize the reading to canonical tionship between the reading and one or more dictio- form (see Baldwin et al. (2002) for full de- nary entries, and rate the plausibility of each entry tails). In particular, we consider gemination being realized with the entered reading. (onbin) and sequential voicing () as the In a sense this problem is analogous to kana–kanji most commonly-occurring phonological alterna- tions in kanji compound formation (Tsujimura, conversion (see, e.g., Ichimura et al. (2000) and 4 Takahashi et al. (1996)), in that we seek to deter- 1996) . The canonical reading for a given seg- mine a ranked listing of kanji strings that could cor- 3A unit is not limited to one character. For example, verbs respond to the input kana string. There is one major and adjectives commonly have conjugating suffices that are difference, however. Kana–kanji conversion systems treated as part of the same segment.

4 happyou ¡ In the previous example of “announcement” correctly identify word boundaries. the underlying reading of individual characters are hatsu and 2http://www.rikai.com hyou respectively. When the compound is formed, hatsu seg- ment is the basic reading to which conjugational 3 System Description and phonological processes apply. The section above described the preprocessing steps 3. Calculate the probability of a given segment be- necessary for our system. In this section we describe ing realized with each reading (P (r|k)), and the actual implementation. of phonological (Pphon(r)) or conjugational 3.1 System overview (Pconj(r)) alternation occurring. The set of reading probabilities is specific to each (kanji) The base dictionary for our system is the publicly- available EDICT Japanese–English electronic dictio- segment, whereas the phonological and conju- 5 gational probabilities are calculated based on nary. We extracted all entries containing at least the reading only. After obtaining the compos- one kanji character and executed the steps described ite probabilities of all readings for a segment, above for each. Corpus frequencies were calculated we normalize them to sum to 1. over the EDR Japanese corpus (EDR, 1995). During the generation step we ran into problems 4. Create an exhaustive listing of reading candi- with extremely large numbers of generated readings, dates for each dictionary entry s and calculate P r|s particularly for strings containing large numbers of the probability ( ) for each reading, based kanji. Therefore, to reduce the size of generated on evidence from step 3 and the naive Bayes data, we only generated readings for entries with model (assuming independence between all pa- less than 5 kanji segments, and discarded any read- rameters). ings not satisfying P (r|s) ≥ 5 × 10−5. Finally, to P (r|s)=P (r1..|k1..n) (1) complete the set, we inserted correct readings for all dictionary entries skana that did not contain any n kanji characters (for which no readings were gener- P r1..n|k1..n P | × ated above), with plausibility grade calculated by ( )= ( ) 6 =1 equation (6). ×Pphon(ri) × Pconj(ri) (2) Grade(skana|r)=F (skana) (6) 5. Calculate the corpus-based frequency F (s)of This resulted in the following data: each dictionary entry s in the corpus and then the string probability P (s), according to equa- Total entries: 97,399 tion (3). Notice that the term i F (si) de- Entries containing kanji: 82,961 pends on the given corpus and is constant for Average number of segments: 2.30 all strings s. Total readings: 2,646,137 Unique readings: 2,194,159 F (s) P s  Average entries per reading: 1.21 ( )= F s (3) i ( i) Average readings per entry: 27.24 Maximum entries per reading: 112 6. Use Bayes rule to calculate the probability Maximum readings per entry: 471 P (s|r) of each resulting reading according to equation (4). The above set is stored in a MySQL relational database and queried through a CGI script. Since P (s|r) P (r|s) = (4) the readings and scores are precalculated, there is no P (s) P (r) time overhead in response to a user query. Figure 1 depicts the system output for the query atamajou.7 Here, as we are only interested in the relative The system is easily accessible through any score for each s given an input r, we can ig- P r F s -enabled web browser. Currently nore ( ) and the constant i ( i). The final we include only a Japanese–English dictionary but plausibility grade is thus estimated as in equa- it would be a trivial task to add links to translations tion (5). in alternative languages.

Grade(s|r)=P (r|s) × F (s) (5) 3.2 Search facility The system supports two major search modes: sim- The resulting readings and their scores are stored ple and intelligent. Simple search emulates a in the system database to be queried as necessary. conventional search (see, e.g., Note that the above processing is fully automated, 5 a valuable quality when dealing with a volatile dic- http://www.csse.monash.edu.au/~jwb/edict.html 6 P r|s tionary such as EDICT. Here, ( kana) is assumed to be 1, as there is only one possible reading (i.e. r). ment undergoes gemination and hyou segment undergoes - 7This is a screen shot of the system as it is visible at quential voicing resulting in happyou surface form reading. http://tanaka-www.titech.ac.jp/foks/.

Entry Reading Grade Translation ¡

zujou 0.40844 overhead ¥ tousei 0.00271168 head voice

Table 1: Results of search for atamajou

Entry Reading Grade Translation

¦ §

toujou 73.2344 appearance ¡

zujou 1.51498 overhead

¨ ©

toujou 1.05935 embarkation

toujou 0.563065 cylindrical

§

doujou 0.201829 dojo

¡ toujou 0.126941 going to Tokyo

 

shimoyake 0.0296326 frostbite

 

toushou 0.0296326 frostbite 

 toushou 0.0144911 swordsmith ¥

tousei 0.0100581 head voice  Figure 1: Example of system display 

toushou 0.00858729 sword wound

 

tousou 0.00341006 smallpox

 

Breen (2000)) over the original dictionary, taking tousou 0.0012154 frostbite

 both kanji and kana as query strings and displaying toushin 0.000638839 Eastern China the resulting entries with their reading and transla- tion. It also supports wild character and specified character length searches. These functions enable Table 2: Results of search for toujou lookup of novel kanji combinations as long as at least one kanji is known and can be input into the dictio- all those entries where the actual reading coincides nary interface. with the user input, the reading is displayed in bold- Intelligent search is over the set of generated 8 face. readings. It accepts only kana query strings and From Table 1 we see that only two results are proceeds in two steps. Initially, the user is provided returned for atamajou, and that the highest rank- with a list of candidates corresponding to the query,

ing candidate corresponds to the desired string displayed in descending order of the score calculated

¡ . Note that atamajou is not a valid word in from equation (5). The user must then click on the Japanese, and that a conventional dictionary search appropriate entry to get the full translation. This would yield no results. search mode is what separates our system from ex- Things get somewhat more complicated for the isting electronic dictionaries. reading toujou, as can be seen from Table 2. A total 3.3 Example search of 14 entries is returned, for four of which toujou is

Let us explain the benefit of the system to the the correct reading (as indicated in bold). The string ¡ Japanese learner through an example. Suppose the is second in rank, scored higher than three en-

tries for which toujou is the correct reading, due to ¡ user is interested in looking up zujou “over- head” in the dictionary but does not know the cor- the scoring procedure not considering whether the

generated readings are correct or not. ¡ rect reading. Both “head” and “over/above” are quite common characters but frequently realized For both of these inputs, a conventional system with different readings, namely atama, tou, etc. and would not provide access to the desired translation ue, jou, etc., respectively. As a result, the user could without additional user effort, while the proposed

system returns the desired entry as a first-pass can- ¡ interpret the string as being read as atamajou or toujou and query the system accordingly. Tables didate in both cases. 9 1 and 2 show the results of these two queries. Note 4 Analysis and Evaluation that the displayed readings are always the correct readings for the corresponding Japanese dictionary To evaluate the proposed system, we first provide entry, and not the reading in the original query. For a short analysis of the reading set distribution and 8 then describe results of a preliminary experiment on In order to retain the functionality offered by the simple real-life data. interface, we automatically default all queries containing kanji characters and/or wild characters into simple search. 4.1 Reading set analysis 9Readings here are given in romanized form, whereas they appear only in kana in the actual system interface. See Figure Since we create a large number of plausible read- 1 for an example of real-life system output. ings, one potential problem is that a large number Number of results returned per reading in the original dictionary; and All is all readings in 12 All Existing the generated set. The distribution of the latter two 11 Baseline sets is calculated over the generated set of readings. 10 In Figure 2 the x-axis represents the number of 9 results returned for the given reading and the y-axis 8 represents the natural log of the number of readings 7 returning that number of results. It can be seen that 6 only a few readings return a high number of entries. 5 308 out of 2,194,159 or 0.014% readings return over 4 Number of Readings(log) 30 results. As it happens, most of the readings - 3 turning a high number of results are readings that 2 existed in the original dictionary, as can be seen from 1 the fact that Existing and All are almost identical 0 0 10 20 30 40 50 60 70 80 90 100 110 120 for x values over 30. Note that the average number Results per Reading of dictionary entries returned per reading is 1.21 for the complete set of generated readings. Figure 2: Distribution of results returned per read- Moreover, as seen from Figure 3 the number of ing results depends heavily on the length of the reading. In this figure, the x-axis gives the length of the read- y Average number of results returned depending on length of reading ing in characters and the -axis the average number 27 All of entries returned. It can be seen that queries con- Existing 24 Baseline taining 4 characters or more are likely to return 3 results or less on average. Here again, the Exist- 21 ing readings have the highest average of 2.88 results 18 returned for 4 character queries. The 308 readings

15 mentioned above were on average 2.80 characters in length. 12 From these results, it would appear that the re- 9 turned number of entries is ordinarily not over- Average number of results returned 6 whelming, and provided that the desired entries are included in the list of candidates, the system should 3 prove itself useful to a learner. Furthermore, if a 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 user is able to postulate several readings for a target Length of reading (in characters) string, s/ is more likely to obtain the translation with less effort by querying with the longer of the two postulates. Figure 3: Distribution of results for different query string lengths 4.2 Comparison with a conventional system As the second part of evaluation, we tested to see of candidates would be returned for each reading, ob- whether the set of candidates returned for a query scuring dictionary entries for which the input is the over the wrong reading, includes the desired entry. correct reading. This could result in a high penalty We ran the following experiment. As a data set we for competent users who mostly search the dictio- used a collection of 139 entries taken from a web site displaying real-world reading errors made by nary with correct readings, potentially making the 10 system unusable. native speakers of Japanese. For each entry, we To verify this, we tried to establish how many queried our system with the erroneous reading to candidates are returned for user queries over read- see whether the intended entry was returned among ings the system has knowledge of, and also tested the system output. To transform this collection of whether the number of results depends on the length items into a form suitable for dictionary querying, we of the query. converted all readings into hiragana, sometimes re- The distribution of results returned for different moving context words in the process. Table 3 gives queries is given in Figure 2, and the average num- a comparison of results returned in simple (con- ventional) and intelligent (proposed system) search ber of results returned for different-length queries 11 is given in Figure 3. In both figures, Baseline is modes. 62 entries, mostly proper names and 4- calculated over only those readings in the original 10http://www.sutv.zaq.ne.jp/shirokuma/godoku.html dictionary (i.e. correct readings); Existing is the 11We have also implemented the proposed system with the subset of readings in the generated set that existed ENAMDICT, a name dictionary in the EDICT distribution, Conventional Our System the 34 entries. The average relative rank was 12.8 In dictionary 77 77 for erroneous readings and 19.6 for correct readings. Ave. # Results 1.53 5.42 Thus, on average, erroneous readings were ranked Successful 10 34 higher than the correct readings, in line with our Mean Rank 1.4 4.71 prediction above. Admittedly, this evaluation was over a data set Table 3: Comparison between a conventional dictio- of limited size, largely because of the difficulty in nary look-up and our system gaining access to naturally-occurring kanji–reading confusion data. The results are, however, promising. character proverbs, were not contained in the dictio- nary and have been excluded from evaluation. The 4.3 Discussion erroneous readings of the 77 entries that were con- tained in the dictionary averaged 4.39 characters in In order to emulate the limited cognitive abilities of length. a language learner, we have opted for a simplistic view of how individual kanji characters combine in From Table 3 we can see that our system is able compounds. In step 4 of preprocessing, we use the to handle more than 3 times more erroneous read- naive Bayes model to generate an overall probability ings then the conventional system, representing an for each reading, and in doing so assume that com- error rate reduction of 35.8%. However, the average ponent readings are independent of each other, and number of results returned (5.42) and mean rank of that phonological and conjugational alternation in the desired entry (4.71 — calculated only for suc- readings does not depend on lexical context. Clearly cessful queries) are still sufficiently small to make this is not the case. For example, kanji readings de- the system practically useful. riving from Chinese and native Japanese sources (on That the conventional system covers any erro- and kun readings, respectively) tend not to co-occur neous readings at all is due to the fact that those in compounds. Furthermore, phonological and con- readings are appropriate in alternative contexts, and jugational alternations interact in subtle ways and as such both readings appear in the dictionary. are subject to a number of constraints (Vance, 1987). Whereas our system is generally able to return all reading-variants for a given kanji string and there- However, depending on the proficiency level of fore provide the full set of translations for the kanji the learner, s/he may not be aware of these rules, string, conventional systems return only the transla- and thus may try to derive compound readings in tion for the given reading. That is, with our system, a more straightforward fashion, which is adequately the learner will be able to determine which of the modeled through our simplistic independence-based readings is appropriate for the given context based model. As can be seen from preliminary experi- on the translation, whereas with conventional sys- mentation, our model is effective in handling a large tems, they will be forced into attempting to contex- number of reading errors but can be improved fur- tualize a (potentially) wrong translation. ther. We intend to modify it to incorporate further constraints in the generation process after observ- Out of 42 entries that our system did not handle, ing the correlation between the search inputs and the majority of misreadings were due to the usage of selected dictionary entries. incorrect character readings in compounds (17) and graphical similarity-induced error (16). Another 5 Furthermore, the current cognitive model does not errors were a result of substituting the reading of include any notion of possible errors due to graphic a semantically-similar word, and the remaining 5 a or semantic similarity. But as seen from our pre-

result of interpreting words as personal names. liminary experiment these error types are also com-

¡ £ Finally, for the same data set we compared the mon. For example, bochi “graveyard” and

relative rank of the correct and erroneous readings ¡ kichi “base” are graphically very similar but read ¥ to see which was scored higher by our grading pro- differently, and ¤ mono “thing” and koto “thing” cedure. Given that the data set is intended to ex- are semantically similar but take different readings. emplify cases where the expected reading is different This leads to the potential for cross-borrowing of er- from the actual reading, we would expect the erro- rant readings between these kanji pairs. neous readings to rank higher than the actual read- Finally, we are working under the assumption that ings. An average of 76.7 readings was created for the target string is contained in the original dictio- nary and thus base all reading generation on the allowing for name searches on the same basic methodology. existing entries, assuming that the user will only at- We feel that this part of the system should prove itself useful tempt to look up words we have knowledge of. We even to the native speakers of Japanese who often experience problems reading uncommon personal or place names. How- also provide no immediate solution for random read- ever, as of yet, we have not evaluated this part of the system ing errors or for cases where user has no intuition as and will not discuss it in detail. to how to read the characters in the target string. 4.4 Future work EDICT. 2000. EDICT Japanese-English Dictio- nary File. ftp://ftp.cc.monash.edu.au/pub/ So far we have conducted only limited tests of cor- nihongo/. relation between the results returned and the target words. In order to truly evaluate the effectiveness of EDR. 1995. EDR Electronic Dictionary Technical our system we need to perform experiments with a Guide. Japan Electronic Dictionary Research In- larger data set, ideally from actual user inputs (cou- stitute, Ltd. In Japanese. pled with the desired dictionary entry). The reading Yumi Ichimura, Yoshimi Saito, Kazuhiro Kimura, generation and scoring procedure can be adjusted by and Hideki Hirakawa. 2000. Kana-kanji conver- adding and modifying various weight parameters to sion system with input support based on pre- modify calculated probabilities and thus affect the diction. In Proc. of the 18th International Con- results displayed. ference on Computational Linguistics (COLING 2000), pages 341–347. Also, to get a full coverage of predictable errors, we would like to expand our model further to in- Tatuya Kitamura and Yoshiko Kawamura. 2000. corporate consideration of errors due to graphic or Improving the dictionary display in a reading semantic similarity of kanji. support system. International Symposium of Japanese Language Education. (In Japanese). Kevin Knight and Jonathan Graehl. 1998. - 5 Conclusion chine transliteration. Computational Linguistics, In this paper we have proposed a system design 24:599–612. which accommodates user reading errors and supple- Sadao Kurohashi, Toshihisa Nakamura, Yuji Mat- ments partial knowledge of the readings of Japanese sumoto, and Makoto Nagao. 1994. Improvements words. Our method takes dictionary entries con- of Japanese morphological analyzer JUMAN. In taining kanji characters and generates a number of SNLR, pages 22–28. readings for each. Readings are scored depending on Masaaki Nagata. 1994. A stochastic Japanese mor- phological analyzer using a forward-DP backward- their likeliness and stored in a system database ac- ∗ cessed through a web interface. In response to a user A N-best search algorithm. In Proc. of the 15th query, the system displays dictionary entries likely International Conference on Computational Lin- to correspond to the reading entered. Initial evalua- guistics )(COLING 1994, pages 201–207. tion indicates that the proposed system significantly NLI. 1986. Character and Writing system Edu- increases error-resilience in dictionary searches. cation, volume 14 of Japanese Language Educa- tion Reference. National Language Institute. (in Acknowledgements Japanese). This research was supported in part by the Re- Masahito Takahashi, Tsuyoshi Shichu, Kenji search Collaboration between NTT Communication Yoshimura, and Kosho Shudo. 1996. Process- Science Laboratories, Nippon Telegraph and Tele- ing homonyms in the kana-to-kanji conversion. phone Corporation and CSLI, Stanford University. In Proc. of the 16th International Conference We would particularly like to thank Ryo Okumura on Computational Linguistics (COLING 1996), for help in the development of the FOKS system, pages 1135–1138. Prof. Nishina Kikuko of the International Student Natsuko Tsujimura. 1996. An Introduction to Center (TITech) for initially hosting the FOKS sys- Japanese Linguistics. Blackwell, Cambridge, tem, and Francis Bond, Christoph Neumann and Massachusetts, first edition. two anonymous referees for providing valuable feed- Timothy J. Vance. 1987. Introduction to Japanese back during the writing of this paper. Phonology. SUNY Press, New York.

References Timothy Baldwin and Hozumi Tanaka. 2000. A comparative study of unsupervised grapheme- phoneme alignment methods. In Proc. of the 22nd Annual Meeting of the Cognitive Science Society (CogSci 2000), pages 597–602, Philadelphia. Timothy Baldwin, Slaven Bilac, Ryo Okumura, Takenobu Tokunaga, and Hozumi Tanaka. 2002. Enhanced Japanese electronic dictionary look-up. In Proc. of LREC. (to appear). Jim Breen. 2000. A WWW Japanese Dictionary. Japanese Studies, 20:313–317.