Error Correcting Romaji-Kana Conversion for Japanese Language Education
Total Page:16
File Type:pdf, Size:1020Kb
Error Correcting Romaji-kana Conversion for Japanese Language Education Seiji Kasahara† Mamoru Komachi† Masaaki Nagata†† Yuji Matsumoto† NTT Communication Science Nara Institute of Science and Technology †† Laboratories † 8916-5 Takayama-cho, Ikoma-shi, 2-4 Hikari-dai, Seika-cho, Nara, 630-0192 Japan Soraku-gun,Kyoto, 619-0237 Japan seiji-k, komachi, matsu @is.naist.jp { } [email protected] Abstract tification of language, spelling correction and con- verting text from Roman to kana. First, learners We present an approach to help editors of often write a word from their native language di- Japanese on a language learning SNS cor- rectly in a Japanese sentence. However, they are rect learners’ sentences written in Roman not converted correctly into their kana counterpart characters by converting them into kana. since the original spelling is usually not equiva- Our system detects foreign words and con- lent to the Japanese transliteration. Thus it is bet- verts only Japanese words even if they ter to leave these word unchanged for the read- contain spelling errors. Experimental re- ability of editors. Second, since erroneous words sults show that our system achieves about cannot be converted correctly, spelling correction 10 points higher conversion accuracy than is effective. We combined filtering with cosine traditional input method (IM). Error anal- similarities and edit distance to correct learners’ ysis reveals some tendencies of the errors spelling errors. Third, we greedily convert Ro- specific to language learners. man characters to kana for manual correction by native Japanese teachers. We compared our pro- 1 Introduction posed system with a standard IM and conducted The Japan Foundation reports that more than 3.65 error analysis of our system, showing the charac- million people in 133 countries and regions were teristics of the learner’s errors. studying Japanese in 2009. Japanese is normally written in thousands of ideographic characters im- 2 Related Work ported from Chinese (kanji) and about 50 unique Our interest is mainly focused on how to deal syllabic scripts (kana). Because memorizing these with erroneous inputs. Error detection and correc- characters is tough for people speaking European tion on sentences written in kana with kana char- languages, many learners begin their study with acter N-gram was proposed in (Shinnou, 1999). romaji, or romanization of Japanese. Our approach is similar to this, but our target is However, sentences written in kana are eas- sentences in Roman characters and has the addi- ier to edit for native Japanese than the ones in tional difficulty of language identification. Error- Roman characters. Converting Roman characters tolerant Chinese input methods were introduced in into kana helps Japanese editors correct learners’ (Zheng et al., 2011; Chen and Lee, 2000). Though sentences, but naive romaji-kana conversion does Roman-to-kana conversion is similar to pinyin-to- not work well because there are spelling errors in Chinese conversion, our target differs from them learners’ sentences. Even though traditional in- because our motivation is to help Japanese lan- put methods have functionality to convert Roman guage teachers. Japanese commercial IMs such characters into kana, existing IMs cannot treat as Microsoft Office IME1, ATOK2, and Google learners’ errors correctly since they are mainly de- IME3 have a module of spelling correction, but signed for native Japanese speakers. their target is native Japanese speakers. (Ehara and In this paper, we present an attempt to make the Tanaka-Ishii, 2008) presented a high accuracy lan- learner’s sentences easier to read and correct for guage detection system for text input. We perform a native Japanese editor by converting erroneous 1http://www.microsoft.com/japan/ text written in Roman characters into correct text office/2010/ime/default.mspx written in kana while leaving foreign words un- 2http://www.atok.com/ changed. Our method consists of three steps: iden- 3http://www.google.com/intl/ja/ime/ 38 Proceedings of the Workshop on Advances in Text Input Methods (WTIM 2011), pages 38–42, Chiang Mai, Thailand, November 13, 2011. error correction in addition to language identifica- 5.1 Language Identification tion. Correcting Japanese learners’ error is also Language identification is done by exact match- proposed in (Mizumoto et al., 2011). They try to ing input sequences in English with a roman- correct sentences written in kana and kanji mixed, ized5 Japanese dictionary. Learners sometimes di- whereas we aim at texts in Roman characters. rectly write words in their native language without adapting to Japanese romaji style. Since we are 3 Romanization of Japanese not focusing on implementing full transliteration There are some different standards of romaniza- (Knight and Graehl, 1998), we would like to con- tion in Japanese. The three main ones are Hepburn vert only Japanese words into kana. To achieve romanization, Kunrei-shiki Romaji, and Nihon- this, we use an English word dictionary because shiki Romaji. Most Japanese learners write in the most foreign words found in learners’ sentences Hepburn system, so we use this standard for our are English words. By adding dictionary, we conversion system. Hepburn romanization gen- can easily extend our system to another language. erally follows English phonology with Romance Those words matched with the dictionary are not 6 vowels. It is an intuitive method of showing the converted. WordNet 2.1 is used as the dictionary. pronunciation of a word in Japanese. The most It has 155,287 unique words. common variant is to omit the macrons or circum- We also use a Japanese word dictionary to de- flexes used to indicate a long vowel. cide whether a word goes to the approximate word matching phase or not. The Japanese word dic- 4 Romanized Japanese Learners Corpus tionary is IPADic 2.7.0. We also use a dictionary from Lang-8 of Japanese verb conjugations, because verbs in learners’ sentence are followed by conjugational To our knowledge, there are no Japanese learn- endings but they are separated in our word dic- ers’ copora written in Roman characters. There- tionary. The conjugation dictionary is made of fore, we collected text for a romanized Japanese all the occurrences of verbs and their conjugations 4 learners’ corpus from Lang-8 , a language learn- extracted from Mainichi newspaper of 1991, with ing SNS. Since it does not officially distribute the a Japanese dependency parser CaboCha 0.537 to data, we crawled the site in Dec 2010. It has ap- find bunsetsu (phrase) containing at least one verb. proximately 75,000 users writing on a wide range The number of extracted unique conjugations is of topics. There are 925,588 sentences written by 243,663. Japanese learners and 763,971 (93.4%) are revised by human editors (Mizumoto et al., 2011). About 5.2 Error Correction 10,000 sentences of them are written in Roman Words which are not matched in either the En- characters. Table 1 shows some examples of sen- glish or the Japanese dictionary in the language tences in Lang-8. As a feature of learners’ sen- identification step are corrected by the following tences in Roman characters, most of them have method. Spelling error correction is implemented delimiters between words, but verbs and their con- by approximate word matching with two different jugational endings are conjoined. Another is the measures. One is the cosine similarity of character ambiguity of particle spelling. For example, “は” unigrams. The other is edit distance. We use only (topic marker) is assigned to ha by the conver- IPADic to get approximate words. sion rule of Hepburn romanization, but it is pro- nounced as wa, so both of them are found in the 5.2.1 Candidate Generation with corpus. Pairs of “を” wo (accusative case marker) Approximate Word Matching and o,“へ” he (locative-goal case marker) and e First, we would like to select candidates with also have the same ambiguity. the minimum edit distance (Wagner and Fischer, 1974). Edit distance is the minimum number of 5 Error Tolerant Romaji-kana editing operations (insertion, deletion and substi- Conversion System tution) required to transform one string into an- The system consists of three components: lan- 5Romanization was performed by kakasi 2.3.4. http: guage identification, error correction with approx- //kakasi.namazu.org/ imate matching, and Roman-to-kana conversion. 6http://wordnet.princeton.edu/ 7http://chasen.org/˜taku/software/ 4http://lang-8.com/ cabocha/ 39 Table 1: Examples of learners’ sentences in Lang-8. Spell errors are underlined. learners’ sentence correct kana yorushiku onegia shimasu. yoroshiku onegai shimasu. よろしく おねがい します。 を みたい。 ::::::Muscle musical:::::: wo mietai. Muscle musical wo mitai. Muscle musical anatah wa aigo ga wakarimasu ka. anata wa eigo ga wakarimasu ka. あなた は えいご が わかります か。 other. However, the computational cost of edit dis- Table 2: Examples of successfully corrected word tance calculations can be a problem with a large misspelled kana correct kana 8 vocabulary. Therefore, we reduce the number of shuutmatsu しゅう t まつ shuumatsu しゅうまつ candidates using approximate word matching with do-yoobi どよおび doyoubi どようび cosine distance before calculating edit distance packu ぱ c く pakku ぱっく (Kukich, 1992). Cosine distance is calculated us- some pairs have several possibilities. One of them ing character n-gram features. We set n = 1 be- is a pair of n and following characters. For exam- cause it covers most candidates in dictionary and ple, we can read Japanese word kinyuu as “きん reduces the number of candidates appropriately. ゆう/kin-yuu: finance” and “きにゅう/kinyuu: en- For example, when we retrieved the approximate try.” The reason why it occurs is that n can be a words for packu in our dictionary with cosine dis- syllable alone. Solving this kind of ambiguity is tance, the number of candidates is reduced to 163, out of scope of this paper; and we hope it is not a and examples of retrieved words are kau, pakku, problem in practice, because after manual correc- chikau, pachikuri, etc.