Error Correcting Romaji- Conversion for Education

Seiji Kasahara† Mamoru Komachi† Masaaki Nagata†† Yuji Matsumoto†

NTT Communication Science Nara Institute of Science and Technology †† Laboratories † 8916-5 Takayama-cho, Ikoma-, 2-4 Hikari-dai, Seika-cho, Nara, 630-0192 Japan Soraku-gun,Kyoto, 619-0237 Japan seiji-k, komachi, matsu @is.naist.jp { } [email protected]

Abstract tification of language, spelling correction and con- verting text from Roman kana. First, learners present an approach to help editors of often write word from their native language di- Japanese on a language learning SNS cor- rectly in a Japanese sentence. However, they are rect learners’ sentences written in Roman not converted correctly into their kana counterpart characters by converting them into kana. since the original spelling is usually not equiva- Our system detects foreign words and con- lent to the Japanese transliteration. Thus it is bet- verts only Japanese words even if they ter to leave these word unchanged for the read- contain spelling errors. Experimental - ability of editors. Second, since erroneous words sults show that our system achieves about cannot be converted correctly, spelling correction 10 points higher conversion accuracy than is effective. We combined filtering with cosine traditional input method (IM). Error anal- similarities and edit distance to correct learners’ ysis reveals some tendencies of the errors spelling errors. Third, we greedily convert - specific to language learners. man characters to kana for manual correction by native Japanese teachers. We compared our pro- 1 Introduction posed system with a standard IM and conducted The Japan Foundation reports that more than 3.65 error analysis of our system, showing the charac- million people in 133 countries and regions were teristics of the learner’s errors. studying Japanese in 2009. Japanese is normally written in thousands of ideographic characters im- 2 Related Work ported from Chinese () and about 50 unique Our interest is mainly focused on how to deal syllabic scripts (kana). Because memorizing these with erroneous inputs. Error detection and correc- characters is tough for people speaking European tion on sentences written in kana with kana char- languages, many learners begin their study with acter -gram was proposed in (Shinnou, 1999). romaji, or of Japanese. Our approach is similar to this, but our target is However, sentences written in kana are eas- sentences in Roman characters and has the addi- ier to edit for native Japanese than the ones in tional difficulty of language identification. Error- Roman characters. Converting Roman characters tolerant Chinese input methods were introduced in into kana helps Japanese editors correct learners’ (Zheng et al., 2011; Chen and Lee, 2000). Though sentences, but naive romaji-kana conversion does Roman-to-kana conversion is similar to -to- not work well because there are spelling errors in Chinese conversion, our target differs from them learners’ sentences. Even though traditional in- because our motivation is to help Japanese lan- put methods have functionality to convert Roman guage teachers. Japanese commercial IMs such characters into kana, existing IMs cannot treat as Microsoft Office IME1, ATOK2, and Google learners’ errors correctly since they are mainly de- IME3 have a module of spelling correction, but signed for native Japanese speakers. their target is native Japanese speakers. (Ehara and In this paper, we present an attempt to make the Tanaka-Ishii, 2008) presented a high accuracy lan- learner’s sentences easier to read and correct for guage detection system for text input. We perform a native Japanese editor by converting erroneous 1http://www.microsoft.com/japan/ text written in Roman characters into correct text office/2010/ime/default.mspx written in kana while leaving foreign words un- 2http://www.atok.com/ changed. Our method consists of three steps: iden- 3http://www.google.com/intl/ja/ime/

38 Proceedings of the Workshop on Advances in Text Input Methods (WTIM 2011), pages 38–42, Chiang Mai, Thailand, November 13, 2011. error correction in addition to language identifica- 5.1 Language Identification tion. Correcting Japanese learners’ error is also Language identification is done by exact match- proposed in (Mizumoto et al., 2011). They try to ing input sequences in English with a roman- correct sentences written in kana and kanji mixed, ized5 Japanese . Learners sometimes di- whereas we aim at texts in Roman characters. rectly write words in their native language without adapting to Japanese romaji style. Since we are 3 Romanization of Japanese not focusing on implementing full transliteration There are some different standards of romaniza- (Knight and Graehl, 1998), we would like to con- tion in Japanese. The three main ones are Hepburn vert only Japanese words into kana. To achieve romanization, Kunrei-shiki Romaji, and Nihon- this, we use an English word dictionary because shiki Romaji. Most Japanese learners write in the most foreign words found in learners’ sentences Hepburn system, we use this standard for our are English words. By adding dictionary, we conversion system. Hepburn romanization gen- can easily extend our system to another language. erally follows English phonology with Romance Those words matched with the dictionary are not 6 vowels. It is an intuitive method of showing the converted. WordNet 2.1 is used as the dictionary. pronunciation of a word in Japanese. The most It has 155,287 unique words. common variant is to omit the macrons or circum- We also use a Japanese word dictionary to de- flexes used to indicate a long vowel. cide whether a word goes to the approximate word matching phase or not. The Japanese word dic- 4 Romanized Japanese Learners Corpus tionary is IPADic 2.7.0. We also use a dictionary from Lang-8 of Japanese verb conjugations, because verbs in learners’ sentence are followed by conjugational To our knowledge, there are Japanese learn- endings but they are separated in our word dic- ers’ copora written in Roman characters. There- tionary. The conjugation dictionary is made of fore, we collected text for a romanized Japanese all the occurrences of verbs and their conjugations 4 learners’ corpus from Lang-8 , a language learn- extracted from Mainichi newspaper of 1991, with ing SNS. Since it does not officially distribute the a Japanese dependency parser CaboCha 0.537 to data, we crawled the site in Dec 2010. It has ap- find bunsetsu (phrase) containing at least one verb. proximately 75,000 users writing on a wide range The number of extracted unique conjugations is of topics. There are 925,588 sentences written by 243,663. Japanese learners and 763,971 (93.4%) are revised by human editors (Mizumoto et al., 2011). About 5.2 Error Correction 10,000 sentences of them are written in Roman Words which are not matched in either the En- characters. Table 1 shows some examples of sen- glish or the Japanese dictionary in the language tences in Lang-8. As a feature of learners’ sen- identification step are corrected by the following tences in Roman characters, most of them have method. Spelling error correction is implemented delimiters between words, but verbs and their con- by approximate word matching with two different jugational endings are conjoined. Another is the measures. One is the cosine similarity of character ambiguity of particle spelling. For example, “は” unigrams. The other is edit distance. We use only () is assigned to by the conver- IPADic to get approximate words. sion rule of Hepburn romanization, but it is pro- nounced as , so both of them are found in the 5.2.1 Candidate Generation with corpus. Pairs of “を” (accusative case marker) Approximate Word Matching and ,“へ” (locative-goal case marker) and First, we would like to select candidates with also have the same ambiguity. the minimum edit distance (Wagner and Fischer, 1974). Edit distance is the minimum number of 5 Error Tolerant Romaji-kana editing operations (insertion, deletion and substi- Conversion System tution) required to transform one string into an-

The system consists of three components: lan- 5Romanization was performed by kakasi 2.3.4. http: guage identification, error correction with approx- //kakasi.namazu.org/ imate matching, and Roman-to-kana conversion. 6http://wordnet.princeton.edu/ 7http://chasen.org/˜taku/software/ 4http://lang-8.com/ cabocha/

39 Table 1: Examples of learners’ sentences in Lang-8. Spell errors are underlined. learners’ sentence correct kana yorushiku onegia shimasu. yoroshiku onegai shimasu. よろしく おねがい します。 を みたい。 ::::::Muscle musical:::::: wo mietai. Muscle musical wo mitai. Muscle musical anatah wa aigo ga wakarimasu . anata wa eigo ga wakarimasu ka. あなた は えいご が わかります か。 other. However, the computational cost of edit dis- Table 2: Examples of successfully corrected word tance calculations can be a problem with a large misspelled kana correct kana 8 vocabulary. Therefore, we reduce the number of shuutmatsu しゅう t まつ shuumatsu しゅうまつ candidates using approximate word matching with do-yoobi どよおび doyoubi どようび cosine distance before calculating edit distance packu ぱ c く pakku ぱっく (Kukich, 1992). Cosine distance is calculated us- some pairs have several possibilities. One of them ing character n-gram features. We set n = 1 be- is a pair of n and following characters. For exam- cause it covers most candidates in dictionary and ple, we can read Japanese word kinyuu as “きん reduces the number of candidates appropriately. ゆう/kin-yuu: finance” and “きにゅう/kinyuu: en- For example, when we retrieved the approximate try.” The reason why it occurs is that n can be a words for packu in our dictionary with cosine dis- alone. Solving this kind of ambiguity is tance, the number of candidates is reduced to 163, out of scope of this paper; and we hope it is not a and examples of retrieved words are kau, pakku, problem in practice, because after manual correc- chikau, pachikuri, etc. Approximate word match- tion we can translate kana back to Roman charac- ing with cosine similarity can be performed very ters unambiguously. efficiently (Okazaki and Tsujii, 2010)9 to get can- didates from a large scale word dictionary. 6 Experiments 5.2.2 Selecting the Most Likely Candidate We have evaluated our approach in converting Ro- The system selects the correct word by choos- man characters into kana after spelling error cor- ing the most likely candidate by N-gram cost rection of sentences. normalized by a word length. It is calculated with a romanized character 5-gram model built 6.1 Evaluation Metrics from kakasi-romanized Mainichi newspaper cor- We evaluate the accuracy of word error correc- 10 pora of 1991 using SRILM 1.5.12 with Witten- tion. We also evaluate error correction perfor- 11 Bell smoothing. mance with recall and precision. Recall and Preci- sion are defined as follows: 5.3 Converting Roman Characters into Kana Recall = Nt/Nw, P recision = Nt/ We greedily convert Roman characters into the where Nt, Nw and Ne denote the number of words longest match kana characters. If a word includes corrected from wrong word to right word by the character with circumflex, it is assumed to be two system, the number of words that contain errors, vowels meaning long sound (e.g., “kyodai”ˆ is ex- and the number of words edited by the system. panded as kyoudai: brother ). Characters not used in the Hepburn system are assumed to be another 6.2 Experimental Settings character which has similar sound in English if 12 possible. For example, ca, ci, cu, ce ,co are treated For comparison, we use Anthy 7900 as a base- as ka, shi, , , respectively. line, which is one of the de facto standard open source IMs. It does not use either language identi- Most kanas correspond to a pair of a conso- fication or approximate word matching. Note that nant and a vowel. Although most pairs of Roman Anthy is not particularly tailored for spelling er- characters are converted into kana unambiguously, ror correction. To compare with another system 8We set the maximum distance between input and candi- which has error correction function, we experi- date as 1, because it achieved the best accuracy in preliminary experiment. mented with Google CGI API for Japanese In- 13 9http://www.chokkan.org/software/ put . Since it does not have Romaji-kana conver- simstring/ sion module, the experiment was conducted using 10http://www-speech.sri.com/projects/ srilm/ 12http://anthy.sourceforge.jp/ 11Witten-Bell smoothing works well compared to Kneser- 13http://www.google.com/intl/ja/ime/ Ney when data is very sparse. cgiapi.html

40 Table 3: Examples of uncorrected word Table 5: Error types and system performance (per- misspelled kana correct centage) renshou れんしょう renshuu error type number corrected musugashi むすがし muzukashii Typo 31 (13.1) 7 (22.6) noryoukushiken のりょうくしけん nouryokushiken Due to L1 phonetics 62 (26.3) 4 (6.5) Due to L1 writing 28 (11.9) 2 (7.1) Table 4: Performance of error correction Confusing vowels 88 (37.3) 7 (8.0) method Acc P R Others 27 (11.4) 0.0 (0.0) Anthy (baseline) 74.5 66.7 69.7 Total 236 20 (8.5) Anthy w/ Google API 77.8 69.8 72.9 Proposed w/o word match 84.5 76.6 77.3 Proposed w/ word match 85.0 78.1 78.6 in the dictionary. For example, if a learner incor- rectly writes renshou instead of renshuu, it is not Romaji-kana conversion by Anthy and error cor- corrected because it is found in Japanese dictio- rection by Google API. We also compare our sys- nary. This type of error cannot be corrected with- tem with and without approximate word matching. out context information so a word based language model is worth trying. 6.3 Data Set We collected 500 sentences written in Roman Too large edit distance: A word whose edit dis- characters from Lang-8. Although some of tance from the input is larger than the threshold is them have been already revised, we manually re- not selected as a candidate. For example, if the annotated gold standard answers to enhance the learner writes muzukashii as musugashi, the edit consistency of quality. While making the test set, distance between words is 3 which is lower than we corrected only spellings even if they contain our threshold (=1). We can vary threshold but set- other type of error because our main purpose is ting larger threshold introduces dissimilar words 14 correcting spelling errors. into the candidate list. Table 5 shows error types with their percentage against all erroneous words 6.4 Experimental Results and system accuracy (where L1 means learners - Table 4 shows the spelling correction accuracy. tive language). Learners tend to confuse vowels The word accuracy of the proposed system is and write erroneous word such as domou instead 85.0% which is about 10 points higher than An- of doumo. Setting lower cost to edit operations thy’s 74.5%. The accuracy of our method with- of vowels than those of consonants may fix these out approximate word matching is 84.5%, show- kind of phonetic errors. A Japanese IM which lets ing that language identification is the crucial com- us input kana and kanji by typing only consonants 15 ponent of our method. Examples of successfully (Tanaka-Ishii et al., 2001) can be seen as a special corrected word are shown in Table 2. Underlined case where the cost of edit operations of vowels is words are erroneous words and words underlined set to zero. with wavy line are foreign words. Spelling correc- tion with approximate matching can improve pre- Compound words: Our system is effective cision without degrading recall. However, the low when our dictionary and the learners’ sentence use performance of the baseline system shows diffi- the same granularity of tokenization. For exam- culty of this task. ple, “nouryokushiken: capacity test” can be treated as two words, “nouryoku: capacity” and “shiken: 7 Discussion test.” In fact, IPADic does not have an entry for Examples of uncorrected words are shown in - “nouryoku shiken.” Therefore, the single word ble 3. The top three largest ones are matching with “nouryokushiken” does not hit when matching. To valid word (40%), too large edit distance between solve this problem, word segmentation techniques original word and correct word (24%), and com- may be effective. pound words (14%). Acknowledgment Matching with valid word: Matching with valid word occurs when the input matches a word We would like to express our gratitude to our col- 14There are 3,274 words in the test data and 32 characters leagues, Tomoya Mizumoto and Joseph Irwin for in a sentence on average. their cooperation. 15The number of foreign words in the test data is 137 and 124 words of them were correctly identified.

41 References Zheng Chen and Kai- Lee. 2000. A New Statistical Approach to Chinese Pinyin Input. In Proceedings of ACL, pages 241–247. Ehara and Kumiko Tanaka-Ishii. 2008. Multilin- gual Text Entry using Automatic Language Detec- tion. In Proceedings of IJCNLP, pages 441–448.

Kevin Knight and Jonathan Graehl. 1998. - chine Transliteration. Computational Linguistics, 24(4):599–612.

Karen Kukich. 1992. Techniques for Automatically Correcting Words in Text. ACM Computing Sur- veys, 24(4):377–439. Tomoya Mizumoto, Mamoru Komachi, Masaaki Na- gata, and Yuji Matsumoto. 2011. Mining Re- vision Log of Language Learning SNS for Auto- mated Japanese Error Correction of Second Lan- guage Learners. In Proceedings of IJCNLP. Naoaki Okazaki and Jun’ichi Tsujii. 2010. Simple and Efficient Algorithm for Approximate Dictionary Matching. In Proceedings of COLING, pages 851– 859.

Hiroyuki Shinnou. 1999. Detection and Correction for Errors in Sequences by a Hiragana Char- acter N-gram (in Japanese). Transaction of Informa- tion Processing Society of Japan, 40(6):2690–2698.

Kumiko Tanaka-Ishii, Yusuke Inutsuka, and Masato Takeichi. 2001. Japanese input system with digits –Can Japanese be input only with consonants? In Proceedings of HLT, pages 211–218.

Robert A. Wagner and Michael J. Fischer. 1974. The String to String Correction Problem. Journal of the ACM, 21(1):168–173. Yabin Zheng, Chen Li, and Maosong Sun. 2011. CHIME: An Efficient Error-Tolerant Chinese Pinyin Input Method. In Proceedings of IJCAI, pages 2551–2556.

42