<<

Automatic Construction of Japanese Variant List from Large Corpus Takeshi Masuyama Satoshi Sekine Hiroshi Nakagawa Information Technology Center Computer Science Department Information Technology Center University of Tokyo New York University University of Tokyo 7-3-1, Hongo, Bunkyo 715 Broadway, 7th floor 7-3-1, Hongo, Bunkyo Tokyo 113-0023 New York NY 10003 Tokyo 113-0023 Japan USA Japan [email protected].-tokyo.ac.jp [email protected] [email protected]

Abstract gine or query for a question answering system, This paper presents a method construct may not be able to find the web pages or the Japanese KATAKANA variant list from answers for which we are looking, if a different large corpus. Our method is useful for orthography for “スパゲッティ ”isused. information retrieval, information extrac- We investigated how many documents were tion, question answering, and on, because retrieved by Google 1 when each Japanese KATAKANAwordstendtobeusedas KATAKANA variant of “spaghetti” was used “loan words” and the transliteration causes as a query. The result is shown as Table 1. several variations of spelling. Our method consists of three steps. At step 1, our sys- For example, when we inputted “スパゲッテ tem collects KATAKANA words from large ィ” as a query of Google, 104,000 documents corpus. At step 2, our system collects can- were retrieved and the percentage was 34.6%, didate pairs of KATAKANA variants from calculated by 104,000 divided by 300,556. From the collected KATAKANA words using a Table 1, we see that each of six variants appears spelling similarity which is based on the edit frequently and thus we may not be able to find distance. At step 3, our system selects vari- the web pages for which we are looking. ant pairs from the candidate pairs using a semantic similarity which is calculated by Although we can manually create Japanese a vector space model of a context of each KATAKANA variant list, it is a labor-intensive KATAKANA word. We conducted exper- task. In order to solve the problem, we propose iments using 38 years of Japanese news- an automatic method to construct Japanese paper articles and constructed Japanese KATAKANA variant list from large corpus. KATAKANA variant list with the perfor- mance of 97.4% recall and 89.1% precision. Variant # of retrieved documents Estimating from this precision, our system スパゲッティ 104,000 (34.6%) can extract 178,569 variant pairs from the corpus. スパゲッティー 25,400 (8.5%) スパゲッテイ 1,570 (0.5%) 1 Introduction スパゲティ 131,000 (43.6%) スパゲティー 37,700 (12.5%) “Loan words” in Japanese are usually writ- スパゲテイ 886 (0.3%) ten by a phonogram type of Japanese charac- Total 300,556 (100%) ter set, KATAKANA. Because of loan words, the transliteration causes several variations of Table 1: Number of retrieved documents when spelling. Therefore, Japanese KATAKANA we inputted each Japanese KATAKANA vari- words sometimes have several different or- ant of “spaghetti” as a query of Google. thographies for each original word. For exam- ple, we found at least six different spellings of “spaghetti” in 38 years of Japanese newspaper Our method consists of three steps. First, articles, such as “スパゲッティ ,” “スパゲッテ we collect Japanese KATAKANA words from ィー,” “スパゲッテイ ,” “スパゲティ ,” “スパゲ large corpus. Then, we collect candidate pairs of ティー ,” and “スパゲテイ .” The different ex- KATAKANA variants based on a spelling simi- pression causes problems when we use search larity from the collected Japanese KATAKANA engines, question answering systems, and so on words. Finally, we select variant pairs using (Yamamoto et al., 2003). For example, when we input “スパゲッティ ”asaqueryforasearchen- 1http://www.google.co.jp/ a semantic similarity based on a vector space their contexts. model of a context of each KATAKANA word. This paper is organized as follows. Section 3 Construct Japanese KATAKANA 2 describes related work. Section 3 presents Variant List from Large Corpus our method to construct Japanese KATAKANA Our method consists of the following three variant list from large corpus. Section 4 shows steps. some experimental results using 38 years of Japanese newspaper articles, which we call “the 1. Collect Japanese KATAKANA words from Corpus” from now on, followed by evaluation large corpus. and discussion. Section 5 describes future work. Section 6 offers some concluding remarks. 2. Collect candidate pairs of KATAKANA variants from the collected KATAKANA 2 Related Work words using a spelling similarity. There are some related work for the problems 3. Select variant pairs from the candidate with Japanese spelling variations. In (Shishi- pairs based on a semantic similarity. bori and Aoe, 1993), they have proposed a method for generating Japanese KATAKANA 3.1 Collect KATAKANA Words from variants by using replacement rules, such as Large Corpus ベ (be) ↔ ヴェ (ve) and チ () ↔ ツィ(tsi). Here, “↔” represents “substitution.” For ex- At the first step, we collected Japanese ample, when we apply these rules to “ベネチア KATAKANA words which consist of a (Venezia),” three different spellings are gener- KATAKANA character, ・ (bullet), ー ated as variants, such as “ベネツィア,” “ヴェネ (macron-1), and - (macron-2), which are チア,” and “ヴェネツィア.” commonly used as a part of KATAKANA Kubota et al. have extracted Japanese words, using pattern matching. For example, KATAKANA variants by first transforming our system collects three KATAKANA words “ KATAKANA words to directed graphs based on ルートウィヒ・エアハルト (Ludwig Erhard-1),” rewrite rules and by then checking whether the “ソ (Soviet),” “ルードウィッヒ・エアハルト directed graphs contain the same labeled path (Ludwig Erhard-2),” “ドイツ (Germany)” or not (Kubota et al., 1993). A part of their from the following sentences. Note that two rewrite rules is shown in Table 2. For exam- mentions of “Ludwig Erhard” have different ple, when applying these rules to “クウェート orthographies. (Kuwait),” “ク aαc,” “ク bαc,” “クウ dαc” are • generated as variants. “奇跡の経済復興の父” といわれる故ルート ウィヒ・エアハルト氏。(Defunct Ludwig KATAKANA String → Symbol Erhard-1 is called “Father of The Mirac- ウェ (we), エ () → a ulous Economic Revival.”) → ウェ (we), ウエ (ue) b • もしソ連や東欧諸国が統制志向を捨て、一 → トゥ (twu), ト (to), ツ () c 九四八年に西独のルードウィッヒ・エアハル → ー (macron) α トがとったような経済の自由化へと突き進 → ェ (small e), エ (e) d めば、西ドイツ のように奇跡の復興を遂げ るかもしれない。(If Soviet and East Eu- Table 2: A part of rewrite rules. ropean countries give up their controlling concepts and pursue the economic deregu- In (Shishibori and Aoe, 1993) and (Kubota lation which Ludwig Erhard-2 of West Ger- et al., 1993), they only paid attention to apply- many did in 1948, they may achieve the ing their replacement or rewrite rules to words miraculous revival like West Germany.) themselves and didn’t pay attention to their contexts. Therefore, they wrongly decide that 3.2 Spelling Similarity “ウエーブ”isavariantof“ウェブ.” Here, “ウ At the second step, our system collects candi- エーブ”represents“wave”and“ウェブ”repre- date pairs of two KATAKANA words, which sents “web.” In our method, we will decide if “ are similar in spelling, from the collected ウエーブ”and“ウェブ” convey the same mean- KATAKANA words described in Section 3.1. ing or not using a semantic similarity based on We used “string penalty” to collect candidate pairs. String penalty is based on the edit dis- 3.3 Context Similarity tance (Hall and DOWLING, 1980) which is a At the final step, our system selects variant similarity measure between two strings. We pairs from the candidate pairs described in Sec- used the following three types of operations. tion 3.2 based on a semantic similarity. We used a vector space model as a semantic similarity. • Substitution In the vector space model, we treated 10 ran- Replace a character with another charac- domly selected articles from the Corpus as a ter. context of each KATAKANA word. • We divided sentences of the articles into Deletion words using JUMAN2 (Kurohashi and Nagao, Delete a character. 1999) which is the Japanese morphological an- • Insertion alyzer, and then extracted content words which Insert a character. consist of nouns, verbs, adjectives, adverbs, and unknown words except stopwords. Stopwords We also added some scoring heuristics to the are composed of Japanese charac- operations based on a pronunciation similarity ters, punctuations, numerals, common words, between characters. The rules are tuned by and so on. hand using randomly selected training data. We used a cosine measure to calculate a - Some examples are shown in Table 3. Here, mantic similarity of two KATAKANA words. “↔” represents “substitution” and lines with- Suppose that one KATAKANA word makes a out ↔ represent “deletion” or “insertion.” Note context vector a and the other one makes b. that “Penalty” represents a score of the string The semantic similarity between two vectors a penalty from now on. and b is calculated as follows. For example, we give penalty 1 between “ジ a · b ャーナル”and“ヂャーナル,” because the strings sim(a, b)=cosθ = | || | (1) become the same when we replace “ヂ”with“ a b ジ” and its penalty is 1 as shown in Table 3. The cosine measure tends to overscore fre- Rules Penalty quently appeared words. Therefore, in order to ア (a) ↔ ァ (small a) 1 avoid the problem, we treated log( +1)asa score of a word appeared in a context. Here, N ジ (zi) ↔ ヂ (di) 1 represents the frequency of a word in a context. ー (macron) 1 We set 0.05 for the threshold of the seman- ハ () ↔ バ (ba) 2 tic similarity, .e. we regard candidate pairs as ウ (u) ↔ ヴ (vu) 2 variant pairs when the semantic similarities are (a) ↔ () 3 ア ヤ more than 0.05. The threshold was tuned by (tsu) ↔ (small tsu) 3 ツ ッ hand using randomly selected training data. Table 3: A part of our string penalty rules. In the case of “ルートウィヒ・エアハルト (Lud- wig Erhard-1)” and “ルードウィッヒ・エアハル ト (Ludwig Erhard-2)”, the semantic similarity We analyzed hundreds of candidate pairs becomes 0.17 as shown in Table 4. Therefore, of training data and figured out that most we regard them as a variant pair. KATAKANA variations occur when the string Note that in Table 4, a decimal number rep- penalties were less than a certain threshold. In resents a score of a word appeared in a context this paper, we set 4 for the threshold and regard calculated by log(N +1). For example, the score KATAKANA pairs as candidate pairs when the of 奇跡 (miracle) in the first context is 0.7. string penalties are less than 4. The thresh- 4 Experiments old was tuned by hand using randomly selected training data. 4.1 Data Preprocessing and For example, from the collected KATAKANA Performance Measures words described in Section 3.1, our system col- We conducted the experiments using the Cor- lects the pair of ルートウィヒ・エアハルト and pus. The number of documents in the Cor- ルードウィッヒ・エアハルト, since the string 2http://www.kc.t.u-tokyo.ac.jp/nl- penalty is 3. resource/juman.html Word ルートウィヒ・エアハルト following meanings. 奇跡 (miracle):0.7 経済 (economy):1.9 Methodp: The method using only the spelling Context 父 (father):0.7 similarity 復興 (revival):0.7 ··· Methodp&s: The method using both the spelling similarity and the semantic Word ルードウィッヒ・エアハルト similarity 奇跡 (miracle):1.1 自由化 (liberalization):1.4 Ext: The number of extracted candidate pairs Context (economy):2.4 経済 Cor: The number of correct variant pairs among 復興 (revival):1.1 ··· the extracted candidate pairs

Similarity 0.17 Note that in Methodp&s, we ignored candi- date pairs whose string penalties ranged from Table 4: Semantic similarity between “ルート 4 to 12, since we set 4 for the threshold of the ウィヒ・エアハルト”and“ルードウィッヒ・エア string penalty as described in Section 3.2. ハルト.” The result is shown in Table 5. For example, when the penalty was 2, 81 out of 117 were se- lected as correct variant pairs in Methodp and pus was 4,678,040 and the distinct number the precision was 69.2%. Also, 80 out of 98 were of KATAKANA words in the Corpus was selected as correct variant pairs in Methodp&s 1,102,108. and the precision was 81.6%. As for a test set, we collected candidate As for Penalty 1-12 of Methodp, i.e. we fo- pairs whose string penalties range from 1 to cused on the string penalties between 1 and 12, 12. The number of collected candidate pairs the recall was 100%, because we regarded 269 was 2,590,240. In order to create sample correct out of 500 as correct variant pairs and Methodp KATAKANA variant data, 500 out of 2,590,240 extracted all of them. Also, the precision was were randomly selected and we evaluated them 53.8%, calculated by 269 divided by 500. Com- manually by checking their contexts. Through paring Methodp&s to Methodp, the recall and the evaluation, we found that correct vari- the precision of Methodp&s were well-balanced, ant pairs appeared from 10 to 12. Thus, we since the recall was 97.4% and the precision was think that treating candidate pairs whose string 89.1%. penalties range from 1 to 12 can cover almost In the same way, for Penalty 1-3, i.e. the all of correct variant pairs. string penalties between 1 and 3, the recall of To evaluate our method, we used recall (), Methodp was 98.1%, since five correct variant precision (Pr), and F measure (F ). These per- pairs between 4 and 12 were ignored and the formance measures are calculated by the follow- remaining 264 out of 269 were found. The preci- ing formulas: sion of Methodp was 77.2%. It was 23.4% higher than the one of Penalty 1-12. Thus, F measure number of pairs found and correct also improved 16.4%. This result indicates that Re = , setting 4 for the threshold works well to improve total number of pairs correct overall performance. number of pairs found and correct Pr = , Now, comparing Methodp&s to Methodp total number of pairs found when the string penalties ranged from 1 to 3, the 2RePr recall of Methodp&s was0.7%lower.Thiswas F = . Re + Pr because Methodp&s couldn’t select two correct variant pairs when the penalties were 1 and 2. 4.2 Experiment-1 However, the precision of Methodp&s was 16.2% We conducted the first experiment based on higher. Thus, F measure of Methodp&s im- two settings; one method uses only the spelling proved 6.7% compared to the one of Methodp. similarity and the other method uses both the From this result, we think that taking the se- spelling similarity and the semantic similarity. mantic similarity into account is a better strat- Henceforth, we use “Methodp,” egy to construct Japanese KATAKANA variant “Methodp&s,” “Ext,” and “Cor” as the list. Penalty Methodp Methodp&s precision of Methodp&s as shown in Table 5. Cor/Ext (%) Cor/Ext (%) The result is shown in Table 7. We find that 1 130/134 (97.0) 129/129 (100) the number of candidate pairs in the Corpus 2 81/117 (69.2) 80/98 (81.6) was 100,746 for the penalty of 1, and 56,569 for 3 53/91 (58.2) 53/67 (79.1) the penalty of 2, and 40,004 for the penalty of 4 2/14 (14.3) 3. 5 0/30 (0.0) For example, when the penalty was 2, we esti- 6 1/14 (7.1) mate that 46,178 out of 56,569 could be selected 7 1/20 (5.0) as correct variant pairs, since the precision was 8 0/14 (0.0) 81.6% as shown in Table 5. In total, we estimate 9 1/12 (8.3) that 178,569 out of 197,319 could be selected as 10 0/16 (0.0) correct variant pairs from the Corpus. 11 0/17 (0.0) 12 0/21 (0.0) Penalty # of expected variant pairs Re 264/269 (98.1) 262/269 (97.4) 1 100,746/100,746 (100%) 1-3 Pr 264/342 (77.2) 262/294 (89.1) 2 46,178/56,569 (81.6%) F 86.4% 93.1% 3 31,645/40,004 (79.1%) Re 269/269 (100) Total 178,569/197,319 (90.5%) 1-12 Pr 269/500 (53.8) Table 7: Estimation of expected correct variant F 70.0% pairs. Table 5: Comparison of Methodp and Methodp&s. 4.5 Error Analysis-1 As shown in Table 5, our system couldn’t select 4.3 Experiment-2 two correct variant pairs using semantic sim- We investigated how many variant pairs were ilarity when the penalties were 1 and 2. We extracted in the case of six different spellings investigated the reason from the training data. of “spaghetti” described in Section 1. Table 6 The problem was caused because the contexts shows the result of all combination pairs when of the pairs were diffrent. For example, in the we applied Methodp&s. case of “アロックサンワ”and“アロック・サン For example, when the penalty was 1, ワ,” which represent the same building material Methodp&s selected seven candidate pairs and company “Aroc Sanwa” of Fukui prefecture in all of them were correct. Thus, the recall was Japan, their contexts were completely different 100%. From Table 6, we see that the string because of the following reason. penalties of all combination pairs ranged from • 1 to 3 and our system selected all of them by アロックサンワ, アロック・サンワ the semantic similarity. アロックサンワ (Aroc Sanwa): This word appeared with the name of Penalty Methodp&s an athlete who took part in the 1 7/7 (100%) national athletic meet held in Toyama 2 6/6 (100%) prefecture in Japan, and the company 3 2/2 (100%) sponsored the athlete. Total 15/15 (100%) アロック・サンワ (Aroc・Sanwa): This Table 6: A result of six different spellings of word was used to introduce the “spaghetti” described in Section 1. company in the article. Note that each context of these words was composed of only one article. 4.4 Estimation of expected correct variant pairs 4.6 Error Analysis-2 We estimated how many correct variant pairs FromTable5,weseethatthenumbersofincor- could be selected from the Corpus based on the rect variant pairs selected by Methodp&s were 18 and 14 for each penalty of 2 and 3. We in- From the experiments, we found that vestigated such cases in the training data. The Methodp&s performs better than Methodp, example of “カート (Cart, Kart)” and “カード since it constructed Japanese KATAKANA (Card)” is shown as follows. variant list with high performance of 97.4% re- call and 89.1% precision. • カート, カード Estimating from the precision, we found that 178,569 out of 197,319 could be selected as cor- カート (Cart, Kart): This word was used as the abbreviation of “Shopping rect variant pairs from the Corpus. The result Cart,” “Racing Kart,” or “Sport could be helpful to solve the variant problems Kart.” of information retrieval, information extraction, question answering, and so on. カード (Card): This word was used as the abbreviation of “Credit Card” or References “Cash Card” and was also used as the PatrickA.V.HallandGEOFFR.DOWLING. meaning of “Schedule of Games.” 1980. Approximate string matching. Com- puting Surveys, 12(4):381–402. Although these were not a variant pair, our system regarded the pair as the variant pair, Jun’ichi Kubota, Yukie Shoda, Masahiro because their contexts were similar. In both Kawai, Hirofumi Tamagawa, and Ryoichi Sugimura. 1993. A method of detecting contexts, “利用 (utilization),” “記録 (record),” KATAKANA variants in a document. IPSJ “客 (guest),” “目指す (aim),” “チーム (team),” NL97-16, pages 111–117. “優勝 (victory),” “高い (high, expensive),” “活 躍 (success),” “出場 (entry),” and so on were Sadao Kurohashi and Makoto Nagao. 1999. appeared frequently and therefore the semantic Japanese morphological analysis system JU- similarity became high. MAN version 3.61. Department of Informat- ics, Kyoto University. 5 Future Work Masami Shishibori and Jun’ichi Aoe. 1993. A In this paper, we have used newspaper articles method for generation and normalization of to construct Japanese KATAKANA variant list. katakana variant notations. IPSJ NL94-5, We are planning to apply our method on differ- pages 33–40. ent types of corpus, such as patent documents Eiko Yamamoto, Yoshiyuki Takeda, and Kyoji and Web data. We think that more variations Umemura. 2003. An IR similarity measure can be found from Web data. In the case of which is tolerant for morphological variation. “spaghetti” described in Section 1, we found at Natural Language Processing, 10(1):63–80. least seven more different spellings, such as “ スッパゲティ,” “スッパゲティー,” “スパゲーテ ィ,” “スパゲーティー”“スパゲテー,” “スパッゲ ティ,” and “スパッゲティー.” Although we have manually tuned scoring rules of the string penalty using training data, we are planning to introduce an automatic method for learning the rules. We will also have to consider other charac- ter types of Japanese, i.e. variations and HIRAGANA variations, though we have fo- cused on only KATAKANA variations in this paper. For example, both “引越し”and“引っ こし” mean “move” in Japanese. 6Conclusion We have described the method to construct Japanese KATAKANA variant list from large corpus. Unlike the previous work, we focused not only on the similarity in spelling but also on the semantic similarity.