A Simple Approach to Unknown Word Processing in Japanese Morphological Analysis

A Simple Approach to Unknown Word Processing in Japanese Morphological Analysis Ryohei Sasano1 Sadao Kurohashi2 Manabu Okumura1 1 Precision and Intelligence Laboratory, Tokyo Institute of Technology 2 Graduate School of Informatics, Kyoto University sasano,oku @pi.titech.ac.jp, [email protected] { } Input : “³ᬢųɔʡ” (My father is a Japanese.) Abstract Lattice : ɔʡ (the identical person) ᬢ [Noun] This paper presents a simple but effec- (is) ³ [Particle] ų ɔ ʡ tive approach to unknown word processing BOS (father) (day) (book) (man) EOS [Noun] ᬢ [Noun] [Noun] [Noun] in Japanese morphological analysis, which (tooth) [Noun] ųɔ handles 1) unknown words that are de- (Japanese) [Noun] rived from words in a pre-defined lexicon Figure 1: Example of word lattice. The bold lines and 2) unknown onomatopoeias. Our ap- indicate the optimal path. proach leverages derivation rules and onomatopoeia patterns, and correctly recog- lyzer (Nagata, 1999; Uchimoto et al., 2001; Asa- nizes certain types of unknown words. Ex- hara and Matsumoto, 2004; Azuma et al., 2006; periments revealed that our approach rec- Nakagawa and Uchimoto, 2007). Although both ognized about 4,500 unknown words in approaches have their own advantages and should 100,000 Web sentences with only 80 harm- be exploited cooperatively, this paper focuses only ful side effects and a 6% loss in speed. on the latter approach. Most previous work on this approach has aimed 1 Introduction at developing a single general-purpose unknown Morphological analysis is the first step in many word model. However, there are several types natural language applications. Since words are of unknown words, some of which can be easily not segmented by explicit delimiters in Japanese, dealt with by introducing simple derivation rules Japanese morphological analysis consists of two and unknown word patterns. In addition, as we subtasks: word segmentation and part-of-speech will discuss in Section 2.3, the importance of un- (POS) tagging. Japanese morphological anal- known word processing varies across unknown ysis has successfully adopted lexicon-based ap- word types. In this paper, we aim to deal with proaches for newspaper articles (Kurohashi et al., unknown words that are considered important and 1994; Asahara and Matsumoto, 2000; Kudo et can be dealt with using simple rules and patterns. al., 2004), in which an input sentence is trans- Table 1 lists several types of Japanese unknown formed into a lattice of candidate words using a words, some of which often appear in Web text. pre-defined lexicon, and an optimal path in the lat- First, we broadly divide the unknown words into tice is then selected. Figure 1 shows an example two classes: words derived from the words in the of a word lattice for morphological analysis and lexicon and the others. There are a lot of infor- an optimal path. Since the transformation from a mal spelling variations in Web text that are derived sentence into a word lattice basically depends on from the words in the lexicon, such as “ぁなた” the pre-defined lexicon, the existence of unknown (y0u) instead of “あなた” (you) and “冷たーーい” words, i.e., words that are not included in the pre- (coooool) instead of “冷たい” (cool). The types of defined lexicon, is a major problem in Japanese derivation are limited, and thus most of them can morphological analysis. be resolved by introducing derivation rules. Un- There are two major approaches to this prob- known words other than those derived from known lem: one is to augment the lexicon by acquiring words are generally difficult to resolve using only unknown words from a corpus in advance (Mori simple rules, and the lexicon augmentation ap- and Nagao, 1996; Murawaki and Kurohashi, proach would be better for them. However, this 2008) and the other is to introduce better un- is not true for onomatopoeias. Although Japanese known word processing to the morphological ana- is rich in onomatopoeias and some of them do not 162 International Joint Conference on Natural Language Processing, pages 162–170, Nagoya, Japan, 14-18 October 2013. Unknown words derived from known words Type Unknown word Original word Rendaku* (sequential voicing) (たまご) ざけ ((tamago-)zake, sake-nog) さけ (sake, Japanese alcoholic drink) Substitution with long sound symbols* ほんとー (troo) ほんとう (true) Substitution with lowercases* ぁなた (y0u) あなた (you) Substitution with normal symbols うれ∪い (h@ppy) うれしい (happy) Insertion of long sound symbols* 冷たーーーい (coooool) 冷たい (cool) Insertion of lowercases* 冷たぁぁぁい (coooool) 冷たい (cool) Insertion of vowel characters 冷たあああい (coooool) 冷たい (cool) Unknown words other than those derived from known words Type Unknown word Corresponding English expression Onomatopoeia with repetition* かあかあ caw-caw Onomatopoeia w/o repetition* シュッと hiss Rare word / New word 除染 / ツイッター decontamination / Twitter Table 1: Various types of Japanese unknown words. The ‘*’ denotes that this type is the target of this research. See Section 2.2 for more details. appear in the lexicon, most of them follow several and MeCab (Kudo, 2006), use only a few sim- patterns such as ‘ABAB,’ ‘AっB り,’ and ‘ABっと,’1 ple heuristics based on the character types, such and they thus can be resolved by considering typi- as hiragana, katakana, and alphabets2, that regard cal patterns. a character sequence consisting of the same char- Therefore, in this paper, we introduce deriva- acter type as a word candidate. tion rules and onomatopoeia patterns to the un- The optimal path is searched for based on the known word processing in Japanese morphologi- sum of the costs for the path. There are two types cal analysis, and aim to resolve 1) unknown words of costs: the cost for a candidate word and the cost derived from words in a pre-defined lexicon and 2) for a pair of adjacent parts-of-speech. The cost unknown onomatopoeias. for a word reflects the probability of the occurrence of the word, and the connectivity cost of a 2 Background pair of parts-of-speech reflects the probability of 2.1 Japanese morphological analysis an adjacent occurrence of the pair. A greater cost means less probability. The costs are manually as- As mentioned earlier, lexicon-based approaches signed in JUMAN, and assigned by adopting su- have been widely adopted for Japanese morpho- pervised machine learning techniques in ChaSen logical analysis. In these approaches, we as- and MeCab, while the algorithm to find the opti- sume that a lexicon, which lists a pair consisting mal path is the same, which is based on the Viterbi of a word and its corresponding part-of-speech, algorithm. is available. The process of traditional Japanese morphological analysis is as follows: 2.2 Types of unknown words 1. Build a lattice of words that represents all the In this section, we detail the target unknown word candidate sequences of words from an input types of this research. sentence. Rendaku (sequential voicing) is a phenomenon 2. Find an optimal path through the lattice. in Japanese morpho-phonology that voices the ini- Figure 1 in Section 1 shows an example of a tial consonant of the non-initial portion of a com- word lattice for the input sentence “父は日本人” pound word. In the following example, the initial (My father is Japanese), where a total of six can- consonant of the Japanese noun “さけ”(sake, al- didate paths are encoded and the optimal path is coholic drink) is voiced into “ざけ”(zake): marked with bold lines. The lattice is mainly built (1) たまござけ(eggnog) with the words in the lexicon. Some heuristics are ta ma go - zake. also used for dealing with unknown words, but Since the expression “ざけ”(zake) is not in- in most cases, only a few simple heuristics are cluded in a standard lexicon, it is regarded as an used. In fact, the three major Japanese morpho- unknown word even if the original word “さけ” logical analyzers, JUMAN (Kurohashi and Kawa- (sake) is included in the lexicon. There are a lot hara, 2005), ChaSen (Matsumoto et al., 2007), 2Four different character types are used in Japanese: hi- 1‘A’ and ‘B’ denote Japanese characters, respectively. ragana, katakana, Chinese characters, and Roman alphabet. 163 of studies on rendaku in the field of phonetics 2.3 Importance of unknown word processing and linguistics, and several conditions that prevent of each type rendaku are known, such as Lyman’s Law (Ly- The importance of unknown word processing man, 1894), which stated that rendaku does not varies across unknown word types. occur when the second element of the compound We give three example sentences (4), (5), and contains a voiced obstruent. However, few stud- (6), which include the unknown words “もこも ies dealt with rendaku in morphological analysis. こ” (fluffy), “除染” (decontamination), and “ツイ Since we have to check the adjacent word to rec- ッター” (Twitter), respectively. In these examples, ognize rendaku, it is difficult to deal with rendaku (a) denotes the desirable morphological analysis using only the lexicon augmentation approach. and (b) is the output of our baseline morphologi- Some characters are substituted by peculiar cal analyzer, JUMAN version 5.1 (Kurohashi and characters or symbols such as long sound sym- Kawahara, 2005). bols, lowercase kana characters3, in informal text. First, if there is little difference in pronunciation, (4) Input: ふわふわでもこもこの肌触り。 Japanese vowel characters ‘あ’(a), ‘い’(i), ‘う’(u), (A soft and fluffy feeling to the touch.) (a) ふわふわ / で / もこもこ / の / 肌触り。 ‘え’(e), and ‘お’(o) are sometimes substituted by soft and fluffy of touch long sound symbols ‘ー’or‘～.’ For example, (b) ふわふわ / でも / こも / この /肌触り。 a vowel character ‘う’ in the Japanese adjective soft but straw matting this touch “ほんとう”(hontou, true) is sometimes substi- (5) Input: 除染が必要。 tuted by ‘ー’ and this adjective is written as “ほ (Decontamination is required.) んとー”(hontoˆ, troo).

Load more