Automatic Labeling of Voiced Consonants for Morphological Analysis of Modern Japanese Literature
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Labeling of Voiced Consonants for Morphological Analysis of Modern Japanese Literature Teruaki Oka† Mamoru Komachi† [email protected] [email protected] Toshinobu Ogiso‡ Yuji Matsumoto† [email protected] [email protected] Nara Institute of Science and Technology National† Institute for Japanese Language and Linguistics ‡ Abstract literary text,2 which achieves high performance on analysis for existing electronic text (e.g. Aozora- Since the present-day Japanese use of bunko, an online digital library of freely available voiced consonant mark had established books and work mainly from out-of-copyright ma- in the Meiji Era, modern Japanese lit- terials). erary text written in the Meiji Era of- However, the performance of morphological an- ten lacks compulsory voiced consonant alyzers using the dictionary deteriorates if the text marks. This deteriorates the performance is not normalized, because these dictionaries often of morphological analyzers using ordi- lack orthographic variations such as Okuri-gana,3 nary dictionary. In this paper, we pro- accompanying characters following Kanji stems pose an approach for automatic labeling of in Japanese written words. This is problematic voiced consonant marks for modern liter- because not all historical texts are manually cor- ary Japanese. We formulate the task into a rected with orthography, and it is time-consuming binary classification problem. Our point- to annotate by hand. It is one of the major issues wise prediction method uses as its feature in applying NLP tools to Japanese Linguistics be- set only surface information about the sur- cause ancient materials often contain a wide vari- rounding character strings. As a conse- ety of orthographic variations. quence, training corpus is easy to obtain For example, there is an issue of voiced con- and maintain because we can exploit a par- sonant marks. Any “Hiragana” character and tially annotated corpus for learning. We “Katakana” character (called Kana character alto- compared our proposed method as a pre- gether) represent either consonant (k, s, t, n, h, m, processing step for morphological analy- y, r, w) onset with vowel (a, i, u, e, o) nucleus or sis with a dictionary-based approach, and only the vowel (except for nasal codas N). Further- confirmed that pointwise prediction out- more, the characters alone can not represent syl- performs dictionary-based approach by a lables beginning with a voiced consonant (g, z, d, large margin. b) in current orthography. They are spelled with ゛ 1 Introduction Kana and a voiced consonant mark ( ) to the up- per right (see Figure 1). However, confusingly, it Recently, corpus-based approaches have been suc- was not ungrammatical to put down the charac- cessfully adopted in the field of Japanese Linguis- ter without the mark to represent voiced syllable tics. However, the central part of the fields has 2In historical linguistics, the phrase “modern Japanese” been occupied by historical research that uses an- refers to the language from 1600 on to the present in a broad cient material, on which fundamental annotations sense. However, most Japanese people regard the phrase to are often not yet available. the Meiji and Taisho Era; we also use the phrase to intend the narrower sense. Despite the limited annotated corpora, re- 3In Japanese Literature, both Kana (phonogramic char- searchers have developed several morphological acters) and Kanji (ideographic characters) are used together. analysis dictionaries for past-day Japanese. Na- Generally, conjugated form is ambiguous, given the preced- ing Kanji characters. However, the character’s pronunciation tional Institute for Japanese Language and Lin- can also be written using Kana characters. Thus, the pro- guistics creates Kindai-bungo UniDic,1 a morpho- nunciation’s tailing few syllables are hanged out (Okuri), us- logical analysis dictionary for modern Japanese ing Kana (gana) characters for disambiguating the form. Al- though the number of Okuri-gana is fixed for each Kanji char- 1http://www2.ninjal.ac.jp/lrc/index.php?UniDic acter now, it was not fixed in the Meiji Era. 292 Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 292–300, Chiang Mai, Thailand, November 8 – 13, 2011. c 2011 AFNLP ga が:か (ka) + ゛ da だ:た (ta) + ゛ gi → ぎ:き (ki) + ゛ di → ぢ:ち (ti) + ゛ → → 今や廣島は其名大に内外國に顯はれ苟も時 gu ぐ:く (ku) + ゛ du づ:つ (tu) + ゛ :: :: ge → げ:け (ke) + ゛ de → で:て (te) + ゛ 事を談 す るものは同地の形勢如何を知ら go → ご:こ (ko) + ゛ do → ど:と (to) + ゛ :: :: za → ざ:さ (sa) + ゛ ba → ば:は (ha) + ゛ んと欲せ さ るはあら す ::: :: :: :: :: zi → じ:し (si) + ゛ bi → び:ひ (hi) + ゛ zu → ず:す (su) + ゛ bu → ぶ:ふ (hu) + ゛ Today, the fame of “Hiroshima” has been ze → ぜ:せ (se) + ゛ be → べ:へ (he) + ゛ broadly known in and outside Japan, and if you zo → ぞ:そ (so) + ゛ bo → ぼ:ほ (ho) + ゛ talk about current affairs, you want to know how → → the place has been established. Figure 1: Spelling syllables beginning with the voiced consonants g, z, d and b by Hiragana char- acters with a voiced consonant mark. Figure 2: Example of sentences that includes un- marked characters. This text is an excerpt from Voiced Voiceless “The tide of Hiroshima”: Katsuichi Noguchi, Marked 3,124 0 Taiyo, No.2, p.64 (1925). Wavy-underlined char- Ambiguous 4,032 29,332 acters are ambiguous character, and gray-boxed Table 1: The contingency table of observed fre- characters are unmarked character. quencies of characters and voiceness. fully annotated sentence that does not include un- marked characters, and thus if the target sentence until the Meiji Era, because Japanese orthography includes unmarked character(s), the performance dates back to the Meiji Era. Consequently, mod- can degrade considerably. ern Japanese literary text written in the Meiji Era often lacks compulsory voiced consonant marks. There are two major approaches to handle The mark was used only when the author deems this problem: a dictionary-based approach and a it necessary to disambiguate; and it was not often classification-based approach. used if one can infer from the context that the pro- First, the dictionary-based approach creates a nunciation is voiced. dictionary that has both original spellings and Figure 2 shows characters which lack the voiced modified variants without the mark. For exam- consonant mark even though we expect it to be ple, Kindai-bungo UniDic includes both entries marked in the text. Hereafter, we call such char- “ず (zu)” and “す (zu)” for frequent words such as acters as “unmarked characters.” Also, we call the “ず (zu)” in auxiliary verb. This allows morpho- characters to which the voiced consonant mark can logical analysis algorithms to learn the weights of be attached as “ambiguous characters.” In Table 1, both entries all together from a corpus annotated we present the statistics of the voiced consonants with part-of-speech tags in order to select appro- in “Kokumin-no-tomo” corpus which we will use priate entries during decoding. for our evaluation. As you can see, 12% of the Second, the classification-based approach em- ambiguous characters are actually voiced but not ploys a corpus annotated with unmarked charac- marked. In addition, 44% of the voiced charac- ters to learn a classifier that labels the voiced ters have the voiced consonant mark, showing the consonant mark for unmarked characters. Unlike variation of using the voiced consonant mark in the dictionary-based approach, the classification- the corpus. based approach does not require part-of-speech In the modern Japanese literary text, ortho- tagged nor tokenized corpora. Since it is easier for graphic variations are not only the unmarked. human annotators to annotate unmarked characters However, unmarked characters appear a lot in the than word boundaries and part-of-speech tags, we text and can be annotated easily by hand. Thus, we can obtain a large scale annotated corpus at low can get natural texts for evaluation of our method cost. at low cost (in fact, it cost only a few weeks to an- Therefore, in this paper, we propose a notate our above-mentioned test corpus). There- classification-based approach to automatic la- fore, we decided to begin with attaching voiced beling of voiced consonant marks as a pre- consonant mark for unmarked characters as a start- processing step for morphological analysis for ing point for normalizing orthographic variations. modern Japanese literary language. Basically, Kindai-bungo UniDic is created for a We formulate the task of labeling voiced con- 293 sonant marks into a binary classification problem. and part-of-speech tags. Thus, a word segmenta- Our method uses as its feature set only surface in- tion model learned from such a limited amount of formation about the surrounding character strings data is unreliable. Unlike Nagata’s method, our with pointwise prediction, whose training data are classification-based method does not rely on word available at low cost. We use an online learning segmentation and can exploit low-level annotation method for learning large spelling variation from such as voiced consonant mark, which is available massive datasets rapidly and accurately. Thus, we quite easily. can improve its performance easily by increasing In addition, Nagata performed clustering of amount of training data. In addition, we perform characters for smoothing confusion probability clustering of Kanji, which is abundant in the train- among characters to narrow down correction can- ing data, and employ class n-grams for addressing didates. We also perform clustering on Kanji for the data sparseness problem. We compared our addressing the data sparseness problem. Though classification-based approach with the dictionary- Nagata uses character’s shape for clustering, we based approach and showed that the classification- instead use neighboring characters of the Kanji based method outperforms the dictionary-based character. The intuition behind this is that whether method, especially in an out-of-domain setting.