Automatic Labeling of Voiced Consonants for Morphological Analysis of Modern Japanese Literature Teruaki Oka† Mamoru Komachi†
[email protected] [email protected] Toshinobu Ogiso‡ Yuji Matsumoto†
[email protected] [email protected] Nara Institute of Science and Technology National† Institute for Japanese Language and Linguistics ‡ Abstract literary text,2 which achieves high performance on analysis for existing electronic text (e.g. Aozora- Since the present-day Japanese use of bunko, an online digital library of freely available voiced consonant mark had established books and work mainly from out-of-copyright ma- in the Meiji Era, modern Japanese lit- terials). erary text written in the Meiji Era of- However, the performance of morphological an- ten lacks compulsory voiced consonant alyzers using the dictionary deteriorates if the text marks. This deteriorates the performance is not normalized, because these dictionaries often of morphological analyzers using ordi- lack orthographic variations such as Okuri-gana,3 nary dictionary. In this paper, we pro- accompanying characters following Kanji stems pose an approach for automatic labeling of in Japanese written words. This is problematic voiced consonant marks for modern liter- because not all historical texts are manually cor- ary Japanese. We formulate the task into a rected with orthography, and it is time-consuming binary classification problem. Our point- to annotate by hand. It is one of the major issues wise prediction method uses as its feature in applying NLP tools to Japanese Linguistics be- set only surface information about the sur- cause ancient materials often contain a wide vari- rounding character strings.