1994-A Probabilistic Algorithm for Segmenting Non-Kanji Japanese
Total Page:16
File Type:pdf, Size:1020Kb
From: AAAI-94 Proceedings. Copyright © 1994, AAAI (www.aaai.org). All rights reserved. stic Algorithm for Segmenting Non-Kanji Japanese Strings Virginia Teller Eleanor Olds Batchelder Hunter College and the Graduate School The Graduate School The City University of New York The City University of New York 695 Park Avenue 33 West 42nd Street New York, NY 10021 New York, NY 10036 [email protected] [email protected] Abstract part-of-speech taggers have been used to obtain information about the lexical, syntactic, and some We present an algorithm for segmenting unrestricted Japanese text that is able to detect up to 98% of the semantic properties of large corpora. Automatic text words in a corpus. The segmentation technique, which tagging is an important first step in discovering the is simple and extremely fast, does not depend on a linguistic structure of large text corpora. lexicon or any formal notion of what a word is in Probabilistic approaches to tagging have developed Japanese, and the training procedure does not require in response to the failure of traditional rule-based annotated text of any kind. Relying almost exclusively systems to handle large-scale applications involving on character type information and a table of hiragana unrestricted text processing. Characterized by the bigram frequencies, the algorithm makes a decision as brittleness of handcrafted rules and domain knowledge to whether to create word boundaries or not. This and the intractable amount of work required to build method divides strings of Japanese characters into units them and to port them to new domains and that are computationally tractable and that can be applications, the rule-based paradigm that has domi- justified on lexical and syntactic grounds as well. nated NLP and artificial intelligence in general has come under close scrutiny. Stochastic taggers are one Introduction example of a class of alternative approaches, often A debate is being waged in the field of machine referred to as “corpus-based” or “example-based” translation about the degree to which rationalist and techniques, that use statistics rather than rules as the empiricist approaches to linguistic knowledge should be basis for NLP systems. used in MT systems. While most participants in the A major decision in the design of a tagger is to debate seem to agree that both methods are useful, determine exactly what will count as a word, and albeit for different tasks, few have compared the limits whether two sequences of characters are instances of of knowledge-based and statistics-based techniques in the same word or different words (Brown et al. 1992). the various stages of translation. This may sound trivial - after all, words are delimited The perspective adopted in this paper represents one by spaces - but it is a problem that has plagued end of the rationalist-empiricist spectrum. We report a linguists for decades. For example, is s~ozlldn’t one study that assumes almost no rule-based knowledge word or two? Is s/zozcldn’t different from shoz&i not? If and attempts to discover the maximum results that can hyphenated forms like rzcle-basedand Baum-Welch (as in be achieved with primarily statistical information about Baum-Web-h algonkh) are to count as two words, then the problem domain. For the task of segmenting non- what about vis-h-vis? The effects of capitalization must kanji strings in unrestricted Japanese text, we found also be considered, as the following example shows: that the success rate of this minimalist method approaches 9.5%. Bill, please send the bill. Bill me today or bill me tomorrow. May I pay in May? Background In the first sentence Bill is a proper noun and bill is a In recent years, researchers in the field of natural lan- common noun. In the second sentence Bill and bill are guage processing have become interested in analyzing the same - both are verbs - while in the third increasingly large bodies of text. Whereas a decade ago sentence May and May are different - one is a modal a corpus of one million words was considered large, and the other a proper noun. corpora consisting of tens of millions of words are These problems are compounded in Japanese common today, and several are close to 100 million because, unlike English, sentences are written as words. Since exhaustively parsing such enormous continuous strings of characters without spaces between amounts of text is impractical, lexical analyzers called words. As a result, decisions about word boundaries are 742 NaturalLanguage Processing all the more difficult, and lexical analysis plays an basing its decisions instead solely on whether a two- important preprocessing role in all Japanese natural character sequence is deemed more likely to continue language systems. Before text can be parsed, a lexical a word or contain a word boundary. This approach was analyzer must segment the stream of input characters able to segment 90% of the words in test sentences comprising each sentence. correctly, compared to 91.7% for the JUMAN-based Japanese lexical analyzers typically segment method. sentences in two major steps, first dividing each sentence into major phrases called bzcnsetszl composed of Characteristics of Japanese Text a content word and accompanying function words, e.g. noun+particle, verb+endings, and then discovering the Japanese text is composed of four different types of individual words within each phrase. Algorithms based characters: kanji characters borrowed more than a on the longest match principle perform the bulk of the millennium ago from Chinese; two kana syllabaries, work (Kawada 1990). To extract a bunsetsu structure hiragana and katakana; and romaji, consisting of Roman from an input string, the system first proposes the alphabetic and Arabic numeral characters. The longest candidate that matches a dictionary entry and syllabaries contain equivalent sets of around 80 then checks whether the result agrees with the rules for characters each. Hiragana is used for Japanese words bunsetsu composition. If the check fails, the system and inflections, while katakana is used for words backtracks, proposes another shorter word, and checks borrowed from foreign languages and for other special the composition rules again. This process is repeated purposes. Lunde (1993:4) describes the distribution of until the sentence is divided into the least number of character types as follows: bunsetsu consistent with its structure (Ishizaki et al. 1989). Given an average sampling of Japanese writing, Maruyama et al. (1988) describe a sentence analyzer one normally finds 30 percent kanji, 60 percent that consists of five stages, each of which exploits a hiragana, and 10 percent katakana. Actual per- distinct kind of knowledge that is stored in the form of centages depend on the nature of the text. For a set of rules or a table. The five stages are: example, you may find a higher percentage of segmentation by (1) character type, (2) character kanji in technical literature, and a higher percent- sequence, and (3) a longest matching algorithm; (4) a age of katakana in the literature of fields such as bottom-up parallel algorithm if stage 3 fails; and (5) computer science, which make extensive use of compound-word composition. The first three lines in loan words written in katakana. the transliterated example below illustrate what the procedure must accomplish: The variable proportions of character types can easily be seen in a comparison of three different samples of input: sisutemugabunobunkaisuru. Japanese text. The first corpus consists of a set of bunsetsu: sisutemuga/buno/bunkaisuru. short newspaper articles on business ventures from words: sisutemu-ga/bun-o/bunkai-suru. Yomiu?i. The second corpus contains a series of meaning: system-subj/sentence-obj/analyze-nonpast editorial columns from Asahi Shinbun (tenseijingo shasetsu, translation: A/The system analyzes a/the sentence. 19851991). Information on a third corpus was drawn Two recent projects at BBN have used rule-based from the description provided by Yokoyama (1989) of lexical analyzers to construct probabilistic models of an online dictionary, Shin-Meikai Kokugo Jiten. Table 1 Japanese segmentation and part-of-speech assignment. gives the size of each corpus in thousands of characters Matsukawa, Miller, & Weischedel (1993) based their and shows the percentage of text written in each of the work on JUMAN, developed at Kyoto University, four character types. Punctuation and special symbols which has a 40,000 word lexicon and tags with a success have been excluded from the counts. rate of about 93%. They used hand-corrected output from JUMAN to train an example-based algorithm to bus. ed. diet. correct both segmentation errors and part of speech size (K chars) errors in JUMAN’s output. POST, a stochastic tagger, 42 275 2,508 % hiragana 30.2 58.0 52.4 then selects among ambiguous alternative segmentation % kanji 47.5 34.6 37.9 and part-of-speech assignments and predicts the part of % katakana 19.3 4.8 6.8 speech of unknown words. Papageorgiou (1994) trained % num/rom 2.9 2.6 2.9 a bigram hidden Markov model to segment Japanese text using the output of MAJESTY, a rule-based Table 1 morphological preprocessor (Kitani & Mitamura 1993) that is reported to segment and tag Japanese text with Of particular note is the fact that the business corpus better than 98% accuracy. Papageorgiou’s method uses contains roughly half the amount of hiragana of the neither a lexicon of Japanese words nor explicit rules, other two samples, both of which come close to Corpus-Based 743 Lunde’s norm, and three to four times as much units are produced. katakana. Table 2 lists the ten most frequent hiragana in the three corpora expressed as a percentage of total The Segmentation Algorithm hiragana.