Robust Segmentation of Japanese Text Into a Lattice for Parsing
Total Page:16
File Type:pdf, Size:1020Kb
Robust Segmentation of Japanese Text into a Lattice for Parsing Gary Kacmarcik, Chris Brockett, Hisami Suzuki Microsoft Research One Microsoft Way Redmond WA, 98052 USA {garykac,chrisbkt,hisamis}@microsoft.com of records for a given input string. This is Abstract typically done using a connective-cost model We describe a segmentation component that ([Hisamitsu90]), which is either maintained utilizes minimal syntactic knowledge to produce a laboriously by hand, or trained on large corpora. lattice of word candidates for a broad coverage Japanese NL parser. The segmenter is a finite Both of these approaches suffer from problems. state morphological analyzer and text normalizer Handcrafted heuristics may become a maintenance designed to handle the orthographic variations quagmire, and as [Kurohashi98] suggests in his characteristic of written Japanese, including discussion of the JUMAN segmenter, statistical alternate spellings, script variation, vowel models may become increasingly fragile as the extensions and word-internal parenthetical system grows and eventually reach a point where material. This architecture differs from con- side effects rule out further improvements. The ventional Japanese wordbreakers in that it does sparse data problem commonly encountered in not attempt to simultaneously attack the problems of identifying segmentation candidates and statistical methods is exacerbated in Japanese by choosing the most probable analysis. To minimize widespread orthographic variation (see §3). duplication of effort between components and to Our system addresses these pitfalls by assigning give the segmenter greater freedom to address completely separate roles to the segmenter and the orthography issues, the task of choosing the best parser to allow each to delve deeper into the analysis is handled by the parser, which has access to a much richer set of linguistic information. By complexities inherent in its tasks. maximizing recall in the segmenter and allowing a Other NL systems ([Kitani93], [Kurohashi98]) precision of 34.7%, our parser currently achieves a have separated the segmentation and parsing breaking accuracy of ~97% over a wide variety of components. However, these dual-level systems corpora. are prone to duplication of effort since many Introduction segmentation ambiguities cannot be resolved without invoking higher-level syntactic or The task of segmenting Japanese text into word semantic knowledge. Our system avoids this units (or other units such as bunsetsu ( phrases)) § duplication by relaxing the requirement that the has been discussed at great length in Japanese NL segmenter identify the best path (or even n-best literature ([Kurohashi98], [Fuchi98], [Nagata94], paths) through the lattice of possible records. The et al.). Japanese does not typically have spaces segmenter is responsible only for ensuring that a between words, which means that a parser must correct set of records is present in its output. It is first have the input string broken into usable units the function of the parsing component to select the before it can analyze a sentence. Moreover, a best analysis from this lattice. With this model, variety of issues complicate this operation, most our system achieves roughly 97% recall/precision notably that potential word candidate records may (see [Suzuki00] for more details). overlap (causing ambiguities for the parser) or there may be gaps where no suitable record is 1 System Overview found (causing a broken span). Figure shows a simple block diagram of our These difficulties are commonly addressed using Natural Language Understanding system for either heuristics or statistical methods to create a Japanese, the goal of which is to robustly produce model for identifying the best (or n-best) sequence syntactic and logical forms that allow automatic Orthography Word Segmentation ϼ syntax (or any downstream) component from Lexicon arriving at a correct analysis (e.g., a missing Ѐ record). A recoverable error is one that does not Derivational Assembly ϼ Syntax interfere with the operation of following Ѐ Lexicon Syntactic Analysis ϼ components. An example of the latter is the Ѐ inclusion of an extra record. This extra record Logical Form does not (theoretically) prevent the parser from doing its job (although in practice it may since it Figure 1: Block diagram of Japanese NL system consumes resources). extraction of semantic relationships (see Using standard definitions of recall (R) and [Richardson98]) and support other linguistic precision (P): projects like information retrieval, NL interfaces Seg Seg and dialog systems, auto-summarization and R = correct P = correct machine translation. Tagtotal Segtotal The segmenter is the first level of processing. This where Segcorrect and Segtotal are the number of “correct” is a finite-state morphological analyzer responsible and total number of segments returned by the segmenter, and Tagtotal is the total number of “correct” segments for generating all possible word candidates into a from a tagged corpus, word lattice. It has a custom lexicon (auto- matically derived from the main lexicon to ensure we can see that recall measures non-recoverable consistency) that is designed to facilitate the errors and precision measures recoverable errors. identification of orthographic variants. Since our goal is to create a robust NL system, it behooves us to maximize recall (i.e., make very Records representing words and morphemes are few non-recoverable errors) in open text while handed off by the segmenter to the derivational keeping precision high enough that the extra assembly component, which uses syntax-like rules records (recoverable errors) do not interfere with to generate additional derived forms that are then the parsing component. used by the parser to create syntax trees and logical forms. Many of the techniques here are similar to Achieving near-00% recall might initially seem to what we use in our Chinese NL system (see be a relatively straightforward task given a [Wu98] for more details). sufficiently large lexicon – simply return every possible record that is found in the input string. In The parser (described extensively in [Jensen93]) practice, the mixture of scripts and flexible generates syntactic representations and logical orthography rules of Japanese (in addition to the forms. This is a bottom-up chart parser with inevitable non-lexicalized words) make the task of binary rules within the Augmented Phrase identifying potential lexical boundaries an Structure Grammar formalism. The grammar rules interesting problem in its own right. are language-specific while the core engine is shared among 7 languages (Chinese, Japanese, 3 Japanese Orthographic Variation Korean, English, French, German, Spanish). The Japanese parser is described in [Suzuki00]. Over the centuries, Japanese has evolved a complex writing system that gives the writer a 2 Recall vs. Precision great deal of flexibility when composing text. Four scripts are in common use (kanji, hiragana, In this architecture, data is fed forward from one katakana and roman), and can co-occur within component to the next; hence, it is crucial that the lexical entries (as shown in Table ). base components (like the segmenter) generate a minimal number of omission errors. Some mixed-script entries could be handled as syntactic compounds, for example, ID ࠞ࠼ [ai Since segmentation errors may affect subsequent dii kaado=“ID card”] could be derived from components, it is convenient to divide these errors IDNOUN + ࠞ࠼ NOUN. However, many such items into two types: recoverable and non-recoverable. are preferably treated as lexical entries because A non-recoverable error is one that prevents the ݱ ᤨᤨೞೞ [jijikokkoku = “every ޘೞޘᤨ [”ᣂߒ [atarashii = “new Kanji-Hiragana ߶㘃 [honyuurui = “mammal”] repeat moment”] = ߒ ݱ ↲↲ߒ [kaigaishiiޘޘᱤࡉࠪ [haburashi = “toothbrush”] characters ↲ Kanji-Katakana ࡅ⚛ [hiso = “arsenic”] “diligent”] distribution of Kanji-Alpha CGS න [CGS tan’i = “CGS system”] [bideo = “video”] voicing marks ࡅࠐ࠹ࠐࠝ ݱ ࡆ࠺ࠝ 2 [juunigatsu = “December”] Kanji-Symbol 㧲㧹ㅍ ݱ FM ㅍ [FM housou = “FM ǫ✢ [ganma sen = “gamma rays”] halfwidth & broadcast”] ߅࠻ࠗ [otoire = “toilet”] = Mixed kana fullwidth 㩊㩨㨼㩞㩂㩨㩡㩛 ݱ ࠳ࠗࡗࠣࡓ [daiyaguramu ࠳ࡉࠆ [daburu = “to double”] “diagram”] [”ID ࠞ࠼ [aidii kaado = “ID card”] भ ݱ ࡄࡦ࠻ [paasento = “percent Kana-Alpha ࡔ࠶ࡦࠫࡖRNA [messenjaa RNA = [kabushiki gaisha = ␠composite ॽ ݱ ᩣᑼળ “messenger RNA”] “incorporated”] symbols ࠬ࠻ࡠࡦ࠴࠙ࡓ 90 [sutoronchiumu 90 = ৷ ݱ 㧞㧤ᣣ [nijuuhachi nichi = “28th day of the “Strontium 90”] Kana-Symbol month”] ߶߃ࠆ 40q[hoeru yonjuu do = “roaring forties”] Table 2: Character type normalizations ᶖߒࠧࡓ [keshigomu = “eraser”] [arufa kentauri sei = Other mixed ǩࠤࡦ࠲࠙ᤊ “Alpha Centauri”] 4 Segmenter Design ࠻ᦠ߈ [togaki = “stage directions”] Table 1: Mixed-script lexical entries Given the broad long-term goals for the overall system, we address the issues of recall/precision and orthographic variation by narrowly defining they have non-compositional syntactic or semantic the responsibilities of the segmenter as: attributes. () Maximize recall In addition, many Japanese verbs and adjectives (and words derived from them) have a variety of (2) Normalize word variants accepted spellings associated with okurigana, optional characters representing inflectional 4.1 Maximize Recall endings. For example, the present tense of ಾࠅ⪭ Maximal recall is imperative. Any recall mistake ߣߔ (kiriotosu = “to prune”) can be written as any made in the segmenter prevents the