Robust Segmentation of Japanese Text into a Lattice for Parsing Gary Kacmarcik, Chris Brockett, Hisami Suzuki Microsoft Research One Microsoft Way Redmond WA, 98052 USA {garykac,chrisbkt,hisamis}@microsoft.com

of records for a given input string. This is Abstract typically done using a connective-cost model We describe a segmentation component that ([Hisamitsu90]), which is either maintained utilizes minimal syntactic knowledge to produce a laboriously by hand, or trained on large corpora. lattice of word candidates for a broad coverage Japanese NL parser. The segmenter is a finite Both of these approaches suffer from problems. state morphological analyzer and text normalizer Handcrafted heuristics may become a maintenance designed to handle the orthographic variations quagmire, and as [Kurohashi98] suggests in his characteristic of written Japanese, including discussion of the JUMAN segmenter, statistical alternate spellings, script variation, vowel models may become increasingly fragile as the extensions and word-internal parenthetical system grows and eventually reach a point where material. This architecture differs from con- side effects rule out further improvements. The ventional Japanese wordbreakers in that it does sparse data problem commonly encountered in not attempt to simultaneously attack the problems of identifying segmentation candidates and statistical methods is exacerbated in Japanese by choosing the most probable analysis. To minimize widespread orthographic variation (see §3). duplication of effort between components and to Our system addresses these pitfalls by assigning give the segmenter greater freedom to address completely separate roles to the segmenter and the issues, the task of choosing the best parser to allow each to delve deeper into the analysis is handled by the parser, which has access to a much richer set of linguistic information. By complexities inherent in its tasks. maximizing recall in the segmenter and allowing a Other NL systems ([Kitani93], [Kurohashi98]) precision of 34.7%, our parser currently achieves a have separated the segmentation and parsing breaking accuracy of ~97% over a wide variety of components. However, these dual-level systems corpora. are prone to duplication of effort since many Introduction segmentation ambiguities cannot be resolved without invoking higher-level syntactic or The task of segmenting Japanese text into word semantic knowledge. Our system avoids this units (or other units such as bunsetsu ( phrases)) § duplication by relaxing the requirement that the has been discussed at great length in Japanese NL segmenter identify the best path (or even n-best literature ([Kurohashi98], [Fuchi98], [Nagata94], paths) through the lattice of possible records. The et al.). Japanese does not typically have spaces segmenter is responsible only for ensuring that a between words, which means that a parser must correct set of records is present in its output. It is first have the input string broken into usable units the function of the parsing component to select the before it can analyze a sentence. Moreover, a best analysis from this lattice. With this model, variety of issues complicate this operation, most our system achieves roughly 97% recall/precision notably that potential word candidate records may (see [Suzuki00] for more details). overlap (causing ambiguities for the parser) or there may be gaps where no suitable record is 1 System Overview found (causing a broken span). Figure  shows a simple block diagram of our These difficulties are commonly addressed using Natural Language Understanding system for either heuristics or statistical methods to create a Japanese, the goal of which is to robustly produce model for identifying the best (or n-best) sequence syntactic and logical forms that allow automatic

Orthography Word Segmentation ϼ syntax (or any downstream) component from Lexicon arriving at a correct analysis (e.g., a missing

Ѐ record). A recoverable error is one that does not Derivational Assembly ϼ Syntax interfere with the operation of following Ѐ Lexicon Syntactic Analysis ϼ components. An example of the latter is the Ѐ inclusion of an extra record. This extra record Logical Form does not (theoretically) prevent the parser from doing its job (although in practice it may since it Figure 1: Block diagram of Japanese NL system consumes resources). extraction of semantic relationships (see Using standard definitions of recall (R) and [Richardson98]) and support other linguistic precision (P): projects like information retrieval, NL interfaces Seg Seg and dialog systems, auto-summarization and R = correct P = correct machine translation. Tagtotal Segtotal

The segmenter is the first level of processing. This where Segcorrect and Segtotal are the number of “correct” is a finite-state morphological analyzer responsible and total number of segments returned by the segmenter, and Tagtotal is the total number of “correct” segments for generating all possible word candidates into a from a tagged corpus, word lattice. It has a custom lexicon (auto- matically derived from the main lexicon to ensure we can see that recall measures non-recoverable consistency) that is designed to facilitate the errors and precision measures recoverable errors. identification of orthographic variants. Since our goal is to create a robust NL system, it behooves us to maximize recall (i.e., make very Records representing words and are few non-recoverable errors) in open text while handed off by the segmenter to the derivational keeping precision high enough that the extra assembly component, which uses syntax-like rules records (recoverable errors) do not interfere with to generate additional derived forms that are then the parsing component. used by the parser to create syntax trees and logical forms. Many of the techniques here are similar to Achieving near-00% recall might initially seem to what we use in our Chinese NL system (see be a relatively straightforward task given a [Wu98] for more details). sufficiently large lexicon – simply return every possible record that is found in the input string. In The parser (described extensively in [Jensen93]) practice, the mixture of scripts and flexible generates syntactic representations and logical orthography rules of Japanese (in addition to the forms. This is a bottom-up chart parser with inevitable non-lexicalized words) make the task of binary rules within the Augmented Phrase identifying potential lexical boundaries an Structure Grammar formalism. The grammar rules interesting problem in its own right. are language-specific while the core engine is shared among 7 languages (Chinese, Japanese, 3 Japanese Orthographic Variation Korean, English, French, German, Spanish). The Japanese parser is described in [Suzuki00]. Over the centuries, Japanese has evolved a complex writing system that gives the writer a 2 Recall vs. Precision great deal of flexibility when composing text. Four scripts are in common use (, , In this architecture, data is fed forward from one and roman), and can co-occur within component to the next; hence, it is crucial that the lexical entries (as shown in Table ). base components (like the segmenter) generate a minimal number of omission errors. Some mixed-script entries could be handled as syntactic compounds, for example, ID ࠞ࡯࠼ [ai Since segmentation errors may affect subsequent dii kaado=“ID card”] could be derived from components, it is convenient to divide these errors IDNOUN + ࠞ࡯࠼ NOUN. However, many such items into two types: recoverable and non-recoverable. are preferably treated as lexical entries because A non-recoverable error is one that prevents the

ݱ ᤨᤨೞೞ [jijikokkoku = “every ޘೞޘᤨ [”ᣂߒ޿ [atarashii = “new Kanji-Hiragana ߶੃㘃 [honyuurui = “mammal”] repeat moment”] = ߒ޿ ݱ ↲᢫↲᢫ߒ޿ [kaigaishiiޘޘᱤࡉ࡜ࠪ [haburashi = “toothbrush”] characters ↲᢫ Kanji-Katakana ࡅ⚛ [hiso = “arsenic”] “diligent”] distribution of Kanji-Alpha CGS න૏ [CGS tan’i = “CGS system”] [bideo = “video”] voicing marks ࡅࠐ࠹ࠐࠝ ݱ ࡆ࠺ࠝ 2 ᦬ [juunigatsu = “December”] Kanji-Symbol 㧲㧹᡼ㅍ ݱ FM ᡼ㅍ [FM housou = “FM ǫ✢ [ganma sen = “gamma rays”] halfwidth & broadcast”] ߅࠻ࠗ࡟ [otoire = “toilet”] = Mixed fullwidth 㩊㩨㨼㩞㩂㩨㩡㩛 ݱ ࠳ࠗࡗࠣ࡜ࡓ [daiyaguramu ࠳ࡉࠆ [daburu = “to double”] “diagram”] [”ID ࠞ࡯࠼ [aidii kaado = “ID card”] भ ݱ ࡄ࡯࠮ࡦ࠻ [paasento = “percent Kana-Alpha ࡔ࠶࠮ࡦࠫࡖ࡯RNA [messenjaa RNA = [kabushiki gaisha = ␠composite ॽ ݱ ᩣᑼળ “messenger RNA”] “incorporated”] symbols ࠬ࠻ࡠࡦ࠴࠙ࡓ 90 [sutoronchiumu 90 = ৷ ݱ 㧞㧤ᣣ [nijuuhachi nichi = “28th day of the “Strontium 90”] Kana-Symbol month”] ߶߃ࠆ 40q[hoeru yonjuu do = “roaring forties”] Table 2: Character type normalizations ᶖߒࠧࡓ [keshigomu = “eraser”] [arufa kentauri sei = Other mixed ǩࠤࡦ࠲࠙࡝ᤊ “Alpha Centauri”] 4 Segmenter Design ࠻ᦠ߈ [togaki = “stage directions”] Table 1: Mixed-script lexical entries Given the broad long-term goals for the overall system, we address the issues of recall/precision and orthographic variation by narrowly defining they have non-compositional syntactic or semantic the responsibilities of the segmenter as: attributes. () Maximize recall In addition, many Japanese and (and words derived from them) have a variety of (2) Normalize word variants accepted spellings associated with okurigana, optional characters representing inflectional 4.1 Maximize Recall endings. For example, the present tense of ಾࠅ⪭ Maximal recall is imperative. Any recall mistake ߣߔ (kiriotosu = “to prune”) can be written as any made in the segmenter prevents the parser from of: ಾ⪭ߔ, ಾࠅ⪭ߔ, ಾ⪭ߣߔ, ಾࠅ߅ߣߔ, ߈ࠅ߅ߣߔ reaching a successful analysis. Since the parser in or even ߈ࠅ⪭ߣߔ, ߈ࠅ⪭ߔ. our NL system is designed to handle ambiguous Matters become even more complex when one input in the form of a word lattice of potentially script is substituted for another at the word or sub- overlapping records, we can accept lower precision word level. This can occur for a variety of if that is what is necessary to achieve high recall. reasons: to replace a rare or difficult kanji (ࠄ⥌ Conversely, high precision is specifically not a [rachi=“kidnap”] instead of ᜆ⥌); to highlight a goal for the segmenter. While desirable, high word in a sentence ( ࡋࡦߥ߆ߞߎ߁ [henna precision may be at odds with the primary goal of kakkou = “strange appearance”]); or to indicate a maximizing recall. Note that the lower bound for particular, often technical, sense (ࡢ࠲ߞߡ [watatte precision is constrained by the lexicon. =“crossing over”] instead of ᷰߞߡ, to emphasize the domain-specific sense of “connecting 2 4.2 Normalize word variants groups” in Go literature). Given the extensive amount of orthographic More colloquial writing allows for a variety of variability present in Japanese, some form of contracted forms like ࠝ࡟㆐߾ § ࠝ࡟㆐ + ߪ [ore- normalization into a canonical form is a pre- tacha § ore-tachi + wa = “we” + TOPIC] and requisite for any higher-order linguistic processing. phonological mutations as in ߢࠚ࡯ߔ § ߢߔ [dee– The segmenter performs two basic kinds of su § desu = “is”]. normalization: Lemmatization of inflected forms This is only a sampling of the orthographic issues and Orthographic Normalization. present in Japanese. Many of these variations pose serious sparse-data problems, and lexicalization of all variants is clearly out of the question.

”⹟⁁ ݱ ⹟߮⁁ [wabijou = “apology letter”] ߚߴࠆ [㘩:ߚ]ߴࠆ taberu = “to eat okurigana ็ߧ߈ ݱ ็߈ᛮ߈ [fukinuki = “drafty”] kamiawaseru = “to ߆ߺ޽ࠊߖࠆ [ི:߆,ߺ][ว:޽,ࠊ]ߖࠆ ”(ݱ ็߈⽾߈ [fukinuki = “a well-hole”] engage (gears mitsumori = “an [ , , : ][ :] [”non-standard ᅚߩࠦ ݱ ᅚߩሶ [onnanoko = “girl ߺߟ߽ࠅ ⷗ ߺ Ⓧ ߟ ߽ ࠅ estimate” [”script ߢ޿ߔߎ ݱ ࠺ࠖࠬࠦ [disuko = “disco ID ࠞ࡯࠼ [I:ࠕࠗ][D:࠺ࠖ࡯]ࠞ࡯࠼ aidii kaado = “ID card” [”৻߆᦬ ݱ ৻ࡩ᦬ [ikkagetsu = “one month ࡩࡨ variants kagi tabako = “snuff [kasumigaseki = “Kasumigaseki”] [: , ][ : ] ”㔰߇㑐 ݱ 㔰ࡩ㑐 ߆߉࠲ࡃࠦ ༦ ߆ ߉ ᾍ⨲ ࠲ࡃࠦ tobacco [gorin = “Olympics”] ”numerals for 㧡ベ ݱ ੖ベ ߥ߇޿ [㐳:ߥ߇][!ዬ:޿] nagai = “a long visit [”kanji 㧝ੱ ݱ ৻ੱ [hitori = “one person ߅ߦ࡯ߐ޼ࠎ ݱ ߅ߦ޿ߐࠎ [oniisan = “older vowel Table 4: Orthography lattices brother”] extensions ࡈࠔࠗ࠻ࠜ ݱ ࡈࠔࠗ࠻ [faito = “fight”] lexical entry. Examples are given in Table 3. Two [”ࡧࠔࠗࠝ࡝ࡦ ݱ ࡃࠗࠝ࡝ࡦ [baiorin = “violin katakana cases of special interest are okurigana and inline variants ࠕ࡜ࡕ࡯࠼ ݱ ࠕ࡮࡜࡮ࡕ࡯࠼ [aramoode = “à la mode”] yomi/kanji normalizations. The okurigana ᤋ(ߪ)߃ࠆ ݱ ᤋ߃ࠆ [haeru = “to shine”] normalization expands shortened forms into fully inline yomi Ῥ⯻(ߪߜࠀ߁)㘃 ݱῬ⯻㘃 [hachuurui = specified forms (i.e., forms with all optional “reptile”] characters present). The yomi/kanji handling takes [”inline kanji ߍߞ(㦅)ᱤ㘃 ݱ 㦅ᱤ㘃 [gesshirui = “rodent infixed parenthetical material and normalizes it out Table 3: Script normalizations (after using the parenthetical information to verify segmentation accuracy).

4.2.1 Lemmatization 5 Lexicon Structures LEMMATIZATION in Japanese is the same as that Several special lexicon structures were developed for any language with inflected forms – a lemma, to support these features. The most significant is or dictionary form, is returned along with the an orthography lattice* that concisely encapsulates inflection attributes. So, a form like 㘩ߴߚ [tabeta all orthographic variants for each lexicon entry and =“ate”] would return a lemma of [taberu = 㘩ߴࠆ implicitly specifies the normalized form. This has “eat”] along with a PAST attribute. the advantage of compactness and facilitates Contracted forms are expanded and lemmatized lexicon maintenance since lexicographic inform- individually, so that 㘩ߴߡߞߜ߾ߞߚ [tabe- ation is stored in one location. tecchatta = “has eaten and gone”] is returned as: The orthography lattice stores kana information + + 㘩ߴࠆ ޿ߊ GERUND ߒ߹߁ PAST about each kanji or group of kanji in a word. For [taberu=“eat” + iku=“go” + shimau=ASPECT]. example, the lattice for the 㘩ߴࠆ [taberu = 4.2.2 Orthographic Normalization “eat”] is [㘩:ߚ]ߴࠆ, because the first character (ta) can be written as either kanji 㘩 or kana ߚ. A ORTHOGRAPHIC NORMALIZATION smoothes out richer lattice is needed for entries with okurigana orthographic variations so that words are returned variants, like ಾࠅ⪭ߣߔ [kiriotosu = “to prune”] in a standardized form. This facilitates lexical cited earlier: commas separate each okurigana lookup and allows the system to map the variant grouping. The lattice for kiriotosu is [ಾ:߈,ࠅ][⪭: representations to a single lexicon entry. ߅,ߣ]ߔ. Table 4 contains more lattice examples. We distinguish two classes of orthographic Enabling all possible variants can proliferate normalization: character type normalization and records and confuse the analyzer (see [Kurohashi script normalization. 94]). We therefore suppress pathological variants CHARACTER TYPE NORMALIZATION takes the that cause confusion with more common words various representations allowed by the Unicode and constructions. For example, 㐳ዬ [nagai = “a specification and converts them into a single long visit”] never occurs as 㐳޿ since this is consistent form. Table 2 summarizes this class of ambiguous with the highly frequent 㐳޿ normalization. [nagai = “long”]. Likewise, a word like ᣣᧄ SCRIPT NORMALIZATION rewrites the word so that it conforms to the script and spelling used in the * Not to be confused with the word lattice, which is the set of records passed from the segmenter to the parser.

[nihon = “Japan”] is constrained to inhibit invalid variants like ߦᧄ, which cause confusion with: ߦ 50% POSP + ᧄ NOUN [ni=PARTICLE + hon =“book”]. 40% We default to enabling all possible for each entry and disable only those that are 30% required. This saves us from having to update the 20% lexicon whenever we encounter a novel orthographic variant since the lattice anticipates all 10% possible variants. 0% 15 25 35 45 55 65 75 85 95 105 115 6 Unknown Words Japanese Chinese Unknown words pose a significant recall problem in languages that don’t place spaces between Figure 2: Worst-case segmenter precision (y-axis) versus words. The inability to identify a word in the input sentence length (x-axis - in characters) stream of characters can cause neighboring words to be misidentified. We have divided this problem space into six 7 Evaluation categories: variants of lexical entries (e.g., Our goal is to improve the parser coverage by okurigana variations, vowel extensions, et al.); improving the recall in the segmenter. Evaluation non-lexicalized proper nouns; derived forms; of this component is appropriately conducted in the foreign loanwords; mimetics; and typographical context of its impact on the entire system. errors. This allows us to devise focused heuristics to attack each class of unfound words. 7.1 Parser Evaluation The first category, variants of lexical entries, has Running on top of our segmenter, our current been addressed through the script normalizations parsing system reports ~7% coverage† (i.e., input discussed earlier. strings for which a complete and acceptable Non-lexicalized proper nouns and derived words, sentential parse is obtained), and ~97% accuracy which account for the vast majority of unfound for POS labeled breaking accuracy. A full words, are handled in the derivational assembly description of these results is given in [Suzuki00]. component. This is where compounds like ࡈ࡜ࡦ 7.2 Segmenter Evaluation ࠬ⺆ [furansugo = “French (language)”] are assembled from their base components ࡈ࡜ࡦࠬ Three criteria are relevant to segmenter per- [furansu = “France”] and ⺆ [go = “language”]. formance: recall, precision and speed.

Unknown foreign loanwords are identified by a 7.2.1 Recall simple maximal-katakana heuristic that returns the longest run of katakana characters. Despite its Analysis of a randomly chosen set of tagged simplicity, this algorithm appears to work quite sentences gives a recall of 99.9%. This result is reliably when used in conjunction with the other not surprising since maximizing recall was a mechanisms in our system. primary focus of our efforts. Mimetic words in Japanese tend to follow simple The breakdown of the recall errors is as follows: ABAB or ABCABC patterns in hiragana or missing proper nouns = 47%, missing nouns = katakana, so we look for these patterns and 5%, missing verbs/adjs = 5%, orthographic propose them as adverb records. idiosyncrasies = 5%, archaic inflections = 8%. The last category, typographical errors, remains It is worth noting that for derived forms (those that mostly the subject for future work. Currently, we only address basic ੑ (kanji) ļ ࠾ (katakana) and † Tested on a 5,000 sentence blind, balanced corpus. ߳ (hiragana) ļ ࡋ (katakana) substitutions.  See [Suzuki00] for details.

4000 100% 90% 3000 80% 70% 2000 60% 50% 1000 40% 30% 0 20% 10% 5 5 5 5 15 25 35 45 55 65 75 85 95 10 11 12 13 0% 15 25 35 45 55 65 75 85 95 105 115 125 135 Segmenter Lexical Deriv Parser Other Figure 3: Characters/second (y-axis) vs. sentence length (x-axis) for segmenter alone (upper curve) Figure 4: Percentage of time spent in each component (y- and our NL system as a whole (lower curve) axis) vs. sentence length (x-axis) are handled in the derivational assembly com- ponent), the segmenter is considered correct as Using conservative pruning heuristics, we are able long as it produces the necessary base records to bring the precision up to 34.7% without needed to build the derived form. affecting parser recall. Primarily, these heuristics work by suppressing the hiragana form of short, 7.2.2 Precision ambiguous words (like ߈ [ki=“tree, air, spirit, season, record, yellow,…”], which is normally Since we focused our efforts on maximizing recall, written using kanji to identify the intended sense). a valid concern is the impact of the extra records on the parser, that is, the effect of lower segmenter 7.2.3 Speed precision on the system as a whole. Another concern with lower precision values has to Figure 2 shows the baseline segmenter precision do with performance measured in terms of speed. plotted against sentence length using the 3888 tagged sentences ‡ . For comparison, data for Figure 3 summarizes characters-per-second per- Chinese§ is included. These are baseline values in formance of the segmentation component and our the sense they represent the number of records NL system as a whole (including the segmentation looked up in the lexicon without application of any component). As expected, the system takes more heuristics to suppress invalid records. Thus, these time for longer sentences. Crucially, however, the numbers represent worst-case segmenter precision. system slowdown is shown to be roughly linear. The baseline precision for the Japanese segmenter Figure 4 shows how much time is spent in each averages 24.8%, which means that a parser would component during sentence analysis. As the sen- need to discard 3 records for each record it used in tence length increases, lexical lookup, derivational the final parse. This value stays fairly constant as morphology and “other” stay approximately con- the sentence length increases. The baseline stant while the percentage of time spent in the precision for Chinese averages 37.%. The parsing component increases. disparity between the Japanese and Chinese worst- Table 5 compares parse time performance for case scenario is believed to reflect the greater tagged and untagged sentences. This table ** ambiguity inherent in the , quantifies the potential speed improvement that the owing to orthographic variation and the use of a parser could realize if the segmenter precision was syllabic script. improved. Column A provides baseline lexical lookup and parsing times based on untagged input. ‡ The tagging was obtained by using the results of the parser on untagged sentences. ** Note that segmenter time is not given this table § 3982 sentences tagged in a similar fashion using our because it would not be comparable to the hypothetical Chinese NLP system. segmenters devised for columns B and C.

A B C 9 Conclusion Lexical processing 7.66 s 2.50 s 2.324 s Parsing 3.480 s 8.865 s 7.79 s The complexities involved in segmenting Japanese Other 4.95 s 3.620 s 3.59 s text make it beneficial to treat this task Total 25.336 s 4.995 s 3.022 s independently from parsing. These separate tasks Overall - 40.82% 48.60% are each simplified, facilitating the processing of a Percent Lexical - 67.24% 69.66% wider range of phenomenon specific to their Improvement Parsing - 34.24% 46.74% Other - 3.7% 6.% respective domains. The gains in robustness greatly outweigh the impact on parser performance Table 5: Summary of performance (speed) caused by the additional records. Our parsing experiment where untagged input (A) is compared results demonstrate that this compartmentalized with space-broken input (B) and space-broken input approach works well, with overall parse times with POS tags (C). increasing linearly with sentence length.

Columns B and C give timings based on a 10 References (hypothetical) segmenter that correctly identifies [Fuchi98] Fuchi,T., Takagi,S., “Japanese all word boundaries (B) and one that identifies all Morphological Analyzer using Word Co-occurrence”, †† word boundaries and POS (C) . C represents the ACL/COLING 98, pp409-43, 998. best-case parser performance since it assumes [Hisamitsu90] Hisamitsu,T., Nitta,Y., perfect precision and recall in the segmenter. The Morphological Analyis by Minimum Connective-Cost bottom portion of Table 5 restates these Method”, SIGNLC 90-8, IEICE pp7-24, 990 (in improvements as percentages. Japanese). This table suggests that adding conservative [Jensen93] Jensen,K., Heidorn,G., Richardson,S., (eds.) “Natural Language Processing: The PLNLP pruning to enhance segmenter precision may Approach”, Kluwer, Boston, 993. improve overall system performance. It also provides a metric for evaluating the impact of [Kitani93] Kitani,T., Mitamura,T., “A Japanese Preprocessor for Syntactic and Semantic Parsing”, 9th heuristic rule candidates. The parse-time Conference on AI in Applications, pp86-92, 993. improvements from a rule candidate can be weighed against the cost of implementing this [Kurohashi94] Kurohashi,S., Nakamura,T., Matsumoto,Y., Nagao,M., “Improvements of Japanese additional code to determine the overall benefit to Morphological Analyzer JUMAN”, SNLR, pp22-28, the entire system. 994. [Kurohashi98] Kurohashi,S., Nagao,M., “Building a 8 Future Japanese Parsed Corpus while Improving the Parsing Planned near-term enhancements include adding System”, First LREC Proceedings, pp79-724, 998. context-sensitive heuristic rules to the segmenter as [Nagata94] Nagata,M., “A Stochastic Japanese appropriate. In addition to the speed gains Morphological Analyzer Using a Forward-DP quantified in Table 5, these heuristics can also be Backward-A* N-Best Search Algorithm”, COLING, expected to improve parser coverage by reducing pp20-207, 994. resource requirements. [Richardson98] Richardson,S.D., Dolan,W.B., Vanderwende,L., “MindNet: Acquiring and Structuring Other areas for improvement are unfound word Semantic Information from Text”, COLING/ACL 98, models, particularly typographical error detection, pp098-02, 998. and addressing the issue of probabilities as they [Suzuki00] Suzuki,H., Brockett,C., Kacmarcik,G., apply to orthographic variants. Additionally, we “Using a broad-coverage parser for word-breaking in are experimenting with various lexicon formats to Japanese”, COLING 2000. more efficiently support Japanese. [Wu98] Wu,A., Zixin,J., “Word Segmentation in Sentence Analysis”, Microsoft Technical Report MSR- TR-99-0, 999.

†† For the hypothetical segmenters, our segmenter was modified to return only the records consistent with a tagged input set.