Filling Knowledge Gaps in a Broad-Coverage System*

Kevin Knight, Ishwar Chander, Matthew Haines, Vasileios Hatzivassiloglou, Eduard Hovy, Masayo Lda, Steve K Luk, Richard Whitney, Kenji Yamada USC/Infonnation Sciences Institute 4676 Admiralty Way Manna del Rey, CA 90292

Abstract may be an unknown word, a missing grammar rule, a missing piece of world knowledge, etc A system can be Knowledge-based machine translation (KBMT) designed to respond to knowledge gaps in any number of techniques yield high quabty in domuoH with de• ways It may signal an error It may make a default or tailed semantic models, limited vocabulary, and random decision It may appeal to a human operator controlled input grammar Scaling up along these It may give up and turn over processing to a simpler, dimensions means acquiring large knowledge re• more robust MT program Our strategy is to use pro• sources It also means behaving reasonably when grams (sometimes statistical ones) that can operate on definitive knowledge is not yet available This pa more readily available data and effectively address par per describes how we can fill various KBMT know] ticular classes of KBMT knowledge gaps This gives us edge gap*, often using robust statistical techniques robust throughput and better quality than that of de• We describe quantitative and qualitative results fault decisions It also gives us a platform on which to from JAPANGLOSS, a broad-coverage Japanese- build and test knowledge bases that will product further English MT system improvements in quality Our research is driven in large part by the pracLi 1 Introduction cal problems we encountered while constructing JAPAN- GLOSS, a Japanese-Eng)ish newspaper MT system built Knowledge-based machine translation (KBMT) tech• at USC/ISI JAPANGLOSS is a year-old effort within niques have led to high quality MT systems in circum• the PANGLOSS MT project [Nirenburg and Frederking scribed problem domains [Nirenburg et al, 1992] with 1994, NMSU/CRL et al, 1995], and it participated in limited vocabulary and controlled input grammar [Ny- the most recent ARPA evaluation of MT quality [White berg and Mitamura, 1992] This high quality is delivered and O'Connell, 1994] by algorithms and resources that permit some access to the meaning of texts But can KBMT be scaled up to 2 System Overview unrestricted newspaper articles? We believe it can, pro• vided we address two additional questions This section gives a brief tour of the JAPANGLOSS sys• tem Later sections address questions of knowledge gaps 1 In constructing a KBMT system, how can we ac• on a module-by-module basis quire knowledge resources (lexical, grammatical, Our design philosophy has been to chart a middle conceptual) on a large scale7 course between "know-nothing" statistical MT [Brown el 2 In applying a KBMT system, what do we do when al , 1993] and what might be called "know-it-all" definitive knowledge is missing7 know]edge-based MT [Nirenburg et al, 1992] Our goal has been to produce an MT system that "knows some• There are many approaches to these questions Our working hypotheses are that (1) a great deal of useful thing" and has a place for each piece of new knowl• knowledge can be extracted from online and edge, whether it be lexical, grammatical, or conceptual text, and (2) statistical methods, properly integrated, The system is always running, and as more knowledge is can effectively fill knowledge gaps until better knowledge added, performance improves bases or linguistic theories arrive Figure 1 showe the modules and knowledge resources When definitive knowledge is missing m a. KBMT sys• of our current translator With the exceptions of JU- tem, we ral' this a knowledge gap A knowledge gap MAN and PENMAN, all were constructed new for JAPANGLOSS The system includes components typ 'This work was snpported in part by the Advanced Re• ical of many KBMT systems a syntactic parser driven search Projects Agency (Order 8073, Contract MDA904-91- by a feature-based augmented context-free grammar, a C-5224) and by the Department of Defense Vasdeios Hatzi semantic analyzer that turns parse trees into candidate vassiloglou's address u Department of Computer Science, interlinguas, a semantic ranker that prunes away mean• Columbia University, New York NY 1O027 All authors can ingless interpretations, and a flexible generator for ren- be reached at lastnameQiBi edu

1390 NATURAL LANGUAGE KNIGHT, ETAL 1391 dering interhngua expressions into English We have not achieved high throughput in our semantic Because written Japanese has no inter-word spacing, analyzer, primarily because we have not yet completed a we added a word segmenter We also added a pre- large-scale Japanese semantic lexicon, l e , mapped tens paring (chunking) module that performs several lasks of thousands of common Japanese words onto conceptual (1) -based word chunking, (2) recognition of expressions For progress on our manual and automatic personal, corporate, and place names, (3) number/date attacks on this problem, see [Okumura and Hovy, 1994] recognition, and (4) phrase chunking This last type and [Knight and Luk, 1994] So we have encountered of chunking inserts new strings into the input sentence, our first knowledge gap—what to do when a sentence 9 such as BEGII-KP and EID-IP The syntactic grammar produces no semantic interpretations Other gaps in is written to take advantage of these barriers, prohibit• elude incomplete dictionaries, incomplete grammar, in• ing the combination of partial constituents across phrase complete target lexical specifications, and incomplete in• boundaries VP and S chunking allow us to process the terlingual representations very long sentences characteristic of newspaper text, and to reduce the number of syntactic analyses We also em• 3 Incomplete Semantic Throughput ploy a limited inference module between semantic anal• ysis and ranking One of the tasks of this module is to While semantic resources are under construction, we take topic-marked entities and insert them into particu• want to be able to test our other modules in an end lar semantic roles to-end setting, so we have built a shorter path from Japanese to English called the glossing path The processing modules are themselves independent We made very minor changes to the semantic inter of particular natural languages—they are driven by preter and rechristened it the glosser Like the inter language-specific knowledge bases The interhngua ex• preter, the glosser also performs a bottom-up tree walk pressions are also language independent In theory, it of the syntactic parse, assigning feature structures Lo should be possible to translate interhngua to and from each node But instead of semantic feature structures it any natural language In practice, our interhngua 16 assigns glosses which are potential English translations sometimes close to the languages we are translating encoded as feature structures A sample gloss looks like Here is an example interhngua expression produced by this our system

The source for this expression was a Japanese sentence whose literal translation is something like as for new company, then is plan io establish in February The tokens of the expression are drawn from our con• ceptual knowledge base (or ontology), called SENSUS SENSUS is a knowledge base skeleton of about 70,000 concepts and relations, see [Knight and Luk, 1994] for a description of how we created it from online resources Note that the interhngua expression above includes a syntactic relation Q-HOD ("quality-modification") While our goal is to replace such relations with real semantic ones, we realize that semantic analysis of adjectives and nominal compounds IB an extremely difficult problem, and so we are willing to pay a performance penalty in In this structure, the 0P1, 0P2, etc features represent the near term sequentially ordered portions of the gloss All syntactic In addition to SENSUS, we have large syntactic lexi• and glossing ambiguities are maintained in disjunctive cal resources for Japanese and English (roughly 100,000 (*0R*) feature expressions stems for each), a large grammar of English inherited We buiJt three new knowledge resources for the from the PENMAN generation system [Penman, 1989], glosser a bilingual dictionary (to assign glosses to a large semantic lexicon associating English words with leaves), a gloss grammar (to assign glosses to higher con• any given SENSUS concept, and hand-built chunking stituents, keying off syntactic structure), and an English and syntax rules for the analysis of Japanese Sample morphological dictionary rules can be found in [Knight el al, 1994] The output of the glosser is a large English word lat• tice, stored as a state transition network The typical

1382 NATURAL LANGUAGE network produced from a twenty-word input sentence English spelling Given a Japanese transliteration like has 100 states, 300 transitions, and billions of paths kunnton, we seek an English-looking word likely to have Any path through the network is a potential English been the source of the transliteration "English-looking" translation, but some are clearly more reasonable than is defined in terms of common four-letter sequences To others Again we have a knowledge gap, because we do produce candidates, we use another statistical model for not know which path to choose katakana translations, computed from an automatically Fortunately, we can draw on the experience of speech aligned database of 3,000 katakana-English pairs This recognition and statistical NLP to fill this gap We built model tells us that Japanese hi is sometimes a translit• a language model for the English language by estimating eration of English c (especially at the beginning or a bigram and trigram probabilities from a large collection word), sometimes of ck (especially at the end) some• of 46 million words of Wall Street Journal material 1 times of cu, etc , with associated probabilities Another We smoothed these estimates according to class mem• extractor program delivers a reasonable transliteration, bership for proper names and numbers, and according to for example, preferring Clinton over Kulccnin an extended version of the enhanced Good Turing method [Church and Gale, 1991] for the remaining words The 5 Incomplete Grammatical Analysis resulting tables of conditional probabilities guide a sta• tistical extractor, which applies a version of the N-best Unlike spoken language, newspaper text is general!} beam search algorithm [Chow and Schwartz, 1989] to grammatical However, it is frequently ungrammatical identify an ordered set of "best" paths in the word lat• with respect to our knowledge resources at any given tice (l e , the set of English sentences that are most likely time Newspaper sentences are very long, and every new accordirg to our model) Due to heavy computational text contains some unusual syntactic construction that and memory requirements, we have not yet completed we have not yet captured in our formal grammar We the trigram version of our model But even when only also encounter problems with non-standard punctuation bigrams are used, comparing the statistical extractor to as well as wrong segmentation, part-of-speech tagging, random path extraction reveals the power of statistics to and phrase chunking For these reasons, we made an make reasonable decisions, e g early decision to do all parsing, semantic interpretation, and glossing in a bottom-up fashion, dealing indepen dently with sentence fragments when need be But be• (random extractor) cause Japanese and English word orders are so different a fragmentary parse usually leads to a bad translation ' planned economy ages is threadbare " One method of overcoming these difficulties is to en• (bigraa extractor) sure a full parse with statistical context-free parsing Low probability rules like ADV —» ADJ guarantee that " planned econoay tiaee axe old ' a full parse will be returned [Yamron et al, 1994] Wi avoided this technique because wc wanted to keep our feature-based grammar, in order to have fine-tuned con• The statistical model of English gives us much better trol over the structures we accept and assign Inspired glosses A full description of our glossing system, includ• by [Lavie, 1994], we turned to word skipping as a gram• ing our use of feature unification, appears in [Hatzivas- matical gap-plugger This approach seeks out the largest siloglou and Knight, 1995] Interestingly, we originally grammatical subset of words in a sentence, with the hope built the statistical model to address knowledge gaps in that the skipped words are peripheral to the core mean generating English from interlingua, not glossing We ing of the sentence (or perhaps simply strav punctuation return to these gaps (and the statistical model) in Sec• marks) Rather than port the LR-parsing-based tech• tion 8 niques of [Lavie, 1994] to our bottom-up chart parser we developed heuristic techniques for spotting suspicious 4 Incomplete Dictionaries words and dropping them Our most general method is to automatically process large sets of parsed and un- Lexical incompleteness causes problems for both seman• parsed sentences, looking for statistical differences be tics and glossing Numbers and dates are typical un• tween the two sets For example, if a part-of-speech bi• known words, we take a. finite-state approach to recog• gram appears frequently in unparsed sentences, this is nizing and translating these Another important class of a clue that one of the two words (or both) should l>e unknown words comprises katakana loanwords and for• dropped We also use grammar-specific heuristics sucli eign proper names, which are represented in Japanese as don't drop a single noun from a noun sequence, drop with approximate phonetic transliterations For exam• phrasal boundaries only in pairs (and onlj when the ple, kunnton should be translated as Clinton, and suttp- internal material is completely parsed), etc Skipping paa moot a a as stepper motor The knowledge-based ap• raises our full-parse rate from 40% to 90% proach is to pack our dictionaries with every possible name and technical term, but this approach quickly leads to diminished returns To plug the rest of the gap, we 6 Incomplete Semantic Constraints have again applied statistical techniques, this time to This section and the next return to incompleteness in 1 KBMT systems usually model world knowl• Available from the ACL Data Collection Initiative, CD edge as a collection of "hard" constraints that any m- ROM 1

KNIGHTETAL 1393 terlingua must satisfy Unsatisfactory semantic candi• Now there axe four probability distributions to esti• dates are pruned away, leaving a single sensible interpre• mate, one of which is P(I), exactly the a priori proba• tation In many cases, however, semantic constraints bility of an Interlingua expression described above are too strong, and all interpretations are ruled out While the full statistical model of KBMT is not used Preference semantics [Wilks, 1975] was devised to han• in JAPANGLOSS, it is useful to view what we do (and dle just this problem Within the KBMT tradition, role don't do) in this light For example, P(I | E) sug• restrictions can be augmented with relaxeable to state gests a mode] of English generation that explicitly shies ments [Carlson and Nirenburg, 1990], as in the agent away from ambiguous E's, because they spread prob• of a say-event must be a person, or relaxable-to an abilities thinly across several I's For example, if I is organization REIT-TO (SPEAKER,HOUSE.') , it is better to say I'm rent• Our approach has been to further soften the impact ing my house to someone rather than the correct but of semantic constraints We assign a score (between 0 ambiguous I'm renting my house Most language gener• and 1) to any interlingue fragment, as follows We first ation systems focus on accurate renditions rather than extract all stated relations between entities To each unambiguous ones, although as Pereira and Grosz [1993, relation, we assign scores based on domain constraints p 12] remark, this is changing and range constraints in the conceptual knowledge base "The coextensiveness of natural language per- There are five possible heuristic scores, depending on ceivers and producers both enables and rrquires the the suitability of the role filler satisfies basic constraint language generator lo reason about the generated (1 0), satisfies relaxed constraint, but is not mutually language in reflective terms 'how would I react if I disjoint from concepts satisfying basic constraint (0 8), heard what I am about to say9' This reflective as satisfies relaxed constraint but is mutually disjoint from pect of language genejatjon is essential at all levels basic constraint (0 25), satisfies neither basic nor relaxed of generation " constraint (0 05), is mutually disjoint from concepts sat• isfying basic or relaxed constraint (0 01) Scores for all 7 Incomplete Interlingua Expressions of the relations are multiplied to yield an overall score No interlingua expression is ever assigned a zero score Even with full semantic throughput and accurate rank• ing of interlingua candidates, a full account of text mean• This approach is like language modeling, but here ing is beyond the state of the art It is easy to record verb the basic unit is the relational/conceptual n-gram (e g , tenses, but difficult to make the inferences required to lav

1394 NATURAL LANGUAGE 8 Incompleteness in Generation We have encountered several classes of knowledge gaps As the generator becomes more knowledgeable, its in large-scale generation Large-scale in our case output lattices become leaner, and dependence on auto• means roughly 70,000 concepts/relations and 91,000 matic statistical selection is reduced However, the sta• roots/phrases First, we must anticipate incomplete• tistical component is still useful as long as uncertainty ness in the input specification, as described in the last remains, and improvements in language modeling will section Second, lexical syntactic specification may be continue to have a big effect on overall performance incomplete—does verb V take a nominal direct object or an infinitival complement7 Third, collocations may be missing Knowledge of collocations has been success• 9 Conclusion fully used in the past to increase the fluency of generated We have reported progress on aspects of the JAPAN- text [Smadja and McKeown, 1991] In particular, such GLOSS newspaper MT system In particular, we have knowledge can be crucial for selecting prepositions (on described the integration of statistical and heuristic the field versus in the end zone) and other forms And methods into a KBMT system While these methods are fourth, dictionaries may not mark rare words and gen• not a panacea, they offer a way to fill knowledge gaps erally may not distinguish between near- We until better knowledge bases become available Further• constantly strive to augment our knowledge bases and more, they offer a way of rationally prioritizing manual lexicons to improve generation, but we also want to plug tasks if a statistical method solves your problem 90% the gap with something reasonable of the time, you may not want to invest in a knowl• edge base In other cases, statistics may offer only a Our approach is one of bottom-up many-paths gen small benefit over random or default choices, then, a. eration [Knight and Hatzivassiloglou, 1995], in analogy more careful analysis is called for to bottom-up all-paths parsing If the generator can• not make an informed decision (lexical selection, comple• ment type, spatial preposition, etc ), it pursues all pos• Acknowledgments sibilities This is in contrast to many industrial strength We would hke to thank Yolanda Gil and the IJCAI re• generators [Tomita and Nyberg, 1988, Penman, 1989] viewers for helpful comments on a draft of this paper that never backtrack and usually default their difficult choices Other generators [Elhadad, 1993] do backtrack A JAPANGLOSS MT Output but still use default or random schemes for making new selections A 1 Interhngua-Based Sample Our generator packs all paths into efficient word lat• tices as it goes, and trie fina? output ts a/so a word Citizen Hatch announced on the eighteenth to establish a joint venture sith Perkin Elmer Co lattice Because the generator produces the same data a major microcireuit manufacturing device structure as the glosser, we can select a final output us• Manufacturer, and to start the production of the ing the same extractor and language model described licrocixcuit lanuiacturing device in Section 3 This approach combines the strengths of The new coapany plane a launching in February knowledge-based generation (c g , generally grammati• The subsidiary of Perk in Elaer Co in Japan cal lattice paths, long-distance agreement, parallel con• bears a majority of the stock, and the production joined expressions) with statistical modeling (eg, lo• of the dry etching device that is used for the cal dependencies, lexical constraints, common words and manufacturing of the microcircuit chip and the collocations) As in Section 3, we can compare random stepper is planned path extraction with n-gram extraction A 2 Glosser Based Sample (random extractor) The "gun possession penalties important—national police agency plans of gun sword legal reform "The ilea companies will have as a purpose The national police agency defend policies that launching at February " change some of gun esord law that put the brake* on the proliferation of the 31st gun Change plans (bigxaa extractor) control recovery plan of the wrong owning step currency—make the decision by the cabinet meeting "The lie* coapany plane to establish of the middle of this month in three pieces support in February ' estimate of the filing in this parliament

(random extractor) References [Brown ti al, 1993] P F Brown, S A Della-Pietra, "A subsidiary of the Perkin Elaer Co on the Japan bears majority of the stock ' V J Della-Pietra, and R L Mercer The mathe• matics of statistical machine translation Parameter (blgrem extractor) estimation Computational Linguistics, 19(2), June 1993 "A subsidiary of Parkin Elaer Co in Japan bears a majority of the stock '

KNIGHT ETAL 1395 [Carlson and Nirenburg, 1990] L Carlson and S Niren- [Nyberg and Mitamura, 1992] E Nyberg and T Mita- burg World modeling for NLP Technical Report mura The KANT system Fast, accurate, high- CMU-CMT-90-121, Center for Machine Translation, quality translation in practical domains In Proc Carnegie Mellon University, 1990 COLING, 1992 [Chow and Schwartz, 1989] Y Chow and R Schwartz [Okumura and Hovy, 1994] A Okumura and E Hovy The N-Beat algorithm An efficient search proce• Ontology concept association using a bilingual dictio• dure for finding top N sentence hypotheses In nary In Proc of the ARPA Human Language Tech• Proc DARPA Speech and Natural Language Work• nology Workshop, 1994 shop, pages 199-202, 1989 [Penman, 1989] Penman The Penman documentation [Church and Gale, 1991] K W Church and W A Gale Technical report, USC/Information Sciences Institute, A comparison of the enhanced Good-Turing and 1989 deleted estimation methods for estimating probabil• [Pereira and Grosz, 1993] F C N Pereira and B J ities of English bigrams Computer Speech and Lan Grosz Introduction Artificial Intelligence, Special guage, 5 19-54, 1991 Volume on Natural Language Processing, 63(1-2), Oc• [Elhadad, 1993] M Elhadad FUF The universal tober 1993 unifier—user manual, version 5 2 Technical Report [Smadja and McKeown, 199l] F Smadja and K R CUCS-038-91, Columbia University, 1993 McKeown Using collocations for language generation [Hatzivassiloglou and Knight, 1995] V Hatzivassiloglou Computational Intelligence, 7(4), December 1991 and K Knight Unification-based glossing In Proc [Tomita and Nyberg, 1988] M Tomita and E Nyberg JJCAI, 1995 The GenKit and Transformation Kit User's Guide [Knight and Chander, 1994] K Knight and I Chander Technical Report CMU-CMT-88-MEMO, Center for Automated postediting of documents In Proc AAAI, Machine Translation, Carnegie Mellon University 1994 1988 [Knight and Hatzivassiloglou, 1995] K Knight and [White and O'Connell, 1994] V Hatzivassiloglou Two-level, many paths genera• J White and T O'Connell Evaluation in the ARPA tion In Proc ACL, 1995 machine translation program 1993 methodology In [Knight and Luk, 1994] K Knight and S K Luk Build• Proc ARPA Human Language Technology Workshop ing a large-scale knowledge base for machine transla• 1994 tion In Proc AAAI, 1994 [Wilks, 1975] Y A Wilks Preference semantics In [Kmght et al, 1994] K Knight, I Chander, M Haines, E L Keenan, editor, Formal Semantics of Natural V Hatzivassiloglou, E Hovy, M Iida, S K Luk, Language Cambridge University Press, Cambridge, A Okumura, R Whitney, and K Yamada Integrat• UK, 1975 ing knowledge bases and statistics in MT In Proc [Yamron et al, 1994] J Yamron, J Cant, A Demedts Conference of the Association for Machine Transla• T Dietzel, and Y Ito The automatic component of tion in the Amertcas (AMTA), 1994 the LINGSTAT machine-aided translation system In [Lavie, 1994] A Lavie An integrated heuristic scheme Proc ARPA Workshop on Human Language Technal for partial parse evaluation In Proc ACL (student ogy, 1994 session), 1994 [Nirenburg and Frederking, 1994] S Nirenburg and R Frederking Toward multi-engine machine trans- lation In Proc ARPA Human Language Technology Workshop, 1994 [Nirenburg et al, 1992] S Nirenburg, J Carbonell, M Tomita, and K Goodman Machine Translation A Knowledge-Based Approach Morgan Kaufmann, San Mateo, Calif, 1992 [NMSU/CRL et al, 1995] NMSU/CRL, CMU/CMT, and USC/ISI The Pan- gloss Mark III machine translation system Techni• cal Report CMU-CMT-95-145, Carnegie Mellon Uni• versity, 1995 Jointly issued by Computing Research Laboratory (New Mexico State University), Center for Machine Translation (Carnegie Mellon University), Information Sciences Institute (University of Southern California) Edited by S Nirenburg

139e NATURAL LANGUAGE