Refining Automatically-Discovered Lexical Relations: Combining Weak Techniques for Stronger Results
Total Page:16
File Type:pdf, Size:1020Kb
From: AAAI Technical Report WS-92-01. Compilation copyright © 1992, AAAI (www.aaai.org). All rights reserved. Refining Automatically-Discovered Lexical Relations: Combining Weak Techniques for Stronger Results Marti A. Hearst Gregory Grefenstette ComputerScience Division Department of Computer Science 571 Evans Hall 210 MIB University of California, Berkeley University of Pittsburgh Berkeley, CA94720 Pittsburgh, PA 15260 raa rti @cs.berkeley, edu [email protected], edu Abstract level language processing techniques in isolation often do not suffice for a particular task; for this reason we Knowledge-poor corpus-based approaches to nat- are interested in finding ways to combine various ap- ural language processing are attractive in that proaches and improve their results. they do not incur the difficulties associated with complex knowledge bases and real-world infer- Accordingly, we conducted experiments to refine the ences. However, these kinds of language process- results of an automatic lexical discovery technique by ing techniques in isolation often do not suffice for makinguse of a statistically-based syntactic similarity a particular task; for this reason we are interested measure, and integrating them with an existing knowl- in finding ways to combine various techniques and edge structure. The discovery program uses lexieo- improvetheir results. syntactic patterns to find instances of the hyponymy Accordingly, we conducted experiments to refine (i.e., ISA) relation in large text bases. Once relations the results of an automatic lexical discovery tech- of this sort are found, they should be inserted into an existing lexicon or thesaurus. However, the terms in nique by making use of a statistically-based syn- the relation may have multiple senses, thus hampering tactic similarity measure. The discovery program automatic placement. In order to address this problem uses lexico-syntactic patterns to find instances of we applied a term-similarity determination technique the hyponymyrelation in large text bases. Once to the problem of choosing where, in an existing lexi- relations of this sort are found, they should be in- cal hierarchy, to install a lexieal relation. The union of serted into an existing lexicon or thesaurus. How- these two corpus-based methods is promising, although ever, the terms in the relation may have multiple only partially successful in the experiments run so far. senses, thus hampering automatic placement. In order to address this problem we applied a term- These ideas are related to other recent work in sev- similarity determination technique to the problem eral ways. We make use of restricted syntactic in- of choosing where, in an existing lexical hierarchy, formation as do Brent’s (Brent 1991) verb subcate- to install a lexical relation. The union of these gorization frame recognition technique and Smadja’s two corpus-based methods is promising, although (Smadja & McKeown1990) collocation acquisition al- only partially successful in the experiments run so gorithm. The work reported here attempts to find se- far. Here we report some preliminary results, and mantic similarity among terms based on the contexts make suggestions for how to improve the technique they tend to occur in; (Church & Hanks 1990) uses in future. frequency of co-occurrence of content words to create clusters of semantically similar words, (Hindle 1990) uses both simple syntactic subject-verb-object frames Introduction and frequency of occurrence of content words to deter- Knowledge-poor corpus-based approaches to natural mine similarity among nouns, and (Calzolari & Bindi language processing are attractive in that they do 1990)use corpus-basedstatistical association ratios not incur the difficulties associated with complex to determinelexical information such as prepositional knowledge bases and real-world inferences, while they complementationrelations, modification relations, and promise to offer efficient means of exploiting ever- significantcompounds. This paperpresents an at- growing quantities of on-line text. However, coarse- temptto combineknowledge-poor techniques; (Wilks 64 et al. 1992) discusses the potential power behind com- For example, the noun "board" appears in the synsets bining weak methods and describes advances achieved {board, plank} and {board, committee~, and this using this paradigm. grouping serves for the most part as the word’s defi- The next section describes in detail the problem be- nition. In version 1.1, WordNetcontains about 34,000 ing addressed and the two existing coarse-level lan- noun word forms, including some compounds and guage processing techniques that are to be combined. proper nouns, organized into about 26,000 synsets. This is followed by a description of how the similar- Noun synsets are organized hierarchically 1 according ity calculations are done, the results of applying these to the hyponymyrelation with implied inheritance and calculations to several examples, the difficulties that are further distinguished by values of features such as arise in each case, and a sketch of somesolutions for meronymy. WordNet’s coverage is impressive and pro- these difficulties. Wethen illustrate a side-effect of vides a good base for an automatic acquisition algo- the integration of statistical techniques with a lexical rithm to build on. knowledgesource, followed by a brief conclusion. Now, assuming we have discovered the relation hy- ponym(X, Y), indicating that X is a kind of Y, we wish The Problem: Integration of Lexical to enter this relation into the WordNetnetwork. Relations If the network is sufficiently mature, as is Word- Net, we can assume that most of the highly ambiguous (Hearst 1992) reports a method for the automatic ac- words are already present and appear in higher levels quisition of the hyponymylexical relation from unre- of the network. Therefore, most of the time we will be stricted text. In this method, the text is scannedfor in- trying to insert a rather specific term that itself does stances of distinguished lexico-syntactic patterns that not need to be disambiguated (i.e., it has only one main indicate the relation of interest. sense) as a hyponym of a term that can have one or For example, consider the lexico-syntactic pattern more senses. If we assume this is indeed the case, then ... NP {, NP}~ (,}or otherNP ... there are two scenarios to consider: (Scenario 1) Each sense of Y has several child sub- Whena sentencecontaining this pattern is found(with trees and ~he task is to determine which subtree X somerestrictions on the syntax to the left and the right shares context with. This in turn implies which sense of the pattern) it can be inferred that the NP’s on the of Y the hyponymrelation refers to. left of or other are hyponymsof the NP on the right (Scenario 2) One or more of the senses of Y have (where NP indicates a simple noun phrase). From the no children. Thus there are no subtrees to compare X sentence against. Bruises, wounds, broken bones or o~her There are two considerations associated with Sce- injuries are common. nario 1: (la) X is not a direct descendent of Y, but belongs we can infer: two or more levels down. hyponym~bruise, injury) (lb) X belongs in a new subtree of its own, even hyponym(wound, injury) though the correct sense of Y has one or more child hyponym:brokenbone, injury) subtrees. This pattern is one of several that have been identified In the work described here we address only the sit- as indicating the hyponymyrelation. uation associated with Scenario 1, since our technique This approach differs from statistical techniques in uses the child subtrees to determine which sense of Y an interesting way. Both require as their primary re- to associate X with. source a large text collection, but whereas statistical It has been observed (e.g., (Kelly & Stone 1975)), techniques try to find correlations and make general- that the sense of a word can be inferred from the lex- izations based on the data of the entire corpus, only ical contexts in which the word is found. As a (sim- a single instance of a lexico-syntactic pattern need be plified) example, when ’bank’ is used in its riverbank found in order to have made a discovery. sense, it is often surrounded by words having to do Once the relation is found, it is desirable to inte- with bodies of water, while when used in its financial grate it as a part of an existing network of lexical re- institution sense, it appears with appropriate financial lations. We want to develop a means to correctly in- terms. The strategy we present here makes use of an sert an instance of the hyponymyrelation into an ex- extension of this idea; namely, we will look at the con- isting hyponymically-structured network (hyponymyis texts of each subtree of the hypernymof interest, and reflexive, and transitive, but not symmetric.). see which subtrees’ contexts coincide most closely with For our experiments we use the manually con- the contexts that the target hyponymtends to occur structed thesaurus WordNet(Miller et al. 1990). In in. WordNet, word forms with synonymous meanings are 1Although WordNet’shyponyms are structured as a di- grouped into sets, called synsets. This allows a dis- rected network, as opposedto as a tree, for the purposes of tinction to be made between senses of homographs. this paper, wetreat it as if it werea tree. 65 To restate: in order to place the hyponymrelation creation, instauration into the network, we propose the following: => colonization, settlement Similarity Hypothesis: when comparing the contexts that hyponymX occurs in with the contexts of the subtrees (e.g., senses) of hypernymY, X’s contexts Our goal is to see if, by examiningthe syntactic con- will be found to be most similar to those of the sub- texts of these terms in a corpus of text, we can decide tree of Y in which X belongs. under which synset to place ’Harvard’. Howwill the context comparison be done? (Grefen- Given a large enough text sample, SEXTANTcan stette 1992b) has developed a weak technique, embod- tell us what words are used in the most similar ways ied in a program called SEXTANT,which, given a tar- ¯ to ’Harvard’.