Wordnet and Distributional Analysis: a Class-Based Approach to Lexical Discovery

From: AAAI Technical Report WS-92-01. Compilation copyright © 1992, AAAI (www.aaai.org). All rights reserved. WordNet and Distributional Analysis: A Class-based Approach to Lexical Discovery Philip Resnik* Department of Computer and Information Science University of Pennsylvania, Philadelphia, PA 19104, USA resnik@linc,cis. upenn. edu Introduction plications for which such class-based investigations might prove useful. It has becomecommon in statistical studies of natural language data to use measures of lexical association, such as the information-theoretic measure Word/Word Relationships of mutual information, to extract useful relationships between words (e.g. [Church et al., 1989; Mutual information is an information-theoretic mea- Church and Hanks, 1989; Hindle, 1990]). For ex- sure of association frequently used with natural lan- ample, [Hindle, 1990] uses an estimate of mutual in- guage data to gauge the "relatedness" between two formation to calculate what nouns a verb can take words x and y. The mutual information of x and y as its subjects and objects, based on distributions is defined as follows: found within a large corpus of naturally occurring Pr(x, y) text. I(x;y) = log Pr(x)Pr(y) (1) Lexical association has its limits, however, since oftentimes either the data are insufficient to provide Intuitively, the probability of seeing x and y to- reliable word/word correspondences, or a task re- gether, Pr(x, y), gives some idea as to how related quires more abstraction than word/word correspon- they are. However, ifx and y are both very common, dences permit. In this paper I present a generaliza- then it is likely that they appear together frequently tion of lexical association techniques that addresses simply by chance and not as a result of any relation- these limitations by facilitating statistical discovery ship between them. In order to correct for this pos- of facts involving word classes rather than individual sibility, Pr(x,y) is divided by Pr(x)Pr(y), which words. Although defining association measures over the probability that x and y would have of appearing classes (as sets of words) is straightforward in theory, together by chance if they were independent. Taking makingdirect use of such a definition is impractical the logarithm of this ratio gives mutual information because there are simply too many classes to con- some desirable properties; for example, its value is sider. Rather than considering all possible classes, respectively positive, zero, or negative according to I propose constraining the set of possible word whether x and y appear together more frequently, as classes by using WordNet [Beckwith et al., 1991; frequently, or less frequently than one would expect Miller, 1990], a broad-coverage lexical/conceptual if they were independent) hierarchy. Let us consider two examples of how mutual in- Section 1 very briefly reviews mutual informa- formation can be used. [Church and Hanks, 1989] tion as it is commonlyused to discover lexical as- define the association ratio between two words as sociations, and Section 2 discusses somelimitations a variation on mutual information in which word y of this approach. In Section 3 I apply mutual infor- must follow word z within a windowof w words. The mation to word/class relationships, and propose a association ratio is then used to discover "interest- knowledge-based interpretation of "class" that takes ing associations" within a large corpus of naturally advantage of the WordNet taxonomy. In Section 4 occurring text -- in this case, the 1987 Associated I apply the technique to a problem involving verb Press corpus. Table 1 gives some of the word pairs argument relationships, and in Sections 5 and 6 I found in this manner. An application of such infor- mention some related work and suggest further ap- mation is the discovery of interesting lexico-syntactic regularities. Church and Hanks write: *I would like to acknowledge the support of an IBMgraduate fellowship, and to thank Eric Brill, Ar- 1Thedefinition of mutual information is also related avind Joshi, David Magerman,Christine Nakatani, and to other information-theoretic notions such as relative Michael Niv for their helpful comments.This research entropy; see [Cover and Thomas,1991] for a clear dis- wasalso supportedin part by the following grants: ARO cussion of these and related topics. See [Churchet al., DAAL03-89-C-0031, DARPAN00014-90-J-1863, NSF 1991] for a discussion of mutual information and other IRI 90-16592, and Ben Franklin 91S.3078C-1. statistics as applied to lexical analysis. 48 Association ratio x Y I Co-occurrence score I verb I object ]countl 11.3 honorary doctor 5.8470 open closet 2 11.3 doctors dentists 5.8470 open chapter 2 10.7 doctors nurses 5.7799 open mouth 7 9.4 doctors treating 5.5840 open window 5 9.0 examined doctor 5.4320 open store 2 8.9 doctors treat 5.4045 open door 26 8.7 doctor bills 5.0170 open scene 3 4.6246 open season 2 Table 1: Some interesting associations with "Doc- 4.2621 open church 2 tor" (data from Church and Hanks 1989) 3.6246 open minute 2 3.2621 open eye 7 3.1101 open statement 2 Co-occurrence score [ verb ] object 1.5991 open time 2 11.75 drink tea 1.3305 open way 3 11.75 drink Pepsi 11.75 drink champagne Table 3: Verb~object pairs for open 10.53 drink liquid (with count > I) 10.20 drink beer 9.34 drink wine First, as in all statistical applications, it must Table 2: High-scoring verb~object pairs for drink be possible to estimate probabilities accurately. Al- (This is a portion of Hindle 1990, Table 2.) though larger and larger corpora are increasingly available, the specific task under consideration often can restrict the choice of corpus to one that provides ... Webelieve the association ratio can also a smaller sample than necessary to discover all the be used to search for interesting lexico-syntactic lexical relationships of interest. This can lead some relationships between verbs and typical argu- lexical relationships to go unnoticed. ments/adjuncts .... [For example, i]t happens For example, the Brown Corpus [Francis and that set ... off was found 177 times in the 1987 AP Corpus of approximately 15 million words Kucera, 1982] has the attractive (and for sometasks, ... Quantitatively, I(set, o.0) 5.9982, indicat- necessary) property of providing a sample that is ing that the probability of set ... off is Mmost64 balanced across many genres. In an experiment us- greater than chance. ing the BrownCorpus, modelled after Hindle’s [1990] investigation of verb/object relationships, I calcu- As a second example, consider Hindle’s [1990] lated the mutual information between verbs and the application of mutual information to the discov- nouns that appeared as their objects. 2 Table 3 shows ery of predicate argument relations. Unlike Church objects of the verb open; as in Table 2, the listing in- and Hanks, who restrict themselves to surface dis- cludes only verb/object pairs that were encountered tributions of words, Hindle investigates word co- more than once. occurrences as mediated by syntactic structure. A six-million-word sample of Associated Press news Attention to the verb/object pairs that occurred stories was parsed in order to construct a collection only once, however, led to an interesting observation. Included among the "discarded" object nouns of subject~verb~object instances. On the basis of was the following set: discourse, engagement, reply, these data, Hindle calculates a co-occurrence score (an estimate of mutual information) for verb/object program, conference, and session. Although each pairs and verb/subject pairs. Table 2 shows some of these appeared as the object of open only once of the verb/object pairs for the verb drink that oc- -- certainly too infrequently to provide reliable esti- curred more than once, ranked by co-occurrence mates -- this set, considered as a whole, reveals an score, "in effect giving the answer to the question interesting fact about somekinds of things that can ’what can you drink?’ " [Itindle, 1990, p. 270]. be opened, roughly captured by the notion of com- munications. More generally, several pieces of sta- tistically unreliable information at the lexical level Word/Word Limitations maynonetheless capture a useful statistical regular- The application of mutual information at the level ity when combined. This observation motivates an of word pairs suffers from two obvious limitations: approach to lexical association that makes such com- binations possible. ¯ the corpus mayfail to provide sufficient informa- A second limitation of the word/word relation- tion about relevant word/word relationships, and ship is simply this: some tasks are not amenable to ¯ word/word relationships, even those supported by treatment using lexical relationships alone. An ex- the data, maynot be the appropriate relationships to lookat forsome tasks. ZTheexperiment differed from that described in [Hin- dle, 1990] primarily in the way the direct object of the Letus considerthese limitations in turn. verb wasidentified; see Section 4 for details. 49 ample is the automatic discovery of verb selectional Coherent Classes preferences for natural language systems. Here, the The search for a verb’s most typical object nouns relationship of interest holds not betweena verb and requires at most ]A/’I evaluations of the association a noun, but between the verb and a class of nouns function, and can thus be done exhaustively. An ex- (e.g. between eat and nouns that are [+EDIBLE]). haustive search amongobject classes is impractical, Given a table built using lexical statistics, such as however, since the numberof classes is exponential. Table 2 or 3, no single lexical item necessarily stands Clearly some way to constrain the search is needed.

Load more