<<

From: AAAI Technical Report WS-92-01. Compilation copyright © 1992, AAAI (www.aaai.org). All rights reserved.

WordNet and Distributional Analysis: A Class-based Approach to Lexical Discovery

Philip Resnik* Department of and Information Science University of Pennsylvania, , PA 19104, USA resnik@linc,cis. upenn. edu

Introduction plications for which such class-based investigations might prove useful. It has becomecommon in statistical studies of nat- ural language data to use measures of lexical asso- ciation, such as the information-theoretic measure /Word Relationships of , to extract useful relation- ships between (e.g. [Church et al., 1989; Mutual information is an information-theoretic mea- Church and Hanks, 1989; Hindle, 1990]). For ex- sure of association frequently used with natural lan- ample, [Hindle, 1990] uses an estimate of mutual in- guage data to gauge the "relatedness" between two formation to calculate what nouns a verb can take words x and y. The mutual information of x and y as its subjects and objects, based on distributions is defined as follows: found within a large corpus of naturally occurring Pr(x, y) text. I(x;y) = log Pr(x)Pr(y) (1) Lexical association has its limits, however, since oftentimes either the data are insufficient to provide Intuitively, the probability of seeing x and y to- reliable word/word correspondences, or a task re- gether, Pr(x, y), gives some idea as to how related quires more abstraction than word/word correspon- they are. However, ifx and y are both very common, dences permit. In this paper I present a generaliza- then it is likely that they appear together frequently tion of lexical association techniques that addresses simply by chance and not as a result of any relation- these limitations by facilitating statistical discovery ship between them. In order to correct for this pos- of facts involving word classes rather than individual sibility, Pr(x,y) is divided by Pr(x)Pr(y), which words. Although defining association measures over the probability that x and y would have of appearing classes (as sets of words) is straightforward in theory, together by chance if they were independent. Taking makingdirect use of such a definition is impractical the logarithm of this ratio gives mutual information because there are simply too many classes to con- some desirable properties; for example, its value is sider. Rather than considering all possible classes, respectively positive, zero, or negative according to I propose constraining the set of possible word whether x and y appear together more frequently, as classes by using WordNet [Beckwith et al., 1991; frequently, or less frequently than one would expect Miller, 1990], a broad-coverage lexical/conceptual if they were independent) hierarchy. Let us consider two examples of how mutual in- Section 1 very briefly reviews mutual informa- formation can be used. [Church and Hanks, 1989] tion as it is commonlyused to discover lexical as- define the association ratio between two words as sociations, and Section 2 discusses somelimitations a variation on mutual information in which word y of this approach. In Section 3 I apply mutual infor- must follow word z within a windowof w words. The mation to word/class relationships, and propose a association ratio is then used to discover "interest- knowledge-based interpretation of "class" that takes ing associations" within a large corpus of naturally advantage of the WordNet taxonomy. In Section 4 occurring text -- in this case, the 1987 Associated I apply the technique to a problem involving verb Press corpus. Table 1 gives some of the word pairs argument relationships, and in Sections 5 and 6 I found in this manner. An application of such infor- mention some related work and suggest further ap- mation is the discovery of interesting lexico-syntactic regularities. Church and Hanks write: *I would like to acknowledge the support of an IBMgraduate fellowship, and to thank Eric Brill, Ar- 1Thedefinition of mutual information is also related avind Joshi, David Magerman,Christine Nakatani, and to other information-theoretic notions such as relative Michael Niv for their helpful comments.This research entropy; see [Cover and Thomas,1991] for a clear dis- wasalso supportedin part by the following grants: ARO cussion of these and related topics. See [Churchet al., DAAL03-89--0031, DARPAN00014-90-J-1863, NSF 1991] for a discussion of mutual information and other IRI 90-16592, and Ben Franklin 91S.3078C-1. statistics as applied to .

48 Association ratio x Y I Co-occurrence score I verb I object ]countl 11.3 honorary doctor 5.8470 open closet 2 11.3 doctors dentists 5.8470 open chapter 2 10.7 doctors nurses 5.7799 open mouth 7 9.4 doctors treating 5.5840 open window 5 9.0 examined doctor 5.4320 open store 2 8.9 doctors treat 5.4045 open door 26 8.7 doctor bills 5.0170 open scene 3 4.6246 open season 2 Table 1: Some interesting associations with "Doc- 4.2621 open church 2 tor" (data from Church and Hanks 1989) 3.6246 open minute 2 3.2621 open eye 7 3.1101 open statement 2 Co-occurrence score [ verb ] object 1.5991 open time 2 11.75 drink tea 1.3305 open way 3 11.75 drink Pepsi 11.75 drink champagne Table 3: Verb~object pairs for open 10.53 drink liquid (with count > I) 10.20 drink beer 9.34 drink wine First, as in all statistical applications, it must Table 2: High-scoring verb~object pairs for drink be possible to estimate probabilities accurately. Al- (This is a portion of Hindle 1990, Table 2.) though larger and larger corpora are increasingly available, the specific task under consideration often can restrict the choice of corpus to one that provides ... Webelieve the association ratio can also a smaller sample than necessary to discover all the be used to search for interesting lexico-syntactic lexical relationships of interest. This can lead some relationships between verbs and typical argu- lexical relationships to go unnoticed. ments/adjuncts .... [For example, i]t happens For example, the Brown Corpus [Francis and that set ... off was found 177 times in the 1987 AP Corpus of approximately 15 million words Kucera, 1982] has the attractive (and for sometasks, ... Quantitatively, I(set, o.0) 5.9982, indicat- necessary) property of providing a sample that is ing that the probability of set ... off is Mmost64 balanced across many genres. In an experiment us- greater than chance. ing the BrownCorpus, modelled after Hindle’s [1990] investigation of verb/object relationships, I calcu- As a second example, consider Hindle’s [1990] lated the mutual information between verbs and the application of mutual information to the discov- nouns that appeared as their objects. 2 Table 3 shows ery of predicate argument relations. Unlike Church objects of the verb open; as in Table 2, the listing in- and Hanks, who restrict themselves to surface dis- cludes only verb/object pairs that were encountered tributions of words, Hindle investigates word co- more than once. occurrences as mediated by syntactic structure. A six-million-word sample of Associated Press news Attention to the verb/object pairs that occurred stories was parsed in order to construct a collection only once, however, led to an interesting observa- tion. Included among the "discarded" object nouns of subject~verb~object instances. On the basis of was the following set: discourse, engagement, reply, these data, Hindle calculates a co-occurrence score (an estimate of mutual information) for verb/object program, conference, and session. Although each pairs and verb/subject pairs. Table 2 shows some of these appeared as the object of open only once of the verb/object pairs for the verb drink that oc- -- certainly too infrequently to provide reliable esti- curred more than once, ranked by co-occurrence mates -- this set, considered as a whole, reveals an score, "in effect giving the answer to the question interesting fact about somekinds of things that can ’what can you drink?’ " [Itindle, 1990, p. 270]. be opened, roughly captured by the notion of com- munications. More generally, several pieces of sta- tistically unreliable information at the lexical level Word/Word Limitations maynonetheless capture a useful statistical regular- The application of mutual information at the level ity when combined. This observation motivates an of word pairs suffers from two obvious limitations: approach to lexical association that makes such com- binations possible. ¯ the corpus mayfail to provide sufficient informa- A second limitation of the word/word relation- tion about relevant word/word relationships, and ship is simply this: some tasks are not amenable to ¯ word/word relationships, even those supported by treatment using lexical relationships alone. An ex- the data, maynot be the appropriate relationships to lookat forsome tasks. ZTheexperiment differed from that described in [Hin- dle, 1990] primarily in the way the direct object of the Letus considerthese limitations in turn. verb wasidentified; see Section 4 for details.

49 ample is the automatic discovery of verb selectional Coherent Classes preferences for natural language systems. Here, the The search for a verb’s most typical object nouns relationship of interest holds not betweena verb and requires at most ]A/’I evaluations of the association a noun, but between the verb and a class of nouns function, and can thus be done exhaustively. An ex- (e.g. between eat and nouns that are [+EDIBLE]). haustive search amongobject classes is impractical, Given a table built using lexical statistics, such as however, since the numberof classes is exponential. Table 2 or 3, no single lexical item necessarily stands Clearly some way to constrain the search is needed. out as the "preferred" object of the verb -- the se- lectional restriction on the object of drink should Let us consider the possibility of restricting the be something like "beverages," not tea. Once again, search by imposing some requirement of coherence the limitation seems to be one that can be addressed upon the classes to be considered. For example, con- by considering sets or classes of nouns, rather than sidering objects of open, the class {closet, locker} is individual lexical items. more coherent than {closet,discourse} on intuitive grounds: every noun in the former class describes a repository of some kind, whereas the latter class has Word/Class Relationships no such obvious interpretation. There are two ways to arrive at a catalogue A Measure of Association of coherent noun classes. First, one can take a In this section, I present a method for investigat- "knowledge-free" approach, deriving a set of classes ing word/class relationships in text corpora on the or a class hierarchy on the basis of distributional basis of mutual information, using for illustration analysis. [Hindle, 1990] describes one distributional the problem of finding "prototypical" object classes approach to classifying nouns, based upon the verbs for verbs. As will becomeevident, the generaliza- for which they tend to appear as arguments. [Brown tion from word/word relationships to word/class re- et al., 1990] and [Brill et al., 1990] propose algo- rithms for building word class hierarchies using n- lationships addresses the two limitations discussed 4 in the previous section. gram distributions. Approaches such as these share some of the commonadvantages and disadvantages Let of statistical methods. On the positive side, they V = {vl,v2 ..... vl} fine-tune themselves to fit the data at hand: the classes discovered by these techniques are exactly .Af : {nl,n2,...,nrn} those for which there is evidence in the particu- be the sets of all verbs and all nouns, respectively, lar corpus under consideration. On the negative and let side, there may be useful classes for which there is nonetheless too little evidence in the data -- this c = {clcCY¢} may result from low counts, or it may be that what makes the class coherent (e.g., some shared semantic be the set of noun classes; that is, the power set of feature) is not reflected sufficiently by lexical distri- H. Since the relationship being investigated holds butions. between verbs and classes of their objects, the ele- A second possibility is to take a "knowledge- mentary events of interest are membersof 1: x C. The based" approach, constraining possible noun classes joint probability of a verb and a class is estimated to be those that appear within a hand-crafted lexi- as cal knowledgebase or semantic hierarchy. Here, too, there are knownadvantages and disadvantages asso- count(v,n) ciated with knowledge-based systems. On the pos- ,ec Pr(v, c) (2) itive side, a knowledge base built by hand captures Zco..t(v’,n’)" regularities that may not be easily recoverable sta- v’EVn’ q.’~" tistically, and it can be organized according to prin- ciples and generalizations that seem appropriate to Given v E Y, c E C, define the association score the researcher regardless of what evidence there is Pr(~,c) in any particular corpus. On the negative side, the A(v,c) Pr(cIv) lo g Pr (v)Pr(c) (3) decisions made in building a knowledge base may be based upon intuitions (or theoretical biases)that are = Pr(cIv)I(v;e) (4) not fully supported by the data; worse, few knowl- edge bases, regardless of quality, are broad enough The association score takes the mutual information in to support work based upon corpora of un- between the verb and a class, and scales it according constrained text. to the likelihood that a memberof the class will 3 Clearly there is no foolproof way to choose be- actually appear as the object of the verb. tween the knowledge-free and knowledge-based ap- aScaling the pointwise mutualinformation score in 4 Nounclassification has even been investigated within this fashion is relatively commonin practice; cf. the a connectionist framework [Elman, 1990], although to score usedby [Gale et al., 1992] andthe "expectedutil- myknowledge no one has attempteda practical demon- ity" of [Rosenfeldand Huang, 1992]. stration using a non-trivial corpus.

5O proaches. For largely practical reasons, I have cho- all three elements: the WordNet noun hierar- sen to adopt a knowledge-based approach, using the chy has no unique top element, so even the WordNet]exical database [Beckwith et al., 1991; most general WordNet class that includes water Miller, 1990]. WordNet is a lexical/conceptual and tea (the class (thing, [entity, thing])) fails database constructed as robustly as possible using to include happiness, which has /state, [state]), psycholinguistic principles. Miller and colleagues (psychol.-feature,[psychol..:feature]}, write: and (abstraction, [abstraction])as its mostgen- eralr superordinate categories, ... Howthe leading psycholinguistic theories should be exploited for this project was not al- ways obvious. Unfortunately, most research of in- Experimental Results terest for psycholexicologyhas dealt with relatively small samplesof the English lexicon . . . All too An experiment was performed in order to dis- often, an interesting hypothesisis put forward,fifty cover the "prototypical" object classes for a set or a hundred wordsillustrating it are considered, of common English verbs. The verbs considered and extension to the rest of the lexicon is left as are frequently-occurring verbs found in a corpus of an exercise for the reader. Onemotive for devel- parentals , oping WordNetwas to expose such hypotheses to the full range of the commonvocabulary. Word- The counts of equation (2) were calculated Net presently contains approximately54,000 differ- collecting a sample of verb/object pairs from the ent wordforms . . . and only the most robust hy- Brown Corpus, which contains approximately one potheseshave survived. [Miller et ai., 1990], p. 2. million words of English text, balanced across di- verse genres such as newspaperarticles, periodicals, Although I cannot judge how well WordNetfares humor, fiction, and government documents. 9 The with regard to its psycholinguistic aims, its noun object of the verb was identified in a fashion sim- taxonomy appears to have many of the qualities ilar to that of [Hindle, 1990], except that a set of needed if it is to provide basic taxonomic knowledge heuristics was used to extract surface objects only fors the purpose of corpus-based research in English. from surface strings, rather than using a fragmen- As noted above, its coverage is remarkably broad tary parse to extract both deep and surface objects. (though proper names represent a notable excep- In addition, verb inflections were mapped down to tion). In addition, most easily-identifiable senses of a base form and plural nouns were mapped down words are distinguished; for example, buck has dis- to their singular forms.1° For example, the sentence tinct senses corresponding to both the unit of cur- John ate two shiny red apples would yield the pair rency and the physical object, among others, and (eat, apple). The sentence These are the apples that newspaper is identified both as a form of paper and John ate would not provide a pair for eat, since apple as a mass medium of communication. does not appear as its surface object. Given the WordNetnoun hierarchy, the defini- Given each verb, v, the verb’s "prototypical" tion of "coherent class" adopted here is straightfor- object class was found by conducting a search up- ward. Let words(w) be the set of nouns associated wards in the WordNetnoun hierarchy, starting with withs a WordNetclass w. WordNet classes containing synonym-set members that appeared as objects of the verb. Each Word- Definition. A noun class c E C is coherent iff Net class w considered was evaluated by calculat- there is a WordNetclass w such that words(w)N ing A(v,.A/" A words(w)). (For details see Figure JV" -- C. Classes having too low a count (fewer than five oc- As a consequence of this definition, noun classes currences with the verb) were excluded from consid- that are "too small" or "too large" to be coher- eration, as were a set of classes at or near the top ent are excluded. For example, suppose that wa- of the hierarchy that described concepts such as en- ter, tea, beer, whiskey, and happiness are amongthe tity, psychological_feature, abstraction, state, event, 11 elements of A/’. The noun class {water, tea, beer} and so forth. turns out to be incoherent, because there is no The results of this experiment are encouraging. W0rdNet class that includes its three elements but fails to include whiskey. Conversely, the rA related lexical representation schemebeing inves- noun class {water, tea, happiness} is excluded be- tigated independently by Paul Kogut (personal. commu- cause there is no WordNet class that includes nication) is to assignto eachnoun and verb a vectorof feature/valuepairs based upon the word’s classification SWordNetalso contains taxonomic information about in theWordNet hierarchy, and to classifynouns on the verbs and adjectives, and the taxonomygoes well beyond basisof theirfeature-value correspondences. subordinate/superordinaterelations. See [Beckwithet Sl am gratefulto AnnieLederer for providing these al., 1991]. data. sStrictly speaking, WordNetas described by [Miller et 9Theversion of theBrown corpus used was the tagged ai., 1990]does not have classes, but rather lexical group- corpusfound as partof thePenn [Brill et al., ings called synonymsets. By "WordNetclass" I mean 1990]. a pair (word, synonym-set). For example, the Word- 1°Nouns outside the scope of WordNet that were Netclass (iron, [iron, branding./ron]) is a subclass tagged as proper names were mapped to the token of (implement,[implement]). pname, a subclass of classes (someone, [person]) and

51 1. objnoun, -- {.lcount(~, .) >

2. c~Gsse~ {win 6 synset(~)} -- U nEobjnouns

3a. maxscore -- m,~x A(v, At" rl words(w)) W t~ Cl&sse~ 3b. marc/ass -- argmax A( v, ~ f’l words(w)) w E classes 4. threshold ~ ms=score l A(v,c)[verb I object class 5. while (e/asses is not empty)do 0.16 call someone, [person]) 0.30 catch Ilookin$_at, [look]) c -- arsmax A(v,JV’nwords(w)) 2.39 climb w E classes (stair, [step]) score-- A( v , ~ n word*(c)) I.15 close movable.barrier, [...]) 3.64 cook I clalses -- classes -- {c} meal, [repast]) if (score > threshold) 0.27 draw cord, [cord]) olasse8 ~ clas~e~ U hypernyms( c) 1.76 eat nutrient, [food]) if (score > maxscore) then 0.45 forget conception, [concept]) mo~score ~ 8core; rnaxdass ~ c 0.81 ignore 6. return(maxcla~ s) question, [problem]) 2.29 mix intoxicant, [alcohol ~. 0.26 move article~f_commerce, ..]) Figure 1: Search algorithm for identifping a verb’s 0.39 need helping, [aid]} "protot!lpical" object class 0.66 stop lvehicle, [vehicle] 0.34 take spatial~roperty, L...]) 0.45 work (change_of_place, [...]) l A(v,c)[object class 3.58 i Ibeveral~e, [drink .... ]2 Table 5: Some prototypical object classes J2.05 ] ~alchoholic_beveral~e, [intoxicant .] Table 4: Object classes for drink

Table4 showsthe objectclasses discovered for the verbdrink, which can be comparedto the corre- spondingresults using lexical statistics in Table 2. Table5 showsthe highest-scoring object classes for a selectionof fifteenverbs chosen randomly from the sample(these are taken to be the"prototypical" ob- jectclasses) and Table6 identifieswhich subset of theobjects of theverb fell into that "prototypical" class. Tables5 and 6 can be evaluatedon the basis I verb I object subset of how reasonablythe classanswers the question call (pname,man,...) "Whatdo you typically--?" for each verb.How- catch (eye) ever,the search process also provides additional in- climb (step,stair) formationabout the kinds of objectthat appear with close (door) cook (meal,supper,dinner) the verb. For example, although (door, [door]) is the draw top-scoring class for open, other classes with positive (line,thread,yarn) eat egg,cereal,meal,mussel,celery,chicken,...) scores include forget possibility,name,word,rule,order) ignore ¯ (entrance, [entrance]): (problem,question) (door) mix (martini,liquor) move (comb,arm,driver,switch,bit,furniture,...) ¯ (mouth,[mouth]): need assistance,help,support,service,...) (mouth). stop engine,wheel,car,tractor,machine,bus) take ¯ (repository,[depository .... ]): lposition,attitude,place,shape,form,...) (store,closet, locker, trunk) work way,shift)

(location, [location]). Table 6: Objects of the verb appearing in the "pro- nThecategories at the top of thehierarchy were ex- totypical" class. cludedprimarily for computational reasons. Since in- stancesof suchhigh-level classes appear extremely fre- quently,one would expect them not to havehigh mutual informationwith any givenverb. The experiment has recentlybeen replicated using parental speech from the CHILDEScorpus [MacWhinney, 1991], without exclud- ingany classes o priori, with comparable results.

52 $ (container, [container .... ]}: In summary, the experiment described in this (bottle, bag, trunk, locker, can, box, section demonstrates the viability of using a hand- hamper) constructed conceptual hierarchy such as WordNet as a basis for computing class-based statistics. Ex- ¯ (time.period, [time.period .... ]} : (tour, round, season, spring, session, week, ploring the relationships between words and classes evening, morning, saturday) in this fashion, rather than relationships between words, permits even low-frequency word-pairs to ¯ (oral_conunication, [speech .... ]/: contribute to statistical regularities, and provides a (discourse, engagement, relation, reply, mouth, a suitable level of abstraction for tasks such as this program, conference, session) one that go beyond purely lexical associations. ¯ (eriting, [eriting~}: (scene, book, program, statement, bible, para- Related Work graph, chapter) [Yarowsky, 1992] has independently investigated an Thus, although this experiment suggests that one approach similar to the one described here and ap- "prototypicaUy" opens doors, the other object plied it to the problem of word sense disambiguation. classes for open suggest a number of plausible dis- He demonstrates that word classes within a hand- tinctions in the way the verb is used. Some of the constructed taxonomy- Roget’s Thesaurus -- can distinctions, such as that betweenthe classes "repos- successfully be used as word sense discriminators by itory" and "container," are not at all obvious by sim- calculating statistics similar to those of equations (2) ple inspection of the verb’s object nouns, even when and (4). ranked1~ by mutual information. One interesting difference between the two ap- The careful reader may have noted a potential proaches has to do with use of levels within the tax- objection to this approach amongthe various classes onomy. Yarowskyrestricts the possible word-sense listed above. Although mouth (saucy or disrespect- labels to Roget’s numberedclasses -- in effect, des- ful language) and relation (the act of telling or re- ignating a priori which horizontal slice across the counting) are both instances of oral communication, hierarchy will contain the relevant distinctions. In it is likely that when the pairs (open, mouth) and contrast, the search procedure defined here relies on (open, relation) appeared in the corpus, the objects the statistical association score to automatically find were not being used in that sense. Nonetheless, a proper level in the hierarchy: too low, and the those instances counted as evidence for the pur- next higher class may bring in additional relevant pose of calculating the association score for open words; too high, and the class may have expanded and (oral_commtmication,[speech .... ]). In addi- too much. The advantages and disadvantages of a tion, this problem surfaces in an obvious way for priori restrictions within the hierarchy warrant fur- classes containing just a single word; for exam- ther consideration. ple, the singleton set {mouth} was also associated Recently, [Basili et al., 1992] have independently with WordNetclasses (impertinence, [impudence]) pointed out the limitations discussed in Section 2, and (riposte, [rejoinder]). writing that "the major problem with word-pairs Although inappropriate word senses add noise to collections is that reliable results are obtained only the data, it is not clear that they seriously under- for a small subset of high-frequency words on very mine efforts to extract useful associations. Previous large corpora" and that "an analysis based simply efforts using word/wordassociations (e.g. [Church et on surface [i.e. word-pair] distribution may pro- al., 1989; Hindle, 1990]) appear to have relied on the duce data at a level of granularity too fine" (p. 97). strength of large numbers to wash out inappropriate They propose using high-level semantic tags in or- senses. Presumably (play, record) (as in John played der to perform a statistical analysis similar to the a record on his stereo) and (beat, record) (as in John one presented here, where the tags are assigned by beat the school record/or absenteeism) are both con- hand from a very small, domain-specific set of cate- sidered instances of record, and the word is effec- gories. Their technique is applied to the problem tively counted in all senses appropriate to the verbs of acquiring selectional restrictions across a vari- with which it appears. The approach taken here is ety of syntactic relations (subject/verb, verb/object, similar: so long as the appropriate class is among noun/adjective, etc.). As in the present study, only the classes to which a word belongs, the word’s ap- surface syntactic relationships are considered. pearance in other classes should not makea signifi- Were an equivalent of the WordNet taxonomy cant difference. (An exception pointed out by Hindle available for Italian, it would be interesting to apply [1990] is the case of consistent or systematic errors the technique proposed here to the same set of data in , or, in this case, in the heuristics used to considered by [Basili et al., 1992]. The most sig- identify the direct object.) nificant difference between the two approaches, as l~According to the AmericanHeritage Dictionary, a was the case for [Yarowsky, 1992], is in how nouns repository is "a place wherethings maybe put for safe- are grouped into classes. Although Basili et al. keeping," whereas a container is "something,as a box or argue in favor of domain-dependent classification, barrel, in whichmaterial is held or carried." In WordNet they point out the obvious drawback of having to neither class subsumesthe other. reclassify the corpus for new application domains.

53 An alternativeapproach might be to use a broad- Statistical Language Modelling coveragetaxonomy such as WordNetrather than A central problem in statistical language modelling hand-defined,hand-assigned semantic tags. Such an as used in , for example -- is approachwould represent a compromisebetween the accurate prediction of the next word on the basis of overlyfine-grained distinctions found when consid- its prior context. Abstractly, the goal is to closely eringindividual words and theextremely high-level approximate Pr(X, IX~-I), where -1 is an ab- classificationrequired if semantictagging is to be breviation for X1, ¯ ¯., X,_ 1 and each Xi ranges over donemanually. Should the classification still prove words in the . In practice, these estimates toofine grained, or unsuitablefor the application do- are frequently calculated by computing main,one could(a) ezcludefrom consideration all taxonomicclasses other than those considered rele- Pr(X.lX~ -1) ~ Pr(X. lf(X’~-l)), vantto the task,and (b) construcl, as disjunctions where f is a function that partitions the possi- of taxonomicclasses, a small set of domain-specific classesnot founddirectly in the taxonomy. ble prior contexts into equivalence classes. For the commontrigram , for example, [Grefenstetteand Hearst.,1992] have recently f(z~-1) "- Xn_2~n_l, SO that presentedan interestingcombination of knowledge- based and statisticalmethods concerning noun Pr(XnIX~-I) ~ Pr(XnlXn_~X,_I)). (5) classes.They explore the possibilityof combining In addition,probability estimates often make use a corpus-basedlexical discovery technique (using of informationfrom several different equivalence pattern-matchingto identify hyponymy relations) classeseither by "backingoff" or by smoothing;for with a statistically-based word similarity measure, example,the smoothingtechnique of interpolated in order to discover newinstances of lexical relations estimation[Bahl et al.,1983; Jelinek and Mercer., for extending lexical hierarchies such as WordNet. 1980]estimates Pr(X.IX[~-1) ~ ~"~A,Pr(X.If,(X~-I)), Further Applications i In this section, I briefly consider three areas in statis- where each fi determines a different equivalence tical natural language processing for which a class- class, and the Ai’s weight the equivalence classes’ based approach of the kind proposed here might contributions. prove useful. Of these, only the last represents work Several recent approaches to language modelling in progress; the first two are as yet unexplored. have extended the notion of equivalence class to in- clude more than just the classes formed by "forget- ting" all but the previous few words (e.g. equa- Computational Lexicography and tion (5)). [Sahl al. , 1989] pro poses usi ng a b i- nary decision tree to classify prior contexts: since each node of a decision tree consitutes an equiva- For the purposes of lexicography, the work presented lence class, interpolated estimation can be used to here can be considered a generalization of [Church combine each value of fi computed along the path and Hanks, 1989], which describes the use of a - from any leaf node to the root of the tree. [Brown ical association score in the identification of seman- et al., 1990] suggests a hierarchical partitioning of tic classes and the analysis of concordances. The the vocabulary into equivalence classes on the basis discussion of the object classes discovered for open of similar distributions: models can in Section 4 suggests that corpus-based statistics then be estimated by defining f(x~-1) = c,_~cn_l, and broad-coverage lexical taxonomies represent a where each cj is the equivalence class of xj. powerful combination, despite the fact that neither These approaches are strictly "knowledge-free," the corpus data nor the taxonomy are free of error. in the sense that they eschew hand-constructed in- To the extent that WordNet respects the lexicog- formation about lexical classes. However,the results rapher’s intuitions about taxonomy, the classes un- of Section 4 suggest that, by using a broad-coverage covered as statistically relevant will provide impor- lexical database like WordNet,statistical techniques tant information about usage that might otherwise can take advantage of a conceptual space structured be overlooked. by a hand-constructed taxonomy. A disadvantage In’ addition, the experience of applying Word- of the purely statistical classification methods pro- Net to large corpora may be useful to researchers posed in .[Bahl et al., 1989] and [Brownet al., 1990] is in psycholexicology. WordNetwas designed to "ex- their difficulty in classifying a word not encountered pose [psycholinguistic] hypotheses to the full range in the training data; in addition, such methods have of the commonvocabulary," and it is likely that no remedy if a word’s distribution in the training class-based statistics of the kind described here, cal- data differs substantially from its distribution in the culated using large quantities of naturally occurring test data. These difficulties might be alleviated by text, can contribute to that enterprise by uncovering an integrated approach that takes advantage of tax- places in the hierarchy where the linguist’s view of onomic as well as purely distributional information, lexical taxonomy diverges from commonusage. treating concepts in the taxonomy as just another

54 kind of equivalence class. The method of interpo- for organizing those sets according to a knowledge- lated estimation provides one straightforward way based conceptual taxonomy. The experiment de- to effect the combination: the contribution of the scribed here demonstrates the viability of the tech- taxonomic component will be large or small, ~ com- nique, and shows how the class-based approach is pared to other kinds of equivalence class, according an improvementon statistics that use simple lexical to howmuch its predictions contribute to the accu- association. racy of the language model as a whole. References Linguistic Investigation Lalit R. Bahl, , and Robert L. Mercer. Diathesis alternations are variations in the way that A maximumlikelihood approach to continuous speech a verb syntactically expresses its arguments. For ex- recognition. In IEEETransactions on Pattern Analysis ample, (1) shows an instance of the indefinite object and Machine Intelligence, pages 179-190. PAMI-5(2), alternation, March 1983. Lalit R. Bahl, Peter F. Brown, Peter V. de Souza, and (1)a. John ate the meal. Robert L. Mercer. A tree-based statistical language b. John ate. model for natural language speech recognition. In IEEE (2) *The meal ate. Transactions on , Speech and Signal Process- ing, pages 37:1001-1008,July 1989. and example (3) shows an instance of the Roberto Basili, Teresa Pazienza, and Paola Velardi. causative/inchoative alternation [Levin, 1989]. Computational lexicons: the neat examples and the odd exemplars. In Third Conference on Applied Natu- (3)a. John opened the door. ral LanguageProcessing, pages 96-103. Association for b. The door opened. Computational , March1992. (4) *John opened. Richard Beckwith, Christiane Fellbaum, Derek Gross, and George Miller. WordNet:A lexical database orga- Such phenomena are of particular interest in the nized on psycholinguisticprinciples. In Uri Zernik, edi- study of how children learn the semantic and syn- tor, Lexical Acquisition: Exploiting On-LineResources tactic properties of verbs, because they stand at the to Build a Lexicon, pages 211-232. Erlbaum, 1991. border of syntax and lexical semantics. There are Eric Brill, D. Magerman,M. Marcus, and B. Santorini. numerous possible explanations for why verbs can Deducinglinguistic structure fromthe statistics of large and cannot appear in particular alternations, rang- corpora. In DARPASpeech and Natural Language ing from shared semantic properties of verbs within a Workshop,1990. class, to pragmatic factors, to "lexical idiosyncracy." Peter F. Brown,Vincent J. Della Pietra, Peter V. deS- Statistical techniques like the one described in this ouza, and Robert L. Class-based n-gram models of nat- paper may be useful in investigating relationships ural language. In Proceedings of the IBMNatural Lan- between verbs and their arguments, with the goal of guage ITL, pages 283-298, Paris, France, March1990. contributing data to the study of diathesis alterna- Kenneth Church and Patrick Hanks. Wordassociation tions, and, ideally, in constructing a computational norms, mutual information, and lexicography. In Pro- model of verb acquisition. ceedings of the 27th AnnualMeeting of the ACL,1989. One interesting outcome of the experiment de- K. Church, W. Gale, P. Hanks, and D. Hindle. Parsing, scribed in Section 4 is that the verbs participating wordassociations and typical predicate-argument rela- in13 "implicit nominal object" diathesis alternations tions. In Proceedings of the International Workshopon associate more strongly with their "prototypical" ob- Parsing Technologies, August 1989. ject classes than do verbs for which implicit objects Kenneth Church, William Gale, Patrick Hanks, and are disallowed. Preliminary results, in fact, show DonaldHindle. Using statistics in lexical analysis. In a statistically significant difference between the two Uri Zernik, editor, Lexical Acquisition: Exploiting On- groups. This suggests the following interesting ques- Line Resourcesto Build a Lexicon, pages 116-164. Erl- tion: might shared information-theoretic properties baum, 1991. of verbs play a role in their acquisition, in tile same Sharon Cote. Discourse functions of two types of null way that shared semantic properties do? objects in English, January 1992. Presented at the 66th Annual Meeting of the Linguistic Society of America, Philadelphia, PA. Conclusions Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley, 1991. In this paper, I have suggested a method for inves- tigating statistical relationships between words and Jeffrey Elman. Finding structure in time. Cognitive word classes. The approach comprises two parts: Science, 14:179-211,1990. a (straightforward) application of word-based mu- W. Francis and H. Kucera. Frequency Analysis of En- tual information to sets of words, and a proposal glish Usage. HoughtonMifflin Co., NewYork, 1982. William Gale, Kenneth Church, and David Yarowsky. 13Theindefinite object alternation [Levin, 1989] and A methodfor disambiguatingword senses in a large cor- the specified object alternation [Cote, 1992] -- roughly, pus. Technical report, AT&TBell Laboratories, 1992. verbs for which the direct object is understood when Gregbry Grefenstette and Mufti Hearst. A method omitted. for refining automatically-discoveredlexical relations:

55 Combining weak techniques for stronger results. In AAAI Workshop on Statistically-based NLP Tech- niques, San Jose, July 1992. Donald Hindle. Noun classification from predicate- argument structures. In Proceedings of the 28th Annual Meeting of the ACL, 1990. Frederick Jelinek and Robert L. Mercer. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on in Practice, Amsterdam, The Netherlands, May 1980. North-Holland. Beth Levin. Towards a lexical organization of English verbs. Technical report, Dept. of Linguistics, North- western University, November 1989. Brian MacWhinney. The CHILDESproject : tools for analyzing talk. Erlbaum, 1991. George Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller. Five papers on WordNet. CSL Report 43, Cognitive Science Labora- tory, Princeton University, July 1990. George Miller. Wordnet: An on-line lexical database. International Journal of Lexicography, 3(4), 1990. Spe- cial Issue. Ronald Rosenfeld and . Improvements in stochastic language modelling. In Mitch Marcus, editor, Fifth DARPAWorkshop on Speech and Natural Language, Arden House Conference Center, Harriman N¥, February 1992. David Yarowsky. Word-sense disambiguation using sta- tistical models of Roget’s categories trained on large corpora. In Proceedings of COLING-92,pages 454-460, Nantes, France, July 1992.

56