I i Towards a Representation of in WordNet i Christiane Fellbaum Cognitive Science Laboratory, Princeton University Rider University I Princeton, New Jersey, USA I 1 Introduction h~.~ been written on figurative language, there is no agreement on the boundary between literal WordNet (Miller, 1995), (Fellbaum, 1998) is i and non-literal language, see e.g. (Moon, 1986). perhaps the most widely used electronic dic- Criteria that are commonly accepted include se- tionary of English and serves as the mantic non-compositionality and syntactic con- for a rarity of different NLP applications in- I straints on internal modification (such as adjec- cluding Information Retrieval (IR), Sense tive and adverb insertion) and movement trans- Disambignation (WSD), and M~hine Transla~ formations. Our purpose here is not to attempt tion (MT). Despite WordNet's large coverage, a clear delimitation or definition of non-literal I which comprises some 100,000 concepts lexi- language, but to examine how extended senses cMi~.ed by approySmately 120,000 word forms of and phrases from different syntactic (strings) and is comparable to that of a colle- and lexical categories-or conforming to none of I giate , it contains relatively little figu- the standard categories-are compatible with the rative language. WordNet includes a w~mber of network structure of a relational lexicon like multi-word strings, such as phrasal verbs, but WordNet and its particular way of represent- many idiomatic verb phrases Like smell a rat, I ing words and concepts. Our discussion will fo- know the ropes, and eat humble pie, are mi~g- cus on, but not be limited to, idiomatic verb ing. Idioms and metaphors abound in everyday I language and are found in texts spanning many phrases. genres (see, e.g., (Jackendoff, 1997) for a nu- 2 A simple Classification merical estlm~te of the frequency of idioms and fixed expression). Clearly, a dictionary that in- An inspection of dictionary sources such I dudes extended senses of words and phrases is as (Boatner et al., 1975) suggests a three-fold likely to yield more successful NLP applications. distinction among idioms for our purposes. On the one hand, no system wants to retrieve 3 Constructions I the string bucket from the idiom kick the bucket. On the other hand, MT and WSD efforts need First, some idiomatic constructions are simply to distinguish the sense of ropes in phrases like too complex to be integrated into WordNet and I know~learn/teach someone the ropes from the must be excluded at this point. We have in sense meaning "strong cords"; selecting the lat- mind constructions of the kind studied by (Fill- ter sense in any of the idiomatic phrases leads more et al., 1988) and (Jackendoff, 1997),(Jack- I to failure. An IR query is likely to be interested endoff, 1997). Examples are the more the mer- only in the "strong cord" reading. When this rier and she can't write a letter, let alone a sense is to be retrieved with the aid of a lexicon novel These structures comprise discontinu- l intended for multiple applications, the figura- ous constituents and chunks that tive sense must be successfully recognized and are governed by special syntactic and seman- excluded from a text that may contain instances tic rules. Thus, the X-er the Y-er allows the of the string ropes with both meanings. insertion of a wide variety of adjectives. Fill- I In this paper, we consider the possibility of more et al. discuss let alone and show that its extending WordNet to accommodate figurative syntactic properties require an amazing amount I meanings in the English lexicon. While much of description of facts absent from the standard 52 I I

i grammar. A full account of these constructions fish is a subordinate of catch and is further re- goes far beyond the lexical level, and there- lated to more semantically specified senses (tro- fore we need to exclude them, at least for now, ponyms) including flyfish, net fish, trawl, and I in a database like WordNet that does not in- shrimp. The second, arguably extended, sense dude much syntax and whose relationalseman- h~_q as its superordinate concept the synset con- tics cannot accommoda~ the kind of semantic taining the strings search and look .for. The dif- I facts observed by Fillmore et al. and Jackend- ferent locations in the network of the two senses off. offish, together with the difference in the kinds of noun objects they select are the sort of in- I 4 Idioms as a kind of polysemy formation exploited in NLP applications, and they will suffice in most cases to distinguish the By contrast, the second kind of idiomatic struc- two senses in such cases where the senses are ture is unprobl~matic for WordNet. Word- homophones rather than polysemes. I Net contains not only simple verbs and no-na but also more complex verb and noun phrases Some phrases consisting of more than one like show the way and academic gown. Strings word can be treated in a similar manner. For I like stepping stone, kick the bucket, hit the bot- example, the idiomatic verb phrases kick the tle, and come out of the closet therefore corre- bucket, chew the fat, and take a powder can be spond to categories already represented in the considered as single units. Their constituents database, and can be included when they are never occur in an order different from the cited i considered as partie, lar manifestations of pol- one because these idioms are syntactically com- ysemy. Polysemy in WordNet is represented pletely frozen. They not tolerate the insertion by membership of the polysemous string in dif- of an adjective or adverb, nor do they undergo t ferent synonym sets; synonym sets (synsets) in passivization, clefting, or any movement trans- WordiNet represent concepts that are lexicalized formation that would change the order of the by one or more strings (synonyms). In other individual strings.1 I words, the synsets contain different words forms The system therefore needs only to recognize with the same meaning, and a word form with the string that is part of the lexicon. If the more than one meaning appea~ in as many dif- strings kick, bucket, powder, fat, etc., occur out- I ferent synsets as it has meanings. side of the idiom order, they do not receive the For example, the string fish occurs as a verb idiomatic interpretation and must be considered in two different synsets, and has thus two dis- as carrying different meanings. I tinct senses in WordNet. One expresses the con- Some compound nouns have extended senses cept "catch, or try to catch, seafood;" the other as well, such as stepping stone, straight ar- sense is ~seek indirectly," as in the phrases fish row, and square shooter. We classify these as for compliments and fish for information. Note instances of non-literal language, because the I that such a representation does not in fact at- head (the rightmost noun) is not the superordi- tempt to answer the question as to whether or nate concept for the figurative reading: a step- not the second sense of fish is indeed an "ex- ping stone is not a kind of stone; a straight I tended" one or not, but simply treats them as arrow is not a type of arrow, and a square different meanlngs of the same word form. shooter is not a specific shooter. By contrast, Figurative senses can be seen as homophones nouns like limestone, gravestone, and gemstone, I rather than polysemes in that there is no dis- and sharpshooter and trapshooter are linked to cernible relation between the "literal" and the their superordinates senses, one or more senses "extended" senses. WordNet does not formally of stone and shooter, respectively; similarly, a i distinguish between polysemy and homophony broad arrow is a subordinate of arrow. Many but treats these two phenomena of multiple NLP applications using WordNet for determin- meanings alike under the label of polysemy. In all cases of polysemy, membership in two lOaly the verb changes in that it shows the usual in- I flectional endings; this should not pose a major problem different synsets entails a different location in for English idioms where the verb is virtually always the the semantic network and relatedness to distinct first constituent in a Verb Phrase (VP) idiom and can I concepts for each sense. Thus, the firstsense of thus be easily recognized. 53 I ing discourse coherence, finding malpropisms to. The negation must therefore be considered (Hirst and St-Onge, 1998), and word sense dis- part of the idioms. But a verb phrase headed ambiguation (Voorhees, 1998); (Leacock and by negation is not a constituent recognized in Chodorow, 1998) identify related word senses WordNet. by means of links such as between super- and Consider also the string eat one's cake and subordinates. When searching a text, such sys- have it, too: here, two verb phrases are adjoined I tems could easily recognize (and discard as po- and are often followed by an adverb. Moreover, tentlaUy related senses) figurative compounds the second clause contains a pronoun coreferent such as stepping stone and straight shooter be- with the noun in the first clause. Again, such cause these are not linked to nouns correspond- a string does not fit in with WordNet's entries. I tug to their heads. 2 Some idioms are entire sentences. Wild horses Moreover, literal and figurative senses are of- could not make me do that and the cat's got your ten in very different WorclNet files: an arrow tongue are not compatible with any of Word- i (and its hyponyms broad arrow and butt shaft) Net's noun, verb, adjective, or adverb compo- are classified as noun.artifacts; while a straight nent. WordNet does not contain sentences, and arrow is found in the noun.person file. at present we see no way of integrating these 1 Frozen VP idioms and metaphoric noun com- into the lexical database. The problem should pounds can be integrated into the WordNet be addressed in the future, because an NLP sys- database and distinguished from literally refer- tem would simply attempt to treat each con- I ring expressions in many cases. But much of stituent in these idioms separately, with unde- what is commonly considered to be figurative sirable consequences. language presents more serious problems for a In some cases, idioms whose syntactic shape I semantic network like WordNet and applica- does not correspond to any of the categories in tions relying on its particular design. The re- WordNet could be accommodated nevertheless mainder of this paper will be devoted to a dis- when they are synonymous with strings that are I cussion of the third category of idioms, which represented in an existing synset. For example, includes verb phrases like learn the ropes and the negation-headed phrase not in a pig's eye hide one's light under a bushel. These cannot and the clauses when hell freezes over and when automatically be integrated into WordNet, but the cows come home are all synonymous with I we offer some proposals for adding them to the never, which is included among Word.Net's ad- lexicon. verbs. If such strings are completely frozen, as they tend to be, they can be included as syn- 5 Some challenging idioms for I onymous members of existing WordNet synsets WordNet and the fact that they do not conform to any of The integration into WordNet of many idioms WordNet's syntactic categories can be ignored. I that do not fall into one of the categories dis- Such idioms do not pose problems for automatic cussed above is problematic for a variety of rea- processing because they do not admit of any sons. phrase-internal variation or modification. I Another formal (syntactic) problem pertains 6 Formal problems to the fact that the fixed parts of many VP id- First, there are formal problems. Some idiom ioms are not continuous. For example, a num- I strings have surface forms that do not conform ber of expressions contain nouns that resem- to any of the syntactic categories included in ble inalienable possessions, such as body parts, WordNet. For example, many idioms must oc- and a possessive adjective that is bound to the I cur with a negation: the VP give a hoot loses subject. Examples are hold one's light under a its (figurative)meaning in the absence of nega- bushel, blow one's stack, and flip one's wig. In tion; the same is true for the VP hold a candle other idioms with a similar structure, the pos- sessive is not bound to the subject but refers I 2In this respect, idiomatic compounds resemble exo- to another noun (got someone's number). And eestric compounds like lot~-life and sea~ata, which are not kinds of lives or latum, either, nor ate they found in expressions like cook one's goose allow for both I the vicinity of these concepts in the semantic net. bound and unbound genitives. 54 I I These idioms cannot be treated as single idioms carry a lot of highly specific semantic in- strings because the genitive slot can be filled by formation that would probably get lost if they any of the possessive adjectives, or by a noun were integrated into WordNet and attached to I in the case of the unbound genitive. One so- more general concepts. lution would be to enter these strings into the The problems for WordNet posed by syntacti- lexicon with a placeholder, such as a metachar- cally or semantically idiosyncratic idioms would I acter, in place of the genitive. This would make be reduced if these could be broken up, that is, for a somewhat unfelicitous entry. But a rule if the individual content words in the idioms could be added to a preprocessor for a syntactic could be treated as referring expressions and be ! tagger that allowed the placeholder be substi- assigned meanings that are similar to concepts tuted with either a pronoun from a finite list already represented in the lexicon. Some tradi- (for the bound cases) or any noun from Word- tional decompose a number of such Net (for the unbound cases); the preprocessor idioms and attempt to give an interpretations I would then be able to recognize the idiom as to their individual parts. This may seem justi- a unit and match the WordNet entry and the fiable particularly in cases where the idioms are actual string. Currently, we do not have a pre- syntactically variable, indicating that speakers i processor that is able to recognize discontinuous assign meanings to some of their components. constituents, but given the large number of VP For example, the American Heritage dictionary idioms and their frequency in the language, the defines one sense of the noun ice as "extreme un- development of such a tool seems desirable.3 friendliness or reserve." This entry seems mo- tivated by the apparent semantic transparency 7 What kinds of concepts axe these? of the noun (in contrast to strings like bucket In the previous section, we considered idioms in Idck the bucket, which seems to have no ref- I whose syntactic form does not comply with any erent at all, let alone a transparent one). But of the categories N(P), V(P), Adj(P), or Adver- synsets of the kind ice, extreme unfriendliness bial(P) represented in Word.Net or whose syn- or reserve seem undesirable for a computation- I tax poses problems for the creation of a neat ally viable dictionary like WordNet, because ice dictionary entry. However, such idioms could cannot be used freely and compositionally with easily be added to the lexical database when the proposed meanings. This is evident in sen- I they are synonymous with strings that fit into tences like the following: WordNet's design and organization. But many such syntactically idiosyncratic idiom strings (a) I felt/resented his unfriendliness/reserve/*ice. I raise a second problem having to do with their (b) His unfriendliness/reserve/*ice melted away. conceptual-semantic rather than their syntactic (c) Our laughter broke the .unfriendliness/reserve/ nature. They express concepts that cannot be ice. | fitted into WordNet's web structure either as members of existing synsets or as independent A language generation system (or a learner concepts, because there are no other lexicalized of English) relying on WordNet's lexicon could concepts to which they can be linked via any not be blocked from producing the ungrammat- I of the WordNet relations. In fact, if one exam- ical sentences above, if they are exploiting on ines idioms and their glosses in an idiom dictio- the close similarity and usage of the members nary, one quickly realizes that almost all idioms of the synset. Moreover, automatic attempts I express complex concepts that cannot be para- at word sense disambiguation that rely on syn- phrased by means of any of the standard lexical tactic taggers could probably not identify the or syntactic categories. Consider such exam- correct sense of ice in this phrase, because they ! plea as fish or cut bait, cook one's/somebody's could not recognize that the noun is a part of 9oose, and drown one's sorrows/troubles. These an idiom if the dictionary entry contains this noun in isolation, outside of its idiomatic con- ~A related phenomenon is that of phrasal verbs, many text. Only when one entry for ice lists the spe- i of which allow particle movemeat. In the cases where the verb head and the particle are not contiguous, they cific environment (break and the definite deter- e~nnot currently be adjoint by the preprocessor and they miner) can a program recognize the idiom and I are therefore not matched to an entry in Word.Net. assign the proper meaning. 55 I I

I Consider a second example. The American get the axe means be fired/dismissed. Simi- Heritage Dictionary contains an sense of ropes laxly, the phrase one's heart goes out (to) can that is glossed as "specialized procedures or be glossed by means of the verb .feel and the I details." This sense of ropes is the one in adjective phrase "sorry or sympathetic (for)." the expressions know/learn/get/teach the ropes. Such idioms pose a problem for integration into To assume a compositional reading here seems WordNet, not because of their form but because I more justified than in the case of ice, because of the kinds of concepts they express. In Word- this idiom is more flexible than break the ice Net, verbs (including eopular verbs) and adjec- and can undergo some internal modification tives are strictly separated because they express as well as passivization (he never learnt the distinct kinds of concepts. This separation is of I ropes~he taught Fred the ropes/Pfed was taught course desirable and even necessary when one the ropes). Moreover, ropes co-occurs with more deals with non-idiomatic language, where the verbs than just one. In fact, the verbs for which meaning of a phrase or sentence is composed of I it can serve as an argument are compatible with the meanings of its individual parts. Copular or the meaning assigned to ropes by the Ameri- copula-like verbs like be and .feel combine with a can Heritage Dictionary. A word sense disam- large number of adjectives and there is no point I biguation system that relied on the semantics in entering specific combinations into a lexicon. 4 of the contexts of the ambiguous word (such as While the separation of verbs and the adjec- • the verbs a noun co-occurs with), would prob- tives they select accounts for the large num- I ably choose the correct sense of rope, because ber of possible combinations allowed in the lan- the contexts of "specialized proceduresn or "de- guage, it also means that there exist no concepts tails ~ do not seem to overlap with the contexts like "feel sorry/sympathetic (for)" or "become in which ropes is found with the sense of "strong angry" in WordNet, and idioms like one's heart I cords." goes out (to} and hit the roof are presently ex- Yet despite their shared verb contexts, the cluded from the lexicon. Yet these strings need distribution of ropes is far more narrow than to be added if the lexicon is to serve NLP appli- I that of specialized procedures or details. Again, cations of real texts, where idiomatic language is a language generation system or a learner of En- pervasise. Expressions of the kind listed above glish might overgenerate and produce incompre- can simply be added as subordinates of be with- I hensible sentences like I forgot the ropes or Tell out causing a change in the structure of the lex- me the ropes. Therefore, an optimal solution icon. They would stretch the meaning of tro- might be to enter the idiom as a string but with ponymy, the manner relation that organizes the I a placeholder instead of the verb; a separate rule verb lexicon, in that it is somewhat off to state in the lexicon would list the verbs that are com- that "to be angry is to be in some manner." patible with the idiomatic reading of the string. However this seems to be the only way to ac- commodate such idioms, which express concepts The proposed solution for the idioms Like I of the kind not found in the literal language. teach/%arn/get the ropes and those that con- tain a possessive genitive might suggest a huge 8 Summary and conclusions amount of work. However, a survey of English I We considered the nature of idiomatic expres- idioms suggests that most are frozen and could sions in the light of their potential integration therefore simply be entered as entire strings, into WordNet. Some idioms pose formal, syn- without the need for specifying a list of selected I tactic problems and express complex concepts verbs. that are not expressible by mean.q of the stan- Another type of VP idiom that does not read- dard lexical and syntactic categories, including I ily fit into WordNet is that whose meaning can those represented in WordNet. Other idioms be glossed as be or become Adj. These idioms are formally uaremarkable but express concepts have the form of a VP but express states: hide one's light under a bushel and hold one's tongue 4There are some de-adjectival verbs that express spe- I cillc concepts with meanings "be or become Adjective," mean "be modest" and "be quiet, ~ respectively; such as pa/e or redden. Idioms that express the same flip one's wig;, blow one's stack/a fuse, and hit concepts as such verbs could be added as synonyms, but I the roof/ceiling all mean "become angry," and these cases are very few. 56 I !

I that cannot easily be connected to any of the Ro~mund Moon. "Time" and idioms. In concepts in the semantic network. Perhaps one Proceedings o.f the EURALEX International function of idioms (and one reason for their fre- Congress, SneU-Hornby M. (Ed.), Francke I quency and their persistence over time) is to Verlag 107-160, 1986. provide for the pre-coded lexicalized expression Ellen Voorhees. Using WordNet for Text Re- of complex concepts and ideas that do not ex- trieval In WordNet: An electronic lezic~ I ist as units in the language and would have to database. Christlane Fellbaum (ed.), MIT be composed by speakers. Their frequent oc- Press, Camhridge MA.,1998. currence in the language seems to show that many idioms refer to salient concepts and must I be considered an important part of the lexicon. We have made some proposals for their integra- tion into WordNet that should benefit in par- I ticular the kinds of NLP applications that rely on this lexical resource.

I References Maxine Tull Boatner, John Edward Gates and Adam Makl~. A dictionary of American id- I ioms. Barron's Educational Series, Wood- bury, NY, 1975. Chr=lstiane Fellbaum. WordNet: An electronic I lezical database. MIT Press, Camhridge, MA, 1998. Charles Fillmore, Paul Kay and Cather- I ine O'Connor. Regularity and idiomaticity in grammatical construction. In Language, 64:501-568, 1988. I Graeme Hirst and David St-Onge. Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms. I In WordNet: An electronic le.z-ical database. Christlane FeUbaum (ed.), MIT Press, Cam- bridge MA.,1998. I Ray Jackendoff. The Boundaries of the Lexi- con. In Idioms: Structural and Psychological Perspectives, M. Everaert, E. J. van den Lin- I den, A. Schenk, and R- Schreuder, (Eds.) , HiUsdale, NJ: Erlhaum, 1995. Ray Jackendoff. Twistin' the night away. In I Language, No 73:534-559, 1997. Claudia Leacock and Martin Chodorow. Com- bining Local Context and WordNet Similar- I ity for Word Sense Identification. In Word- Net: An electronic lezical database. Chris- tiane FeUbaum (ed.), MIT Press, Cambridge MA.,1998. I George A. MiUer. Word_Net: a lexical database for English. In Communicatioas of the A CM, I Vol.38, No.11:39-41, 1995. 57 I