Adding Dense, Weighted Connections to WORDNET

Adding Dense, Weighted Connections to WORDNET Jordan Boyd-Graber and Christiane Fellbaum and Daniel Osherson and Robert Schapire Princeton University October 9, 2005 Abstract WORDNET, a ubiquitous tool for natural language processing, suffers from sparsity of connections between its component concepts (synsets). Through the use of human annotators, a subset of the connections between 1000 hand-chosen synsets was assigned a value of “evocation” representing how much the first concept brings to mind the second. These data, along with existing similarity measures, constitute the basis of a method for predicting evocation between previously unrated pairs. Submission Type: Long Article Topic Areas: Extending WORDNET Author of Record: Jordan Boyd-Graber, [email protected] Under consideration for other conferences (specify)? None Adding Dense, Weighted Connections to WORDNET Abstract WORDNET, a ubiquitous tool for natural language processing, suffers from sparsity of connections between its component concepts (synsets). Through the use of human annotators, a subset of the connections between 1000 hand-chosen synsets was assigned a value of “evocation” representing how much the first concept brings to mind the second. These data, along with existing similarity measures, constitute the basis of a method for predicting evocation between previously unrated pairs. 1 Introduction barrier preventing the development of practical information retrieval, machine translation, summa- WORDNET is a large electronic lexical database of rization, and language generation systems. Al- English. Originally conceived as a full-scale model though most word forms in English are monose- of human semantic organization, it was quickly em- mous, the most frequently occurring words are braced by the Natural Language Processing (NLP) highly polysemous. Resolving the ambiguity of community, a development that guided its subse- a polysemous word in a context can be achieved quent growth and design. WORDNET has become by distinguishing the multiple senses in terms of the lexical database of choice for NLP; Kilgariff their links to other words. For example, the (Kilgarriff, 2000) notes that “not using it requires noun [club] can be disambiguated by an au- explanation and justification.” WORDNET’s popu- tomatic system that considers the superordinates larity is largely due to its free public availability and of the different synsets in which this word form its broad coverage. occurs: [association], [playing card], WORDNET already has a rich structure connect- and [stick]. It has also been noted that di- ing its component synonym sets (synsets) to each rectly antonymous adjectives share the same con- other. Noun synsets are interlinked by means of texts (Deese, 1964); exploiting this fact can help to hyponymy, the super-subordinate or is-a relation, disambiguate highly polysemous adjectives. as exemplified by the pair [poodle]-[dog].1 Meronymy, the part-whole or has-a relation, links 1.1 Shortcomings of WORDNET noun synsets like [tire] and [car] (Miller, 1998). Verb synsets are connected by a va- Statistical disambiguation methods, relying on riety of lexical entailment pointers that express cooccurrence patterns, can discriminate among manner elaborations [walk]-[limp], tempo- word senses rather well, but not well enough for the ral relations [compete]-[win], and causation level of text understanding that is desired (Schutze,¨ [show]-[see] (Fellbaum, 1998). The links 1998); exploiting sense-aware resources, such as among the synsets structure the noun and verb lex- WORDNET, is also insufficient (McCarthy et al., icons into hierarchies, with noun hierarchies being 2004). To improve such methods, large manu- considerably deeper than those for verbs. ally tagged training corpora are needed to serve as “gold standards.” Manual tagging, however, is time- WORDNET appeals to the NLP community because these semantic relations can be exploited for consuming and expensive, so large semantically an- word sense disambiguation (WSD), the primary notated corpora do not exist. Moreover, people have trouble selecting the best-matching sense from a 1Throughout this article we will follow the convention of dictionary for a given word in context. (Fellbaum using a single word enclosed in square brackets to denote a and Grabowski, 1997) found that people agreed synset. Thus, [dog] refers not just to the word dog but to the set – when rendered in its entirety – consisting of {dog, with a manually created gold standard on average domestic dog, canis familaris}. 74% of the time, with higher disagreement rates for 1 more polysemous words and for verbs as compared meanings but also allows a wider range of con- to nouns. Confidence scores mirrored the agreement texts (that might contain a newly connected rates. word) to help disambiguation. In the absence of large corpora that are manually (Mel’cuk and Zholkovsky, 1998) propose sev- and reliably disambiguated, the internal structure of eral dozen lexical and semantic relations WORDNET can be exploited to help discriminate not included in WORDNET, such as “ac- senses, which are represented in terms of relation- tor” ([book]-[writer]) and “instrument” ships to other senses. But because WORDNET’s ([knife]-[cut]). But many associations network is relatively sparse, such WSD methods among words and synsets cannot be repre- achieve only limited results. sented by clearly labeled arcs. For exam- In order to move beyond these meager begin- ple, no relation proposed so far accounts for nings, one must be able to use the entire context the association between pairs like [tulip] of a word to disambiguate it; instead of looking at and [Holland], [sweater] and [wool], only neighboring nouns, one must be able to com- and [axe] and [tree]. It is easy to de- pare the relationship between any two words via a tect a relation between the members of these complete comparison. Moreover, the character of pairs, but the relations cannot be formulated as these comparisons must be quantitative in nature. easily as hyponymy or meronymy. Similarly, This paper serves as a framework for the addition the association between [chopstick] and of a complete, directed, and weighted relationship [Chinese restaurant] seems strong, to WORDNET. As a motivation for this additon, but this relation requires more than the kind of we now discuss three fundamental limitations of simple label commonly used by ontologists. WORDNET ’s network. Some users of WORDNET have tried to make No cross-part-of-speech links WORDNET con- up for the lack of relations by exploiting the sists of four distinct semantic networks, one definition and/or the illustrative sentence that for each of the major parts of speech. There accompany each synset. For example, in are no cross-part-of-speech links.2 The lack of an effort to increase the internal connectiv- syntagmatic relations means that no connec- ity of WORDNET, Mihalcea and Moldovan tion can be made between entities (expressed (Mihalcea and Moldovan, 2001) automatically by nouns) and their attributes (encoded by link each content word in WORDNET’s defi- adjectives); similarly, events (referred to by nitions to the appropriate synset; the Prince- verbs) are not linked to the entities with which tonWORDNET team is currently performing they are characteristically associated. For the same task manually. But even this signif- example, the intuitive connections among such icant increase in arcs leaves many synsets un- concepts as [traffic], [congested], connected; moreover, it duplicates some of the and [stop] are not coded in WORDNET. information already contained in WORDNET, as in the many cases where the definition con- Too few relations WORDNET’s potential is lim- tains a monosemous superordinate. Another ited because of its small number of relations. way to link words and synsets across parts of Increasing the number of arcs connecting a speech is to assign them to topical domains, as given synset to other synsets not only refines in (Magnini and Cavaglia, 2000). WORDNET the relationship between that synset and other contains a number of such links, but the domain labels are not a well-structured set. In 2WORDNET does contain arcs among many words from different syntactic categories that are semantically and morpho- any case, domain labels cannot account for logically related, such as [operate], [operator], and the association of pairs like [Holland] and [operation] (Fellbaum and Miller, 2003). However, se- [tulip]. In sum, attempts to make WORD- mantically related words like operation, perform, and danger- ous are not interconnected in this way, as they do not share the NET more informative by increasing its con- same stem. nectivity have met with limited success. 2 No weighted arcs A third shortcoming of WORD- of ideas has been a prominent feature of psycho- NET is that the links are qualitative rather logical theories for a long time (Lindzey, 1936). than quantitative. It is intuitively clear that It appears to be involved in low-level cognitive the semantic distance between the members of phenomena such as semantic priming in lexical hierarchically related pairs is not always the decision tasks (McNamara, 1992) and high-level same. Thus, the synset [run] is a subordi- phenomena like diagnosing mental illness (Chap- nate of [move], and [jog] is a subordinate man and Chapman, 1967). Its role in the on-line of [run]. But [run] and [jog] are seman- disambiguation of speech and reading has been tically much closer than [run] and [move]. explored by (Swinney, 1979), (Tabossi, 1988), and WORDNET currently does not reflect this dif- (Rayner et al., 1983), among others. ference and ignores the fact that words – la- Evocation is a meaningful variable for all pairs bels attached to concepts – are not evenly dis- of synsets and seems easy for human annotators to tributed throughout the semantic space covered judge. In this sense our extension of WORDNET by a language. This limitation of WORDNET will have no overlap with knowledge repositories is compounded in NLP applications that rely like CYC (Lenat, 1995) but can be viewed as com- on semantic distance measures where edges plementary.

Load more