<<

Word Sense Disambiguation in Rogets Thesaurus

Using WordNet

Vivi Nastase and Stan Szpakowicz

School of Information Technology and Engineering

University of Ottawa

Ottawa Ontario Canada

fvnastaseszpakgsiteuottawaca

Abstract that is indicative of the of p olysemous

We want to see what results we can get by

We describ e a simple metho d of disambiguating

using as little information as p ossible for

senses in Rogets Thesaurus using information

DataSet the only context information is the base

ab out the sense of the word in WordNet We present

noun phrase in which the word app ears We will

a few variations on this metho d compare their p er

compare results obtained when no context informa

formance and discuss the results We explain why

tion is available with results obtained when the base

this type of disambiguation can b e useful

noun phrase serves as context We will rst consider

head nouns in the context of their base noun phrase

Introduction

and then head nouns and mo diers that mutually

The work presented in this pap er is part of a larger

play the role of context

research pro ject We learn the assignment of seman

Kwong has worked on aligning lexical re

tic relations to head nouns and mo diers in base

sources She used WordNet as the intermediary in

noun phrases extracted automatically from Sem

passing from a word sense in LDOCE to a word

Cor and other texts Larrick manually from

sense in Rogets She showed that it is p ossible to

Levi In the absence of morphological and

determine the sense of a word in Rogets by manu

syntactic indicators of such relations we rely on lex

ally applying a simple algorithm that uses synsets

ical resources for the semantic characterization of

hypernym sets and co ordinate terms from WordNet

words We want to compare the p erformance of dif

to a small set of words divided into test groups

ferent lexical resources available for this task We

according to the number of p ossible senses Kwong

have previously used WordNet Turning to another

exp erimented only with nouns

lexical resource Rogets Thesaurus raises the prob

Wordsense disambiguation that we prop ose han

lem of nding the prop er senses of words This is

dles nouns adjectives and adverbs from base noun

the problem we discuss here

phrases The algorithm uses the following informa

We have created two data sets for our machine

tion

learning exp eriments DataSet contains base

noun phrases extracted from Larrick and

from Rogets all the paragraphs A para

Levi DataSet contains base noun phrases

graph in Rogets is a group of words with the

extracted automatically from SemCor Op enclass

same part of sp eech split into semicolon groups

words in SemCor are annotated with WordNet

Words in paragraphs are known to b e related

senses

although no explicit semantic link exists b e

We are lo oking for a simple and fast algorithm

tween them Words in a semicolon group are

that will help us annotate DataSet with Rogets

known to b e most closely related

senses using information ab out word senses from

from WordNet all the information at onelink

WordNet We use DataSet to test our wordsense

distance from a certain word we call this a

disambiguation algorithm We will then apply

WordNet mininet

the algorithm to the semiautomatic annotation of

DataSet

nouns the words hyponyms

hypernyms meronyms and holonyms

adjectives adverbs the synonyms the

The algorithm

other sets are left empty

Previous work on selecting the right sense from Ro

words that are derived from another word

gets for words in context includes Yarowsky

w the information p ertaining to the word

The author used information from the context of

w according to its part of sp eech

words keywords in the vicinity of the target word

Here is an example of a derived word propriate part of sp eech and select the paragraph

that captures b est the semantics of the word Words

sense of adjective cosmic its synset in paragraphs from Rogets are semantically related

but the relations are not sp ecied We try to match

and the synonyms hyponyms hypernyms the information ab out the sense of the word in

meronyms and holonyms of the word cosmos WordNet with each of these paragraphs and the

one that scores b est in this matching b ecomes the

cosmic

corresp onding sense in Rogets

Pertains to noun cosmos Sense

universe existence nature creation

Let P b e the set of paragraphs in our copy of

world cosmos macrocosm

Rogets Thesaurus paragraphs in total

natural object

Let W b e the set of words from DataSet

SynonymsHypernyms Ordered by Frequency of

without sense duplicates the same word can app ear

noun cosmos

several times if every time it is asso ciated with a

Sense

dierent WordNet sense

universe existence nature creation world

cosmos macrocosm

For each w W

i

natural object

Let S n H o H r Mr H l b e the synset set of hy

Hyponyms of noun cosmos

p onyms hypernyms meronyms and holonyms

Sense

in this order extracted from WordNet

universe existence nature creation world

Let S P P b e a set of paragraphs such that

cosmos macrocosm

p S P w p

j i j

natural order

For each p S P compute the score as a

j

tuple

Meronyms of noun cosmos

jp S nj jp H oj jp H r j jp Mr j jp H l j

j j j j j

Sense

universe existence nature creation world

Arrange the results in lexicographic order of the

cosmos macrocosm

scores

HAS PART celestial body

Cho ose the sense that scored b est

heavenly body

cosmos has no holonyms

This is the basic algorithm We will exp eriment

with variations on steps and

A paragraph in Rogets can contain single words

These sets are represented as lists and we group

and phrases that we treat as sets of words In step

them in a structure that represents a mininet It

the intersection of a paragraph p from Rogets with a

will b e used in step of the algorithm presented in

set of words from WordNet will b e p erformed in two

Section The word cosmic is represented as follows

ways WordNetSet is one of S n H o H r Mr H l

x p W or dN etS et

cosmic

x p x W or dN etS et

universeexistencenaturecreation

worldcosmosmacrocosm

x p W or dN etS et

natural order

phr ase p x phr ase x W or dN etS et

natural object

In step the scores are ordered in two ways

celestial body heavenly body

S cor e is the score asso ciated with paragraph p

i i

and S cor e with paragraph p recall that a score is

j j

a tuple

We will lo ok for the o ccurrences of the word

lexicographic

alone and in phrases Stemming was not done In

S cor e S cor e k k

i j

DataSet the words were represented by their ro ot

form SemCor also contains information ab out the

s s l k

il j l

ro ot form of all the op enclass words annotated with

s S cor e s S cor e

ix i j x j

s s

ik j k

WordNet senses

sum

The Algorithm

X X

S cor e S cor e s s

i j k l

The idea of the algorithm is to consider all Rogets

s 2S cor e s 2S cor e

i j

k l

paragraphs in which a word app ears with the ap

For words w that in WordNet p ertain to or are The lexicographically ordered list of scores

derived from another word w the synset usually

p

f fr ain g

contains just the word w The algorithm will then

fmoistur e g

nd the same score for all the paragraphs

fdescent g

that contain w We have therefore decided to

fmoistening g

use the word w instead and its disambiguated

p

fstor m g

Rogets sense for the semantic characterization of w

fw ater g

fw eather gg

Here is an example run of the algorithm for

the word cosmic adjective

cosmic p ertains to noun cosmos sense

Results and Discussion

The mininet for Sense of noun cosmos

We have conducted six exp eriments using varia

tions on the main algorithm presented in Section

S n funiv er se existence natur e cr eation

w or l d cosmos macr ocosmg

In four of these exp eriments we did not lo ok for

H o fnatural orderg

o ccurrences of words in combinations even if the

H r fnatural objectg

correct sense of some of such words only o ccurs in

Mr fcelestial body heavenly body g

phrases For example sugar cane this sense of

H l fg

cane only app ears in combination with its mo dier

sugar We made this decision b ecause a paragraph

in Rogets usually contains numerous phrases Let

The subset of paragraphs S P from Rogets such

w b e a word taken from one of the ve WordNet

that cosmos S P

sets we consider If w app ears in many phrases in

a paragraph in Rogets this paragraphs score may

S P f fw hol e w hol eness integ r al ity gg

b ecome unduly high

far r ang ement reduction to order g

funiv er se omneity w hol e w or l d g We also tried lo oking for words in phrases to

conrm our hypothesis that lo oking for words alone

The lexicographically ordered list of scores

gives a b etter result The results presented in Ta

ble suggest that lo oking for words in combinations

f funiv er se g

decreases the precision of the algorithm the num

fw hol e g

b er of senses correctly marked decreases but it in

far r ang ement gg

creases the recall some senses missed b efore are

now found We also note an increase of the number

of senses correctly marked but at a tie with others

A more complex example for the noun rain in the

and of second choices

noun phrase autumnal rain

Another variation is in comparing scores We did

The mininet for Sense of noun rain

lexicographic ordering of the lists of results and also

compared the sums of the elements of these lists

S n fr ain rainfall g

While studying Rogets Thesaurus we made a use

H o fmonsoon dow npour cl oudbur st

ful observation If we lo ok for the sense of a word

del ug e w ater spout tor r ent soak er

w that app ears as the rst word in a Rogets para

dr iz z l e miz z l e show er rain shower g

graph the paragraph overlaps mostly with w s hy

H r fpr ecipitation downfal lg

p onym set in WordNet We have therefore cho

Mr fr aindr opg

sen the ordering of the information from WordNet

H l fg

presented in step of the algorithm the synset

is considered most imp ortant then hyponyms hy

p ernyms meronyms and nally holonyms Lexico

The subset of paragraphs S P from Rogets such

graphic order is more discriminating than just a sum

that r ain S P

of individual scores and our exp eriment variations

show this

S P f fstor m tur moil tur bul ence g

We also introduced a bit of context In one

fdescent decl ension decl ination g

of the exp eriments we used the mo diers as a

fw ater H O heavy water g

context for the head nouns In another exp eriment

fw eather the elementfair weather g

mo diers and head nouns served as a context for

fmoistur e humidity sap j uice g

one another We consider another set C x containing

fmoistening humidif ication g

the word and its context and the score is a tuple

fr ain r ainf al l moistur e gg

Table WordNet Rogets sense corresp ondence for the unique senses

WL WS WCL WCS WLH WLB

Words that do not app ear in Rogets

Senses that do not app ear in Rogets

Words for which all Rogets senses are a tie

Senses correctly marked

Senses marked correct but at a tie with others

Senses for which the second choice is the right one

Senses missed

Table Comparison of results

WL WS WCL WCS WLH WLB

Correct sense identied

The correct sense is rst or second sense indicated

Average number of senses

computed as follows p is a paragraph from Rogets unique senses The results presented in Table are

computed by comparison with the manually anno

tated data

jp C xj jp S nj jp H oj jp H r j jp Mr j jp H l j

Here are examples of senses from DataSet with

The context will b e considered more imp ortant their asso ciated base noun phrases

than the synset in our list of scores b ecause if we

Words that do not app ear in Rogets Thesaurus

nd the exact base noun phrase in a Rogets para

lacy adjective lacy pattern weekend noun

graph the words will have the senses we are lo oking

weekend b oredom ounder adjective oun

for

der sh

The results are presented for the following vari

ants of the algorithm and of the computation of the

Senses that do not app ear in Rogets Thesaurus

score

laser laser printer Tuesday Tuesday night

WL Lo ok for words alone order the lists of

Words for which all senses are a tie bathing

results lexicographically

bathing suit laugh laugh wrinkles pet

spray p et spray

WS Lo ok for words alone order the sum of

the elements of the result lists

Senses correctly marked cloud storm storm

cloud weather weather rep ort moist moist

WCL Lo ok for words in combinations order

air

the lists of results lexicographically

Senses marked correctly but at a tie with oth

WCS Lo ok for words in combinations order

ers tiny tiny clouds rope giant rop e report

the sum of elements of the result lists

weather rep ort space op en space

WLH Lo ok for words alone order the lists of

Senses for which the second choice is the right

results lexicographically treating the mo dier

one puried puried water plan group

as the context for the head noun

plan aquatic aquatic mammal

WLB Lo ok for words alone order the lists

Senses missed wire electric wire lightning

of results lexicographically assuming that the

lightning ro d heavy heavy storm point

head noun and the mo dier have each other as

sharp p oint

context

We have manually assigned senses from Rogets to There are dierences in the numbers of words that

the words in DataSet using information ab out the do not app ear in Rogets for the dierent variants of

word senses in WordNet and the corresp onding base the algorithm This is b ecause sometimes a word ap

noun phrases From the mo diernoun pairs in p ears in Rogets only in combinations which are not

DataSet we extracted words among them found when we lo ok for words alone for example

cane in sugar cane There are cases when a word is We want to p erform sup ervised machine learning

found in a combination but its sense is not found on base noun phrases annotated with semantic re

for example Tuesday day of the week lations while the individual words are describ ed by

In Table we show some statistical results The information coming from dierent lexical resources

p ercentages shown are computed using as a base the The correct Rogets sense can b e selected from the

number of unique senses that app ear in Rogets ac rst two senses indicated by our simple algorithm

cording to that particular variation of the algorithm with a recall of when the average number of

We have computed the accuracy separately for the senses in DataSet is This signicantly reduces

nouns in our data set We obtain a precision the amount of manual lab our necessary to annotate

average of word senses correctly and unam our corp ora with Rogets senses

biguously marked and rst sense chosen is

In our exp eriment the words were annotated with

the correct one ties may o ccur Kwong rep orts an

WordNet senses This algorithm and Kwongs pap er

average of senses accurately mapp ed The

could serve as pro of that there is a corresp ondence

average number of p ossible senses in Rogets for the

b etween WordNet and Rogets senses We can use

words in our corpus is when we lo ok for the

these two resources in a wordsense disambiguation

word alone and when we accept o ccurrences

algorithm for untagged data they should reinforce

of words in combinations The average number of

each others choice of the correct sense of a word in

senses for the words used by Kwong is

context

Even though our algorithm is very simple and op

The uses of our disambiguation mechanism could

erates in a relative semantic void one would have

b e numerous Rogets hierarchy is very homoge

liked the accuracy to b e higher than We

neous This could b e a great help in testing the sim

lo oked at our results and we have observed the fol

ilarity of words as shown by McHale When

lowing phenomena

the similarity of words is decided by edge counting

McHale shows that the homogeneous organization of

some Rogets paragraphs are quite similar and

Rogets Thesaurus gives b etter results than Word

they get the same score For the adjective tiny

Net

for example the two paragraphs in which it ap

Another use would b e to nd related words that

p ears received the same score The paragraph

have dierent parts of sp eech or using Rogets The

keywords are smal l and littlesmal l

saurus to add p ertainyms to WordNet This is b e

WordNet is oriented toward words whereas Ro

cause words in Rogets with dierent parts of sp eech

gets is oriented toward phrases Words that

are group ed together under the same headwords

app ear in many phrases in a paragraph change

We have shown that with a very simple algorithm

the score in favour of this paragraph and the

the correct sense from Rogets can b e selected from

score obtained can b e misleading

the rst two senses indicated with an average recall

of with very little context information We

We can also note that we have set a fairly high

nd this a very promising result

standard for a rather simple algorithm In the pro

cess of manually assigning correct senses from Ro

Acknowledgements

gets we consider very carefully the sense of the

word not only in the context of its base noun phrase

Pearson Education licensed to us the Rogets

but also within paragraphs to which it could b elong

Thesaurus for research purp oses Partial funding

If as we have rep eatedly observed paragraphs con

for this work comes from the Natural Sciences and

tained words that were indeed very close in meaning

Engineering Research Council of Canada

we chose to weigh complete paragraphs against each

other Sometimes our eventual choice rated second

References

or even lower on the ordered list of scored para

Christiane Fellbaum editor WordNet An

graphs

Electronic Lexical Database The MIT press

Conclusions

Betty Kirkpatrick Penguin Authorized Ro

gets Thesaurus of English Words and Phrases

We needed a simple and fast algorithm to help us an

Penguin Bo oks London New York edition

notate a list of base noun phrases with senses from

Oi Yee Kwong Aligning WordNet with

Rogets Thesaurus The present study help ed elim

Additional Lexical Resources In COLINGACL

inate two variations of our simple algorithm the

Workshop on Usage of WordNet in Natural Lan

ones that lo ok for the word in phrases They are

guage Processing Systems pages

more computationally exp ensive and the gain of at

most in recall comparing WLB b est preci N Larrick Junior Science Book of Rain Hail

sion and WCS b est recall with a loss of in Sleet and Snow Garrard Publishing Company

precision do es not justify using these two metho ds Champaign IL

J N Levi The Syntax and Semantics of

Complex Nominals Academic Press New York

Michael McHale A Comparison of WordNet

and Rogets Taxonomy for Measuring Semantic

Similarity In COLINGACL Workshop on Us

age of WordNet in Natural Language Processing

Systems

George A Miller Five Papers on WordNet

Special issue of International Journal of Lexicog

raphy

David Yarowsky Unsup ervised Word Sense

Disambiguation Rivaling Sup ervised Metho ds In

Proceedings of the rd Annual Meeting of the

Association for Computational pages

Cambridge MA