An Iterative Approach to Word Sense Disambiguation

From: FLAIRS-00 Proceedings. Copyright © 2000, AAAI (www.aaai.org). All rights reserved. An Iterative Approach to Word Sense Disambiguation Rada Mihalcea and Dan I. Moldovan Department of Computer Science and Engineering Southern Methodist University Dallas, Texas, 75275-0122 {rada, moldovan} @seas.smu.edu Abstract There are also hybrid methods that combine sev- eral sources of knowledge such as lexicon information, In this paper, we present an iterative algorithm heuristics, collocations and others (Bruce and Wiebe for WordSense Disambiguation. It combines two 1994) (Ng and Lee 1996) (Rigau, Atserias and Agirre sources of information: Word_Netand a semantic tagged corpus, for the purpose of identifying the 1997) (Mihalcea and Moldovan 1999). correct sense of the wordsin a given text. It dif- The method proposed here is a hybrid method, and fers from other standard approaches in that the uses information gathered from a MRD,namely Word- disambiguation process is performed in an itera- Net, and from a semantic tagged corpus, i.e. SemCor. tive manner:starting from free text, a set of dis- It differs from previous approaches in that it uses an it- ambiguatedwords is built, using various methods; erative approach: the algorithm has as input the set of new words are sense tagged based on their rela- nouns and verbs extracted from the input text, and in- tion to the already disambiguatedwords, and then crementally builds a set of disambiguated words. This added to the set. This iterative process allows us approach allows us to identify, with high precision, the to identify, in the original text, a set of words semantic senses for a subset of the input words. About which can be disambiguated with high precision; 55%of the verbs and nouns are disambiguated 55% of the nouns and verbs are disambiguated with a with an accuracy of 92%. precision of 92%. The algorithm presented here is part of an ongoing re- search for the purpose of integrating WSDtechniques Introduction into Information Retrieval (IR) systems. It is an im- Word Sense Disambiguation (WSD) is an open problem provement over our previous work in WSD(Mihalcea in Natural Language Processing (NLP). Its solution im- and Moldovan 1999). This method can be also used pacts other tasks such as information retrieval, machine in combination with other WSDalgorithms with the translation, discourse, reference resolution and others. purpose of fully disambiguating free text. WSDmethods can be broadly classified into four Lately, the biggest effort to incorporate WSDinto types: larger applications is performed in the field of IR. The inputs of IR systems usually consist of a question/query 1. WSDthat makes use of the information provided by and a set of documents from which the information has Machine Readable Dictionaries (MRD)(Miller et to be retrieved. This led to two main directions consid- 1994), (Agirre and Rigau 1995), (Li, Szpakowicz ered so far by researchers, for the purpose of increasing Matwin), (Leacock, Chodorow and Miller 1998); the IR performance with WSDtechniques: 2. WSDthat uses information gathered from training 1. The disambiguation of the words in the input query. on a corpus that has already been semantically dis- The purpose of this is to expand the query with sim- ambiguated (supervised training methods) (Ng ilar words, and thus to improve the recall of the Lee 1996); IR system. (Voorhees 1994), (Voorhees 1998) 3. WSDthat uses information gathered from raw cor- (Moldovan and Mihalcea 2000) proved that this tech- pora (unsupervised training methods) (Yarowsky nique can be useful if the disambiguation process is highly accurate. 1995) (Resnik 1997). 2. The disambiguation of words in the documents. 4. WSDmethods using machine learning algorithms (Schutze and Pedersen 1995) proved that sense-based (Yarowsky 1995), (Leacock, Chodorow and Miller retrieval can increase the precision of an IR system up 1998). to 7%, while a combination of sense-based and word- Copyright ©2000, American Association for Artificial based retrieval increases the precision up to 14%. Intelligence (www.aaal.org). All rights reserved. With the algorithm described in this paper, a large NATURALLANGUAGE PROCESSING 219 subset of the words in the documents can be disam- Example. The noun subcommittee has one sense de- biguated with high precision, allowing for an efficient fined in WordNet. Thus, it is a monosemousword and combined sense-based and word-based retrieval. can be marked as having sense #1. Resources PROCEDURE3. With this procedure, we are trying to WordNet(WordNet 1.6 has been used in our method) get contextual clues regarding the usage of the sense a Machine Readable Dictionary developed at Princeton of a word. For a given word Wi, at position i in the University by a group led by George Miller (Fellbaum text, form two pairs, one with the word before Wi (pair 1998). WordNet covers the vast majority of nouns, Wi-t-Wi) and the other one with the word after Wi verbs, adjectives and adverbs from the English lan- (pair Wi-Wi+l). Determiners or conjunctions cannot be guage. It has a large network of 129,504 words, or- part of these pairs. Then, we extract all the occurrences ganized in 98,548 synonymsets, called synsets. of these pairs found within the semantic tagged corpus The main semantic relation defined in WordNetis the formed with the 179 texts from SemCor. If, in all the "is a" relation; each concept subsumes more specific occurrences, the word Wi has only one sense #k, and concepts, called hyponyms, and it is subsumed by more the number of occurrences of this sense is larger than general concepts, called hypernyms. For example, the a given threshold, then mark the word Wi as having concept {machine} has the hypernym {device}, and sense #k. one of its hyponyms is {calculator, calculating Example. Consider the word approval in the text frag- machine}. ment "committee approval of", and the thresh- WordNet defines one or more senses for each word. old set to 3. The pairs formed are "committee Depending on the number of senses it has, a word can be approval’ ’ and ’ ’ approval of’ ’. No occurrences of (1) monosemous,i.e. it has only one sense, for example the first pair are found in the corpus. Instead, there are the noun interestingness, or (2) polysemous, i.e. it four occurrences of the second pair: has two or more senses, for example the noun interest "... with the appro~al#lof the FarmCredit Association..." which has seven senses defined in WordNet. "... subject to the approral#Iof the Secretary of State ..." SemCor. SemCor(Miller et al. 1993) is a corpus formed "... administrativeapproval#1 of the reclassification ... " with about 25%of the Browncorpus files; all the words "... recommendedapproval#1 of the 1-A classification ..." in SemCorare part-of-speech tagged and semantically In all these occurrences the sense of approval is sense disambiguated. In the algorithm described here, we use #1. Thus, approval is marked with sense #1. the brownl and brown2 sections of SemCor, containing PROCEDURE4. For a given noun N in the text, de- 185 files; from these, 6 files are used with the purpose termine the noun-context of each of its senses. This of testing our method; the other 179 files form a corpus noun-context is actually a list of nouns which can occur used to extract rules with procedure 3 and to determine within the context of a given sense i of the noun N. noun-contexts for procedure 4 (as described in the next In order to form the noun-context for every sense Ni, section). we determine all the concepts in the hypernym synsets of Ni. Also, using SemCor, we determine all the nouns Iterative Word Sense Disambiguation which occur within a windowof 10 words respect to Ni. All of these nouns, determined using WordNet and The algorithm presented in this paper determines, in a SemCor, constitute the noun-context of Ni. We can given text, a set of nouns and verbs which can be dis- now calculate the number of common words between ambiguated with high precision. The semantic tagging this noun-context and the original text in which the is performed using the senses defined in WordNet. noun N is found. In this section, we are going to present the various Applying this procedure to all the senses of noun N methods used to identify the correct sense of a word. will provide us with an ordering over its possible senses. Next, we present the main algorithm in which these We pick up the sense i for the noun N which: (1) is procedures are invoked in an iterative manner. the top of this ordering and (2) has the distance to the PROCEDURE1. This procedure uses a Named En- next sense in this ordering larger than a given threshold. tity (NE) component to recognize and identify person Example. The word diameter, as it appears in a names, locations, company names and others. The var- text from the aerodynamics field (Cranfield collec- ious names are recognized and tagged. Of interest for tion), has two senses. The commonwords found be- our purpose are the PER (person), ORG(group) tween the noun-contexts of its senses and the text LOC(location) tags. The words or word collocations are: for diameter#i: ( property, hole, ratio } and for marked with such tags are replaced by their role (per- diameter#2: { form}. For this text, the threshold was son, group, location) and marked as having sense #1. set to 1, and thus we pick diameter#1 as the correct Example. ’ ’Scott Hudson’ ’ is identified as a person sense (there is a difference larger than 1 between the name, thus this word group will be replaced with its number of nouns in the two sets).

Load more