Using Relevant Domains Resource for Word Sense Disambiguation
Total Page:16
File Type:pdf, Size:1020Kb
Using Relevant Domains Resource for Word Sense Disambiguation Sonia Vázquez, Andrés Montoyo German Rigau Department of Software and Computing Systems Department of Computer Languages and Systems University of Alicante Euskal Herriko Unibertsitatea Alicante, Spain Donostia, Spain {svazquez,montoyo}@dlsi.ua.es [email protected] Abstract educate ourselves. The revolution continues and one of its results is that large volumes of information will be shown in a format that is more natural for users than This paper presents a new method for Word Sense the typical data presentation formats of past computer Disambiguation based on the WordNet Domains systems. Natural Language Processing (NLP) is crucial lexical resource [4]. The underlaying working in solving these problems and language technologies hypothesis is that domain labels, such as will make an indispensable contribution to the success ARCHITECTURE, SPORT and MEDICINE provide a of information systems. natural way to establish semantic relations between Designing a system for NLP requires a large word senses, that can be used during the knowledge on language structure, morphology, syntax, disambiguation process. This resource has already semantics and pragmatic nuances. All of these different been used on Word Sense Disambiguation [5], but it linguistic knowledge forms, however, have a common has not made use of glosses information. Thus, we associated problem, their many ambiguities, which are present in first place, a new lexical resource based on difficult to resolve. WordNet Domains glosses information, named In this paper we concentrate on the resolution of the ªRelevant Domainsº. In second place, we describe a lexical ambiguity that appears when a given word has new method for WSD based on this new lexical several different meanings. This specific task is resource (ªRelevant Domainsº). And finally, we commonly referred as Word Sense Disambiguation evaluate the new method with English all-words task of (WSD). The disambiguation of a word sense is an SENSEVAL-2, obtaining promising results. ªintermediate taskº [8] and it is necessary to resolve such problems in certain NLP applications, as Machine Keywords: Word Sense Disambiguation, Translation (MT), Information Retrieval (IR), Text Computational Lexicography. Processing, Grammatical Analysis, Information Extraction (IE), hypertext navigation and so on. In 1. Introduction and motivation general terms, WSD intents to assign a definition to a selected word, in a text or a discourse, that endows it The development and convergence of computing, with a meaning that distinguishes it from all of the telecommunications and information systems has other possible meanings that the word might have in already led to a revolution in the way that we work, other contexts. This association of a word to one communicate with other people, buy news and use specific sense is achieved by acceding to two different services, and even in the way that we entertain and information sources, known as context1 and external method is evaluated with English all-words task of knowledge sources2. SENSEVAL-2, obtaining promising results. The method we propose in this paper is based on The organisation of this paper is: after this strategic knowledge (knowledge-driven WSD), that is, introduction, in section 2 we describe the new lexical the disambiguating of nouns by matching the context in resource, named Relevant Domains. In section 3, the which they appear with the information from WordNet new WSD method is presented using the Relevant lexical resource. Domains resource. In section 4, an evaluation of WSD WordNet is not a perfect resource for word-sense method is realized, and finally conclusions and an disambiguation, because it has the problem of the outline of further works are shown. fined-grainedness of WordNetÂs sense distinctions [2]. This problem causes difficulties in the performance of 2. New resource: Relevant Domains automatic word-sense disambiguation with free- running texts. Several authors [8, 3] have stated that the WordNet Domains [4] is an extension of WordNet divisions of a proposed sense in the dictionary are too 1.6 where each synset has one or more domain labels. fine for Natural Language Processing. To solve this Synsets associated to different syntactic categories can problem, we propose a WSD method for applications have the same domain labels. These domain labels are that do not require a fine granularity for senses selected from a set of about 250 hundred labels, distinctions. This method consists of labelling texts hierarchically organized in different specialization words with a domain label instead of a sense label. We levels. This new information added to WordNet 1.6., named domains to a set of words with a strong allows to connect words that belong to different semantic relation. Therefore, applying domains to subhierarchies and to include into the same domain WSD contributes with a relevant information to label several senses of the same word. Thus, a single establish semantic relations between word senses. For domain label may group together more than one word example, ªbankº has ten senses in WordNet 1.6 but sense, obtaining a reduction of the polysemy. Table 1 three of them ªbank#1º, ªbank #3º and ªbank #6º are shows an example. The word ªmusicº has six different grouped into the same domain label ªEconomyº, senses in WordNet 1.6.: four of them are grouped whereas ªbank#2º and ªbank#7º are grouped into under the MUSIC domain, causing the reduction of the domains labels ªGeographyº and ªGeologyº. polysemy from six to three senses. A lexical resource with domain labels associated to word senses is necessary for the WSD proposed Table 1. Domains associated to word ªmusicº method. Thus, a new lexical resource has been developed, named Relevant Domains obtained from Synset Domain Noun Gloss ªWordNet Domainsº [4]. an artistic form of A proposal in WSD using domains has been 05266809 Music music#1 auditory ¼ developed in [5]; they use WordNet Domains as lexical any agreeable 04417946 Acoustics music#2 resource, but from our point of view they don't make (pleasing¼ good use of glosses information. Thus, in this paper we Music, and a musical diversion; 00351993 music#3 present a new lexical resource obtained from glosses Free_time his music¼ information of WordNet Domains and a new WSD a musical 05105195 Music music#4 method that use this new lexical resource. This new composition in¼ the sounds produced 04418122 Music music#5 by singers.. 1 Context is a set of words which are around the word to punishment for one©s disambiguate along with syntactical relations, semantic categories 00755322 Law music#6 and so on. actions;¼ 2 External knowledge resources are lexical resources, as WordNet, manually developed to give valuable information for associating senses to words. In this work, WordNet Domains will be used to Then, Table 2 shows the domains associated with gloss collect examples of domains associations to the nouns of ªmusic#1º. different meanings of the words. To realize this task, WordNet Domains glosses will be used to collect the Table 2. Domains association with gloss nouns of more relevant and representative domain labels for ªmusic#1º each English word. In this way, the new resource named Relevant Domains, contains all words of Domain Noun WordNet Domains glosses, with all their domains and Music form they are organised in an ascendant way because of their Music communication relevance in domains. Music tone To collect the most representative words of a Music manner domain, we use the ªMutual Informationº formula (1) as follows: This process is realized with all the WordNet = Pr(w | D) Domains glosses to obtain all the domains associated to MI(w, D) log2 (1) Pr(w) each noun for begining with the Association Ratio W: word. calculus. Finally, we obtain a list of nouns with their D: domain. associated domains sorted by Association Ratio. With this format, the domains that appear in first positions of Intuitively, a representative word is that appears in a a noun are the most representatives. The results of the domain context most frequently. But we are interested Association Ratio for noun ªmusicº are showed in about the importance of words in a domain, that is, the Table 3. Thus, the most representative domains for most representative and common words in a domain. noun ªmusicº are: MUSIC, FREE-TIME and We can appreciate this importance with the ACOUSTICS. ªAssociation Ratioº formula: After the Association Ratio for nouns, the same process is done to obtain Association Ratio for verbs, = Pr(w | D) AR(w, D) Pr(w | D) log 2 (2) Pr(w) adjectives and adverbs. W: word. D: domain. Table 3. Association Ratio of ªmusicº Formula (2) shows ªAssociation Ratioº that is Noun Domain A.R. applied to all words with noun grammatical category music Music 0.240062 obtained from WordNet Domains glosses. Later, the music Free_time 0.093726 same process is applied to verbs, adjectives and music Acoustics 0.072362 adverbs grammatical categories. A proposal in this music Dance 0.065254 sense has been made in [6], but using Lexicography music University 0.046024 Codes of WordNet Files. music Radio 0.042735 In order to obtain Association Ratio for nouns of music Art 0.020298 WordNet Domains glosses, it is necessary to use a music Telecommunication 0.006069 parser which obtains all nouns appeared in each gloss. ¼ ¼ ¼ For this task, we use ªTree Taggerº parser [7]. For example, the gloss associated to sense ªmusic#1º is the following: ªAn artistic form of 3. WSD method auditory communication incorporating instrumental or vocal tones in a structured and continuous mannerº. The method presented here is basically about the automatic sense-disambiguation of words that appear into the context of a sentence, with their different CV = ∑ AR(W , D) ∈ (3) possible senses quite related. The context is taken from w context Figure 1 shows the context vector obtained from the the words that co-occur with the proposed word into a following text: ªThere are a number of ways in which sentence and from their relations to the word to be the chromosome structure can change, which will disambiguated.