An Optimized Lesk-Based Algorithm for Word Sense Disambiguation Consuming

Open Comput. Sci. 2018; 8:165–172 Research Article Open Access Eniafe Festus Ayetiran* and Kehinde Agbele An optimized Lesk-based algorithm for word sense disambiguation https://doi.org/10.1515/comp-2018-0015 consuming. Unsupervised approaches, on the other hand, Received May 8, 2018; accepted August 24, 2018 are self-reliant without the use of hand-tagged examples. The rigour involved in developing training set and the Abstract: Computational complexity is a characteristic of need for repetition for dierent cases make supervised almost all Lesk-based algorithms for word sense disam- approaches unappealing for several real-life applications biguation (WSD). In this paper, we address this issue by such as text categorization, information retrieval, machine developing a simple and optimized variant of the algo- translation among others. Knowledge-based approaches rithm using topic composition in documents based on the primarily use dictionaries, thesauri, and lexical knowl- theory underlying topic models. The knowledge resource edge resources for word sense disambiguation without the adopted is the English WordNet enriched with linguistic need for any corpus evidence as applicable in supervised knowledge from Wikipedia and Semcor corpus. Besides approaches. Knowledge-based techniques include graph- the algorithm’s eciency, we also evaluate its eective- based methods which rely on the interconnection of se- ness using two datasets; a general domain dataset and mantic networks available in several lexical resources and domain-specic dataset. The algorithm achieves a supe- overlap-based methods popularly called Lesk-based algo- rior performance on the general domain dataset and supe- rithms which originated from the original Lesk algorithm rior performance for knowledge-based techniques on the [2]. Lesk-based algorithms rely on the overlap of words be- domain-specic dataset. tween the denitions of a target word and words in context Keywords: optimized Lesk, distributional hypothesis, in order to determine the sense of the target word. topic composition Algorithms based on the original Lesk algorithm are a popular and eective family of knowledge-based techniques for word sense disambiguation. Several algorithms 1 Introduction based on the original Lesk algorithm have been developed over the years, including the adapted version which ini- tiated the adjustability of the algorithm to ne-grained The computational complexity associated with most word lexical resources such as the WordNet. These algorithms sense disambiguation algorithms is one of the major rea- are generally known to be computationally costly because sons why they are not being fully employed in most real- of the combinatorial growth of comparisons required of life applications. Agirre and Edmonds [1] identied three the several candidate senses associated with polysemous major approaches to word sense disambiguation; super- words available in dierent lexical resources. However, the vised, unsupervised and knowledge-based approaches. variants of the algorithm that have been proposed over the Supervised approaches rely on hand-tagged examples on years focused majorly on improving its eectiveness rather which algorithms are trained and are known for their supe- than improving its eciency. A simplied variant of the rior performance over unsupervised and knowledge-based algorithm attempts to solve the combinatorial explosion approaches. However, they require large amount of train- of Lesk-based algorithms by computing overlaps between ing data which must be repeated for specic cases each the denition of candidate senses of the target word and time they are required. Training is also rigorous and time the context words, it however, does not take into account the denitions of the senses of the context words. In agree- ment with [3], denitions are an important component in *Corresponding Author: Eniafe Festus Ayetiran: Department of determining the meanings of words since they make dis- Mathematics & Computer Science, Elizade University, Ilara Mokin, tinctions more clear among polysemous words through a Nigeria; E-mail: [email protected] Kehinde Agbele: Department of Mathematics & Computer Science, description of each of the senses of a word. This makes Elizade University, Ilara Mokin, Nigeria; the simplied Lesk algorithm prone to poor coverage and E-mail: [email protected] Open Access. © 2018 Eniafe Festus Ayetiran and Kehinde Agbele, published by De Gruyter. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 License. 166 Ë Eniafe Festus Ayetiran and Kehinde Agbele consequently poor recall as a result of information spar- together gives the redundancy of the text. The E function sity. The main advantage of our algorithm is that it takes is then dened as the inverse of redundancy. The goal is sense description into account and computes similarity for to nd a combination of senses that minimizes this func- each candidate sense in a single operation. That is, for an tion. To this end, an initial combination of senses is deter- n number of senses belonging to a target word, there are mined, and then several iterations are performed, where exactly n number of comparisons. This kind of growth rate the sense of a random word in the text is replaced with a in algorithm analysis is linear, in contrast to the exponen- dierent sense, and the new selection is considered as cor- tial growth of comparisons required of the other variants rect only if it reduces the value of the function E. The itera- of the Lesk algorithm with the exception of the simplied tions stop when there is no change in the conguration of Lesk. senses. The algorithm is still complex computationally as In our algorithm, instead of combinatorial compar- it involves traversing a multi-path graph looking for short- isons among candidate senses, we model the algorithm as est route to destination. a topic-document relationship based on the theory of topic The Adapted Lesk algorithm [8] adjusts the original models. The main idea underlying our algorithm stems Lesk algorithm to a lexical resource, the English WordNet, from distributional hypothesis [4] on which Lesk-based al- by computing maximum overlap between glosses of candi- gorithms generally rely. The hypothesis states that words date senses belonging to a target word and glosses of can- are similar if they appear in similar contexts. The main the- didate senses of context words including the semantic re- oretical footing on which our work stands is that assuming lations in a combinatorial fashion based on prior tagged the linguistic information of all the context words made part-of-speech as discussed in [9]. In their work, a lim- available by lexical resources are modeled as a document, ited window size of the context words was used by con- and the ones provided by each of the candidate senses sidering only the immediate words before and after the of the target word as topics based on the theory under- target word. The algorithm takes as input an instance in lying topic models, then if the distributional hypothesis which the target word occurs, and produces the sense for is valid, then the topic representing the correct sense of the word based on information about it and a few imme- the target word should have the highest topic composition diately surrounding content context words. The choice of in the document. Due to the information sparsity prob- sense is nally determined based on the maximum score lem predominant in overlap-based methods, we follow the achieved by computing the cumulative scores obtained work of [5] which enriches glosses of candidate senses in from individual combinations of several candidate senses. WordNet with additional knowledge by extending them Kilgarri and Rosenzweig [10] in a simplied algorithm with their corresponding wikipedia denitions obtained use only the context words in isolation to compute simi- from BabelNet. We further enrich our algorithm with cor- larity among candidate senses of the target words without pus knowledge from the Semcor corpus [6]. The organiza- recourse to the denitions of senses of the target words. tion of the paper is as follows: Section 2 discusses related Ponzetto and Navigli [5] developed an extended version of work. We describe the optimized Lesk-based algorithm us- the Lesk algorithm through enrichment of glosses of Word- ing topic composition in Section 3. Section 4 evaluates and Net senses with corresponding Wikipedia denitions by discusses the results while Section 5 concludes the paper. using exhaustively all words in the context window of a target word. They achieved this by rst mapping Word- Net senses with corresponding Wikipedia terms. Their al- 2 Related work gorithm shows signicant improvement in performance over the use of WordNet glosses in isolation. Basile et al. [3] in a similar fashion developed another version based Our algorithm relies on the original Lesk algorithm [2] on the distributional semantic model using BabelNet such and its variants. Cowie et al. [7] presented a variation of that the algorithm can use all or part of the context words. the original Lesk algorithm called simulated annealing. In Each sense in BabelNet is enriched with semantic relations their work, they designated a function E that reects the using the "getRelatedMap" available in BabelNet API. In combination of word senses in a given text whose min- other works, Ayetiran et al. [11] & Ayetiran and Boella [12] imum should

An Optimized Lesk-Based Algorithm for Word Sense Disambiguation Consuming

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support