Redalyc.Sentence Similarity Computation Based on Wordnet

Computación y Sistemas ISSN: 1405-5546 [email protected] Instituto Politécnico Nacional México Wali, Wafa; Gargouri, Bilel; Hamadou, Abdelmajid Ben Sentence Similarity Computation based on WordNet and VerbNet Computación y Sistemas, vol. 21, núm. 4, 2017, pp. 627-635 Instituto Politécnico Nacional Distrito Federal, México Available in: http://www.redalyc.org/articulo.oa?id=61553900006 How to cite Complete issue Scientific Information System More information about this article Network of Scientific Journals from Latin America, the Caribbean, Spain and Portugal Journal's homepage in redalyc.org Non-profit academic project, developed under the open access initiative ISSN 2007-9737 Sentence Similarity Computation based on WordNet and VerbNet Wafa Wali1, Bilel Gargouri1, Abdelmajid Ben Hamadou2 1 MIRACL Laboratory, FESGS-Sfax, Tunisia 2 MIRACL Laboratory, ISIMS-Sfax, Tunisia [email protected], [email protected], [email protected] Abstract. Sentence similarity computing is increasingly clustering, question answering, machine transla- growing in several applications, such as question tion, text summarization and others. answering, machine-translation, information retrieval A number of different metrics for computing and automatic abstracting systems. This paper firstly the semantic similarity of sentences have been sums up several methods to calculate similarity between devised. Some of these outlined syntactic sentences which consider semantic and syntactic methods, to measure the similarity between knowledge. Second, it presents a new method for sentences are based on the co-occurring words the sentence similarity measure that aggregates, in a linear function, three components: the Lexical similarity between sentences, such as [16] or on the similar Lexsim including the common words, the semantic syntactic dependencies, like [13]. These methods similarity SemSim using the synonymy words and the of assessing the sentence similarity, however, do syntactico-semantic similarity SynSemSim based on not take into account the semantic information, common semantic arguments, notably, thematic role such as some words that have different meanings and semantic class. Concerning the word-based (cancer may be an animal or a disease), semantic similarity, a measure is computed to estimate synonymous words such as, car/automobile, etc the semantic degree between words by exploiting the may not be recognized. WordNet ”is a” taxonomy. Moreover, the semantic argument determination is based on the VerbNet In order to fight against the weaknesses of database. The proposed method yielded competitive syntactic methods, others investigated approaches results compared to previously proposed measures to compute sentence similarity based on semantic and with regard to the Li’s benchmark, which shown information using human-constructed lexical re- a high correlation with human ratings. Furthermore, sources, such as WordNet like [18], [5] and [11] experiments performed on the Microsoft Paraphrase and/or trained by collecting the statistics of each Corpus showed the best F-measure values compared to word from unannotated or highly-annotated text other measures for high similarity thresholds. corpora, such as [8] and [14]. Keywords. Sentence similarity, syntactico-semantic However, these sentence similarity methods similarity, thematic role, semantic class, WordNet, based on semantic information do not directly VerbNet. induce a real similarity score. For this reason, some approaches estimate the similarity between sentences based on syntactic and semantic 1 Introduction information, called hybrid methods, such as [12] and [6] that take account of the semantic Sentence similarity measures have become an information and word order information implied important task in several applications, such as in the sentence, [19] that considers multiple information retrieval, text classification, document features measuring the word-overlap similarity Computación y Sistemas, Vol. 21, No. 4, 2017, pp. 627–635 doi: 10.13053/CyS-21-4-2853 ISSN 2007-9737 628 Wafa Wali, Bilel Gargouri, Abdelmajid Ben Hamadou and syntactic dependency and [21] that takes order on sentence meaning. The derived word account of synonymy relations between the word order similarity measures the number of different sense and the semantic predicate based on the words as well as that of the word pairs in a different LMF standardized Arabic dictionary [9]. Indeed, order. the authors estimate to compute the sentence The authors of [6] presented a method similarity for Arabic Language. The proposed and called it Semantic Text Similarity (STS). measure achieved a better correlation of the order This method determines the similarity of two of 0.92. sentences from a combination between semantic In this paper, we are interested in generalizing and syntactic information. They considered two the proposal of [21] on the English language using mandatory functions (string similarity and semantic the WordNet and VerbNet databases. Indeed, word similarity) using corpus-based measures to WordNet is used to having the synonyms of calculate semantic similarity. Moreover, they took sentence ’s words and we benefit also of the into account an optional function as common-word VerbNet database, because it is considered as order similarity to compute the syntactic similarity. the best lexical resource that gives the adequate properties of semantic arguments in terms of 2.2 Part of Speech-based Similarity semantic class and thematic role. This paper is outlined as follow. Firstly, The authors of [1] presented an approach that we are concerned about presenting the mostly combines corpus-based semantic relatedness known hybrid approaches. Secondly, we describe measure over the whole sentence along with the the proposed method to measure the sentence knowledge -based semantic similarity scores that similarity based on semantic arguments, notably, were obtained for the words falling under the same the semantic class and the thematic role. Then, syntactic roles in both sentences. All the scores, we present the benchmarks used for studying which are features, were fed to machine learning the performance of our method compared to models, like linear regression, and bagging models competitive methods. After that, we describe and to obtain a single score giving the degree of interpret the obtained results using the Li et al. similarity between sentences. dataset [12] and the Microsoft Paraphrase Corpus A method named FM3S was introduced by [20] [4]. And finally, we provide a conclusion and that estimates the sentence similarity between perspectives for future research. sentences based firstly on the semantic similarity of their words through the separate processing 2 Hybrid Similarity Measures of verbs and nouns and secondly the common word order. Indeed, the method exploits an Several hybrid methods have already been pro- IC (Information Content)-based semantic similarity posed to measure similarity between sentences. measure in the quantification of noun and verb Related work can roughly be classified into three semantic similarity and it is the first which includes major categories: word order-based similarity, the compound nouns and verb tenses in the part of speech-based similarity and syntactic similarity measure between two sentences. dependency-based similarity. 2.3 Syntactic Sependency-based Similarity 2.1 Word Order-based Similarity Oliva et al. [17] reported on a method called A method for measuring the semantic similarity SyMSS to compute sentence similarity. The between sentences, based on semantic and word method considers that the meaning of a sentence order information was presented in [12], named is made up of the meanings of its separate words STATIS. First, semantic similarity is derived from a and the structural way the words are combined. lexical knowledge base and a corpus. Second, the In fact, the semantic information is obtained from proposed method considers the impact of the word a WordNet lexical database and the syntactic Computación y Sistemas, Vol. 21, No. 4, 2017, pp. 627–635 doi: 10.13053/CyS-21-4-2853 ISSN 2007-9737 Sentence629 Similarity Computation based on WordNet and VerbNet information is obtained through a deep parsing element in this vector corresponds to a word in a process to find the syntactic structure of each joint word set. Thus, the value of an element of sentence. With this syntactic information, SyMSS the semantic vector is determined by the semantic measures the semantic similarity between terms similarity of the corresponding word to word in the with the same syntactic role. sentence. Let us take sentence S1 as example: The authors of [3] introduced a method to assess — Case 1: if W (the word in sentence S1) the semantic similarity between sentences, which i appears in joint word set, then the cell value relies on the assumption that the meaning of a of semantic vector of S1 equals 1. sentence is captured by its syntactic constituents and the dependencies between them. They obtain — Case 2: if Wi (the word in the sentence both the constituents and their dependencies from S1) does not appear in the joint word set, a syntactic parser. The algorithm considers that then a semantic similarity score is calculated two sentences have the same meaning if there between the word Wi and each word of the is a good mapping between their chunks and the joint word set. chunk dependencies in one text are preserved in the other. Moreover, the algorithm considers

Redalyc.Sentence Similarity Computation Based on Wordnet

Evaluation of Stop Word Lists in Chinese Language

An Arabic Wordnet with Ontologically Clean Content

PSWG: an Automatic Stop-Word List Generator for Persian Information

Verbnet Based Citation Sentiment Class Assignment Using Machine Learning

Towards a Cross-Linguistic Verbnet-Style Lexicon for Brazilian Portuguese

Natural Language Processing Security- and Defense-Related Lessons Learned

Ontology Based Natural Language Interface for Relational Databases

INFORMATION RETRIEVAL SYSTEM for SILT'e LANGUAGE Using

The Applicability of Natural Language Processing (NLP) To

FTS in Database

An Association Thesaurus for Information Retrieval 1 Introduction

214310991.Pdf