<<

Computación y Sistemas ISSN: 1405-5546 [email protected] Instituto Politécnico Nacional México

Wali, Wafa; Gargouri, Bilel; Hamadou, Abdelmajid Ben Sentence Similarity Computation based on WordNet and VerbNet Computación y Sistemas, vol. 21, núm. 4, 2017, pp. 627-635 Instituto Politécnico Nacional Distrito Federal, México

Available in: http://www.redalyc.org/articulo.oa?id=61553900006

How to cite Complete issue Scientific Information System More information about this article Network of Scientific Journals from Latin America, the Caribbean, Spain and Portugal Journal's homepage in redalyc.org Non-profit academic project, developed under the open access initiative ISSN 2007-9737

Sentence Similarity Computation based on WordNet and VerbNet

Wafa Wali1, Bilel Gargouri1, Abdelmajid Ben Hamadou2

1 MIRACL Laboratory, FESGS-Sfax, Tunisia

2 MIRACL Laboratory, ISIMS-Sfax, Tunisia

[email protected], [email protected], [email protected]

Abstract. Sentence similarity computing is increasingly clustering, , machine transla- growing in several applications, such as question tion, text summarization and others. answering, machine-translation, A number of different metrics for computing and automatic abstracting systems. This paper firstly the of sentences have been sums up several methods to calculate similarity between devised. Some of these outlined syntactic sentences which consider semantic and syntactic methods, to measure the similarity between knowledge. Second, it presents a new method for sentences are based on the co-occurring words the sentence similarity measure that aggregates, in a linear function, three components: the Lexical similarity between sentences, such as [16] or on the similar Lexsim including the common words, the semantic syntactic dependencies, like [13]. These methods similarity SemSim using the synonymy words and the of assessing the sentence similarity, however, do syntactico-semantic similarity SynSemSim based on not take into account the semantic information, common semantic arguments, notably, thematic role such as some words that have different meanings and semantic class. Concerning the word-based (cancer may be an animal or a disease), semantic similarity, a measure is computed to estimate synonymous words such as, car/automobile, etc the semantic degree between words by exploiting the may not be recognized. WordNet ”is a” taxonomy. Moreover, the semantic argument determination is based on the VerbNet In order to fight against the weaknesses of database. The proposed method yielded competitive syntactic methods, others investigated approaches results compared to previously proposed measures to compute sentence similarity based on semantic and with regard to the Li’s benchmark, which shown information using human-constructed lexical re- a high correlation with human ratings. Furthermore, sources, such as WordNet like [18], [5] and [11] experiments performed on the Microsoft Paraphrase and/or trained by collecting the statistics of each Corpus showed the best F-measure values compared to word from unannotated or highly-annotated text other measures for high similarity thresholds. corpora, such as [8] and [14]. Keywords. Sentence similarity, syntactico-semantic However, these sentence similarity methods similarity, thematic role, semantic class, WordNet, based on semantic information do not directly VerbNet. induce a real similarity score. For this reason, some approaches estimate the similarity between sentences based on syntactic and semantic 1 Introduction information, called hybrid methods, such as [12] and [6] that take account of the semantic Sentence similarity measures have become an information and word order information implied important task in several applications, such as in the sentence, [19] that considers multiple information retrieval, text classification, document features measuring the word-overlap similarity

Computación y Sistemas, Vol. 21, No. 4, 2017, pp. 627–635 doi: 10.13053/CyS-21-4-2853 ISSN 2007-9737

628 Wafa Wali, Bilel Gargouri, Abdelmajid Ben Hamadou and syntactic dependency and [21] that takes order on sentence meaning. The derived word account of synonymy relations between the word order similarity measures the number of different sense and the semantic predicate based on the words as well as that of the word pairs in a different LMF standardized Arabic dictionary [9]. Indeed, order. the authors estimate to compute the sentence The authors of [6] presented a method similarity for Arabic Language. The proposed and called it Semantic Text Similarity (STS). measure achieved a better correlation of the order This method determines the similarity of two of 0.92. sentences from a combination between semantic In this paper, we are interested in generalizing and syntactic information. They considered two the proposal of [21] on the English language using mandatory functions (string similarity and semantic the WordNet and VerbNet databases. Indeed, word similarity) using corpus-based measures to WordNet is used to having the synonyms of calculate semantic similarity. Moreover, they took sentence ’s words and we benefit also of the into account an optional function as common-word VerbNet database, because it is considered as order similarity to compute the syntactic similarity. the best that gives the adequate properties of semantic arguments in terms of 2.2 Part of Speech-based Similarity semantic class and thematic role. This paper is outlined as follow. Firstly, The authors of [1] presented an approach that we are concerned about presenting the mostly combines corpus-based semantic relatedness known hybrid approaches. Secondly, we describe measure over the whole sentence along with the the proposed method to measure the sentence knowledge -based semantic similarity scores that similarity based on semantic arguments, notably, were obtained for the words falling under the same the semantic class and the thematic role. Then, syntactic roles in both sentences. All the scores, we present the benchmarks used for studying which are features, were fed to machine learning the performance of our method compared to models, like linear regression, and bagging models competitive methods. After that, we describe and to obtain a single score giving the degree of interpret the obtained results using the Li et al. similarity between sentences. dataset [12] and the Microsoft Paraphrase Corpus A method named FM3S was introduced by [20] [4]. And finally, we provide a conclusion and that estimates the sentence similarity between perspectives for future research. sentences based firstly on the semantic similarity of their words through the separate processing 2 Hybrid Similarity Measures of verbs and nouns and secondly the common word order. Indeed, the method exploits an Several hybrid methods have already been pro- IC (Information Content)-based semantic similarity posed to measure similarity between sentences. measure in the quantification of noun and verb Related work can roughly be classified into three semantic similarity and it is the first which includes major categories: word order-based similarity, the nouns and verb tenses in the part of speech-based similarity and syntactic similarity measure between two sentences. dependency-based similarity. 2.3 Syntactic Sependency-based Similarity 2.1 Word Order-based Similarity Oliva et al. [17] reported on a method called A method for measuring the semantic similarity SyMSS to compute sentence similarity. The between sentences, based on semantic and word method considers that the meaning of a sentence order information was presented in [12], named is made up of the meanings of its separate words STATIS. First, semantic similarity is derived from a and the structural way the words are combined. lexical knowledge base and a corpus. Second, the In fact, the semantic information is obtained from proposed method considers the impact of the word a WordNet lexical database and the syntactic

Computación y Sistemas, Vol. 21, No. 4, 2017, pp. 627–635 doi: 10.13053/CyS-21-4-2853 ISSN 2007-9737

Sentence629 Similarity Computation based on WordNet and VerbNet information is obtained through a deep element in this vector corresponds to a word in a process to find the syntactic structure of each joint word set. Thus, the value of an element of sentence. With this syntactic information, SyMSS the semantic vector is determined by the semantic measures the semantic similarity between terms similarity of the corresponding word to word in the with the same syntactic role. sentence. Let us take sentence S1 as example: The authors of [3] introduced a method to assess — Case 1: if W (the word in sentence S1) the semantic similarity between sentences, which i appears in joint word set, then the cell value relies on the assumption that the meaning of a of semantic vector of S1 equals 1. sentence is captured by its syntactic constituents and the dependencies between them. They obtain — Case 2: if Wi (the word in the sentence both the constituents and their dependencies from S1) does not appear in the joint word set, a syntactic parser. The algorithm considers that then a semantic similarity score is calculated two sentences have the same meaning if there between the word Wi and each word of the is a good mapping between their chunks and the joint word set. chunk dependencies in one text are preserved in the other. Moreover, the algorithm considers In fact, this semantic similarity score between that every chunk has a different importance with words is calculated based on the common respect to the overall meaning of a sentence, which synonymy between two words with the assistance is computed based on the information content of of the database WordNet [15] and using the the words in the chunk. Jaccard coefficient [7]. The semantic similarity between two words Wi and Wj is computed as follows: 3 The Proposed Method SWi + SWj − CS(Wi, Wj) SSim(Wi, Wj) = , (2) The suggested method aggregates between three CS(Wi, Wj) modules, namely lexical similarity, semantic similarity and syntactic-semantic similarity. Figure where SWi, SWj is the synonymy number of 1 gives an overview of the suggested method. each word and CS(Wi, Wj) returns the synonymy Indeed, the lexical similarity is computed after common number between Wi and Wj. Thus, the removing punctuation signs and determining the most similar word in S1 to Wi is that with the lemma of each word using a stemmer, based on highest similarity score. common lemmas between sentences using the The last step in this module is to calculate the Jaccard coefficient [7]. Indeed, the choice of the semantic similarity between sentences based on Jaccard coefficient is explained by their simplicity. the generated semantic vectors corresponding to The lexical similarity computation function between sentences S1 and S2 using the cosine similarity. sentences S1 and S2 is defined as follows: The semantic similarity between two sentences S1 and S2 is computed as follows: WS1 + WS2 − CW (S1, S2) LexSim(S1, S2) = , Pn V 1 × V 2 CW (S1, S2) SemSim(S1, S2) = i=0 i i , (3) (1) pPn 2pPn 2 i=0 V 1i i=0 V 2i where WSi refers to the number words of a sentence Si and CW(S1,S2) is the common word where V 1i and V 2i are the components of vectors number between a pair of sentences. V1 and V2 respectively to S1 and S2. Using The semantic similarity is measured based on the illustrative example below, table 1 shows the semantic vectors. For this module, the authors computing mode of SemSim(S1,S2): determined, firstly, the joint word set that contains X the distinct words between sentences. Then, each Sim(S, DS) = max(∀Sj ∈ DsSim(S, DS)) . sentence is presented by the use of a joint word (4) set which is called semantic vector as follow. Each S1=”The car was destroyed by a tree”.

Computación y Sistemas, Vol. 21, No. 4, 2017, pp. 627–635 doi: 10.13053/CyS-21-4-2853 ISSN 2007-9737

630 Wafa Wali, Bilel Gargouri, Abdelmajid Ben Hamadou

Fig. 1. The proposed sentence similarity measure

S2=”The falling branch crumpled the automo- Levin classes through refinement and addition bile”. of subclasses to achieve syntactic and semantic The verbs ”was destroyed, crumpled” are coherence among members of a class. reduced to lemma form ”destroy, crump”. The word Each verb class in VerbNet is completely ”failing” is eliminated because it is considered as described by thematic roles, selectional restrictions . on the arguments, and frames consisting of a syntactic description and semantic predicates with Moreover, the syntactico-semantic similarity a temporal function. The syntactico-semantic is computed based on the common semantic similarity computation function between sentences arguments between sentences in terms of thematic S1 and S2 is determined as follows: role and semantic class using the Jaccard SArgS1 + SArgS2 − CSArg(S1, S2) coefficient [7]. Indeed, this idea is considered SSSim(S1, S2) = , original because it has not been employed CSArg(S1, S2) in former research in the literature. The (5) process of syntactico-semantic computation starts where CSArg Si refers to the number of semantic by determining the syntactic structure for each arguments of sentence Si and CSArg (S1,S2) is sentence using Stanford parser [2]. the common semantic argument number between a pair of sentences. Then, the semantic predicate is defined for each sentence by means of a linguistic expert. Having applied the procedure on the previous The determination process of semantic arguments page, the semantic arguments for S1 and S2 takes the relation between syntactic structure are respectively SArg S1 and SArg S2. For the and semantic predicate into consideration using example of the sentence pair, we have: VerbNet. In fact, VerbNet [10] is the largest on-line verb lexicon currently available for English. 1. SArg S1= (patient, inanimate), (instrument, It is organized into verb classes extending inanimate),

Computación y Sistemas, Vol. 21, No. 4, 2017, pp. 627–635 doi: 10.13053/CyS-21-4-2853 ISSN 2007-9737

Sentence631 Similarity Computation based on WordNet and VerbNet

Table 1. Example of computing SemSim(S1,S2)

Car destroy tree branch crump Automobile Car 1 1 S1 destroy 1 tree 1 V1 1 1 1 0 0 1 branch 1 crump 1 S2 automobile 1 1 V2 1 0 0 1 1 1 SemSim(S1,S2)=0.5

2. SArg S2 = (Experiencer, inanimate), (theme, 4.1 Benchmarks inanimate), In this subsection, we present the benchmarks that 3. SynSemSim(S1,S2)=0. are used by the previous measures listed in the second section, such as Li et al. dataset [12] and Finally, the similarity between two sentences Microsoft paraphrase corpus (MSPC) [4]. is based on an aggregation based on three components defined above, namely LexSim, 4.1.1 Li et al. Dataset SemSim and SynSemSim. The aggregation combines between them in a linear function This dataset, which was created by Li et al., [12] using the supervised learning especially the SMO takes a set of 65 noun pairs, replaces the nouns (Sequential Minimal Optimization) algorithm to turn with their dictionary definitions collected from the the contribution of each component in the final Collins Cobuild Dictionary, and has 32 human score: participants that rate the similarity in the meaning of each of the sentence pair on a scale of 0.0 to 4.0. Sim(S1, S2) = α ∗ A + β ∗ B + γ ∗ C, (6) When the similarity scores average the distribution scores, they are heavily skewed toward the low where A designed LexSim(S1,S2),B similarity end of the scale, with 46 pairs rated from designed SemSim(S1,S2) and C designed 0.0 to 0.9 and 19 pairs rated from 1.0 to 4.0 to SynSemSim(S1,S2). obtain an uneven distribution across the similarity range. A subset of 30 sentence pairs was selected, The component SemSim and SynSemSim have consisting of all the sentence pairs rated from 1.0 β main contribution in the final score because and to 4.0, and 11 taken at equally spaced intervals γ > α . from the 46 pairs rated from 0.0 to 0.9. Unlike the dataset described above, in which the task is binary classification, this dataset has been used to 4 Assessment Benchmarks compare correlation with the human ratings.

The sentence similarity measure proposed in the 4.1.2 MicroSoft Paraphrase Corpus (MSPC) above section is evaluated in two ways. The first way includes the study of correlation between the The Microsoft Research Paraphrase corpus [4] similarity values to sentence pairs judged by the consists of 5801 sentence pairs, 3900 of experts. The second way involves the integration of which were labeled as paraphrases by human these sentence similarity measures in a particular annotators. The MSPC corpus is divided into application, such as paraphrase determination. training set (4076 sentences) and a test set (1725

Computación y Sistemas, Vol. 21, No. 4, 2017, pp. 627–635 doi: 10.13053/CyS-21-4-2853 ISSN 2007-9737

632 Wafa Wali, Bilel Gargouri, Abdelmajid Ben Hamadou pairs). The number of average words per sentence 4.2.3 Recall (sentence length) for this corpus is 17. MSPC is by far the largest publicly available paraphrase annotated corpus, and has been used extensively The recall measure is calculated as follows: the over the last decade. number of the determined relevant paraphrases divided by the existing number of paraphrases:

D Recall = , (9) 4.2 Evaluation Metrics E

In this subsection, we present the metrics where D corresponds to number of pairs correctly used to evaluate the performance of the hybrid annotated as paraphrases by the measure and E approaches, such as Pearson’s coefficient and designed Number of parapharses in the dataset. Spearman’s coefficient that are used in Li et al. dataset [12] and recall, precision and f-measure that are used in Microsoft Research Paraphrase 4.2.4 Precision corpus [4].

The precision measure is based on the number 4.2.1 Pearson’s Coefficient of the determined relevant paraphrases divided by the number of returned paraphrases:

The Pearson’s coefficient measure indicates how D P recision = , (10) well the results of such a measure are similar to F human judgements. Pearson’s coefficient, called r, is computed as follows: where F corresponds to number of pairs annotated as paraphrases by the measure. n(P x y ) − (P x ) ∗ (P y ) r = i i i i , p P 2 P 2p P 2 P 2 n( xi ) − ( xi) n( yi ) − ( yi) (7) 4.2.5 F-measure where Xi corresponds to the i-th element in the list of human judgement, Yi corresponds to the i-th element in the list of sentence similarity computed F-measure combines the precision and the recall values and n corresponds to the pair sentence and expresses a trade-off between those two number. measures: P recision × Recall F − measure = 2 × . (11) 4.2.2 Spearman’s Coefficient P recision + Recall

The classification is produced based on the sentence similarity measure compared to the one 5 Experiments and Results produced on the basis of human judgments. Spearman’s is computed as follows: This section presents the obtained results for 6 P di2 the employed benchmarks, Li et al. dataset p = 1 − , (8) n(n2 − 1) and MSPC. All the experiments are performed using the α=0.2, β=0.45 and γ=0.35 parameters, where di corresponds the difference of the ranks of which are empirically determined with respect to xi and yi. equation (5).

Computación y Sistemas, Vol. 21, No. 4, 2017, pp. 627–635 doi: 10.13053/CyS-21-4-2853 ISSN 2007-9737

Sentence633 Similarity Computation based on WordNet and VerbNet

Fig. 2. Comparison between STS, F3MS and our proposal measures using F-measure values and varying the threshold θ on the MSPC dataset

Fig. 3. Precision and recall of our measure applied on the MPSC dataset

5.1 Experiment with Li et al. Dataset method achieved the best results compared to other computational methods. Table 2 shows the correlation coefficients of Our proposal can reach results that approxi- Pearson (r) and Spearman (p) obtained from mates to 100%, but unfortunately 28 sentence different measures on the Li et al. dataset. Our pairs in the Li et al. dataset contain the verb

Computación y Sistemas, Vol. 21, No. 4, 2017, pp. 627–635 doi: 10.13053/CyS-21-4-2853 ISSN 2007-9737

634 Wafa Wali, Bilel Gargouri, Abdelmajid Ben Hamadou

”to be”, which negatively affects the contribution the WordNet in order to determine the synonymy of SynSemSim(S1,S2) component to the final of the words using the Jaccard measure. The similarity score. word semantic similarity measure is arranged in semantic vectors so that it has for the sentence 5.2 Experiment with MSPC semantic similarity using the Cosine similarity. The proposed method also, takes the common Due to the deficiency of published research semantic argument properties, notably, the se- presenting results on the MPSC dataset, our mantic class and the thematic role using VerbNet method is compared only to STS [6] and FM3S dataset. Our method yielded competitive results [20] measures. Accordingly, this method provides compared to other computational methods, such results at the threshold θ ∈[0.7, 1.0] as shown as of Li et al. dataset. figure 2. For the paraphrase recognition task, our Our proposal yielded the competitive results proposal outperforms other measures, mainly at a compared to ”FM3S” method for the Training data high threshold θ ∈ [0.7,1]. These results provide and for the Test data. a strong support for the utility of a number of Moreover, the F-measure values obtained with sentence features, such as semantic arguments the ”STS” approach for the same interval tend and properties in the process of computing towards 0. This provides further support for the sentence similarity. advanced efficiency of our method. Due to the promising performance of this The results illustrated in figure 3 show the measure, it can be applied in other applications, precision and recall of the proposed measure, such as plagiarism detection. respectively. In fact, precision reached a peak with θ=0.9, showing a value of 0.742 for the Training References and Test datasets. This demonstrates that the sentence pairs judged as highly similar by our 1. Aggarwal, N., Asooja, K., & Buitelaar, P. (2012). measure are qualified as paraphrases in the MSPC Deri&upm: Pushing corpus based relatedness dataset. to similarity: Shared task system description. Proceedings of the First Joint Conference on Table 2. Results obtained using the Li et al. dataset Lexical and Computational Semantics, Volume 1: Proceedings of the main conference and the shared r p task, and Volume 2: Proceedings of the Sixth STATIS [12] 0.81 0.81 International Workshop on Semantic Evaluation, STS [6] 0.85 0.85 Association for Computational Linguistics, pp. SyMSS [17] 0.76 0.71 643–647. FM3S [20] 0.76 0.79 2. Chen, D. & Manning, C. D. (2014). A fast and Our proposal 0.87 0.87 accurate dependency parser using neural networks. EMNLP, pp. 740–750. 3. S¸tefanescu,˘ D., Banjade, R., & Rus, V. (2014). A sentence similarity method based on chunking 6 Conclusion and Future Works and information content. LNCS, Proceedings of the 15th International Conference on Computational The proposed measure determines the similarity Linguistics and Intelligent , Vol. between two sentences regarding the semantic 8403, Springer-Verlag New York, Inc., pp. 442–453. and syntactico-semantic information they contain. 4. Dolan, B., Quirk, C., & Brockett, C. (2004). The aggregate function Sim(S1,S2) presented Unsupervised construction of large paraphrase cor- in equation (5), contains the common words, pora: Exploiting massively parallel news sources. Synonymy words and the properties of semantic Proceedings of the 20th international conference arguments in a linear way. Our proposal is on Computational Linguistics, Association for based on the word semantic similarity. It exploits Computational Linguistics, p. 350.

Computación y Sistemas, Vol. 21, No. 4, 2017, pp. 627–635 doi: 10.13053/CyS-21-4-2853 ISSN 2007-9737

Sentence635 Similarity Computation based on WordNet and VerbNet

5. Hirst, G. & St-Onge, D. (1997). Lexical chains text semantic similarity. AAAI, volume 6, pp. as representations of context for the detection and 775–780. correction of malapropisms. 15. Miller, G. A. (1995). WordNet: a lexical database for 6. Islam, A. & Inkpen, D. (2008). Semantic text English. Communications of the ACM, 38(11):39– similarity using corpus-based word similarity and 41. string similarity. ACM Transactions on Knowledge 16. Nirenburg, S., Domashnev, C., & Grannes, Discovery from Data (TKDD), 2(2):10. D. J. (1993). Two approaches to matching in 7. Jaccard, P. (1901). Etude comparative de la example-based . Proc. of distribution florale dans une portion des Alpes et du the 5th International Conference on Theoretical Jura. Impr. Corbaz. and Methodological Issues in Machine Translation 8. Jiang, J. J.& Conrath, D. W. (1997). Semantic (TMI-93), pp. 47–57. similarity based on corpus statistics and lexical 17. Oliva, J., Serrano, J. I., del Castillo, M. D., taxonomy. arXiv preprint cmp-lg/9709008. & Iglesias, A.´ (2011). Symss: A syntax-based measure for short-text semantic similarity. Data & 9. Khemakhem, A., GargouriI, B., Hamadou, A. B., Knowledge Engineering, 70(4):390–405. & Francopoulou, G. (2016). Iso standard modeling of a large Arabic dictionary. Natural Language 18. Resnik, P. (1995). Using information content to Engineering, 22(6):849–879. evaluate semantic similarity in a taxonomy. arXiv preprint cmp-lg/9511007. 10. Kipper, K., Korhonen, A., Ryant, N., & Palmer, M. (2006). Extending VerbNet with novel verb classes. 19. Sariˇ c,´ F., Glavas,ˇ G., Karan, M., Snajder,ˇ J., In Proceedings of LREC, volume 2006, page 1. & Basiˇ c,´ B. D. (2012). Takelab: Systems for Citeseer. measuring semantic text similarity. Proceedings of the First Joint Conference on Lexical and 11. Leacock, C. & Chodorow, M. (1998). Combining Computational Semantics, Volume 1: Proceedings local context and WordNet similarity for word sense of the main conference and the shared task, and identification. WordNet: An electronic lexical Volume 2: Proceedings of the Sixth International database, 49(2):265–283. Workshop on Semantic Evaluation, Association for 12. Li, Y., McLean, D., Bandar, Z. A., O’shea, J. D., Computational Linguistics, pp. 441–448. & Crockett, K. (2006). Sentence similarity based 20. Taieb, M. A. H., Aouicha, M. B., & Bourouis, Y. on semantic nets and corpus statistics. IEEE (2015). Fm3s: Features-based measure of senten- transactions on knowledge and data engineering, ces semantic similarity. International Conference on 18(8):1138–1150. Hybrid Artificial Intelligence Systems, pp. 515–529. 13. Mandreoli, F., Martoglia, R., & Tiberio, P. 21. Wali, W., Gargouri, B., & Hamadou, A. B. (2002). A syntactic approach for searching (2016). Using sentence semantic similarity to similarities within sentences. Proceedings of the improve LMF standardized Arabic dictionary quality. Eleventh International Conference on Information In Computational Linguistics and Natural Language and Knowledge Management, CIKM ’02, New York, Processing. NY, USA, ACM, pp. 635–637.

14. Mihalcea, R., Corley, C., & Strapparava, C. (2006). Article received on 26/12/2016; accepted on 23/02/2017. Corpus-based and knowledge-based measures of Corresponding author is Wali Wafa.

Computación y Sistemas, Vol. 21, No. 4, 2017, pp. 627–635 doi: 10.13053/CyS-21-4-2853