Sentence Similarity Using Semantic Distance Between Words

WSL: Sentence Similarity Using Semantic Distance Between Words Naoko Miura Tomohiro Takagi Meiji University, Japan 1-1-1 Higashi-Mita, Tama-ku, Kawasaki-shi,Kanagawa 214–8571 E-mail:{n_miura,takagi}@cs.meiji.ac.jp of our evaluation at SemEval2015.We conclude in Abstract Section 5 with a brief summary. A typical social networking service contains huge amounts of data, and analyzing this data at the level of the sentence is important. In this paper, we describe our system for a SemEval2015 semantic textual similarity task (task2). We present our approach, which uses edit distance to consider word order, and introduce word appearance in context. We re- port the results from SemEval2015. Fig. 1. Hierarchical semantic knowledge base (Li et al., 1 Introduction 2006). The Internet, particularly sites related to social 2 Related Work networking services (SNS), contains a vast array of information used for a variety of purposes. The Recent research has introduced the lexical database vector space model is conventionally used for nat- as a dictionary to analyze short texts(Aziz et ural language processing. This model creates vec- al.,2010). Aziz uses a set of similar noun phrases tors on the basis of frequency of word appearance and similar verb phrases and common words to and co-occurring words, without taking word order compute sentence similarity. Li combines semantic into account. When it comes to short texts, word similarity between words into a hierarchical se- co-occurrence is rare (or even non-existent), and mantic knowledge base and word order(Li et the number of words is often less than in a typical al.,2006). There are currently a few hierarchical newspaper article. Because the average SNS con- semantic knowledge bases available, one of which tains data consisting mostly of short sentences, the is WordNet(Miller,1995). WordNet contains vector space model is not the best choice. 155,287 words and 117,659 synsets that were In this work, we describe a system we developed stored in 2012 into the lexical categories of nouns, and submitted to SemEval2015. In the proposed verbs, adjectives, and adverbs(WordNet Statistics, system, we compute sentence similarity using edit 2014). All synsets have semantic relation to other distance to consider word order along with the se- synsets. An example in the case of using nouns is mantic distance between words. We also introduce shown in Fig.1. Li proposed a formula to measure word appearance in context. the similarity s(w1,w2) between words w1 and w2 as The rest of this paper is organized as follows. Section 2 reviews related work and in Section 3 we eh eh s(w ,w ) el , (1) present the three systems we submitted for 1 2 eh eh SemEval2015. In Section 4, we discuss the results 128 Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 128–131, Denver, Colorado, June 4-5, 2015. c 2015 Association for Computational Linguistics where l is the shortest path length between w1 and L(i 1, j 1) c1 (ai ,b j ) otherwise. (2) w2 and h is the depth of the subsumer of w1 and w2 L(i, j) min L(i, j 1) 1 in WordNet. For example, we describe the path L(i 1, j) 1 between “boy” and “girl” in Fig. 1. The shortest path is boy-male-person-female-girl, which is 4, so The indicator function c1(ai,bj) is defined as l = 4. The subsumer of “boy” and “girl” is “per- 0 (a b ) son, human...”, so the depth of this synset is h. In c (a ,b ) i j (3) 1 i j hierarchical semantic nets, words at the upper lay- 1 (ai b j ) ers have a general meaning and less similarity than words at the lower layers. Li sets = 0.2 and = 3.1.2 Jaro-Winkler Distance 0.45. The Jaro distance between S1 and S2 (|S1|=n,|S2|=m) Not only the similarity between words but also is dj: word order is important. For example, the two sen- 0 (q 0) tences “a dog bites Mike” and “Mike bites a dog” d 1 q q q t , (4) consist of the same words, but the meanings are j (q 0) very different. In this case, we use vectors such 3 n m q that when each vector completely matches, the sentence similarity is high. Our approach is based on where q is the number of matching words between edit distance to take into account word order and S1 and S2. We consider two words as matching combined semantic similarity between words. when they are the same and not father than max( n,m) 1 3 System Details 2 t is half the number of transpositions. The proposed system uses edit distance to take The Jaro-Winkler distance is d : word order into account. It also uses the impact of w word appearance in each context. d if d 0.7 In this paper, we describe sentence S as d j j , (5) 1 w d j (k * p*(1 d j )) otherwise. S1={a1,a2, … ,an} and sentence S2 as S2 ={b1,b2, …,b }. S consists of n words and S consists of m m 1 2 where k is the length of common words at the start words. a is the i th word of S and b is the j th i 1 j of the sentence. p is constant and usually set to p = word of S We describe the similarity Sim(S ,S ) 2. 1, 2 0.1. between S1 and S2 within the range of 0 (no relation) to 1 (semantic equivalence). 3.2 Semantic Distance 3.1 Edit Distance We borrow our approach to compute similarity between words from Li (Li et al.,2006)(Eq. (1)). It Edit distance is a way of computing the dissimilari- can be used for both nouns and verbs because both ty between two strings. Conventionally, the dis- are organized into hierarchies. However, it is not tance is computed for a set of characters with three available for adjectives and adverbs, which are not kinds of operations (substitution, insertion, dele- organized into hierarchies. Therefore, in addition tion). However, our approaches are for word sets. to Eq. (1), when w ∈synsetA, w ∈synsetB, we Here, we describe the two kinds of edit distance 1 2 extended in our system. define semantic similarity between words if they are adverbs and adjectives as 3.1.1 Levenshtein Distance 1 (synsetA synsetB) The Levenshtein distance between S1 and S2 (6) s(w1,w2 ) (|S1|=n,|S2|=m) is L(n,m), where 0 (synsetA synsetB) 0≦i≦n, 0≦j≦m s(w1,w2) is 1 if w1 and w2 are in the same synset. Conventionally, we calculated edit distance on L(i, j) max(i, j) if min(i,j)=0, the basis of match or mismatch between words and 129 ignored how similar two words are. However, with 3.3 The Impact of Word Appearance in Con- this approach, if two words have the same meaning text although they are different words (e.g., “fall” and There is one issue when we compute Sim(S ,S ), as “autumn”), edit distance defines them as a mis- 1, 2 follows. Let us consider two sentences: “I ate an match. We address this issue by introducing se- apple” and “I hate an apple”. These sentences indi- mantic similarity between words as distance. cate opposite meanings. However, except for “ate” and “hate”, both sentences consist of the same (a) Levenshtein distance words and have the same word order. Therefore, We rewrite Eq. (3) as the method we mentioned above (Eq. (8)) com- putes the Sim(S ,S ) as high. However, we decide c (a ,b ) 1 s(a ,b ) (7) 1, 2 1 i j i j that the similarity between these sentences have opposite meanings because of “ate” and “hate”. We propose a measure for the sentence simi- For this reason, we introduce conditional probabil- larity of S1 and S2 Sim(S1,,S2) as ity to estimate word appearance for each context and extract the probabilities from a corpus as train- L(n,m) Sim(S ,S ) 1.0 (8) ing data. Further, we give this word appearance for 1 2 max( n,m) semantic similarity (Eq. (1)) as a weight. Let us show an example. P(I | S2), P(ate| S2), (b) Jaro-Winkler distance P(an| S2), and P(apple|S2) are words of S1 appear- * We rewrite Jaro-distance dj defined by Eq. (4) ance in context S2. We define S as the set of nouns, as verbs, adjectives, and adverbs (e.g., when sentence S is “It is a dog”, S* is {“is”, “dog”}). 0 （q' 0） We measure each word appearance weight(w) in d 1 q' q' q't (9) context S as: j （q' 0） 3 n m q' doc * w,S P(w | S) (13) doc * We define q’ in Eq. (10). q’ indicates the sum S of all semantic similarity between words in S1 and S2 (1≦i≦n,1≦j≦m , SUM(c2(ai,bj)) ). Further, 1 , (14) originally, we calculated t only if two words weight(w) *P(w|S ) (1 e ) are matching (ai=bj); however, in our proposed methods we change to s(ai,bj)>0.5 to take into where docw,S* is the number of documents that con- account of the semantic similarity of words. * tains both w and S and docS* is the number of documents that contains S*.

Sentence Similarity Using Semantic Distance Between Words

Intelligent Chat Bot

An Ensemble Regression Approach for Ocr Error Correction

Practice with Python

NLP - Assignment 2

3 Dictionaries and Tolerant Retrieval

Use of Word Embedding to Generate Similar Words and Misspellings for Training Purpose in Chatbot Development

Feature Combination for Measuring Sentence Similarity

The Research of Weighted Community Partition Based on Simhash

Hybrid Algorithm for Approximate String Matching to Be Used for Information Retrieval Surbhi Arora, Ira Pandey

Levenshtein Distance Based Information Retrieval Veena G, Jalaja G BNM Institute of Technology, Visvesvaraya Technological University

Thesis Submitted in Partial Fulﬁlment for the Degree of Master of Computing (Advanced) at Research School of Computer Science the Australian National University

Effective Search Space Reduction for Spell Correction Using Character Neural Embeddings