Assessing Sentence Similarity Using Wordnet Based Word Similarity

JOURNAL OF SOFTWARE, VOL. 8, NO. 6, JUNE 2013 1451 Assessing Sentence Similarity Using WordNet based Word Similarity Hongzhe Liu Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing, China Email: [email protected] Pengfei Wang Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing, China Email: [email protected] Abstract— Sentence similarity assessment is key to most similarities by using the structural information inherent in NLP applications. This paper presents a means of a given ontology structure. calculating the similarity between very short texts and Any means of assessing text semantic similarity must sentences without using an external corpus of literature. account for the fact that text has structure. In this project, This method uses WordNet, common-sense knowledge base and human intuition. Results were verified through we began with an approximate model, which we later experiments. These experiments were performed on two sets adapted into a means of assessing sentence level of selected sentence pairs. We show that this technique similarity based on word-level similarity derived from the compares favorably to other word-based similarity sentence. We used four similarity measures in our measures and is flexible enough to allow the user to make experiments to capture sentence similarities from comparisons without any additional dictionary or corpus different aspects. information. We believe that this method can be applied in a variety of text knowledge representation and discovery II. RELATED WORK applications. The existing sentence similarity evaluation methods Keywords—Sentence similarity, WordNet, word similarity, can be grouped into five categories: word overlap natural language processing measures, TF-IDF measures, word-similarity-based measures, word-order-based measures, and combined measures[1]–[6],[14]. Among these, word-based I. INTRODUCTION similarity measures provided the best human Many natural language processing applications require correlation[15],[16],[17]. There are many means of that the similarity between very short text paragraphs or assessing word-to-word similarity. These include sentences be calculated quickly and reliably. The samples, distance-oriented measures, knowledge-based measures, usually pairs of sentences, are considered similar if they and measures of information theory. The main methods are judged to have the same meaning or discuss the same of determining word similarity, however, are those subject. A method that can automatically calculate proposed by Leacock and Chodorow, Lesk, Wu and semantic similarity scores is much more valuable than Palmer, Resnik, Lin, and Jiang and Conrath[7]–[12], simple lexical matching for applications such as question please see[15] for the detailed evaluating of answering(QA), information extraction(IE), WordNet-based measures of semantic distance . Here, we multi-document summarization, and evaluations of used sentence semantic similarity measures, which are machine translation(MT). Most existing based on word similarity. We focused our attention on word-similarity-based measures rely not only on ontology structure-based word similarity measures (RNCVM) [13], knowledge base like WordNet, but also on large text which is based on WordNet. This approach does not corpora which serve as additional knowledge resources. require a training corpus or other additional information, However, in many applications, especially in which is very important when building new applications. domain-based applications, large text corpus cannot be This approach produces high quality results with good expected to be readily available. Many applications store human correlation. ontology relations in a relational database, which do not fully represent the rich relations embedded in the original III. OUR PROPOSED APPROACH text collections. In these cases, the similarities between A. WordNet Structure these short texts and sentences have to be extracted from the limited representations in the database only. In this On WordNet [20], information is presented in synsets, paper, we focus on the challenge of assessing sentence clusters of words that are considered synonyms or otherwise logically similar. WordNet also includes descriptions of the relationships in the synsets. A given © 2013 ACADEMY PUBLISHER doi:10.4304/jsw.8.6.1451-1458 1452 JOURNAL OF SOFTWARE, VOL. 8, NO. 6, JUNE 2013 word may appear in more than one synset, which is the knowledge base is very dense, individual node having logical considering that words can also appear in multiple up to three and four hundreds children, collections of parts of speech. The words in each synset are grouped so generally unpronounceable plant species; it can argue that that they are interchangeable in certain contexts. the distance between nodes in such a section of structure The WordNet pointers indicate both lexical and should be very small relative to other less dense regions. semantic relationships between words. Lexical That is in Fig. 2, the similarity value of the left part relationships concern word form, and semantic should be less than the similarity value of the right part of relationships concern meaning. These include but are not the hierarchy. limited to hypernymy/hyponymy (hyponymy/troponymy), antonymy, entailment, and meronymy/holonymy. Nouns and verbs are organized into hierarchies basedonthehypernymy/hyponymy or hyponymy/troponymy relationships between synsets. Some verbs are not organized in hierarchies. These are connected by antonym relationships and similar relationships, like adjectives. Adjectives are organized into both head and satellite synsets, each organized around antonymous pairs of adjectives. These two Figure 2. Local density effect adjectives are considered head synsets. Satellite synsets include words whose meaning is similar to that of a single head adjective. Depth and similarity Nouns and adjectives that are derived from verbs The deeper the depth of the nodes located, the and adverbs that are derived from adjectives higher the similarity of them. The foundation is that the have pointers indicating these relations. distance shrinks as one descends the hierarchy, since differentiation is based on finer and finer details [5]. That is in Fig. 3, the value of sim (C1,C2) should be less than the value of sim (C3,C4). C 0 C 1 C 2 C 3 C 4 Figure 3. Depth effect Path length and similarity Semantic network includes concepts (usually nouns or Figure 1. WordNet structure noun phrases) that are linked to one another by named relations, for example, hyper/hyponym relation (‘is-a’ relation) and hol/meronym relation (‘part-of’ relation). If B. Computing Word Similarity the semantic network is linked only by taxonomic ‘is-a’ Before introducing our method for calculating sentence relation, it is generally called ‘is-a’ semantic network or similarity, we first describe how word similarity is ‘is-a’ taxonomy. In this kind of semantic network, parent calculated, which will be used in our study later. concept is more generalized than child concept, while 1) Words Similarity in One Hierarchy child ‘is a kind of’ its parent concept. Rada et al.[18]pointed out that the assessment of We now introduce how to compute similarity of two similarity in a semantic network can be in fact thought of words in one hierarchy. as involving just taxonomic ‘is-a’ relation, and the Density and similarity simplest form of determining the distance between two With regard to the tree density, it can be observed that elemental concept nodes, A and B, is the shortest path the densities in different part of the hierarchy are different. that links A and B, i.e. the minimum number of edges The greater the density, the closer the distance that separate A and B. But jiang and Conrath[11] then between the nodes . For example, the ‘plant’ section of pointed out in a more realistic scenario, the distances © 2013 ACADEMY PUBLISHER JOURNAL OF SOFTWARE, VOL. 8, NO. 6, JUNE 2013 1453 between any two adjacent nodes are not necessarily equal. more child concept nodes, which are one step below their It is therefore necessary to consider that the edge parent concept node in the hierarchy. Sibling concept connecting the two nodes should be weighted. To nodes share the same parent concept node. A concept determine the edge weight automatically, certain aspects node has at most one parent concept node. Concept nodes should be considered in the implementation. Most of that do not have any children are called leaf concept these are typically related to the structural characteristics nodes. The topmost concept node in the hierarchy is of a hierarchical network. Some conceivable features are: called the root concept node. Being the topmost concept local network density (the number of child links that span node, the root concept node will not have parents, and it out from a parent node), depth of a node in the hierarchy, is the symbol of the universe. All concept nodes (except type of link, and finally, perhaps the most important of all, root concept node) can be reached from the root concept the strength of an edge link. From Rada et al. and Jiang node by following edges and concept nodes on the path, and Conrath, at least we can state that if the shorter path and all these concept nodes on the path composed of the is contained within the longer path in a ‘is-a’ ancestor concept nodes of that concept node. All concept taxonomy, the concept nodes pair with shorter path nodes below a particular concept node are called between them has greater concept similarity than that descendents of that concept node. Fig. 5 above illustrated of with longer path between them. That is in Fig. 4, the the concept node types. value of sim (C0,C3) should be less than the value of sim The concept hierarchical model is the premise of our (C0,C1). method, and our similarity computation is from cosine similarity which is based on the orthogonality of its C0 components, so the semantic coverage of the concept nodes should be independent.

Load more