<<

ISSN 2319-8885 Vol.07,Issue.04, April-2018,

Pages:0758-0761

www.ijsetr.com

Computing Semantic Similarity of Concepts in Knowledge Graphs 1 2 G. RAGHUNATH , DR. K. VIJAYALAKSHMI 1PG Scholar, Dept of Computer Science, S.V.U College of CM & CS, S.V University, Tirupati, AP, India. 2Assistant Professor, Dept of Computer Science, S.V University, Tirupati, AP, India.

Abstract: This paper presents a method for measuring the semantic similarity between concepts in Knowledge Graphs (KGs) such as WordNet and DBpedia. Previous work on semantic similarity methods have focused on either the structure of the between concepts (e.g. path length and depth), or only on the Information Content (IC) of concepts. We propose a semantic similarity method, namely wpath, to combine these two approaches, using IC to weight the shortest path length between concepts. Conventional corpus-based IC is computed from the distributions of concepts over textual corpus, which is required to prepare a domain corpus containing annotated concepts and has high computational cost. As instances are already extracted from textual corpus and annotated by concepts in KGs, graph-based IC is proposed to compute IC based on the distributions of concepts over instances. Through experiments performed on well known similarity datasets, we show that the wpath semantic similarity method has produced statistically significant improvement over other semantic similarity methods. Moreover, in a real category classification evaluation, the wpath method has shown the best performance in terms of accuracy and F score.

Keywords: Semantic Similarity, Semantic Relatedness, Information Content, Knowledge Graph, WordNet, Dbpedia.

I. INTRODUCTION Note that the tiny KG is a simplified example from With the increasing popularity of the linked data initiative, DBpedia for illustration, DBpedia entities and their types many public Knowledge Graphs (KGs) have become which are mapped to the example KG. The lexical database available, such as Freebase, DBpedia, YAGO, which are WordNet has been conceptualized as a conventional novel semantic networks recording millions of concepts, semantic network of the lexicon of English . WordNet entities and their relationships. Typically, nodes of KGs can be viewed as a concept where nodes denote consist of a set of concepts C1,C2, . . . ,Cn representing WordNet synsets representing a set of words that share one conceptual abstractions of things, and a set of instances I1, I2, common sense (synonyms), and edges denote hierarchical . . . , Im representing real world entities. Following relations of hypernym and hyponymy (the relation between a Description Logic terminology, knowledge bases contain sub-concept and a superconcept) between synsets. Recent two types of axioms: a set of axioms is called a terminology efforts have transformed WordNet to be accessed and box (TBox) that describes constraints on the structure of the applied as concept taxonomy in KGs by converting the domain, similar to the conceptual schema in database setting, conventional representation ofWord-Net into novel linked and a set of axioms is called assertion box (ABox) that data representation. For example, KGs such as DBpedia, asserts facts about concrete situations, like data in a database YAGO and BabelNet have integrated WordNet and used it setting. Concepts of the KG contains axioms describing as part of concept taxonomy to categorize entity instances concept hierarchies and are usually refereed as ontology into different types. Such integration of conventional lexical classes (TBox), while axioms about entity instances are resources and novel KGs have provided novel opportunities usually referred as ontology instances (ABox). There is to facilitate many different Natural Language Processing example of a KG using the above notions. Concepts of TBox (NLP) and Information Retrieval (IR) tasks, including Word are constructed hierarchically and classify entity instances Sense Disambiguation(WSD),Named Entity Disambiguation into different types (e.g., actor or movie) through a special (NED), query interpretation, document modeling and semantic relation rdf:type. Concepts and hierarchical to name a few. Those KG-based relations (e.g., is-a) compose a concept taxonomy which is a applications rely on the knowledge of concepts, instances concept tree where nodes denote the concepts and edges and their relationships. In this work, we mainly exploit the denote the hierarchical relations. The hierarchical relations concept level knowledge, while the instance level knowledge between concepts specify that a concept Ci is a kind of is used to support the concept knowledge. More specifically, concept Cj (e.g., actor is a person). Apart from hierarchical we focus on the problem of computing the semantic relationships, concepts can have other semantic relationships similarity between concepts in KGs. among them (e.g., actor plays in a movie).

Copyright @ 2018 IJSETR. All rights reserved. G. RAGHUNATH, DR. K. VIJAYALAKSHMI According to the evaluation experiments in word Analysis. Recent works based on distributed similarity datasets, compared with the previous state of the techniques consider advanced computational models such as art semantic similarity methods, the wpath method results in and GLOVE, representing the words or concepts statistical significant improvement of correlation between with low dimensional vectors. The co-occurrence computed similarity scores and human judgements. The information of words with the same surrounding context proposed graph-based IC has shown to be effective as the would make a wide variety of words to be considered as corpus-based IC so that it could be used as the substitution of related. Since corpus-based approaches mainly rely on the corpus-based IC in KGs. Furthermore, in order to contextual information of words, they usually measure the evaluate the performance of semantic similarity methods in general semantic relatedness between words rather than the real application datasets, we have applied semantic similarity specific semantic similarity that depends on hierarchical metrics to the aspect category classification task of the relations. Furthermore, corpus based semantic similarity restaurant domain. The evaluation results of semantic methods represent concepts as words without clarifying their similarity based category classification have shown that the different meanings (word senses). Compared to knowledge- wpath semantic similarity method has the best accuracy and based approaches relying on KGs, corpus-based approaches F-measure score. In conclusion, this paper considers the normally have better coverage of because their problem of measuring semantic similarity between concepts computational models can be effectively applied to various in KGs. The main contributions of this work may be and updated corpora. summarized as below.  We propose a method for measuring the semantic III. THE PROPOSED METHODS similarity between concepts in KGs. The main idea of the wpath semantic similarity method is  We propose a method to compute IC based on the to encode both the structure of the concept taxonomy and the specificity of concepts in KGs. statistical information of concepts. Furthermore, in order to  We evaluate the proposed methods in gold standard adapt corpus-based IC methods to structured KGs, word similarity datasets. graphbased IC is proposed to compute IC based on the  We evaluate the semantic similarity methods in distribution of concepts over instances in KGs. aspect category classification. Consequently, using the graph-based IC in the wpath semantic similarity method can represent the specificity and II. SEMANTIC SIMILARITY hierarchical structure of the concepts in a KG. It presents the There is a relatively large number of semantic similarity wpath semantic similarity method for measuring semantic metrics which were previously proposed in the literatures. similarity between concepts in KGs and It describes the Among them, there are mainly two types of approaches in proposed method to compute graph-based IC of concepts measuring semantic similarity, namely corpus-based based on KGs. approaches and knowledge-based approaches. Corpusbased semantic similarity metrics are based on models of A. WPath Semantic Similarity distributional similarity learned from large text collections The knowledge-based semantic similarity metrics mentioned relying on word distributions. Two words will have a high in the previous section are mainly developed to quantify the distributional similarity if their surrounding contexts are degree to which two concepts are semantically similar using similar. Only the occurrences of words are counted in corpus information drawn from concept taxonomy or IC. Metrics without identifying the specific meaning of words and take as input a pair of concepts, and return a numerical value detecting the semantic relations between words. Since indicating their semantic similarity. Many applications rely corpusbased approaches consider all kinds of lexical on this similarity score to rank the similarity between relations between words, they mainly measure semantic different pairs of concepts. Take a fragment of WordNet relatedness between words. On the other hand, knowledge- concept taxonomy, given the concept pairs of (beef, lamb) based semantic similarity methods are used to measure the and (beef, octopus), the applications require similarity semantic similarity between concepts based on semantic metrics to give higher similarity value to sim(beef, lamb) networks of concepts. This section reviews briefly corpus- than sim(beef, octopus) because the concept beef and based approaches and knowledge-based semantic similarity concept lamb are kinds of meat while the concept octopus is metrics that have been observed good performance in NLP a kind of seafood. The semantic similarity scores of some or IR applications. concept pairs computed from the semantic similarity methods have been illustrated. It can be seen in this table A. Corpus-based Approaches how the row of concept pair (beef, lamb) has higher Corpus-based approaches measure the semantic similarity similarity scores than the row of concept pair (beef, between concepts based on the information gained from octopus). large corpora such as . Following this idea, some works exploit concept associations such as Pointwise Mutual B. Graph-Based Information Content Information or Normalised Google Distance, while some Conventional corpus-based IC requires to prepare a other works use techniques to domain corpus for the concept taxonomy and then to represent the concept meanings in high-dimensional vectors compute IC from the domain corpus in offline. The such as and Explicit Semantic inconvenience lies in the high computational cost and International Journal of Scientific Engineering and Technology Research Volume.07, IssueNo.04, April-2018, Pages: 0758-0761 Computing Semantic Similarity of Concepts in Knowledge Graphs difficulty of preparing a domain corpus. More specifically, corpus-based IC. This task evaluates the performance of in order to compute corpus-based IC, the concepts in the semantic similarity methods. taxonomy need to be mapped to the words in the domain  Word-Graph: Word pairs are further chosen from corpus. Then the appearance of concepts are counted and the datasets if both words can be mapped to WordNet IC values for concepts are generated. In this way, the concepts and the LCS of mapped concepts is one of the additional domain corpus preparation and offline extracted 68423 WordNet concepts which are used as computation may prevent the application of those semantic DBpedia type. Apart from running all the semantic similarity methods relying on the IC values (e.g., res, lin, similarity methods based on corpus-based IC, we also jcn, and wpath) to KGs, especially when the domain corpus use the graph-based IC computed from DBpedia in the is insufficient or the KG is frequently updated. Since KGs res method and the proposed wpath method. This task is already mined structural knowledge from textual corpus, we able to evaluate the performance of the graph-based IC present a convenient graph-based IC computation method for used in semantic similarity methods of res and wpath. computing the IC of concepts in a KG based on the instance This task is chosen because both res and wpath only rely distributions over the concept taxonomy. The graph-based on the IC of LCS. IC is proposed to directly take advantage of KGs while  Word-Type: Word pairs are chosen if both words can retaining the idea of corpus-based IC representing the be mapped to the extracted 68423 WordNet concepts specificity of concepts. In consequence, the IC-based used as entity type in DBpedia. We treat those mapped semantic similarity methods such as res, lin, jcn and the word pairs as DBpedia types. Then, all the semantic proposed wpath can compute the similarity score between similarity methods are run using both corpus-based IC concepts directly relying on the KG. and graph-based IC. This task is able to evaluate the performance of graph-based IC used in semantic IV. EXPERIMENTS similarity of lin, jcn and wpath. The goal of our experiments is to evaluate the proposed semantic similarity method and graph-based IC in KGs. V. CONCLUSIONS AND FUTURE WORK However, to best of our knowledge, currently there is no Measuring semantic similarity of concepts is a crucial standard method and dataset to evaluate the performance of component in many applications which has been presented semantic similarity method and IC computation model for in the introduction. In this paper, we propose wpath semantic concepts in KG. Therefore, the commonly used word similarity method combining path length with IC. The basic similarity datasets are used to evaluate the proposed idea is to use the path length between concepts to represent semantic similarity method and graph-based IC based on their difference, while to use IC to consider the commonality WordNet and DBpedia. Moreover, the semantic similarity between concepts. The experimental results show that the methods are evaluated in an aspect category classification wpath method has produced statistically significant task in order to evaluate their performance in a real improvement over other semantic similarity methods. application. This section presents the datasets, Furthermore, graph-based IC is proposed to compute IC implementation, evaluation and provides a brief discussion based on the distributions of concepts over instances. It has about the obtained experimental results. In order to evaluate been shown in experimental results that the graph-based IC the wpath semantic similarity method and graph-based IC, is effective for the res, lin and wpath methods and has word similarity datasets have been processed and split into similar performance as the conventional corpus-based IC. Word-Noun, Word-Graph and Word-Type. For evaluating Moreover, graph-based IC has a number of benefits, since it the wpath similarity metric, the Word-Noun task was created does not requires a corpus and enables online computing by mapping words in word similarity datasets to WordNet based on available KGs. Based on the evaluation of a simple noun concepts. The performance of graph-based IC is aspect category classification task, the proposed wpath compared to corpus-based IC based on their performance in method has also shown the best performance in terms of the similarity metrics of wpath, res, lin and jcn. To compute accuracy and F score. We evaluated the proposed method in graph-based IC, words in word similarity datasets need to be the word similarity dataset and simple classification using mapped to Dbpedia concepts. However, many words are not the most established evaluation method. More evaluation of used as concepts in DBpedia such as noon, madhouse or lad semantic similarity methods in other applications to name a few. In consequence, in order to compare wpath considering the taxonomical relation could be useful and can and res, the Word-Graph was created by mapping the LCS be one of our future works. Furthermore, this paper mainly of word pairs to DBpedia concepts, while the Word-Type discussed semantic similarity rather than general semantic was created by mapping the word pairs to DBpedia concepts relatedness. Therefore, another future work could be in for comparing wpath, lin and jcn. The more detailed dataset studying the combination of knowledge-based methods with split criteria are described as below: the corpus-based methods for semantic relatedness. Finally,  Word-Noun: Word pairs are chosen from all the since we combined WordNet and DBpedia together in this original word similarity datasets if both words can be paper, we would further explore using the proposed mapped to WordNet concepts, while unmapped word approaches for measuring the entity similarity and pairs are removed from the datasets.We run all the relatedness in KGs. semantic similarity methods based on WordNet and

International Journal of Scientific Engineering and Technology Research Volume.07, IssueNo.04, April-2018, Pages: 0758-0761 G. RAGHUNATH, DR. K. VIJAYALAKSHMI VI. REFERENCES Author’s Profile: [1] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. G. Raghunath received Bachelor degree in Taylor, “Freebase: a collaboratively created graph database B.S.c. Computers from Yogi Vemana for structuring human knowledge,” in Proceedings of the University, kadapa in the year of 2012-2015. 2008 ACM SIGMOD international conference on Pursuing Master of Computer Applications Management of data. ACM, 2008, pp. 1247–1250. from Sri Venkateswara University, Tirupati in [2] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, the year of 2015-2018. Research interest in the field of R. Cyganiak, and S. Hellmann, “Dbpedia-a crystallization Computer Science in the area of Data Mining. point for the web of data,” Web Semantics: Science, Services and Agents on the World Wide Web, vol. 7, no. 3, Dr.K.Vijaya Lakshmi is Assistant Professor pp. 154 – 165, 2009, the Web of Data. in the Department of Computer Science, Sri [3] J. Hoffart, F. M. Suchanek, K. Berberich, and G. Venkateswara University, Tirupati. Her Weikum, “Yago2: A spatially and temporally enhanced research interest is in the area of Data Bases, knowledge base from wikipedia (extended abstract),” in Software Engineering, Data Mining. Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, ser. IJCAI ’13. AAAI Press, 2013, pp. 3161–3165. [4] I. Horrocks, “Ontologies and the semantic web,” Commun. ACM, vol. 51, no. 12, pp. 58–67, Dec. 2008. [Online].Available:http://doi.acm.org/10.1145/1409360.1409 377 [5] G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995. [6] R. Navigli and S. P. Ponzetto, “Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network,” Artificial Intelligence, vol. 193, pp. 217–250, 2012. [7] E. Hovy, R. Navigli, and S. P. Ponzetto, “Collaboratively built semi-structured content and artificial intelligence: The story so far,” Artificial Intelligence, vol. 194, pp. 2 – 27, 2013, artificial Intelligence, Wikipedia and Semi-Structured Resources. [8] R. Navigli, “Word sense disambiguation: A survey,” ACM Computing Surveys (CSUR), vol. 41, no. 2, p. 10, 2009. [9] A. Moro, A. Raganato, and R. Navigli, “Entity linking meets word sense disambiguation: a unified approach,” Transactions ofthe Association for Computational Linguistics, vol. 2, pp. 231–244, 2014. [10] J. Hoffart, S. Seufert, D. B. Nguyen, M. Theobald, and G. Weikum, “Kore: Keyphrase overlap relatedness for entity disambiguation,” in Proceedings of the 21st ACM International Conference on Information and Knowledge Management, ser. CIKM ’12. New York, NY, USA: ACM, 2012, pp. 545–554. [11] I. Hulpus, N. Prangnawarat, and C. Hayes, “Path-based semantic relatedness on linked data and its use to word and entity disambiguation,” in International Semantic Web Conference, 2015. [12] J. Pound, I. F. Ilyas, and G. Weddell, “Expressive and flexible access to web-extracted data: A keyword-based structured query language,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’10. New York, NY, USA: ACM, 2010, pp. 423–434.

International Journal of Scientific Engineering and Technology Research Volume.07, IssueNo.04, April-2018, Pages: 0758-0761