Linking Wordnet to Dbpedia

Linking WordNet to DBpedia

Aynaz Taheri Mehrnoush Shamsfard Computer Engineering Dept., Shahid Computer Engineering Dept., Shahid Beheshti University, Tehran, Iran Beheshti University, Tehran, Iran [email protected] [email protected]

DBpedia knowledge extraction framework ex- Abstract tracted its knowledge from Wikipedia and con- verted itself as a crystallization point for the web of data (Bizer, Lehmann, et al., 2009). DBpedia currently has knowledge for more than 3.6 mil- In this paper we present the process of lion things about persons, places, music, films, matching two important datasets in Link- video games and etc (“DBpedia,” n.d.). It also ing Open Data (LOD): DBpedia and contains information for these things in different WordNet 3.0. DBpedia plays the main languages. Most of the dataset publishers try to role in the LOD cloud. It is an influential link their datasets to DBpedia. In (Cyganiak, knowledge base and consists of over one 2010) you can see the mass of links to DBpedia. billion pieces of information about mil- Some of links from DBpedia to these datasets are lions of things. On the other hand Prince- available in (“Interlinking DBpedia,” 2011) and ton WordNet is the most important lexical one of these datasets is WordNet (W3C). ontology. WordNet 3.0 in RDF is also one WordNet (Fellbaum,1998) is an electronic lex- of the resources in Linking Open Data. ical database that is designed in Princeton Uni- Certainly, linking these two common da- versity for English language. WordNet uses syn- tasets has impressive effects in various as- onymous sets, called synset. The latest version of pects of consuming them. In this paper the WordNet contains 155,287 words organized in methodology of matching, statistical in- 117,659 synsets (“WordNet 3.0 database,” n.d.). formation and some beneficial use cases WordNet includes nouns, adjectives, verbs and of the matching is explained. adverbs. Synsets in WordNet are connected to each other with semantic relation such as: syn- 1 Introduction onymy, antonymy, hyponymy, hypernymy, meronymy, troponymy and etc. Nowadays increasing the amount of linked data Lexical ontologies like WordNet are important in Linking Open Data project is not the only resources in natural language processing (NLP). challenge of publishing linked data; rather, map- They are used in various tasks and applications, ping and linking the linked data resources are especially where semantic processing is evolved also equally important and can improve the ef- such as question answering, machine translation, fective consuming of linked data resources. text understanding, information retrieval and ex- Without these links, we confront with isolated traction, knowledge acquisition and semantic islands of datasets, which could not exploit search engines (Shamsfard, 2008). Integration of knowledge of each other. The fourth rule of pub- Princeton WordNet and DBpedia could improve lishing linked data in (Bizer, Heath, et al., 2009) the semantic processing. Princeton Wordnet has explains the necessity of linking URIs to each been mapped to most of the WordNets developed other. Therefore, extension of datasets without for other languages in the world. So, WordNets interlinking them is against the Linked Data of these languages could be linked to DBpedia principles. The importance of this issue increased via Princeton WordNet and the result of Word- our motivation of doing mapping between two Net to DBpedia matching will affect NLP in dif- core datasets of Linking Open Data. ferent languages. DBpedia is a significant knowledge base. At the present time, WordNet is available in

344 the Linking Open Data cloud. There are two da- 4. Only noun synsets of WordNet are considered tasets in the LOD cloud which represents Word- in current links of DBpedia to WordNet. We are Net in the form of linked data. One of them is going to find equivalents of verb and adjective WordNet (W3C1) (Assem et al., 2006) that is the synsets in WordNet to properties in DBpedia too. OWL/RDF representation of Princeton WordNet The rest of the paper is organized as follows: 2.0 and the other one is WordNet (VUA 2 ) Section 2 presents our methodology of matching (“Wordnet 3.0 in RDF” 2010) that is the RDF ver- WordNet 3.0 in RDF to DBpedia. Section 3 ex- sion of WordNet 3.0. WordNet (VUA) is plains statistical information about the result of mapped to WordNet (W3C) and DBpedia is also mapping. Section 4 describe some use cases and linked to WordNet (W3C). advantages of the mapping WordNet3.0 to Each synset in WordNet (VUA) has an URI. DBpedia. Section 5 discusses evaluation of the Synsets are derefrencable by their URIs and via results and section 6 provides some conclusion HTTP protocol. Instances of synsets have also about this paper. URIs. There are specified patterns for URIs of synsets and instances. The word “instance” is depicted in URIs of instances. 2 Methodology of Matching Currently, DBpedia has 467101 links to We are going to find coreferent URIs in Word- WordNet (W3C). But there are shortcomings in Net (VUA) and DBpedia. This process is also these links. We are going to cover these defects known as entity matching, object resolution, ob- in our matching: ject consolidation, entity identification, identity 1. There are only hypernymy relations from in- recognition, identity disambiguation or instance stances of DBpedia to noun synsets of WordNet matching. In recent years many efforts have been in current links and there is no relation from done for making tools, softwares and frame- WordNet to DBpedia. It is considerable that works for detecting coreferent URIs. One of these relations only represent a kind of instantia- these important products is Silk framework (Volz tion. An example of these relations is following: et al., 2009). Silk is a link discovery framework for the web of data that uses a declarative lan- < http://dbpedia.org/property/wordnet_type> guage (Silk-LSL) for specifying which types of entities. DBpedia has counseled utilizing this In the above example, it is demonstrated that tool for generating links from other datasets to ‘White_house’ is an instance of ‘building’ synset DBpedia (“Interlinking DBpedia,” 2011). In our in WordNet and nothing more. Whereas, Word- work, we do not use Silk because it needs all Net has information about ‘white_house’. There kinds of relations that might be discovered be- is a synset in WordNet 2.0 with this URI: tween entities to be described by user before- “http://www.w3c.org/2006/03/wn/wn20/instance hand. We instead, apply our approach for gene- s/synset-White_house-noun-2”. Matching the rating links between two datasets. WordNet synset and the correspondent one in Our methodology consists of two main phases: DBpedia is desirable for us. There are many syn- preliminary, supplementary. sets in WordNet which have equivalents in 1. Preliminary Phase: DBpedia. Discovering this type of relations is In this phase, we use a terminological method for one of our motivations for doing this project. comparing synsets in WordNet and entities in 2. There is another kind of relation that detecting DBpedia. The terminological method is applied it between WordNet and DBpedia is beneficial. in three steps: We find instantiation or hypernymy relations • Matching instances of WordNet to instances between noun synsets of WordNet and DBpedia and classes of DBpedia: classes. It is important to know a synset of WordNet belongs to which concept from the At the first step, equivalent instances in viewpoint of DBpedia. WordNet and DBpedia are found. Instances’ 3. There are many properties in DBpedia. There synsets in WordNet (UVA) are available apart from other noun synsets. Thus, in the first step is no link from these properties to equivalents in WordNet. we discover all equivalent URIs from these two sets. After finding these equivalences, the next step is to detect the correspondences be- 1 World Wide Web Consortium tween Wordnet instances and DBpedia 2 Vrije University Amsterdam

345 classes. are created directly from Wikipedia infobox There are instantiation relations between in- properties with no regard to DBpedia ontolo- stances and their classes in DBpedia. In fact, gy. There are some properties in these three the types of instances are described with these kinds that seem to be equivalent. The next ex- relations. This relation in DBpedia is ample represents that there are two properties represented with: in DBpedia that specify “Language” property: “http://www.w3.org/1999/02/22-rdf-syntax- < http://dbpedia.org/ontology/language> : an ns#type”. Since, we found equivalences rela- object property tions between instances in WordNet and : a DBpedia, and on the other hand there are in- property stantiation or hypernymy relations between In all of the three steps, similarity computing instances and their classes in DBpedia, so it is method is a token-based distance computing. In possible to represent the type of WordNet in- this method a string is considered as a bag of stances in DBpedia. Figure 1 indicates this words (Euzenat,2007). In our matching method process. the label of entities in DBpedia, the label of the WordNet synsets in their URIs, the label of WordNet DBpedia senses in a synset and the description (gloss) of a rdf:type synset in WordNet are transformed to bags of words. But before this transformation we nor- Noun Synsets Classes malize strings and remove stop words. After pro- ducing the bags of words, a measure for estima- rdf:type tion of similarity between WordNet synset and DBpedia entity is applied (1). owl: sameAs X: a bag of words including words in the label of Instance Synsets Instances synset Figure 1. Process of matching at the first step in the Y: a bag of words including words in the label of first phase DBpedia entity or its comment • Matching noun synsets of WordNet to classes S: a bag of words including words in the label of in DBpedia: senses of synset Some noun synsets of WordNet have equiva- G: a bag of words including words in the glos- lents in classes of DBpedia. For example, both sary of synset of the datasets have knowledge about “Lan- X I Y + Y I (XUSUG) δ(X, Y) = (1) guage”. So, the “synset-Language-noun-1” X + Y synset in WordNet is the same as “Language” class in DBpedia. 2. Supplementary Phase: found in the previous phase. So, we are sure that ly correct. But these results are not accurate nec- essarily and maybe there are correspondences • Matching noun, verb and adjective synsets of that don’t have entities with the same identity WordNet to properties in DBpedia: The aim and there are only lexical similarities between of this phase is to recognize properties of them. The purpose of this phase is to refine the DBpedia and their equivalents in WordNet. result of matching in the previous phase. In other Properties play the predicate role in a triple. words, a method for URI disambiguating is de- So, detecting equivalent synsets with proper- scribed in this phase. We used hierarchal struc- ties is advantageous. There are three kinds of ture of WordNet and DBpedia for disambigua- properties in DBpedia: owl:ObjectProperty, tion. ‘wnschema:instanceOf’ and owl:DatatypeProperty and property. ‘wn20schema:hyponymOf’ relations are used for owl:ObjectProperty and owl:DatatypeProperty gaining an understanding of taxonomic structure are properties in ontology of DBpedia but the in WordNet. ‘wnschema:instanceOf’ relation third kind of properties are independent from denotes a relation between an instance synset and DBpedia ontology and there are no structural a noun synset. Two relations from DBpedia are and hierarchical relations between them (“The also utilized for determining the taxonomic struc- DBpedia Data Set,” 2011). These properties ture. These two relations are ‘rdf:type’ and

346 ‘rdfs:subClassOf’. The first one denotes a rela- 3 Statistical Information of the Match- tion between an instance and the class that be- ing Outcome long to and the second one expresses a relation between two classes. These relations are used as In table 1 the result of WordNet to DBpedia is sources for disambiguation. represented. In the first column, the kinds of syn- Some structure based techniques are presented sets are denoted. In the second column the kinds in (Euzenat and Shvaiko, 2007). One of them is of relations that are discovered and in the third Wu-Palmer similarity measure (Wu and Palmer, column elements of DBpedia in matching are 1994). This similarity measure is used in the represented. second phase of our methodology. Consider the following URIs: 4 Use Cases URI1: http://purl.org/vocabularies/princeton/wn3 Influences of the matching consequences are 0/s ynset-Pluto-noun-1 clearly perceptible in the Natural Language URI2: http://purl.org/vocabularies/princeton/wn3 Processing and Semantic Processing domains. 0/synset-Pluto-noun-2 We categorize use cases of WordNet 3.0 to URI3: http://purl.org/vocabularies/princeton/wn3 DBpedia matching in three groups: 0/synset-Pluto-noun-3 • Enriching WordNet: Princeton WordNet can URI4: http://dbpedia.org/resource/Pluto be enriched with more relations. Properties in URI1, URI2 and URI3 from WordNet are lex- DBpedia can be used for finding more relations ically similar to each other and the first phase in Wordnet. finds them equal to URI4 from DBpedia. While, • Developing Formal Ontologies: Princeton URI1 has ‘wnschema:instanceOf’ relation with WordNet is a lexical ontology and is far from a ‘fictional_character’ and is a cartoon character. formal ontology. The types of relations in Word URI2 has ‘wnschema:instanceOf’ relation with Net are restricted to synonymy, antonymy, hypo- ‘Greek_deity’ and is the god of the underworld nymy, hypernymy, meronymy, troponymy. For in ancient mythology. URI3 has ‘wnsche- moving toward a formal ontology, it is necessary ma:instanceOf’ relation with ‘outer_planet’ and to augment the relations between synsets. With is a small planet. URI4 has ‘rdf:type’ relation utilizing interlinking of WordNet and DBpedia, with ‘Planet’ class of DBpedia ontology. The it is possible to discovering relations between matching of URI1 and URI2 with URI4 is ob- synsets of WordNet via their correspondent enti- viously wrong. ties in DBpedia. Due to the fact that DBpedia is We use Wu-Palmer measure and assess the an important knowledge base and have informa- similarity of taxonomic structure of URIs in tion for entities in the form of properties. These WordNet. In the former example, the similarity properties can be exploited for making WordNet of (fictional_character, Planet) is 0.48 and the a formal ontology. similarity of (Greek_deity, Planet) is 0.46. These • Semantic Search: In semantic search, disam- similarities are less than our threshold. There- biguating words is a main challenge. Taking ad- fore, the matching of URI1 and URI2 with URI4 vantage of WordNet to DBpedia matching, could are excluded from the results. help disambiguating through the large amounts All of the correspondences with structural si- of information about entities and properties in milarity less than the threshold are removed from DBpedia. the result of matching. • Finding more instances for WordNet synsets: The result of matching is available at: DBpedia is greater than WordNet in the number http://step1.nlplab.sbu.ac.ir/wordnetdbpedia/matc hing.aspx of instances. We discovered equivalent relations

Subject(WordNet) Predicate Object(DBpedia) Number of Matching Instances owl:SameAs Instances 27923 Instances rdf:type Classes 18555 Noun, Verb, Adjective owl:equivalentProperty Object Property 583 Noun, Verb, Adjective owl:equivalentProperty DataType Property 438 Noun, Verb, Adjective owl:equivalentProperty Property 10379 Noun owl:equivalentClass Classes 344 Table 1. Result of Matching

347 between some synsets of WordNet and classes of Christian Bizer, Jens Lehmann, Georgi Kobilarov, DBpedia. So we can apply the instances of Soren Auer, Christian Becker, Richard Cyganiak equivalent class for the instantiation of the syn- and Sebastian Hellmann. 2009. DBpedia- A crys- set. tallization point for the Web of Data. J. Web Sem. 7(3): 154-165. 5 Evaluation Christian Bizer, Tom Heath and Tim Berners-Lee. 2009. Linked Data-The Story So Far, Int. J. Se- We evaluated a subset of matching result ma- mantic Web Inf. Syst. 5(3) :1-22. nually. This subset contained 500 members. The value of obtained precision is 0.92. Computation Richard Cyganiak and Anja Jentzsch. The Linking of recall is not possible for this project as there is Open Data cloud diagram. (2010). Retrieved April, no golden standard and manual extraction of all 2011, from http://richard.cyganiak.de/2007/10/lod/ possible links between these two sets is almost impossible. DBpedia. (2011). Retrieved April, 2011, from http://dbpedia.org/About . 6 Conclusions and Future Work Jerome Euzenat and Pavel Shvaiko. 2007. Ontology matching. Springer-Verlag, Berlin Heidelberg. In this paper we discussed about matching two important datasets in LOD cloud: DBpedia and Christiane Fellbaum. 1998. WordNet: An Electron- WordNet. Shortcomings about the current links ic Lexical Database. MIT Press, Cambridge, form DBpedia to WordNet presented and the MA. necessity of generating more different kinds of Interlinking DBpedia with other Data Sets. (2011). links between them is explained. Interlinking Retrieved April, 2011, from http://wiki.dbpedia. these two datasets can improve applications on org/ Interlinking. natural language processing and semantic processing; furthermore WordNet is also impres- Mehrnoush Shamsfard, Akbar Hesabi, Hakimeh Fa- sionable from the matching result and can move daie, Niloofar Mansoory, Ali Famian, Somayeh toward an enriched lexical ontology or even a Bagherbeigi, Elham Fekri, Maliheh Monshizadeh, formal ontology. S. Mostafa Assi. 2010. Semi Automatic Develop- ment of FarsNet: The Persian WordNet, In Pro- Future work will focus on mapping WordNets ceedings of the 5th Global WordNet Conference, of other languages especially those with less re- Mumbai, India. sources to linked data. Linking WordNets to DBpedia is possible via the outcome of Princeton The DBpedia Data Set. (2011). Retrieved April, 2011, WordNet to DBpedia matching. For example from http://wiki.dbpedia.org/Datasets. FarsNet, the Persian wordnet (Shamsfard, et al., 2010) is a good candidate for this mapping. Julius Volz, Christian Bizer, Martin Gaedke, Georgi FarsNet to Princeton WordNet mapping is avail- Kobilarov. 2009. Silk – A Link Discovery Frame- able, so matching FarsNet to DBpedia is possi- work for the Web of Data. In Proceedings of the ble. After linking FarsNet to DBpedia, we are 2nd Workshop about Linked Data on the Web. going to extend relations in FarsNet with utiliz- Madrid, Spain. ing DBpedia. Accordingly, FarsNet will be con- WordNet 3.0 database statistics. Retrieved April, nected to LOD cloud and causes improvements 2011, from http://wordnet.princeton.edu/ word- in Persian language semantic processing. net/man/wnstats.7WN.html.

References Wordnet 3.0 in RDF. (2010). Retrieved April, 2011, from Mark V. Assem, Aldo Gangemi and Guus Schreiber. http://semanticweb.cs.vu.nl/lod/wn30/ 2006. Conversion of WordNet to a standard RDF/OWL representation. In proceedings of the Zhibiao Wu and Martha Stone Palmer. 1994. Verb 5th International Conference on Language Re- Semantics and Lexical Selection. In proceedings of sources and Evaluation, Genoa, Italy. the 32nd Annual Meeting of the Association for Computational Linguistics (ACL), 133–138, Las Mark V. Assem, Aldo Gangemi and Guus Schreiber. Cruces. RDF/OWL Representation of WordNet. (2006). Retrieved April, 2011, from http://www.w3.org/TR/ wordnet-rdf/

348