Tomasz Boiński, Adrian Ambrożewicz WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Julian Szymański

DBpedia As A Formal – An Evaluation

TOMASZ BOINSKI´ ADRIAN AMBROZEWICZ˙ JULIAN SZYMANSKI´ Gdansk´ University of Technology Gdansk´ University of Technology Gdansk´ University of Technology Faculty of Electronics, Faculty of Electronics, Faculty of Electronics, Telecommunications and Informatics Telecommunications and Informatics Telecommunications and Informatics Narutowicza Street 11/12 Narutowicza Street 11/12 Narutowicza Street 11/12 80-233 Gdansk´ 80-233 Gdansk´ 80-233 Gdansk´ POLAND POLAND POLAND [email protected] [email protected] [email protected]

Abstract: DBpedia is widely used by researchers as a mean of accessing in a standardized way. In this paper it is characterized from the point of view of questions answering system. Simple implementation of such system is also presented. The paper also characterizes alternatives to DBpedia in form of OpenCyc and YAGO knowledge bases. A comparison between DBpedia and those knowledge bases is presented.

Key–Words: Knowledge base, Natural language processing

1 Introduction pointed out.

Growing amount of data available across the Inter- net is a viable source of knowledge. The problem 2 DBpedia lies in the form of knowledge representation and ways of accessing it. Recently many techniques became DBpedia aims at transforming Wikipedia articles into available to structuralize this knowledge [18, 14, 17]. an RDF compatible database with specified ontology. A few Knowledge Bases also emerges. Among the It’s a database based on so called “semantic stack”. most often used is OpenCyc [5], DBpedia [12] and DBpedia entries are created by automatic conversion YAGO [16]. All those resources create new possibil- of Wikipedia articles into triples in RDF format. Cur- ities enabling creation of different types natural lan- rently transformation includes: guage processing based solutions [1, 8, 15]. Currently • Title, DBpedia is one of the most widely used knowledge base. During research however this knowledge base • Abstract, is taken as granted without further analysis of its pros and cons. Other alternatives are often not taken into • Geo coordinates, account at all. In this paper we describe and evaluate aforemen- • Categories, tioned knowledge bases for use in natural language based system. The main focus of • Pictures, the Paper is placed on DBpedia however as it seems • Links, to the most popular. Unfortunately some of its weak- nesses were found out during our tests. The paper also • Info-boxes. presents a short characteristic of a simple question an- swering system and shortly describes simple imple- mentation used for testing DBpedia knowledge base. 2.1 The structure of this paper is as follows. Sec- The most information is gathered from info-boxes that tion 2 describes the DBpedia from the point of view are directly converted into standard “subject, predi- od question answering system. Section 3 shortly de- cate, object” triples. Info-boxes however are manda- scribes our simple implementation of such system and tory for automatic conversion. If an article does not in Section 4 alternatives to DBpedia are presented. have a info-box than it will not appear in the DBpe- Section 5 compares DBpedia with YAGO and finally dia [7]. Currently only around a half of the Wikipedia in Section 6 some conclusions and observations are articles has an info-box.

E-ISSN: 2224-3402 161 Volume 12, 2015 Tomasz Boiński, Adrian Ambrożewicz WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Julian Szymański

• The highest homogeneity is characteristic to en- Table 1: Estimated number of instances for main sub- tries describing well defined and unchangeable jects in version 3.9 of DBPedia terms like city, country, music record, language.

Subject No of instances • Information stored in records related to each Persons 832 000 other is consistent and coherent, Places 639 000 • In other cases is observed that heterogeneity is Works of art 372 000 dependent on the domain of the concept – for ex- Organizations 209 000 ample persons were described in different way Species 226 000 dependent on their profession. Sicknesses 5 600 • Each category has its own separate ontology in- compatible with other ontologies, however cur- rently there is work being done that should elim- Table 2: Number of connections to other knowledge inate this simple property to ontology mapping bases and standardize the description of a concept. Still most of the entities are not transformed to this Knowledge base No of connections new dbpedia-owl ontology. This requires man- flickr wrappr 4 000 000 ual work done by humans. Freebase 3 900 000 GeoNames 425 000 • There is a low number of relations between en- LinkedGeoData 104 000 tries – in most cases records are connected with each other only by links to common categories. OpenCyc 27 000 UMBEL 900 000 5 200 000 3 Simple Question Answering Sys- WordNet 470 000 tem 2 900 000 instances YAGO 41 000 000 properties We decided to implement a simple question answer- ing system based on question templates that uses DB- pedia as a knowledge base. Its current version al- lows asking questions to the DBPediia database in Table 1 presents estimated number of instances two ways. The first of these involves the creation of for main subjects in version 3.9 of DBPedia. queries in the form of triples of type ¡entity, property, Besides automatic conversion new records are property value, [¡property, property value¿, ...]. This added by the community. The community also maps approach allows the generation of simple queries with DBpedia entries with other semantic knowledge bases a single entity and describing it set the properties of (Table 2). the selected values. The second way is to generate As it can be seen DBpedia is a large database and simple queries in natural language formulated accord- it is under constant development. ing to predefined template. This way a user can ask a question in form of ”which, entity, has, property, property value [and property, the value of property, [ 2.2 DBpedia Ontology and ... ]]. Sample question is ”Which country has Early versions of DBpedia had no ontology. Currently Australian Dollar currency and language English”. In it’s structure is extended by such. DBpedia ontology both cases it is possible to define a single entity and has 529 classes which form a subsumption hierarchy its properties. and are described by 2,333 different properties [6]. In both cases the system worked surprisingly Unfortunately data extracted from Wikipedia not al- well allowing prefetching property values in the back- ways is mapped into the ontology. Especially that ex- ground based on user input which simplifies query tracted using old algorithms need further verification construction. The main difficulty is to transform nat- and correction. ural language formed question into a SPARQL query. We performed a preliminary check of the most The consistency of knowledge available in DBpedia is popular concepts in terms of type and homogeneity unfortunately quite low. Replies for the same queries of available information. The came to the following but with different subjects were not comparable as conclusions: there is inconsistency in terms of knowledge available.

E-ISSN: 2224-3402 162 Volume 12, 2015 Tomasz Boiński, Adrian Ambrożewicz WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Julian Szymański

E.g. different persons are characterized by differ- filled with other types of assertions and is extended ent properties depending on their occupation, achieve- with knowledge from more specific domains. Open- ments etc. DBpedia would require some kind of stan- 2.0 System contains: dardization in this matter. The DBpedia SPARQL endpoint worked well • Cyc ontology containing all concepts and asser- when it was available, but unfortunately it provided tions, a lot of downtime proving it unusable for produc- tion or even extensive testing. Local installment using • Reasoning module – Cyc Engine, Apache Jena [19] was thus necessary. • Knowledge Base browser – OpenCyc Browser. In future we would like to expand the possibili- ties of the system by introducing other types of queries • Links between Cyc concepts and WordNet [9] that can be forwarded to the knowledge base. The first synsets, task is thus the extension of ways to ask questions in natural language by a better analysis of the wording • Links between Cyc and FOAF [4] concepts, of the question that should go beyond the pre-defined templates. It is therefore necessary to allow other • Links between Cyc concepts and Wikipedia [21] forms of questioning (which, where, whose, when, and DBpedia [12] articles, etc.) and other forms of queries (e.g. ”In Which coun- try you can pay with Australian Dollar and speak En- • English names in canonical and conjugated form, glish” or omission explicitly expressed entity ”Where • Documentation, you can pay using Australian Dollar and speak En- glish”). The next task is to add the ability to use the • CycL [13] language specification, property values that are not labeled directly in DBpe- dia resource (e.g. Australian Dollar rather than Aus- • Cyc API. tralian Dollar). In the longer term the students should try to introduce complex queries in form of ”Which country has currency that was Introduced on February 4.1.1 Cyc Knowledge Base 14th 1966 and language English”. OpenCyc is shipped with a knowledge base contain- ing over 300 000 concepts that are connected with 15 000 relations using 3 million of taxonomical as- 4 Other Knowledge Bases sertions. New assertions are constantly added to the database, mainly as a result of knowledge inference. 4.1 OpenCyc All concepts in knowledge base are treated as key- Cyc [10] is a rule based with common words of CycL [13] language which is a formal repre- sense knowledge base. It contains knowledge consist- sentation of Cyc knowledge. ing of very high number of concepts and rules defined The knowledge in Cyc is organized as a pyra- around those concepts that describe everyday real life mid. The top concept is Thing, which is a top con- situations, like shopping, relations between people, cept of Cyc ontology. Directly below Thing abstract time, space, abstract concepts etc. The database is concepts and knowledge about possible relations is also extended by knowledge about grammar and lexis located. The concepts become more and more spe- which allows natural language processing. cific towards the base of the pyramid. At the bot- Cyc knowledge base is utilized in a complex rea- tom lies specific knowledge in form of facts and data soner [5], which allows performing knowledge pro- (like names) which are connected with concepts in the cessing similar to that of a human. knowledge base. The Cyc System is available in three flavors: The knowledge base is divided into multiple micro-theories that share common assumptions. Ev- • full, commercial – Cyc, ery micro-theory is focused on given knowledge do- main, level of detail or time frame. Such micro-theory • free, but limited – OpenCyc, mechanism allow keeping in knowledge base seem- • intended for research projects – ResearchCyc. ingly contradictory assertions – each micro-theory have to has consistent assertions but the whole knowl- OpenCyc differs from Cyc by a limited knowl- edge base can have contradictory ones. Such organ- edge base. It has over 300 000 concepts connected isation of knowledge speeds up reasoning, increases with 3 million assertions. Those assertions are mainly scalability of the solution and allows usage of differ- of taxonomical type. Full Cyc knowledge base is ent reasoners for different knowledge types.

E-ISSN: 2224-3402 163 Volume 12, 2015 Tomasz Boiński, Adrian Ambrożewicz WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Julian Szymański

4.1.2 Reasoning Engine language Wikipedia. Those links, however limited OpenCyc is shipped with integrated reasoning engine in number, allow extracting non trivial links between that allows users to create new assertions and add data Wikipedia articles using Cyc knowledge base and rea- to the database. The database recognizes two types of soning capabilities. basic assertions: 4.1.6 Summary • facts – Tom is a student. OpenCyc is an extensive and formalized knowledge • rules – If a person X is a student than that person base. Unfortunately in the free version it mainly pro- has an index. vides taxonomical relations which is not enough. Be- ing a lite version of a commercial product it provides New facts can be generated by the knowledge powerful system with somewhat limited knowledge base using resolution. When a new fact is added the base that hinders potential uses of the System. It also reasoning engine infers new facts derived from the provides limited interoperability with widely used so- new one and adds them to the database. New facts lutions like WordNet and Wikipedia what limits its in- can also be added as a by-product of a query. teroperability.

4.1.3 CycL language 4.2 YAGO CycL is an LISP based language used to represent YAGO [16] is a knowledge base developed by Max- knowledge in Cyc. First order logic was extended Planck-Institut fur¨ Informatik in Saarbrucken.¨ It is by elements of Second order logic, skolemization and developed strictly by a dedicated team which takes nonmonotonic reasoning. The CycL dictionary is great care in assuring the quality of stored informa- composed of terms, that can be divided to: tion. Currently it holds 450 million facts about 10 million entities. • constants, In 2012 an extended version, called YAGO2, • variables, was released. It introduced full compatibility with RDF/OWL by providing the knowledge base in Turtle • non-atomic terms, format, the resources were combined into a subsets like Core, GeoNames etc. and the entities were ex- • micro-theories. tended by knowledge about domains. The terms are connected into sentences that form assertions that are stored in the knowledge base. Sen- 4.2.1 Data Sources tences in CycL are also used to form queries to Cyc YAGO Knowledge Base is automatically generated reasoning engine. based on the following data sources: • WordNet 4.1.4 NLP Subsystem This is currently the most actively developed part of – YAGO class hierarchy is directly derived Cyc. It is used to interpret facts and questions formed from WordNet synset hierarchy, using English language. The NLP System is com- – Most of the entities stored in YAGO are posed of a lexicon, syntactic parser and semantic in- correlated with WordNet synsets, those cor- terpreter. The lexicon contains information about syn- relations are also visible in construction of tax and semantics of English language. The seman- YAGO ontology. tic parser is used to create sentences out of symbols within the lexicon. Such sentences are transformed • GeoNames [20] into CycL syntax and verified by the semantic inter- – Entities with identical geographical names preter. are grouped, – Wikipedia entries describing location that 4.1.5 Wikipedia Links are described in GeoNames are directly OpenCyc 2.0 was extended to provide links to transform into YAGO entities, Wikipedia articles. Some of the constants are part – GeoNames classes are mapped into their of wikipediaArticleName and wikipediaArticleURL WordNet counterparts. predicates which connects them with proper article. Currently there are 19103 links defined to English • Wikipedia

E-ISSN: 2224-3402 164 Volume 12, 2015 Tomasz Boiński, Adrian Ambrożewicz WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Julian Szymański

– Information from Wikipedia info boxes are • DBpedia defines concrete classes like “Writer” mapped into YAGO facts, or “Musician”, whereas YAGO states only a fact – Facts are generated based on Wikipedia cat- that somebody “is a creator”. The concrete in- egories: formation about form of creation can be derived indirectly from type of the work related to given ∗ Aggregation categories (“born in person, 1990”) are parsed and interpreted in YAGO, • YAGO distinguishes incomplete dates and allows ∗ Leaves in Wikipedia categories struc- their comparison with full dates whereas DBpe- ture are prioritized, dia does not, ∗ The articles and categories are mapped • YAGO does not contain cycles in relations into WordNet synsets. whereas DBpedia does,

In general YAGO knowledge base is thus an ex- • Information stored in DBpedia is more detailed tension of WordNet structure with knowledge from than in YAGO but is domain dependent and in- Wikipedia. consistent across different topics. Both YAGO and DBpedia interchange knowl- 4.2.2 Quality edge. Data from YAGO is being integrated into DBpe- dia, whereas DBpedia serves as YAGO’s access point The quality of YAGO knowledge base is verified con- to world of Linked Data [2, 3]. stantly using questionnaires regarding selected sam- YAGO and DBpedia knowledge base can be ac- ples of data [11]. The measured accuracy is near 95%. cessed using online SPARQL endpoints. They are Furthermore tests shown that 87% of GeoNames en- located at http://lod2.openlinksw.com/ tries were mapped to WordNet synsets with 94% ac- sparql and http://dbpedia.org/sparql curacy. respectively. Unfortunately both of them are unreli- able thus requiring local installments of both systems. 5 Differences between YAGO and DBpedia 6 Conclusion Both YAGO and DBpedia are based on similar as- DBpedia is an interesting initiative that allows treat- sumptions and realize similar goals. The main differ- ing Wikipedia as a semi-formalized knowledge base, ence lies in the way of knowledge acquisition. In gen- in spite of its constantly changing and inconsistent eral DBpedia is based on community effort of map- knowledge base, due to larger amount of data avail- ping knowledge directly from Wikipedia and YAGO able. Unfortunately it proved to be unreliable mainly is based on stable heuristics. YAGO is developed by due to lack of formalization and consistency. a formal group of developers dedicated to the project Needles to say, all three presented systems are whereas DBpedia is developed by a rather loose com- widely used by researchers coping with natural lan- munity. guage processing. OpenCyc is the most formalized The main differences between those systems are: and mature of the solutions but in free version pro- vides limited knowledge in terms of relations between • DBpedia has its own ontology containing 529 entities. YAGO and DBpedia interlinks with other classes while YAGOs ontology is based on widely used solutions like WordNet and Wikipedia al- WordNet taxonomy and has 350 000 classes, lowing a formalized gateway to use that resources and create and interoperable system. Currently YAGO is • YAGO provides basic information through rela- more formalized than DBpedia but provides far less tions defined in yagoSchema, DBpedia has no information and currently it seems to be a better solu- such mechanism, tion. DBpedia efforts on standardizing and formaliz- ing its structure can however change this situation and • DBpedia mappings of info-boxes are performed due to larger amount of data available, when backed quite loosely compared to YAGO, attributes are by formal and consistent structure, can provide a bet- mapped “as is”, ter solution. In the near future we plan on migrating our simple • YAGO’s quality is controlled by manual quality implementation of question answering system to uti- tests on preselected samples of data, lize YAGO database and through its formalized struc-

E-ISSN: 2224-3402 165 Volume 12, 2015 Tomasz Boiński, Adrian Ambrożewicz WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Julian Szymański

ture link with DBpedia. Those two systems interlink [12] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, with each other so it might seem a proper solution that D. Kontokostas, P. N. Mendes, S. Hellmann, would allow combine YAGO formalisms with DBpe- M. Morsey, P. van Kleef, S. Auer and C. Bizer, dia extensive database. That however needs a verifi- DBpedia – A Large-scale, Multilingual Knowl- cation. edge Base Extracted from Wikipedia, Journal, 2014. [13] B. D. Lenat and R. V. Guha, The evolution of References: CycL, the Cyc representation language, ACM [1] M. Aono, and M. H. Seddiqui, Scalability in SIGART Bulletin, 2(3), pp. 84–87, 1991. ontology instance matching of large semantic [14] J. Rzeniewicz, J. Szymanski´ and W. Duch, knowledge base. In AIKED 2010 Proceedings of Adaptive algorithm for interactive question- the 9th WSEAS International Conference on Ar- based search, In Intelligent Information Process- tificial Intelligence, and ing VI, pp. 186–195, Springer, 2012. Data Bases, pp. 378–383, 2010. [15] M. D. Seddiqui and A. Masaki, Ontology in- [2] T. Berners-Lee, Linked data-design issues stance matching by considering semantic link (2006), http://www.w3.org/DesignIssues/Lin- cloud, 9th WSEAS Int. Conf. on Applications of kedData.html, [Online; accessed 10-March- Computer Engineering, 2010. 2014]. [16] F. M. Suchanek, G. Kasneci and G. Weikum, [3] C. Bizer, T. Heath, and T. Berners-Lee, Linked Yago: a core of semantic knowledge, In Pro- data-the story so far, International journal on se- ceedings of the 16th international conference on mantic web and information systems, 5(3), pp. World Wide Web, pp. 697–706, ACM, 2007. 1–22, 2009. [17] J. Szymanski´ and W. Duch, Annotating words using wordnet semantic glosses, In Neural In- [4] D. Brickley and L. Miller, The Friend of a Friend formation Processing, pp. 180–187, Springer, (FOAF) project, 2000. 2012. [5] J. Curtis, G. Matthews and D. Baxter, On the [18] J. Szymanski´ and W. Duch, Information retrieval effective use of Cyc in a question answering with semantic memory model, Cognitive Sys- system, In Proc Workshop on Knowledge and tems Research, 14(1), pp. 84–100, 2012. Reasoning for Answering Questions, pp. 61–70, [19] The Apache Software Foundation, Apache Jena, 2005. 2014, http://jena.apache.org/, [Online; accessed [6] DBpedia, The DBpedia Ontology (3.9), 2014, 10-March-2014]. http://wiki.dbpedia.org/Ontology, [Online; ac- [20] M. Wick and B. Vatant, The geonames geo- cessed 10-March-2014]. graphical database, http://geonames.org, [On- [7] DBpedia, The DBpedia Ontology (3.9), 2014, line; accessed 10-March-2014]. http://wiki.dbpedia.org/gsoc2013/ideas/OntologyCheck[21], Wikipedia, About Wikipedia – Wikipe- [Online; accessed 11-June-2014]. dia, The Free Encyclopedia, 2013, [8] A. Diosteanu, L. A. Cotfas, A. Smeureanu and http://en.wikipedia.org/w/index.php?title=About S. D. Dumitrescu, Natural language processing Wikipedia&oldid=582531310, [Online; ac- applied in itinerary recommender systems. In cessed 10-March-2014]. Proceedings of the 10th WSEAS international conference on Applied computer and applied computational science, pp. 260–265, World Sci- entific and Engineering Academy and Society (WSEAS), 2011. [9] C. Fellbaum, WordNet: An electronic lexical database, 1998. [10] D. Foxvog, Cyc, In Theory and Applications of Ontology: Computer Applications, pp. 259–278, Springer, 2010. [11] J. Hoffart, F. M. Suchanek, K. Berberich and G. Weikum, YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia, Arti- ficial Intelligence, 194, pp. 28–61, 2013.

E-ISSN: 2224-3402 166 Volume 12, 2015