Tomasz Boiński, Adrian Ambrożewicz WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Julian Szymański DBpedia As A Formal Knowledge Base – An Evaluation TOMASZ BOINSKI´ ADRIAN AMBROZEWICZ˙ JULIAN SZYMANSKI´ Gdansk´ University of Technology Gdansk´ University of Technology Gdansk´ University of Technology Faculty of Electronics, Faculty of Electronics, Faculty of Electronics, Telecommunications and Informatics Telecommunications and Informatics Telecommunications and Informatics Narutowicza Street 11/12 Narutowicza Street 11/12 Narutowicza Street 11/12 80-233 Gdansk´ 80-233 Gdansk´ 80-233 Gdansk´ POLAND POLAND POLAND [email protected] [email protected] [email protected] Abstract: DBpedia is widely used by researchers as a mean of accessing Wikipedia in a standardized way. In this paper it is characterized from the point of view of questions answering system. Simple implementation of such system is also presented. The paper also characterizes alternatives to DBpedia in form of OpenCyc and YAGO knowledge bases. A comparison between DBpedia and those knowledge bases is presented. Key–Words: Knowledge base, Natural language processing 1 Introduction pointed out. Growing amount of data available across the Inter- net is a viable source of knowledge. The problem 2 DBpedia lies in the form of knowledge representation and ways of accessing it. Recently many techniques became DBpedia aims at transforming Wikipedia articles into available to structuralize this knowledge [18, 14, 17]. an RDF compatible database with specified ontology. A few Knowledge Bases also emerges. Among the It’s a database based on so called “semantic stack”. most often used is OpenCyc [5], DBpedia [12] and DBpedia entries are created by automatic conversion YAGO [16]. All those resources create new possibil- of Wikipedia articles into triples in RDF format. Cur- ities enabling creation of different types natural lan- rently transformation includes: guage processing based solutions [1, 8, 15]. Currently • Title, DBpedia is one of the most widely used knowledge base. During research however this knowledge base • Abstract, is taken as granted without further analysis of its pros and cons. Other alternatives are often not taken into • Geo coordinates, account at all. In this paper we describe and evaluate aforemen- • Categories, tioned knowledge bases for use in natural language based question answering system. The main focus of • Pictures, the Paper is placed on DBpedia however as it seems • Links, to the most popular. Unfortunately some of its weak- nesses were found out during our tests. The paper also • Info-boxes. presents a short characteristic of a simple question an- swering system and shortly describes simple imple- mentation used for testing DBpedia knowledge base. 2.1 Knowledge Acquisition The structure of this paper is as follows. Sec- The most information is gathered from info-boxes that tion 2 describes the DBpedia from the point of view are directly converted into standard “subject, predi- od question answering system. Section 3 shortly de- cate, object” triples. Info-boxes however are manda- scribes our simple implementation of such system and tory for automatic conversion. If an article does not in Section 4 alternatives to DBpedia are presented. have a info-box than it will not appear in the DBpe- Section 5 compares DBpedia with YAGO and finally dia [7]. Currently only around a half of the Wikipedia in Section 6 some conclusions and observations are articles has an info-box. E-ISSN: 2224-3402 161 Volume 12, 2015 Tomasz Boiński, Adrian Ambrożewicz WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Julian Szymański • The highest homogeneity is characteristic to en- Table 1: Estimated number of instances for main sub- tries describing well defined and unchangeable jects in version 3.9 of DBPedia terms like city, country, music record, language. Subject No of instances • Information stored in records related to each Persons 832 000 other is consistent and coherent, Places 639 000 • In other cases is observed that heterogeneity is Works of art 372 000 dependent on the domain of the concept – for ex- Organizations 209 000 ample persons were described in different way Species 226 000 dependent on their profession. Sicknesses 5 600 • Each category has its own separate ontology in- compatible with other ontologies, however cur- rently there is work being done that should elim- Table 2: Number of connections to other knowledge inate this simple property to ontology mapping bases and standardize the description of a concept. Still most of the entities are not transformed to this Knowledge base No of connections new dbpedia-owl ontology. This requires man- flickr wrappr 4 000 000 ual work done by humans. Freebase 3 900 000 GeoNames 425 000 • There is a low number of relations between en- LinkedGeoData 104 000 tries – in most cases records are connected with each other only by links to common categories. OpenCyc 27 000 UMBEL 900 000 Wikidata 5 200 000 3 Simple Question Answering Sys- WordNet 470 000 tem 2 900 000 instances YAGO 41 000 000 properties We decided to implement a simple question answer- ing system based on question templates that uses DB- pedia as a knowledge base. Its current version al- lows asking questions to the DBPediia database in Table 1 presents estimated number of instances two ways. The first of these involves the creation of for main subjects in version 3.9 of DBPedia. queries in the form of triples of type ¡entity, property, Besides automatic conversion new records are property value, [¡property, property value¿, ...]. This added by the community. The community also maps approach allows the generation of simple queries with DBpedia entries with other semantic knowledge bases a single entity and describing it set the properties of (Table 2). the selected values. The second way is to generate As it can be seen DBpedia is a large database and simple queries in natural language formulated accord- it is under constant development. ing to predefined template. This way a user can ask a question in form of ”which, entity, has, property, property value [and property, the value of property, [ 2.2 DBpedia Ontology and ... ]]. Sample question is ”Which country has Early versions of DBpedia had no ontology. Currently Australian Dollar currency and language English”. In it’s structure is extended by such. DBpedia ontology both cases it is possible to define a single entity and has 529 classes which form a subsumption hierarchy its properties. and are described by 2,333 different properties [6]. In both cases the system worked surprisingly Unfortunately data extracted from Wikipedia not al- well allowing prefetching property values in the back- ways is mapped into the ontology. Especially that ex- ground based on user input which simplifies query tracted using old algorithms need further verification construction. The main difficulty is to transform nat- and correction. ural language formed question into a SPARQL query. We performed a preliminary check of the most The consistency of knowledge available in DBpedia is popular concepts in terms of type and homogeneity unfortunately quite low. Replies for the same queries of available information. The came to the following but with different subjects were not comparable as conclusions: there is inconsistency in terms of knowledge available. E-ISSN: 2224-3402 162 Volume 12, 2015 Tomasz Boiński, Adrian Ambrożewicz WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Julian Szymański E.g. different persons are characterized by differ- filled with other types of assertions and is extended ent properties depending on their occupation, achieve- with knowledge from more specific domains. Open- ments etc. DBpedia would require some kind of stan- Cyc 2.0 System contains: dardization in this matter. The DBpedia SPARQL endpoint worked well • Cyc ontology containing all concepts and asser- when it was available, but unfortunately it provided tions, a lot of downtime proving it unusable for produc- tion or even extensive testing. Local installment using • Reasoning module – Cyc Inference Engine, Apache Jena [19] was thus necessary. • Knowledge Base browser – OpenCyc Browser. In future we would like to expand the possibili- ties of the system by introducing other types of queries • Links between Cyc concepts and WordNet [9] that can be forwarded to the knowledge base. The first synsets, task is thus the extension of ways to ask questions in natural language by a better analysis of the wording • Links between Cyc and FOAF [4] concepts, of the question that should go beyond the pre-defined templates. It is therefore necessary to allow other • Links between Cyc concepts and Wikipedia [21] forms of questioning (which, where, whose, when, and DBpedia [12] articles, etc.) and other forms of queries (e.g. ”In Which coun- try you can pay with Australian Dollar and speak En- • English names in canonical and conjugated form, glish” or omission explicitly expressed entity ”Where • Documentation, you can pay using Australian Dollar and speak En- glish”). The next task is to add the ability to use the • CycL [13] language specification, property values that are not labeled directly in DBpe- dia resource (e.g. Australian Dollar rather than Aus- • Cyc API. tralian Dollar). In the longer term the students should try to introduce complex queries in form of ”Which country has currency that was Introduced on February 4.1.1 Cyc Knowledge Base 14th 1966 and language English”. OpenCyc is shipped with a knowledge base contain- ing over 300 000 concepts that are connected with 15 000 relations using 3 million of taxonomical as- 4 Other Knowledge Bases sertions. New assertions are constantly added to the database, mainly as a result of knowledge inference. 4.1 OpenCyc All concepts in knowledge base are treated as key- Cyc [10] is a rule based expert system with common words of CycL [13] language which is a formal repre- sense knowledge base. It contains knowledge consist- sentation of Cyc knowledge.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-