Wikidata Through the Eyes of Dbpedia
Total Page:16
File Type:pdf, Size:1020Kb
Undefined 1 (2009) 1–5 1 IOS Press Wikidata through the Eyes of DBpedia Ali Ismayilov a Dimitris Kontokostas b Sören Auer a Jens Lehmann a and Sebastian Hellmann b a University of Bonn, Enterprise Information Systems and Fraunhofer IAIS e-mail: [email protected] | [email protected] | [email protected] b Universität Leipzig, Institut für Informatik, AKSW e-mail: {lastname}@informatik.uni-leipzig.de Abstract. DBpedia is one of the earliest and most prominent provenance information. Using the Wikidata data nodes of the Linked Open Data cloud. DBpedia extracts and model as a base, different RDF serializations are provides structured data for various crowd-maintained infor- possible. mation sources such as over 100 Wikipedia language edi- Schema Both schemas of DBpedia and Wikidata are tions as well as Wikimedia Commons by employing a ma- community curated and multilingual. The DBpe- ture ontology and a stable and thorough Linked Data pub- lishing lifecycle. Wikidata, on the other hand, has recently dia schema is an ontology based in OWL that emerged as a user curated source for structured information organizes the extracted data and integrates the which is included in Wikipedia. In this paper, we present how different DBpedia language editions. The Wiki- Wikidata is incorporated in the DBpedia eco-system. Enrich- data schema avoids direct use of RDFS or OWL ing DBpedia with structured information from Wikidata pro- terms and redefines many of them, e.g. wkd:P31 vides added value for a number of usage scenarios. We out- defines a local property similar to rdf:type. line those scenarios and describe the structure and conver- There are attempts to connect Wikidata properties sion process of the DBpediaWikidata (DBW) dataset. to RDFS/OWL and provide alternative exports of Keywords: DBpedia, Wikidata, RDF Wikidata data. Curation All DBpedia data is automatically extracted from Wikipedia and thus, is a read-only dataset. 1. Introduction Wikipedia editors are, as a side-effect, the actual curators of the DBpedia knowledge base but due In the past decade, several large and open knowl- to the semi-structured nature of Wikipedia, not edge bases were created. A popular example, DB- all content can be extracted and errors may oc- pedia [6], extracts information from more than one cur. Wikidata on the other hand has its own direct 1 hundred Wikipedia language editions and Wikimedia data curation interface called WikiBase, which is Commons [9] resulting in several billion facts. A more based on the MediaWiki framework. recent effort, Wikidata [10], is an open knowledge base Publication Both DBpedia and Wikidata publish datasets for building up structured knowledge for re-use across in a number of Linked Data ways, including Wikimedia projects. Both resources overlap as well as datasets dumps, dereferencable URIs and SPARQL complement each other as described in the high-level endpoints. overview below. Coverage While DBpedia covers most Wikipedia lan- guage editions, Wikidata unifies all language edi- Identifiers DBpedia uses human-readable Wikipedia tions and defines resources beyond Wikipedia. article identifiers to create IRIs for concepts in There has not yet been a thorough qualitative and each Wikipedia language edition. Wikidata on the quantitative comparison in terms of content but other hand uses language-independent numeric the following two studies provide a good compar- identifiers. ison overview [3,7]. Structure DBpedia starts with RDF as a base data model while Wikidata developed its own data model, which provides better means for capturing 1http://wikiba.se/ 0000-0000/09/$00.00 c 2009 – IOS Press and the authors. All rights reserved 2 F. Author et al. / Data Freshness DBpedia is a static, read-only dataset that is updated periodically. An exception is DB- 1 wkdt:Q42 wkdt:P26s wkdt:Q42Sb88670f8-456b -3ecb-cf3d-2bca2cf7371e. pedia Live (available for English, French and Ger- 2 wkdt:Q42Sb88670f8-456b-3ecb-cf3d-2 man). On the other hand, Wikidata has a direct bca2cf7371e wkdt:P580q wkdt: editing interface where people can create, update VT74cee544. or fix facts instantly. However, there has not yet 3 wkdt:VT74cee544 rdf:type :TimeValue.; been a study that compares whether facts entered 4 :time "1991-11-25"^^xsd:date; 5 :timePrecision "11"^^xsd:int; : in Wikidata are more up to date than data entered preferredCalendar wkdt:Q1985727. in Wikipedia (and thus, transitively in DBpedia). 6 wkdt:Q42Sb88670f8-456b-3ecb-cf3d-2 bca2cf7371e wkdt:P582q wkdt: We argue that as a result of this complementarity, VT162aadcb. aligning both efforts in a loosely coupled way would 7 wkdt:VT162aadcb rdf:type :TimeValue; produce an improved resource and render a number of 8 :time "2001-5-11"^^xsd:date; benefits for the end users. Wikidata would have an al- 9 :timePrecision "11"^^xsd:int; : ternate DBpedia-based view of it’s data and an addi- preferredCalendar wkdt:Q1985727. tional data distribution channel. End users would have Listing 1: RDF serialization for the fact: Douglas more options for choosing the dataset that fits better in Adams’ (Q42) spouse is Jane Belson (Q14623681) their needs and use cases. from (P580) 25 November 1991 until (P582) 11 May The remainder of this article is organized as follows. 2001. Extracted from [10] Figure 3 Section 2 provides an overview of Wikidata and DB- pedia. Following, Section 3 provides a rational for the design decision that shaped DBW, while Section 4 de- is a collection of entity pages. There are three types tails the technical details for the conversion process. A of entity pages: items, property and query. Every item description of the dataset is provided in Section 5 fol- page contains labels, short description, aliases, state- lowed with statistics in Section 6. Access and sustain- ments and site links. As depicted in Figure 1, each ability options are detailed in Section 7 and Section 8 statement consists of a claim and one or more op- discusses possible use cases of DBW. Finally, we con- tional references. Each claim consists of a property- clude in Section 9. value pair, and optional qualifiers. Values are also di- vided into three types: no value, unknown value and custom value. The “no value” marker means that there 2. Background is certainly no value for the property, the “unknown value” marker means that the property has some value, Wikidata Wikidata is a community-created knowl- but it is unknown to us and the “custom value ” which edge base to manage factual information of Wikipedia provides a known value for the property. and its sister projects operated by the Wikimedia Foun- DBpedia The semantic extraction of information dation [10]. In other words, Wikidata’s goal is to be the from Wikipedia is accomplished using the DBpedia central data management platform of Wikipedia. As of Information Extraction Framework (DIEF) [6] . DIEF April 2016, Wikidata contains more than 20 million is able to process input data from several sources pro- items and 87 million statements2 and has more than vided by Wikipedia. The actual extraction is performed 6,000 active users.3 In 2014, an RDF export of Wiki- by a set of pluggable Extractors, which rely on cer- data was introduced [2] and recently a few SPARQL tain Parsers for different data types. Since 2011, DIEF endpoints were made available as external contribu- is extended to provide better knowledge coverage for tions as well as an official one later on.4 5 Wikidata internationalized content [5] and further eases the in- tegration of different Wikipedia language editions as 2https://tools.wmflabs.org/wikidata-todo/ well as Wikimedia Commons [9]. stats.php 3https://stats.wikimedia.org/wikispecial/ EN/TablesWikipediaWIKIDATA.htm 4https://query.wikidata.org/ 3. Challenges and Design Decisions 5@prefix wkdt:< http : ==wikidata:org=entity= > . 6https://commons.wikimedia.org/wiki/File: In this section we describe the design decisions we Wikidata_statement.svg took to shape the DBpediaWikidata (DBW) dataset F. Author et al. / 3 Fig. 1. Wikidata statements, image taken from Commons 6 while maximising compatibility, (re)usability and co- step towards data integration and fusion. Most compa- herence. nies (e.g. Google, Yahoo, Bing, Samsung) keep these datasets hidden; however, our approach is to keep all New IRI minting The most important design deci- the DBpedia data open to the community for reuse and sion we had to take was whether to re-use the exist- feedback. ing Wikidata IRIs or minting new IRIs in the DBpedia namespace. The decision dates back to 2013, when this Ontology design, reification and querying The DB- project originally started and after lengthy discussions pedia ontology is an ontology developed and main- we concluded that minting new URIs was the only vi- tained since 2006, has reached a stable state and 7 able option. The main reason was the impedance mis- has 375 different datatypes and units.8 The Wikidata match between Wikidata data and DBpedia as both schema on the other hand is quite new and evolv- projects have minor deviations in conventions. Thus, ing and thus, not as stable. Simple datatype support creating new IRIs allows DBpedia to make local asser- in Wikidata started from the beginning of the project tions on Wikidata resources without raising too many but units were only introduced at the end of 2015. In concerns. addition, Wikidata did not start with RDF as a first Re-publishing minted IRIs as linked data Since 2007, class citizen. There were different RDF serializations there has been extensive tool development by the DB- of Wikidata data and in particular different reification pedia community to explore and exploit DBpedia data techniques. For example the RDF we get from content 9 10 through the DBpedia ontology. By publishing DBW, negotiation is still different from the RDF dumps we enable tools that are designed to consume the DB- and the announced reification design [2].