<<

Semantic Web 1 (2016) 1–5 1 IOS Press

Linked Quality of DBpedia, , OpenCyc, , and

Editor(s): Jie Tang, Tsinghua University, China Solicited review(s): Zhigang Wang, Beijing Normal University, China; Anonymous; Sebastian Mellor, Newcastle University, U.K.

Michael Färber ∗,∗∗, Basil Ell, Carsten Menne, Achim Rettinger ∗∗∗, and Frederic Bartscherer Karlsruhe Institute of Technology (KIT), Institute AIFB, 76131 Karlsruhe, Germany

Abstract. In recent years, several noteworthy large, cross-domain and openly available knowledge graphs (KGs) have been created. These include DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Although extensively in use, these KGs have not been subject to an in-depth comparison so far. In this survey, we provide criteria according to which KGs can be analyzed and analyze and compare the above mentioned KGs. Furthermore, we propose a framework for finding the most suitable KG for a given setting.

Keywords: , Quality, Data Quality Metrics, Comparison, DBpedia, Freebase, OpenCyc, Wikidata, YAGO

1. Introduction The idea of a was introduced to a wider audience by Berners-Lee in 2001 [10]. Accord- The vision of the Semantic Web is to publish and ing to his vision, the traditional Web as a Web of Doc- query knowledge on the Web in a semantically struc- uments should be extended to a Web of Data where not tured way. According to Guns [21], the term “Seman- only documents and links between documents, but any tic Web” already was being used in fields such as entity (e.g., a person or organization) and any relation Educational Psychology, before it became prominent between entities (e.g., isSpouseOf ) can be represented in Computer Science. Freedman and Reynolds [19], on the Web. for instance, describe “semantic webbing” as organiz- When it comes to realizing the idea of the Seman- ing information and relationships in a visual display. tic Web, knowledge graphs (KGs) are currently seen Berners-Lee presented his idea of using typed links as vehicle of semantics for the first time at the World as one of the most essential components. The term Wide Web Fall 1994 Conference under the heading "Knowledge Graph" was coined by in 2012 “Semantics,” and under the heading “Semantic Web” and is intended for any graph-based . in 1995 [21]. We define a Knowledge Graph as an RDF graph. An RDF graph consists of a set of RDF triples where each RDF triple (s, p, o) is an ordered set of the following *Corresponding author. E-mail: [email protected]. **This work was carried out with the support of the German Fed- RDF terms: a subject s ∈ U ∪ B, a predicate p ∈ U, eral Ministry of Education and Research (BMBF) within the Soft- and an object U ∪ B ∪ L. An RDF term is either a URI ware Campus project SUITE (Grant 01IS12051). u ∈ U, a blank node b ∈ B, or a literal l ∈ L. U, B, *** The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007- and L are pairwise disjoint. We denote the system that 2013) under grant agreement no. 611346. hosts a KG g with hg.

1570-0844/16/$27.50 c 2016 – IOS Press and the authors. All rights reserved 2 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Further, we define several sets that are used in for- edge) instead of knowledge about special do- malizations throughout the paper: mains such as biomedicine.

– Cg denotes the set of classes in g, defined as Thus, out of scope are KGs which are not openly 4 Cg := {x|(x, rdfs:subClassOf, o) ∈ g ∨ available such as the and (s, rdfs:subClassOf, x) ∈ g ∨ the Google Knowledge Vault [14]. Excluded are also (s, rdf:type, owl:Class) ∈ g ∨ KGs which are only accessible via an API, but which (s, wdt:31, wdt:P279) ∈ g} are not provided as dump files (see WolframAlpha5 6 – Ig denotes the set of instances in g, defined as and the Graph ) as well as KGs which Ig := {s | (s, rdf:type, o) ∈ g} are not based on Semantic Web standards at all or imp – Pg denotes the set of all implicitly defined which comprise only unstructured or weakly struc- properties in g, defined as tured knowledge collections (e.g., The World Factbook imp 7 Pg := {p | (s, p, o) ∈ g} of the CIA ). – Rg denotes the set of all URIs used in g, defined For selecting the KGs for analysis, we regarded all as Rg := {x | ((x, p, o) ∈ g ∨ (s, x, o) ∈ g ∨ datasets which were registered at the online dataset (s, p, x) ∈ g) ∧ x ∈ } catalog http://datahub.io8 and which were tagged as “crossdomain”. Besides that, we took fur- Note that knowledge about the knowledge graphs ther data sets into consideration which fulfilled the analyzed in the context of this survey was taken into above mentioned requirements (i.e., Wikidata). Based account when defining these sets. These definitions on that, we selected DBpedia, Freebase, OpenCyc, may not be appropriate for other KGs. Furthermore, Wikidata, and YAGO as KGs for our comparison. the sets’ extensions would be different when assuming In this paper, we give a systematic overview of these a certain semantic (e.g., RDF, RDFS, or OWL-LD). KGs in their current versions and discuss how the Under the assumption that all entailments under one of knowledge in these KGs is modeled, stored, and can these semantics were added to a KG, the definition of be queried. To the best of our knowledge, such a com- each set could be simplified and the extensions would parison between these widely used KGs has not been be of larger cardinality. However, in this work we did presented before. Note that the focus of this survey is not derive entailments. not the life cycle of KGs on the Web or in enterprises. In this survey, we focus on those KGs having the We can refer in this respect to [5]. Instead, the focus of following aspects: our KG comparison is on data quality, as this is one of 1. The KGs are freely accessible and freely usable the most crucial aspect when it comes to considering within the Linked (LOD) cloud. which KG to use in a specific setting. Linked Data refers to a set of best practices1 Furthermore, we provide an evaluation framework for publishing and interlinking structured data on for users who are interested in using one of the men- the Web, defined by Berners-Lee [8] in 2006. tioned KGs in a research or industrial setting, but who Linked Open Data refers to the Linked Data are inexperienced in which KG to choose for their con- which "can be freely used, modified, and shared crete settings. by anyone for any purpose."2 The aim of the The main contributions of this survey are: Linking Open Data community project3 is to 1. Based on existing literature on data quality, we publish RDF data sets on the Web and to interlink provide 34 data quality criteria according to these data sets. which KGs can be analyzed. 2. The KGs should cover general knowledge (often also called cross-domain or encyclopedic knowl- 4See http://www.google.com/insidesearch/ features/search/knowledge.html 1See http://www.w3.org/TR/ld-bp/, requested on April 5See http://products.wolframalpha.com/api/ 5, 2016. 6See https://developers.facebook.com/docs/ 2See http://opendefinition.org/, requested on Apr 5, graph-api 2016. 7See https://www.cia.gov/library/ 3See http://www.w3.org/wiki/SweoIG/ publications/the-world-factbook/ TaskForces/CommunityProjects/LinkingOpenData, 8This catalog is also used for registering Linked Open Data requested on Apr 5, 2016. datasets. M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 3

2. We calculate key statistics for the KGs DBpedia, data quality dimensions, and data quality categories.10 Freebase, , Wikidata, and YAGO. In the following, we reuse these concepts for our own 3. We analyze DBpedia, Freebase, Cyc, Wikidata, framework, which has the particular focus on the data and YAGO along the mentioned data quality cri- quality of KGs in the context of Linked Open Data. teria.9 A data quality criterion (Wang et al. also call it 4. We propose a framework which enables users to “data quality attribute”) is a particular characteristic of find the most suitable KG for their needs. data w.r.t. its quality and can be either subjective or objective. Examples of subjectively measurable data The survey is organized as follows: quality criteria are trustworthiness and reputation of a – In Section 2 we describe the data quality dimen- KG in its entirety. Examples of objective data quality sions which we later use for the KG comparison, criteria are accuracy and completeness (see [39] and including their subordinated data quality criteria also Section 2.1). and corresponding data quality metrics. In order to measure the degree to which a certain – In Section 3 we describe the selected KGs. data quality criterion is fulfilled for a given KG, each – In Section 4 we analyze the KGs along several criterion is formalized and expressed in terms of a key statistics as well as the data quality metrics function with the value range of [0, 1]. We call this introduced in Section 2. function the data quality metric of the respective data – In Section 5 we present our framework for assess- quality criterion. ing and rating KGs according to the user’s setting. A data quality dimension – in the following just – In Section 6 we present related work on (linked) called dimension – is a main aspect how data quality data quality criteria and on key statistics for KGs. can be viewed. A data quality dimension comprises – In Section 7 we conclude the survey. one or several data quality criteria [40]. For instance, the criteria syntactic validity of the RDF documents, syntactic validity of literals and semantic validity of 2. Data Quality Assessment w.r.t. KGs triples form the accuracy dimension. Data quality dimensions and their respective data Everybody on the Web can publish information. quality criteria are further grouped into data quality Therefore, a data consumer does not only face the chal- categories. Based on empirical studies, Wang et al. lenge to find a suitable data source, but is also con- specified four categories: fronted with the issue that data on the Web can dif- – Criteria of the category of the intrinsic data qual- fer very much regarding their quality. Data quality can ity are independent of the use case context; by thereby be viewed not only in terms of accuracy, but in means of these criteria, one can check whether multiple other dimensions. In the following, we intro- the information reflects the reality and is logically duce concepts regarding the data quality of KGs in the consistent. Linked Data context, which are used in the following – Criteria of the category of the contextual data sections. The data quality dimensions are then exposed quality cannot be considered in general, but must in Section 2.1. be assessed depending on the application context Data quality (DQ) – in the following interchange- of the data consumer. ably used with information quality – is defined by Ju- – Criteria of the category of the representational ran [29] as fitness for use. This means that data quality data quality reveal in which form the information is dependent on the actual use case. is available. One of the most important and foundational works – Criteria of the category of the accessibility data on data quality is that of Wang et al. [40]. They de- quality determine how the data can be accessed. veloped a framework for assessing the data quality of data sets in the database context. In this framework, Since its publication, the presented framework of Wang et al. distinguish between data quality criteria, Wang et al. has been extensively used, either in its orig- inal version or in an adapted or extended version. Bizer [11] and Zaveri [42] worked on data quality in the 9The data and detailed evaluation results for both the key statistics and the metric evaluations are online avail- able at http://km.aifb.kit.edu/sites/knowledge- 10The quality dimensions are defined in [40], the sub- graph-comparison/. classification into parameters/indicators in [39, p. 354]. 4 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Linked Data context. They make the following adapta- Based on the quality dimensions introduced by tions on Wang et al.’s framework: Wang et al. [40], we now present the DQ criteria and metrics as used in our KG comparison. Note that some – Bizer [11] compared the work of Wang et al. [40] of the criteria have already been introduced by others with other works in the area of data quality. He as outlined in Section 6. thereby complements the framework with the di- mensions consistency, verifiability, and offensive- 2.2. Intrinsic Category ness. – Zaveri [42] follows Wang et al. [40], but intro- “Intrinsic data quality denotes that data have quality duces licensing and interlinking as additional di- in their own right.” [40, p. 6] This kind of data qual- mensions. ity can therefore be assessed independently from the In this article, we use dimensions as defined by context. The intrinsic category embraces the three di- Wang et al. [40] and as extended by Bizer [11] and Za- mensions accuracy, trustworthiness, and consistency, veri [42]. I.e., besides Wang et al.’s dimensions, we in- which are defined in the following subsections. The corporate consistency and verifiability into our frame- dimensions believability, objectivity, and reputation, work11 and extend the category of accessibility by the which are separate dimensions in Wang et al.’s classi- dimension license and interlinking, as those data qual- fication system [40], are subsumed by us into the di- ity dimensions get in addition relevant in the Linked mension trustworthiness. Data context. 2.2.1. Accuracy Criterion definition. Accuracy is “the extent to 2.1. Criteria Weighting which data are correct, reliable, and certified free of error.” [40, p. 31] When applying our framework to compare KGs, the Discussion. Accuracy is intuitively an important di- single metrics and dimensions can be weighted differ- mension of data quality. Previous work on data quality ently so that the needs and requirements of the users has mainly analyzed only this aspect [40]. Hence, ac- can be taken into account. In the following, we first for- curacy has often been used as synonym for data quality malize the idea of weighting the different metrics. We [34]. Bizer [11] highlights in this context that accuracy then present the criteria and the corresponding metrics is an objective dimension and can only be applied on of our framework. verifiable statements. Given are a knowledge graph g, a set of criteria Batini et al. [6] distinguish syntactic and seman- C = {c1, ..., cn}, a set of metrics M = {m1, ..., mn}, tic accuracy: Syntactic accuracy describes the for- and a set of weights W = {w1, ..., wn}. Each metric mal compliance to syntactic rules without reviewing whether the value reflects the reality. The semantic ac- mi corresponds to the criterion ci and mi(g) ∈ [0, 1] where a value of 0 defines the minimum fulfillment curacy determines whether the value is semantically degree of a knowledge graph regarding a quality cri- valid, i.e., whether the value is true. Based on the clas- terion, a value of 1 the maximum fulfillment degree. sification of Batini et al., we can define the metric for accuracy as follows: Furthermore, each criterion ci is weighted by wi. The fulfillment degree h(g) ∈ [0, 1] of a KG g is Metric definition. The metric for the dimension ac- then the weighted normalized sum of the fulfillment curacy is determined by degrees w.r.t. the criteria c1, ..., cn: 1. the syntactic validity of RDF documents, 2. the syntactic validity of literals, and P 3. the semantic validity of triples. i∈[1,n] wi · mi(g) h(g) = P The fulfillment degree of a KG g w.r.t. the dimen- j∈[1,n] wj sion accuracy is measured by the metrics msynRDF , msynLit, and msemT riple which are defined as fol- 11Consistency is treated by us as a separate dimension. Verifiabil- lows. ity is treated within the dimension trustworthiness, where it is eval- uated if provenance vocabulary is used in the KG. The offensiveness Syntactic Validity of RDF documents The syntactic of KG facts is not considered by us, as it is hard to make an objective validity of RDF documents is an important require- evaluation in this regard. ment for machines to interpret an RDF document com- M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 5 pletely and correctly. Hogan et al. [26] suggest using ifiability. These aspects were defined by Wang et al. standardized tools for creating RDF data. The authors [40] and Naumann [34, p. 32] as follows: state that in this way normally only little syntax errors – Believability: Believability is “the extent to which occur, despite the complex syntactic representation of data are accepted or regarded as true, real, and RDF/XML. credible.” [40, p. 31]. The RDF can be validated by an RDF validator such – Reputation: Reputation is “the extent to which as the W3C RDF validator.12 data are trusted or highly regarded in terms of ( their source or content” [40, p. 32]. 1 if all RDF documents are valid msynRDF (g) = – Objectivity: Objectivity is “the extent to which 0 otherwise data are unbiased (unprejudiced) and impartial” [40, p. 32]. Syntactic Validity of Literals The aspect of syntac- – Verifiability: Verifiability is “the degree and ease tic validity of literals focuses on the adherence of syn- with which the data can be checked for correct- tactic restrictions w.r.t. literals by means of syntactic ness” [34, p. 32]. rules [20,31]. Syntactic rules can be written in the form of regular expressions. For instance, it can be verified Discussion. In summary, believability considers the whether a literal representing a date follows the ISO subject (data consumer) side; reputation takes the gen- 8601 specification. eral, social view on trustworthiness; objectivity con- siders the object (data provider) side, while verifiabil- ity focuses on the possibility of verification. |{(s, p, o) ∈ g | o ∈ L ∧ synV alid(o)}| Trustworthiness has been discussed as follows: m (g) = synLit |{(s, p, o) ∈ g | o ∈ L}| – According to Naumann [34, p. 34], believability is the “expected accuracy” of a data source. In case of an empty set in the denominator of the – The essential difference of believability to accu- fraction, the metric should evaluate to 1. racy is that for believability, data is trusted with- Semantic Validity of Triples The semantic validity of out verification [11]. Thus, believability is closely triples is introduced to evaluate whether the meanings related to the reputation of a data set. of triples with literal values in object position in the – According to Naumann [34], the objectivity of a KG are semantically correct. data source is strongly related to the verifiability: A triple is either semantically correct if it is also The more verifiable a data source or statement is, available from a trusted source (e.g. Name Authority the more objective it is. The authors of this paper File) or if it is or if the stated property would not go so far, since also biased statements can be measured or perceived by us firectly. could be verifiable. – Heath et al. [24, p. 52] emphasize that it is es- sential for trustworthy applications to be able to m (g) = verify the origin of data. semT riple – In the context of relational , Buneman |{(s, p, o) ∈ g | o ∈ L ∧ semV alid(s, p, o)}| et al. [13] differentiate between why provenance |{(s, p, o) ∈ g | o ∈ L}| and where provenance: Why provenance answers the question why a particular data item is stored In case of an empty set in the denominator of the in the database. Where provenance answers the fraction, the metric should evaluate to 1. question from where the information originated. 2.2.2. Trustworthiness In this paper, we focus on the where provenance, as this aspect is frequently covered in the database Definition. Trustworthiness is defined as "the de- and Linked Data area. gree to which the information is accepted to be correct, true, real, and credible" [42]. It is used as a collective Metric. We define the metric for the data quality di- term for believability, reputation, objectivity, and ver- mension trustworthiness as a combination of trustwor- thiness metrics on both KG and statement level. 12See http://www.w3.org/RDF/Validator, requested The fulfillment degree of a KG g w.r.t. the di- on Feb 29, 2016. mension trustworthiness is measured by the metrics 6 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO mgraph, mfact, and mNoV al which are defined as fol- is used. By means of a provenance vocabulary, the lows. source of statements can be stored. Storing source in- formation on statement level is an important precon- Trustworthiness on KG level The measure of trust- dition to verify statements easily w.r.t. accuracy. The worthiness on KG level exposes a basic indication most widely used ontologies for storing provenance about the trustworthiness of the KG: In this assess- 13 ment, (i) the method of and (ii) the information are the terms dcterms:provenance method of data insertion (respectively, data source) is with properties such as and dcterms:source 14 taken into account. Regarding the method of data cu- and the W3C PROV-O with ration, we distinguish between manual and automated properties such as prov:wasDerivedFrom. methods. Regarding the data insertion, we can differ- entiate between: 1. whether the data is entered by ex- 1 provenance on statement perts (of a specific domain), 2. whether the knowledge   comes from volunteers contributing in a community,  level is used mfact(g) = 0.5 provenance on resource and 3. whether the knowledge is extracted automati-   level is used cally from a data source. This data source can itself be 0 otherwise either structured, semi-structured, or un-structured. We assume that a closed system, where experts or other Indicating unknown and empty values If the data registered users feed knowledge into a system, is less model of the considered KG supports the represen- vulnerable to harmful behavior of users than an open tation of unknown and empty values, more complex system, where data is curated by a community. There- statements can be represented. For instance, empty val- fore, we assign the values of the metric for trusthwor- ues enable to represent that a person has no children thiness on KG level as follows: and unknown values enable to represent that the birth date of a person in not known. This kind of higher ex-  planatory power of a KG increases the trustworthiness 1 manual data curation, man-  of the KG:  ual data insertion in a   closed system   1 unknown and empty values 0.75 manual data curation and    are used  insertion, both by a com-  mNoV al(g) = 0.5 unknown or empty values  munity    are used 0.5 manual data curation, data 0 otherwise   insertion by user OR au-   tomated knowledge extrac-  2.2.3. Consistency  tion from structured data Definition. Consistency implies that “two or more mgraph(hg) = sources  values [in a data set] do not conflict each other” [32]. 0.25 automated data curation,  Discussion. Due to the high variety of data providers  data insertion by auto-  in the Web of Data, a user must expect data incon-  mated knowledge extrac-  sistencies. Data inconsistencies may be caused by (i)  tion from structured data  different information providers, (ii) different levels of  sources 0 automated data curation, knowledge, and (iii) different views of the world [11].   data insertion by auto- In OWL, schema restrictions can be divided into   mated knowledge extrac- class restrictions and relation restrictions [7].   tion from unstructured data Class restrictions refer to classes. For instance,  sources one can specify via owl:disjointWith that two classes have no common instance. Note that a user may use different values instead of the Relation restrictions refer to the usage of rela- 1, 0.75, etc. This also applies for subsequent metrics. tions. They can be classified into value constraints and Trustworthiness on statement level The fulfillment of trustworthiness on statement level is determined 13See http://purl.org/dc/terms/. by an evaluation whether a provenance vocabulary 14See http://www.w3.org/ns/prov#. M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 7 cardinality constraints. Value constraints determine We evaluate the criterion as follows. Let CC be the range of relations. owl:someValuesFrom, the set of all class constraints, defined as CC := for instance, specifies that at least one value of a {(c1, c2) | (c1, owl:disjointWith, c2) ∈ g}. Further- relation belongs to a certain class. We also con- more, let cg(e) be the set of all classes of e in g, defined sider the generic "constraints" rdfs:domain and as cg(e) = {c | (e, rdf:type, c) ∈ g}. rdfs:range here. Cardinality constraints limit the number of times a relation may exist per resource. Moreover, via owl:Functionalproperty and mconClass(g) = owl:InverseFunctionalProperty, global car- dinality constraints can be specified. Functional rela- |{(c1, c2) ∈ CC | ¬∃e :(c1 ∈ cg(e) ∧ c2 ∈ cg(e))}| tions permit at most one value per resource (e.g., the |CC| birth date of a person), inverse functional relations specify that a value should only occur once per re- In case of an empty set of class constraints CC, the source (e.g., an id). metric should evaluate to 1. Metric. We can measure the data quality dimen- Consistency of statements w.r.t. relation constraints sion consistency by means of (i) whether schema con- This metric is intended to measure the degree to straints are checked during the insertion of new state- which the instance data are consistent with the rela- ments into the KG and (ii) whether already existing tion restrictions specified on the schema level (e.g., statements in the KG are consistent to specified class rdfs:range, and owl:FunctionalProperty). and relation constraints. The fulfillment degree of a We evaluate this criterion as follows: KG g w.r.t. the dimension consistency is measured by the metrics mcheckRestr, mconClass, and mconRelat m (g) + m (g) which are defined as follows. m (g) = conRelatRange conRelatF unct conRelat 2 Check of schema restrictions during insertion of new statements Checking the schema restrictions during Let RC be the set of all relation constraints, defined the insertion of new statements can help to reject facts as RC := RCr∪RCf where RCr := {(p, d) | (p,rdfs: that would render the KG inconsistent. Such simple range, d) ∈ g ∧ isDatatype(d)} and RCf := checks are often done on the client side in the user in- {(p, d) | (p, rdf:type, owl:functional- terface. For instance, the application checks whether Property) ∈ g ∧ (p, rdfs:range, d) ∈ g ∧ data with the right data type is inserted. Due to the de- isDatatype(d)}. pendency to the actual inserted data, the check needs to be custom-designed. Simple rules are applicable, how- ever, inconsistencies can still appear if no suitable rules mconRelatRange(g) = are available. Examples of consistency checks are: |{(s, p, o) ∈ g | ∃(p, d) ∈ RC : datatype(o) 6= d}| Checking the expected data types of literals; checking r whether the entity to be inserted has a valid entity type |{(s, p, o) ∈ g | ∃(p, d) ∈ RCr}| (checking the rdf:type relation) and whether the assigned classes of the entity are disjoint, i.e., contra- owl:disjointWith dicting each other (utilizing m (g) = |{(s, p, o) ∈ g | ∃(p, d) relations). conRelatF unctional ∈ RCf : ¬∃(s, p, o2) ∈ g : o 6= o2}| / |{(s, p, o)  1 schema restrictions are  ∈ g | ∃(p, d) ∈ RCf }| mcheckRestr(hg) = checked 0 otherwise In case of an empty set of relation constraints (RCr or RCf ), the respective metric should evaluate to 1. Consistency of statements w.r.t. class constraints This metric is intended to measure the degree to which the 2.3. Contextual Category instance data are consistent with the class restrictions (e.g., owl:disjointWith) specified on the schema Contextual data quality “highlights the requirement level. that data quality must be considered within the con- 8 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO text of the task at hand” [40, p. 6]. This category con- – Value-added: Value-added is “the extent to which tains the three dimensions (i) relevancy, (ii) complete- data are beneficial and provide advantages from ness, and (iii) timeliness. Wang et al.’s further dimen- their use” [40, p. 31]. sions in this category, appropriate amount of data and Discussion. Pipino et al. [35] divide completeness value-added , are considered by us as being a part of further into completeness. 1. schema completeness, i.e., the extent to which 2.3.1. Relevancy classes and relations are not missing, Definition of dimension. Relevancy is “the extent 2. column completeness, i.e., the extent to which to which data are applicable and helpful for the task at values of relations on instance level are not miss- hand” [40, p. 31]. ing, and Discussion. According to Bizer [11], relevancy is 3. population completeness, i.e., the extent to which an important quality dimension, since the user is con- the KG contains all entities. fronted with a variety of potentially relevant informa- tion on the Web. The completeness dimension is context-dependent and Definition of metric. The dimension relevancy is therefore belongs to the contextual category, because determined by the criterion Creating a ranking of the fact that a KG is seen as complete depends on the statements.15 The fulfillment degree of a KG g w.r.t. use case scenario, i.e., on the given KG and on the in- the dimension relevancy is measured by the metric formation need of the user. As exemplified by Bizer [11], a list of German stocks is complete for an investor mRanking which is defined as follows. who is interested in German stocks, but it is not com- Creating a ranking of statements By means of this plete for an investor who is looking for an overview of criterion one can determine whether the KG supports the European stocks. The completeness is, hence, only a ranking of statements by which the relative rele- assessable by means of a concrete use case at hand and vancy of statements among each other statements can with the help of a defined gold standard. be expressed. For instance, given the Wikidata entity Definition of metric. We stick to the above men- "Barack Obama" (wdt:Q76) and the relation "posi- tioned distinction of Pipino et al. [35] and define tion held", "President of the United States of America" the metric for completeness by means of the criteria has a "PreferredRank" (until this year), while older po- schema completeness, column completeness, and pop- sitions which he has no more are ranked as "Normal- ulation completeness. Rank". The fulfillment degree of a KG g w.r.t. the di- mension completeness is measured by the metrics m , m , and m which are defined as ( cSchema cCol cP op 1 ranking of statements supported follows. mRanking(g) = 0 otherwise Schema Completeness By means of the schema com- pleteness criterion, one can determine the complete- Note that this criterion refers to a property of the KG ness of the schema w.r.t. classes and attributes [35]. and not to a property of the system that hosts the KG. The schema is assessed by means of a gold standard. This gold standard would consist of facts which a user 2.3.2. Completeness deems relevant. In this paper, we exemplify the gold Definition of dimension. Completeness is “the ex- standard with a typical set of cross-domain facts. It tent to which data are of sufficient breadth, depth, and comprises (i) basic classes such as persons and loca- scope for the task at hand” [40, p. 32]. tions in different granularities and (ii) basic attributes We include the following aspects of Wang et al. in such as birth date and number of inhabitants. We de- this dimension: fine the Schema completion mcSchema as the ratio of – Appropriate amount of data: Appropriate amount the number of classes and attributes of the gold stan- of data is “the extent to which the quantity or vol- dard existing in g noclat and the number of classes and ume of available data is appropriate.” [40, p. 32] attributes in the gold standard noclatg.

15We do not consider the relevancy of literals, as there is no rank- noclat mcSchema(g) = ing of literals provided for the considered KGs. noclatg M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 9

Column Completeness By means of the column com- 2.3.3. Timeliness pleteness criterion, one can determine the degree by Definition of dimension. Timeliness is “the extent which the attributes of a class, which are defined on to which the age of the data is appropriate for the task the schema level, exist on the instance level of the at hand” [40, p. 32]. KG [35]. Assume that K = {k1, ..., kn} is the set Discussion. Timeliness does not describe the cre- of all considered classes and R = {r1, ..., rm} the ation date of a statement, but instead the time range set of considered relations. Then H = {(k, r) ∈ since the last update or the last verification of the state- imp (K × R) | ∃k ∈ Cg ∧ ∃r ∈ Pg } is the set ment [34]. Due to the easy way of publishing data on of all combinations of k and r which can occur on the Web, data sources can be kept easier up-to-date the instance level based on the schema information. than traditional isolated data sources. This results in We measure the column completeness by evaluating advantages to the consumer of Web data [34]. How whether that relation r is used on instance level which timeliness is measured depends on the application con- was defined for a class k on schema level (e.g., the text: For some situations years are sufficient, while in class dbo:Person in DBpedia determines the re- other situations one may need days [34]. lation dbo:birthDate). The column completeness Metric. The dimension timeliness is determined by fcCol is defined as the ratio of the number of instances the criteria timeliness frequency of the KG, specifica- having class k and a value for the relation r nokr and tion of the validity period, and specification of the mod- ification date of statements the number of all instances having class k nor. . The fulfillment degree of a KG g w.r.t. the dimen- sion timeliness is measured by the metrics mF req, 1 X nokr mV alidity, and mChange which are defined as follows. mcCol(g) = n nor (k,r)∈H Timeliness frequency of the KG The timeliness fre- quency of a KG indicates how fast the KG is updated. Note that there are also relations which do not need We consider the KG RDF export here and differenti- to exist for all entities of the relations dedicated entity ate between continuous updates, where the updates are type. For instance, not all people need to have a rela- always performed immediately, and discrete KG up- tion :hasChild or :deathDate.16 For measuring dates, where the updates take place in discrete time the column completeness, we selected only those rela- intervals. If the RDF export files of the KG are pro- tions for an assessment where a value of the relation vided with varying timeliness frequencies (i.e., updat- typically exists for each entity type the relation is ded- ing intervals), we consider the online version of the icated for. KG, since in the context of Linked Data it is sufficient that URIs are dereferenceable. Population Completeness The population complete-  ness metric determines the extent to which the con- 1 continuous updates sidered KG covers the basic population [35]. The as-  m (g) = 0.5 discrete periodic updates sessment of the completeness of the basic population F req  is performed by a gold standard, which covers both 0 otherwise well-known resources (called “short head”, e.g., the n largest cities in the world according to the number of Specification of the validity period of statements Spec- inhabitants) and little-known resources (called “long ifying the validity period of statements enables to tem- tail”; e.g., municipalities in Germany). We take all re- porally limit the validity of statements. Regarding this sources of our gold standard equally into account: criterion, we measure whether the KG supports the specification of starting and maybe end dates of state- ments by means of providing suitable forms of repre- sentation. mcP op(g) = Number of resources in gold standard existing in g / Number of resources in gold standard  1 specification of validity pe- 16For an evaluation about the prediction which relations are of this mV alidity(g) = riod supported nature, see [1]. 0 otherwise 10 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Specification of the modification date of statements one description is provided, e.g., via rdfs:label, The modification date discloses the point in time rdfs:comment, or schema:description as of the last verification of a statement. The modifi- nodesc and the number of all considered resources cation date is typically represented via the relations nocr. schema:dateModified17 and dcterms:modi- nodesc fied. mDescr = nocr  1 g supports the specification  Labels in multiple languages Resources in the KG  of modification dates for mChange(g) = are described in a -readable way via labels, e.g.  statements 18 0 otherwise rdfs:label or skos:prefLabel. The charac- teristic feature of skos:prefLabel is that this kind 2.4. Representational Data Quality of label has to be used per resource at most once; in contrast, rdfs:label has no cardinality restrictions, Representational data quality “contains aspects re- i.e. it can be used several times for a given resource. lated to the format of the data [...] and meaning of data” Labels are usually provided in English as the “basic [40, p. 21]. This category contains the two dimensions language”. The now introduced metric for the crite- (i) ease of understanding (i.e., regarding the human- rion labels in multiple languages determines whether readability) and (ii) interoperability (i.e., regarding the labels in other languages than English are provided in machine-readability). The dimensions interpretability, the KG. representational consistency and concise representa- 1 Labels provided in English tion as proposed by Wang et al. [40] in addition are   and at least one other lan- considered by us as being a part of the dimension in- m (g) = Lang guage teroperability.  0 otherwise 2.4.1. Ease of Understanding Definition of dimension. The ease of understanding Understandable RDF serialization RDF/XML is the is “the extent to which data are clear without recommended RDF serialization format of the W3C. and easily comprehended” [40, p. 32]. However, due to its syntax RDF/XML documents are Discussion. This dimension focuses on the under- hard to read for . The understandability of RDF standability of a data source by a human data con- data by humans can be increased by providing RDF sumer. In contrast, the dimension interoperability fo- in other, more human-understandable serialization for- cuses on technical aspects. The understandability of mats such as N3, N-Triple, and . We measure this a data source (here: KG) can be improved by things criterion by measuring the supported serialization for- such as self-explanatory labels, literals in multiple lan- mats during the dereferencing of resources. guages, ( The dimension understandability is determined by 1 RDF serializations available muSer(hg) = the criteria description of resources, labels in multiple 0 otherwise languages, provisioning of an understandable RDF se- rialization, and self-describing URIs. The fulfillment Note that data conversions into another RDF serial- degree of a KG g w.r.t. the dimension consistency is ization format are easy to perform. measured by the metrics mDesc, mLang, muSer, and muURI which are defined as follows. Self-describing URIs Self-descriptive URIs contribute to a better human-readability of KG data. Sauermann Description of resources Heath et al. [24,27] sug- et al.19 recommend to use short, memorable URIs gest describing resources in a human-understandable in the Semantic Web context, which are easier un- rdfs:label rdfs:comment way, e.g. via or . derstandable and memorable by humans compared Within our framework, the criterion is measured as follows: Given a sample of resources, we identify the 18 number of resources for which at least one label or Using the namespace http://www.w3.org/2004/02/ skos/core#. 19See https://www.w3.org/TR/cooluris, requested 17Using namespace http://schema.org/. Mar 1, 2016. M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 11 to URIs such as wdt:Q1040. The criterion self- Those features should be avoided according to Heath describing URIs is dedicated to evaluate whether self- et al. in order to simplify the processing of data on the describing URIs or generic IDs are used for the identi- client side. Even the querying of the data via SPARQL fication of resources. may get complicated if reification, RDF collections, and RDF container are used. We can clarify that reification is necessary for stat-  1 self-describing URIs always used ing additional information on statement level. Al-  muURI (g) = 0.5 self-describing URIs partly used ternatively, additional relations are necessary (e.g., dbo:populationAsOf 0 otherwise in DBpedia) which im- plicitly refer to other statements (e.g., statements of dbo:populationTotal). 2.4.2. Interoperability Definition of metric. The dimension interoperabil- Interoperability is another dimension of the repre- ity is determined via the criteria sentational data quality category and subsumes Wang et al.’s aspects interpretability, representational con- – Avoiding blank nodes and RDF reification sistency, and concise representation. – Provisioning of several serialization formats Definition of dimension. We define interoperability – Using external vocabulary along the subsumed Wang et al.’s dimensions: – Interoperability of proprietary vocabulary – Interpretability: Interpretability is “the extent to The fulfillment degree of a KG g w.r.t. the dimen- which data are in appropriate language and units sion operability is measured by the metrics mReif , and the data definitions are clear” [40, p. 32]. miSerial, mexV oc, and mpropV oc which are defined as – Representational consistency: Representational follows. consistency is “the extent to which data are al- Avoiding blank nodes and RDF reification Using ways presented in the same format and are com- RDF blank nodes, RDF reification, RDF container, and patible with previous data” [40]. RDF lists is often considered as ambivalent: On the – Concise representation: Concise representation one hand, these RDF features are not very common is “the extent to which data are compactly repre- and they complicate the processing and querying of sented without being overwhelming” [40, p. 32]. RDF data [27][24, p. 17]. On the other hand, they are Discussion regarding interpretability. In contrast necessary in certain situations, e.g., when statements to the dimension understandability, which focuses on about statements should be made. We measure the cri- the understandability of KG RDF data towards the user terion by evaluating whether blank nodes and RDF (data consumer), interpretability focuses on the repre- reification is used. sentation from a technical perspective. An example is 1 no blank nodes and no reification the evaluation whether the considered KG uses blank  0.5 either no blank nodes or no nodes. According to Heath et al. [24, p. 17], blank mReif (g) = nodes should be avoided in the Linked Data context,  reification  since they complicate the integration of multiple data 0 otherwise sources and since they cannot be linked by resources of other data sources. Provisioning of several serialization formats The in- Discussion regarding representational consis- terpretability of RDF data of a KG is increased if be- tency. sides the serialization standard RDF/XML further seri- In the context of Linked Data, it is the best practice alization formats are supported for URI dereferencing. to reuse existing vocabulary for the creation of own  RDF data. In this way, the potential of reuse can be 1 RDF/XML and further for-  exploited without any preparation [24, p. 61].  mats are supported miSerial(hg) = Discussion regarding concise representation. 0.5 only RDF/XML is supported  Heath et al. [24, p. 17] made the observation that 0 otherwise the RDF features (i) RDF reification, (ii) RDF collec- tions and RDF container, and (iii) blank nodes are not Using external vocabulary Using a common vocabu- very widely used in the Linked Open Data context. lary for representing and describing the KG data leads 12 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO to represent resources and relations between resources 2.5.1. Accessibility in the Web of Data in a unified way. This increases the Definition of dimension. Wang et al.’s term acces- interoperability of data [27][24, p. 61] and allows an sibility result in the additional aspects availability, re- easy . We measure the criterion of us- sponse time, and data request. Those single aspects are ing an external vocabulary by setting the frequency of defined as follows: used external vocabulary in proportion to the number 1. Accessibility is “the extent to which data are of all triples in the KG: available or easily and quickly retrievable” [40, p. 32]. 2. Availability “of a data source is the probability m (g) = Number of all triples in g that use extV oc that a feasible query is correctly answered in a external vocabularies / Number of all triples in g given time range” [34]. According to Naumann [34], the availability is an Interoperability of proprietary vocabulary Linking important quality aspect for data sources on the on schema level means to link the proprietary vocabu- Web, since in case of integrated systems (with lary to external vocabulary. Proprietary vocabulary are federated queries) usually all data sources need classes and relations which were defined in the KG it- to be available in order to execute the query. self. The interlinking to external vocabulary guaran- There can be different influencing factors regard- tees a high degree of interoperability [24, p. 83]. We ing the availability of data sources, such as the measure the interlinking on schema level by calculat- day time, the worldwide distribution of servers, ing the ratio to which classes and relations have at the planed maintenance work, and the caching least one equivalency link to an external vocabulary of data. Linked Data sources can be available (via owl:sameAs, owl:equivalentProperty as SPARQL endpoints (for performing complex or owl:equivalentClass) in order to indicate queries on the data) and via HTTP URI derefer- the equivalence to classes and relations that exist ex- encing, so that we need to consider both possibil- ternal to the KG: ities. 3. Response time characterizes the delay between the point in time when the query was submitted mpropV oc(g) = |{x | x ∈ Ig ∪ Pg ∪ Cg ∧ ∃(x, p, o) ∈ g : and the point in time when the query response is received [11, p. 21]. ext (o ∈ U ∧ o ∈ Rg ) ∧ p ∈ Peq)}|/|Ig ∪ Pg ∪ Cg| Note that the response time is dependent on fac- tors such as the query, the size of the indexed where Peq = {owl:sameAs, owl:equivalentProperty, data, the data structure, the used triple store, the ext owl:equivalenClass} and Rg consists of all URIs in hardware and so on. Therefore, we do not con- Rg which are external to the KG g which means that sider the response time in our evaluations. hg is not responsible for resolving these URIs. 4. In the context of Linked Data, data requests can be made (i) on SPARQL endpoints, (ii) on RDF 2.5. Accessibility Category dumps (export files), and (iii) on Linked Data . Accessibility data quality refers to aspects on how We define the metric for the dimension accessibility data can be accessed. This category contains the three by means of metrics for the following criteria: dimensions – Dereferencing possibility of resources – accessibility, – Availability of the KG – licensing, and – Provisioning of public SPARQL endpoints – interlinking. – Provisioning of an RDF export – Support of content negotiation Wang’s dimension “access security” is considered by – Linking HTML sites with RDF serialization us as being not relevant in the Linked Open Data con- – Provisioning of metadata about a KG text, as we only take open data sources into account. In the following, we go into details of those data The fulfillment degree of a KG g w.r.t. the dimen- quality dimensions: sion availability is measured by the metrics mDeref , M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 13 mAvai, mSP ARQL, mExport, mNegot, mHT MLRDF , private SPARQL endpoint. The criterion here indicates and mMeta which are defined as follows. whether an RDF export data set is officially available: Dereferencing possibility of resources One of the ( 1 RDF export available Linked Data principles [9] is the dereferencing possi- mExport(hg) = bility of resources: URIs must be resolvable via HTTP 0 otherwise requests and useful information in RDF should be re- turned thereby. We assess the dereferencing possibility Support of content negotiation Content negotiation of resources R by means of taking a sample of URIs: (CN) allows that the server returns RDF documents For each of the URIs, the HTTP response status code is during the dereferencing of resources in the desired analyzed, and it is evaluated whether useful RDF data RDF serialization format. The HTTP protocol allows is returned: the client to specify the desired content type (e.g., RDF/XML) in the HTTP request and the server to |dereferencable(Rg)| specify the returned content type in the HTTP re- mDeref (hg) = |Rg| sponse header (e.g., application/rdf+). In this way, the desired and the provided content type are Availability of the KG The availability of the KG cri- matched as far as possible. It can happen that the server terion indicates the uptime of the KG. It is an essential does not provide the desired content type. Moreover, it criterion in the context of linked data, since in case of may happen that the server returns an incorrect content an integrated or federated query mostly all data sources type. This may lead to the fact that serialized RDF data need to be available [34]. We measure the availabil- is not processed further. Example is RDF data which ity of a KG by monitoring the ability of dereferencing is declared as text/plain [24, p. 73]. Hogan et al. URIs over a period of time. This monitoring process [26] therefore propose to let KGs return the most spe- can be done with the help of a monitoring tool such as cific content type as possible. We measure the support Pingdom.20 of content negotiation by dereferencing resources with different RDF serialization formats as desired content Number of successful requests type and by comparing the content type of the HTTP mAvai(hg) = Number of all requests request with the content type of the HTTP response.

Provisioning of public SPARQL endpoints SPARQL 1 CN supported and correct  endpoints allow the user to perform complex queries  content types returned (including potentially many entities and relations) on  mNegot(hg) = 0.5 CN supported but wrong the KG. This criterion here indicates whether an of-  content types returned  ficial SPARQL endpoint is publicly available. There 0 otherwise might be additional restrictions of this SPARQL end- point such as a maximum number of requests per time Linking HTML sites with RDF serialization Heath slide or a maximum runtime of a query. However, we et al. [24, p. 74] suggest linking any HTML descrip- do not measure these restrictions here. tion of a resource to RDF serializations of this re-  source in order to make the discovery of correspond- 1 SPARQL endpoint pub- ing RDF data easier (for Linked Data-aware applica- mSP ARQL(hg) = licly available tions). For that reason, in the HTML header the so-  0 otherwise called Autodiscovery pattern can be included, consist- ing of the phrase link rel=alternate, the in- Provisioning of an RDF export If there is no public dication about the provided RDF content type, and a SPARQL endpoint available or the restrictions of this link to the RDF document.21 We measure the linking endpoint are so strict that the user does not use it, a of HTML websites to RDF (resource representation) RDF export data set (RDF dump) can often be down- files by evaluating whether HTML websites contain a loaded. This data set can then be used to set up a local,

21An example is . 14 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO link as described: published, it is published under the same legal condi- tions; CC026 defines the respective data as public do-  1 link rel=alternate main and without any restrictions. mHTML_RDF (hg) = used Noteworthy is that most data sources in the Linked 0 otherwise Open Data cloud do not provide any licensing infor- mation [28] which makes it difficult to use the data in Provisioning of metadata about a KG In the light commercial settings. Even if data is published under of the Semantic Web vision where agents select and CC-BY or CC-BY-SA, the data is often not used since make use of appropriate data sources on the Web, companies refer to uncertainties regarding these con- also the meta-information about KGs needs to be tracts. available in a machine-readable format. The two im- Definition of metric. The dimension license is de- portant mechanisms to specify metadata about KGs termined by the criteria provisioning machine-readable are (i) using semantic sitemaps and (ii) using the licensing information. 22 VoID vocabulary [24, p. 48]. For instance, the The fulfillment degree of a KG g w.r.t. the dimen- URI of the SPARQL endpoint can be assigned via sion license is measured by the metric mmacLicense :sparqlEndpoint and the RDF export URL which is defined as follows. can be specified by void:dataDump. The metadata can be added as additional facts to the KG or pro- Provisioning machine-readable licensing information vided as separate VoID file. We measure the provi- Licenses define the legal frameworks under which the sioning of metadata about a KG by evaluating whether KG data may be used. Providing machine-readable li- machine-readable metadata about the KG are avail- censing information allows users and applications to able. Note that the provisioning of licensing informa- be aware of the license and to use the data of the KG tion in a machine-readable format (which is also a in accordance with the legal possibilities [27][24, pp. meta-information about the KG) is considered in the 52]. data quality dimension license later on. Licenses can be specified in RDF via relations such as cc:licence27, dcterms:licence, or  dcterms:rights 1 Machine-readable meta- . The licensing information can mMeta(g) = data available be specified either in the KG as additional facts, or 0 otherwise separately in a VoID file. We measure the criterion by evaluating whether licensing information is available 2.5.2. License in a machine-readable format: Definition of dimension. Licensing is defined as  “the granting of permission for a consumer to re-use a 1 machine-readable  dataset under defined conditions” [42]. licensing information mmacLicense(g) = available Discussion. The publication of licensing informa- 0 otherwise tion regarding the KGs is important to use the KGs without legal concerns, especially in commercial set- 2.5.3. Interlinking 23 tings. Creative Commons (CC) published several Definition of dimension. Interlinking is the extent standard licensing contracts which define rights and “to which entities that represent the same concept are obligations. These contracts are also in the Linked linked to each other, be it within or between two or Data context popular. The most frequent licenses for more data sources” [42]. Linked Data are CC-BY, CC-BY-SA, and CC0 [28]. 24 Discussion. According to Bizer et al. [12], DB- CC-BY requires specifying the source of the data, pedia established itself as a hub in the Linked Data CC-BY-SA25 requires in addition that if the data is cloud due to its intensive interlinking with other KGs. These interlinking is on the instance level usually es- 22See namespace http://www.w3.org/TR/void. tablished via owl:sameAs links. However, accord- 23See http://creativecommons.org/, requested Mar 1, 2016. 24See https://creativecommons.org/licenses/ 26See http://creativecommons.org/ by/4.0/, requested Mar 1, 2016. publicdomain/zero/1.0/, requested Mar 3, 2016. 25See https://creativecommons.org/licenses/ 27Using the namespace http://creativecommons.org/ by-sa/4.0/, requested Mar 1, 2016. ns#. M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 15 ing to Halpin et al. [22], those owl:sameAs links by relations such as :homepage and foaf: do not always interlink identical entities in reality. Ac- depiction. Linking to external resources always cording to the authors, one reason might be that the entails the problem that those links get invalid over KGs provide entries in different granularity: For in- time. This can have different causes, such as that the stance, the DBpedia entry of Berlin (in the sense of the URI is not available anymore. We measure the validity capital) links via owl:sameAs relations to three dif- of external URIs by evaluating the URIs from an URI ferent entries in the KG GeoNames, namely (i) Berlin, sample set w.r.t. whether there is a timeout, a client er- the capital,28, (ii) Berlin, the state,29, and (iii) Berlin, ror (HTTP response 4xx) or a server error (HTTP re- the city30. Moreover, owl:sameAs relations are of- sponse 5xx). If an RDF resource is linked, we also con- ten created automatically by some mapping function. sider the content type in the header of a corresponding Mapping errors, hence, lead to a precision less than HTTP response, as owl:sameAs statements get in- 100 %. valid if the linked URI is not available anymore and for Definition of metric. The current dimension inter- instance the server makes a forward to the homepage. linking is determined by the criteria – Interlinking on instance level |{x ∈ A | resolvable(x)}| m (g) = – Validity of external URIs URIs |A| The fulfillment degree of a KG g w.r.t. the dimen- where A = {y | ∃(x, owl:sameAs, y) ∈ g :(x ∈ sion interlinking is measured by the metrics m , Instance R \ (C ∪ P ) ∧ internal (x) ∧ external (y))}. m , and m which are defined as follows. g g g g g Schema URIs In case of an empty set A, the metric should evaluate Interlinking via owl:sameAs The forth Linked to 1. Data principle according to Berners-Lee is the inter- linking of data resources so that the user can explore 2.6. Conclusion further information. According to Hogan et al. [27], the interlinking has a side effect: It does not only result We provide 34 criteria classified in 11 dimension in otherwise isolated KGs, but the number of incom- which can be applied to assess KGs w.r.t. data quality. ing links of a KG indicates the importance of the KG Those dimensions themselves are grouped into four in the Linked Open Data cloud. We measure the in- categories. terlinking on instance level by calculating the ratio to – Intrinsic category which instances have at least one owl:sameAs link to an external knowledge graph ∗ Accuracy ∗ Syntactic Validity of RDF documents ∗ Syntactic Validity of Literals ∗ Semantic Validity of Triples mInst(g) = ∗ Trustworthiness |{x ∈ Ig | ∃(x, owl:sameAs, y) ∈ g}| ∗ Trustworthiness on KG level |Ig| ∗ Trustworthiness on statement level ∗ Using unknown and empty values Validity of external URIs The considered KG may contain outgoing links referring to RDF resources ∗ Consistency or Web documents (non-RDF data). The linking to ∗ Check of schema restrictions during insertion of RDF resources is usually done by owl:sameAs, new statements owl:equivalentProperty, and owl:equi- ∗ Consistency of statements w.r.t. class constraints valentClass relations. Web documents are linked ∗ Consistency of statements w.r.t. relation con- straints 28http://www.geonames.org/2950159/berlin. – Contextual category html 29http://www.geonames.org/2950157/land- ∗ Relevancy berlin.html ∗ Creating a ranking of statements 30http://www.geonames.org/6547383/berlin- stadt.html ∗ Completeness 16 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

∗ Schema Completeness in collaboration with OpenLink Software. Since ∗ Column Completeness the first public release in 2007, DBpedia is up- ∗ Population Completeness dated roughly once a year.32 DBpedia is cre- ∗ Timeliness ated from automatically-extracted structured in- formation contained in the , such as ∗ Timeliness frequency of the KG from tables, categorization information, ∗ Specification of the validity period of statements geo-coordinates, and external links. Due to its ∗ Specification of the modification date of state- ments role as the hub of LOD, DBpedia contains many links to other datasets in the LOD cloud such – Representational data quality as Freebase, OpenCyc, UMBEL,33 GeoNames,34 35 36 37 ∗ Ease of understanding Musicbrainz, CIA World Factbook, DBLP, Project Gutenberg,38 DBtune Jamendo,39 Euro- ∗ Description of resources stat,40 Uniprot,41 and Bio2RDF.42 DBpedia is ∗ Labels in multiple languages used extensively in the Semantic Web research ∗ Understandable RDF serialization ∗ Self-describing URIs community, but is also relevant in commercial settings: companies use it to organize their con- ∗ Interoperability tent, such as the BBC [30] and the New York ∗ Avoiding blank nodes and RDF reification Times [36]. The version of DBpedia we analyzed ∗ Provisioning of several serialization formats is 2015-04. ∗ Using external vocabulary – Freebase: Freebase43 is a KG announced by ∗ Interoperability of proprietary vocabulary Metaweb Technologies, Inc. in 2007 and was – Accessibility category acquired by Google Inc. on July 16, 2010. In contrast to DBpedia, Freebase had provided an ∗ Accessibility interface that allowed end-users to contribute ∗ Dereferencing possibility of resources to the KG by editing structured data. Besides ∗ Availability of the KG user-contributed data, Freebase integrated data ∗ Provisioning of public SPARQL endpoints from Wikipedia, NNDB,44 FMD,45 and Mu- ∗ Provisioning of an RDF export sicBrainz.46 Freebase uses a proprietary graph ∗ Support of content negotiation model for storing also complex statements. Free- ∗ Linking HTML sites with RDF serialization base shut down its services on June 30, 2015. ∗ Provisioning of metadata about a KG Wikimedia Deutschland and Google integrate ∗ License Freebase data into Wikidata via the Primary 47 ∗ Provisioning machine-readable licensing informa- Sources Tool. Further information about the mi- tion 32There is also DBpedia live which started in 2009 and which ∗ Interlinking is updated when Wikipedia is updated. See http://live. ∗ Interlinking via owl:sameAs .org/. 33 http://umbel.org/ ∗ Validity of external URIs See 34See http://www.geonames.org/ 35See http://musicbrainz.org/ 36See https://www.cia.gov/library/ 3. Selection of KGs publications/the-world-factbook/ 37See http://www.dblp.org 38See https://www.gutenberg.org/ We consider the following knowledge graphs for our 39See http://dbtune.org/jamendo/ comparative evaluation: 40See http://eurostat.linked-statistics.org/ 41 31 See http://www.uniprot.org/ – DBpedia: DBpedia is the most popular and 42See http://bio2rdf.org/ prominent KG in the LOD cloud [4]. The project 43See http://freebase.com/ was initiated by researchers from the Free Uni- 44See http://www.nndb.com versity of Berlin and the University of Leipzig, 45See http://www.fashionmodeldirectory.com/ 46See http://musicbrainz.org/ 47See https://www.wikidata.org/wiki/Wikidata: 31See http://dbpedia.org Primary_sources_tool, requested on Apr 8, 2016. M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 17

gration from Freebase to Wikidata are provided 4. Comparison of KGs in [37]. We analyzed the latest Freebase version as of March 2015. 4.1. Key Statistics – OpenCyc: The Cyc48 project started in 1984 as part of Microelectronics and Computer Technol- In the following, we give an overview of the statis- ogy Corporation. The aim of Cyc is to store (in tical commonalities and differences of the KGs DB- a machine-processable way) millions of common pedia, Freebase, OpenCyc, Wikidata, and YAGO. We sense facts such as “Every tree is a plant.” While thereby use the following key statistics: the focus of Cyc in the first decades was on infer- – Number of triples encing and reasoning, more recent work puts a fo- – Number of classes cus on human-interaction such as building ques- – Distribution of classes w.r.t. the number of their tion answering systems based on Cyc. Since Cyc corresponding instances is proprietary, a smaller version of the KG called – OpenCyc49 was released under the open source Coverage of classes with at least one instance per Apache license. In July 2006, ResearchCyc50 was class published for the research community, containing – Covered domains w.r.t. entities more facts than OpenCyc. The version of Open- – Number of distinct predicates Cyc we analyzed is 2012-05-10. – Number of instances – Wikidata: Wikidata51 is a project of Wikimedia – Number of entities per class Deutschland which started on October 30, 2012. – Number of distinct subjects The aim of the project is to provide data which – Number of distinct objects can be used by any Wikipedia project, including In Section 6.2, we provide an overview of related work Wikipedia itself. w.r.t. those key statistics. Wikidata does not only store facts, but also the corresponding sources, so that the validity of facts 4.1.1. Number of Triples and Statements 54 can be checked. Labels, aliases, and descriptions Number of triples. The number of triples (see Table of entities in Wikidata are provided in almost 400 1) differs considerably between the KGs: Freebase is languages. the largest KG with over 3.1B triples, while OpenCyc Wikidata is a community effort, i.e., users col- resides the smallest KG with only 2.5M triples. The laboratively add and edit information. Also, the large size of Freebase can be traced back to the fact schema is maintained and extended based on that many data sources were integrated into this KG. community agreements. In the near future, Wiki- Size differences between DBpedia and YAGO. As data will grow due to the integration of Free- both DBpedia and YAGO were created by extracting base data. The version of Wikidata we analyzed semantically-structured information from Wikipedia, is 2015-10. the significant difference between their sizes (in terms – YAGO: YAGO – Yet Another Great Ontology of triples) is noteworthy. YAGO integrates the state- – has been developed at the Max Planck In- ments from different language versions of Wikipedia stitute for Computer Science in Saarbrücken in one single KG while for the canonical DBpedia since 2007. YAGO comprises information ex- data set (which is used in our evaluations) solely the tracted from the Wikipedia (e.g., categories, redi- is used. Besides that, YAGO con- rects, ), WordNet[17] (e.g., synsets, tains contextual information and detailed provenance hyponymy), and GeoNames.52 The version of information. Contextual information are for instance YAGO we analyzed is YAGO3.53 the anchor texts of all links within Wikipedia where the respective entity is linked. For that, the relation hasWikipediaAnchorText (330M triples in to- 48See http://www.cyc.com/ 49See http://www.opencyc.org/ tal) is used. The provenance information of single 50See http://research.cyc.com/ statements is stored in a reified form. In particular, the 51See http://wikidata.org/. relations extractionSource (161.2M triples) and 52See www.geonames.org/ 53See http://www.mpi-inf.mpg.de/departments/ databases-and-information-systems/research/ 54Measured via the SPARQL query select count(*) yago-naga/yago/downloads/ where ?s ?p ?o. 18 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO extractionTechnique (176.2M triples) are used 100 for that. Influence of reification on the number of triples. 80 As we will see in Section 4.2.8 in more detail, YAGO 60 and Wikidata use reification for data modeling. In case of YAGO, however, the number of triples is 40 due to that only increased to a very limited extend, Coverage in % since YAGO data is provided in N-Quads.55 The ad- 20 ditional column (in comparison to triples), the IDs, by which the triple become uniquely identified, are nor- mally considered as comment and therefore are not YAGO imported into the triple store. For Wikidata, things DBpediaFreebaseOpenCyc Wikidata are different: The Wikidata RDF export is stored in Fig. 1. Coverage of classes with at least one instance per class; per N-Triples format, so that reification has a great im- KG. pact on the number of instantiations of statements, and therefore, on the number of triples in total. In trast, is very small, since it is created manually, based our evaluation, we counted 74,272,190 instances of the on the mostly used infobox templates in Wikipedia. In class wdo:Statement.56 The instantiations make total, the DBpedia KG contains 444,895 classes which up about a tenth of all triples. are connected via rdfs:subClassOf to a taxon- 4.1.2. Classes and Domains omy. Those additional classes originate from the im- Classes Methods for counting classes ported YAGO categories and therefore use the names- http://dbpedia.org/class/yago/ The number of classes can be calculated in different pace . ways: Classes can be identified via rdfs:Class and Ratio of classes which are used in the KG Fig- owl:Class relations, or via rdfs:subClassOf ure 1 shows the coverage of classes with at least relations (plus the most general class).57 Since Free- one instance per KG. Interestingly, none of the KGs base does not provide any class hierarchy with rdfs:- achieves a very high coverage. YAGO performs best with 82.6% coverage, although it contains the highest subClassOf relations and since Wikidata does not number of classes. This can be traced back to the fact mark classes explicitly as classes, but uses instead only that YAGO classes were created by heuristics which “subclass of” (wdt:P279) relations, the method of select Wikipedia categories with multiple instances. calculating the number of classes depends on the con- OpenCyc (with 6.5%) and Wikidata (5.4%) come last sidered KG. in the ranking. Especially Wikidata contains relatively Ranking of KG w.r.t. number of classes Our eval- many classes (second most in our ranking above), out uations revealed that YAGO contains the highest num- of which only few of them are used at all. ber of classes of all considered KGs; DBpedia, in con- Figure 2 shows the distribution of classes regarding trast, has the fewest (see Table 1). the number of the corresponding instances. Note the Number of classes in YAGO and DBpedia How logarithmic scale on the axis of ordinates. We can rec- does it come to this gap between DBpedia and YAGO ognize an almost linear curve progression for all KGs w.r.t. the number of classes? The YAGO category tax- except for DBpedia. For DBpedia, the line drops at onomy, which serves as set of classes, is automati- around class 250 exponentially. This means that only cally created by the Wikipedia categories and the set a small part of the already small DBpedia ontology is of WordNet synsets. The DBpedia ontology, in con- used in reality (i.e., on the instance level). Domains Tartir [38] proposed to measure the cov- 55The idea of N-Quads is based on the assignment of triples to different graphs. YAGO uses N-Quads to identify statements per ID. ered domains by determining the class importance, i.e., 56Using the namespace http://wikidata.org/ the proportion of instances belonging to one or more ontology#. subclasses of the respective domain in comparison to 57 The number of classes in a KG may also be calculated by taking the number of all instances. In our work, however, we all entity type relations (rdf:type and “instance of” (wdt:P31) in case of Wikidata) on the instance level into account. However, this decided to evaluate the coverage of domains concern- would result only in a lower bound estimation, as here those classes ing the classes per KG via manual assignments of the are not considered which have no instances. mostly used classes to the domains people, media, or- M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 19

10 8

DBpedia Freebase OpenCyc Wikidata 10 6 YAGO

10 4 Number of instances

10 2

10 0 10 0 10 1 10 2 10 3 Number of classes

Fig. 2. Distribution of classes w.r.t. the number of their corresponding instances; shown per KG. ganizations, geography, and biology (see our website Fig. 3. Coverage of entities: Number of unique entities of all five for examples of classes per domain and per KG). This domains divided by the number of all entities in the KG. list of domains was created by aggregating the most 100 frequent domains in Freebase. The manual domain assignment is necessary to ob- 80 tain a consistent classification of the classes into the 60 domains across all considered KGs. In Freebase, in- stances can appear in various domains. For instance, 40 an intersection of the classes /music/artist and /people/person is obvious. Coverage in % 20 We retrieve the number of unique instances per do- main and per KG via SPARQL queries, given the most frequently used classes per KG and domain at hand. YAGO Figure 4 shows the number of unique entities per DBpediaFreebaseOpenCyc Wikidata domain in the different KGs. As the reader can see in Figure 3, we obtained a coverage of about 80% for Figure 5 shows the relative coverage of each domain all KGs except Wikidata. Via this coverage scores we in each KG, i.e., the proportion of the number of in- measure the performance of the method for selecting stances of each domain of each KG to the total num- the classes, i.e. the reach of our evaluation. It is calcu- ber of instances of the KG. A value of 100% means lated as the ratio of the number of unique entities of all that all instances reside in one single domain. In case considered domains of a KG divided by the number of of Freebase, 77% of all instances are in the media do- all entities of this KG.58 If the ratio is at 1.0, we could main. The statements of those instances notably origi- assign all instances of a KG in one of the domains. nate from data imports such as from MusicBrainz.org. In DBpedia and YAGO, the domain of people is the 58We used the number of unique entities of all domains and not largest domain (50% and 34%, respectively). Peculiar the sum of the entities measured per domain, since entities may be is also the higher coverage of YAGO regarding the ge- in several domains at the same time. ography domain compared to DBpedia. As one reason 20 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Fig. 4. Number of entities per domain.

1010 DBpedia Freebase OpenCyc 108 Wikidata YAGO

106

104 Mumber of entities

102

100 persons media organizations geography biology

Fig. 5. Relative number of entities per domain. 100

90 DBpedia Freebase 80 OpenCyc Wikidata 70 YAGO 60

50

40

30 Relative number of entities 20

10

persons media organizations geography biology for that we can point out the data import of GeoNames tions remain unconsidered during the identification of into YAGO. the unique predicates which are not used in the KG, i.e., not used on the instance level. 4.1.3. Relations and Predicates Identification of Relations and Predicates In this work, we differentiate between relations We identify the relations of each KG by means of and predicates: Relations (interchangeably used with classifying proprietary relations in the KG namespace "properties") is (proprietary) vocabulary defined on the via rdf:Property, rdfs:Property, and via schema level (i.e., T-Box). In contrast, we use predi- OWL classes such as owl:FunctionalProperty cates to denote the connections on the instance level and in the special case of Wikidata via the class (i.e., A-Box), such as rdf:type, owl:sameAs, etc. wdo:Property. The set of unique predicates per It is important to distinguish the key statistics for re- KG is determined by using the terms on the predicate lations and for predicates, since otherwise those rela- position of N-Triples (and making them unique). M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 21

60 Fig. 6. Frequency of the usage of the relations per KG, grouped by be proposed and then get accepted by the community (i) zero occurrences, (ii) 1–500 occurrences, and (iii) more than 500 only if certain criteria are met. One of those criteria is occurrences in the respective KG. that the new predicate is used presumably at least 100 120 times. Note further that the Wikidata RDF export con- DBpedia tains the supplementaries v, s, q, and r for reification of 100 Freebase statements. “s” indicates that the term in the object po- OpenCyc sition is a statement, “v” refers to a value, “r” refers to Wikidata a reference, and “q” to a qualifier. Relations are always 80 YAGO used together with these supplements. DBpedia The set of DBpedia relations is also quite 60 limited. The DBpedia KG contains, however, 58,776 instances of the class rdf:Property and within the namespace dbp.61 Those are the non-mapping 40 based properties, i.e., properties originating from the Relative occurencies unfiltered, generic InfoboxExtractor. The names of 20 the mapping-based properties and of the non-mapping based properties are not aligned62 and sometimes also overlap.63 The difference between the 59K relations instantiated via rdf:Property and the 60K differ- 0 1-500 >500 ent predicates results from using external vocabulary Number of relations such as owl:sameAs. YAGO For YAGO, we measure 106 distinct rela- Key statistics regarding relations and predicates tions in the namespace yago: and 88,736 distinct for identifying all relations of the KGs on the predicates. Similar to DBpedia, the large discrepancy schema level. can be explained by the fact that the infobox relations Ranking regarding predicates Regarding the pred- are not declared as relations on the schema level. icates, Freebase exhibits by far the highest number of Relations of DBpedia vs. YAGO Although rela- unique predicates (784,977) among the KGs (see Ta- tions are curated manually for YAGO and DBpedia, ble 1). OpenCyc shows only 165 unique predicates, the size of the set of relations differs significantly. Hof- which is the lowest value in this comparison. fart et al. [25] mention the following reasons for that: Freebase Freebase exhibits a high number of rela- 1. Relations in the DBpdia ontology are partially tions. This can be traced back to the integration of sev- very special and fine-grained. For instance, there eral data bases. About a third of all relations (23,728 exists the relation dbo:aircraftFighter out of 70,902 relations) in Freebase are declared as between dbo:MilitaryUnit and dbo: inverse of other relations (via owl:inverseOf).59 MeanOfTransportation. YAGO, in con- However, as visible in Figure 6, only 3,825 (5%) rela- trast, uses only the generic relation yago:has- tions are used more than 500 times and about 70 % are Created. We can confirm this observation in not used at all. Freebase also shows a large number of that way that about half of the relations of the different predicates, but 95% (743,377 predicates) are DBpedia ontology are not used at all and only used only once. a quarter of the relations is used more than 500 Wikidata Wikidata provides a relatively small set times (see Figure 6). of relations and predicates. Note in this context that, despite the fact that Wikidata is curated by a commu- nity (just like Freebase), Wikidata community mem- 60See https://www.wikidata.org/wiki/Wikidata: bers cannot insert arbitrarily new predicates as it was Property_proposal. 61http://dbpedia.org/property/. possible in Freebase. Instead, predicates first need to 62E.g., The DBpedia ontology contains dbo:birthName for the name of a person, while the non-mapping based property set contains dbp:name, dbp:firstname, and 59An example are the inverse relations “music/artist/album” and dbp:alternativeNames. “/music/album/artist.” 63E.g., dbp:alias ans dbo:alias. 22 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

2. The DBpedia ontology allows several relations Fig. 7. Number of instances per KG. for birth dates (e.g., dbo: birthDate and dbo:dbobirthYear), while in YAGO only the relation yago:birthOnDate is used. In- complete date specifications (e.g., only the year 5 is known) are specified by wildcards (“#”). 10 3. YAGO has no relations explicitly specified as inverse such as between dbo:parent and dbo:child. Number of Instances 4. YAGO uses the SPOTL(X) format for extending 100 the triple format for specifications for time and

location. Hence, no dedicated relations such as YAGO dbo:distanceToLondon or dbo:popu- DBpediaFreebaseOpenCyc Wikidata lationAsOf are necessary. the classes manually for the analysis in Section 4.1.2; Predicates of DBpedia vs. YAGO YAGO differs we obtained 41,029 unique instances out of 242,383, from DBpedia also significantly in the number of which corresponds to a coverage of 17%. predicates. YAGO contains more predicates, since in- Ranking w.r.t. the number of instances Wikidata fobox attributes from the various language versions of 64 comprises the highest number of instances per class, Wikipedia are aggregated into one KG. DBpedia, in OpenCyc the fewest (see Figure 7). contrast, provides localized KGs. Ranking w.r.t. the number of entities Table 1 OpenCyc Compared to the 18,028 defined relations shows the ranking of KGs regarding the number of en- for OpenCyc, we measured only 164 distinct predi- tities. Freebase contains by far the highest number of cates. 99.2% of the relations were never used. We as- entities (about 49.9M). OpenCyc is at the bottom with sume that those relations are used just within Cyc. only 41,029 identified entities. 4.1.4. Instances and Entities Wikidata exposes relatively many instances in com- We distinguish between entities and instances: In- parison to the entities, since it uses reification and since stances are resources having an entity type (i.e., hav- it stores the Wikipedia sitelinks. Via instantiation of ing a rdf:type relation). Entities are those instances statements via wdo:Statement over 74M instances which are in addition represented as real world objects. are created. Instances of wdo:Statement etc. are therefore not Freebase exposes also relatively many instances in regarded as entities. comparison to the entities, since each node is rep- Evaluation method. We identify the different in- resented by a M-ID and by an unchangeable GUID. stances by retrieving the subjects of all triples where In the RDF export of Freebase, about 35.8 M GUID rdf:type is the predicate. We identify the dif- nodes are instantiated as xsd:integer. ferent entities from the set of instances which in Differences of entities The reason why the KGs addition belong to owl:Thing in DBpedia and show quite different numbers of entities lies in the YAGO, to freebase:common.topic in Free- sources of the KG knowledge. We illustrate this with base, and to wdo:Item in Wikidata. In OpenCyc, the music domain as example: cyc:Individual corresponds to owl:Thing, but 1. Freebase was created mainly from data imports not all entities are classified in this way. We esti- (e.g., imports from MusicBrainz.com). There- mate the set of entities in OpenCyc by considering all fore, the domain of media and especially song classes which comprise more than 300 instances and release tracks are covered very well (77% of which contain entities.65 In the same way we selected all entities are in the media domain (see Sec- tion 4.1.2), out of which 42% are release tracks 64The language of each attribute is encoded in the URI, e.g., (/music/release_track)). infobox/de/fläche and infobox/en/areakm. 65Examples for those classes are “individual” (cyc:Mx4rvVjaApwpEbGdrcN5Y29ycA), “movie” pace cyc: urlhttp://sw.opencyc.org/concept/. This selection (cyc: Mx4rv973YpwpEbGdrcN5Y29ycA) und “city” method neglects abstract classes such as “type of objects” (cyc:Mx4rvVjnZ5wpEbGdrcN5Y29ycA) (names- (cyc:Mx4rvWXYgJwpEbGdrcN5Y29ycA. M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 23

Therefore, Freebase contains albums and re- Fig. 8. Average number of entities per class per KG. lease tracks of both English and non-English lan- 104 guages. For instance, regarding the English lan- guage, the album “Thriller” from Michael Jack- son and the single “Billie Jean” are there, as well 2 as rather unknown songs from the “Thriller” al- 10 bum such as “The Lady in My Life”. Regarding non-English languages, Freebase contains for in- stance songs and albums from Helene Fischer 100 such as “Lass’ mich in dein Leben” and “Zauber-

mond”; also rather unknown songs such as “Hab’ Average number of entities den Himmel berührt” can be found. 2. In case of DBpedia, the English Wikipedia is the YAGO DBpediaFreebaseOpenCyc Wikidata source of information. In the English Wikipedia, many albums and singles of English artists are covered (e.g., the album “Thriller” and the sin- Fig. 9. Number of unique subjects and objects per KG. gle “Billie Jean”). Rather unknown songs such as “The Lady in My Life” are not covered in unique subjects Wikipedia. For many non-English artists such as 1010 unique objects the German singer Helene Fischer no music al- bums and no singles are contained in the En- glish Wikipedia. In the corresponding language version of Wikipedia (and language-specific DB- 105 pedia version), this information is often avail- able (e.g., the album “Zaubermond” and the song “Lass’ mich in dein Leben”), but not the rather unknown songs such as “Hab’ den Him- 100 mel berührt”. 3. For YAGO, the same situation as for DBpedia YAGO DBpedia Wikidata holds, with the difference that YAGO in addition Freebase OpenCyc imports entities also from the different language versions of Wikipedia and imports also data from Average number of entities per class Figure 8 sources such as GeoNames. However, the above shows the average number of entities per class. Obvi- mentioned works of Helene Fischer are not in ous is the difference between DBpedia and YAGO (de- the YAGO KG, although the song “Lass’ mich spite the similar number of entities): The reason for in dein Leben” exists in the German Wikipedia that is that the number of classes in the DBpedia ontol- since May 2014.66 ogy is small (as created manually) and in YAGO large 4. Wikidata is supported by the community and (as created automatically). contains music albums of English and non- 4.1.5. Subjects and Objects English artists, even if they do not exist in Evaluation method We measure the number of Wikipedia such as “The Lady in My Life”. Note, however, that Wikidata does not provide unique subjects by counting the unique resources on all artist’s works such as from Helene Fischer. the subject position of N-Triples, excluding blank 5. OpenCyc has only few entities in comparison to nodes. Analogously, we measure the number of unique the other KGs, as it mainly consists of a taxon- objects by counting the unique resources on the ob- omy. ject position of N-Triples, excluding literals. We also measure the number of blank nodes and the number of literals separately. 66 YAGO3 is based on the Wikipedia dump of 2014-06-26, Ranking of KGs w.r.t. unique subjects Figure 9 see http://www.mpi-inf.mpg.de/de/departments/ databases-and-information-systems/research/ shows the number of unique subjects per KG. YAGO yago-naga/yago/archive/. contains the highest number of different subjects, 24 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO while OpenCyc contains the fewest. Blank nodes are tains the best coverage, despite its high number of used in Wikidata and OpenCyc. classes. This can be traced back to the heuristics Peculiarities Surprising is the high number of used for selecting the classes. Wikidata contains unique subjects in YAGO compared to the number of relatively many classes, but from them only few instances (see Figure 9 and Table 1). This is caused are used at all. Also regarding DBpedia, only a by the representation form of provenance informa- small part of the already small, manually created tion: For that, ids are used on the subject position of DBpedia ontology is actually instantiated. N-Triples and lead to 308M unique (additional) sub- – Coverage of domains: In DBpedia and YAGO, the 67 jects. In the RDF export of YAGO, those IDs are domain people is the largest (50% and 34%, re- commented out in order to ensure compatibility to spectively). YAGO shows a high coverage regard- Turtle/N-Triples. ing the geography domain due to imports such as Wikidata provides a ratio of unique subjects to in- GeoNames. 77% of all instances in Freebase are stances of about 1. A reason for that might be that in the media domain due to imports such as Mu- each subject resource gets instantiated, e.g., as en- sicBrainz. tity (wdo:Item), statement (wdo:Statement), or – Relations and Predicates: Many relations and wdo:Article sitelink ( ). Especially the 45M inter- predicates are rarely used in the respective KGs: wiki links (linking to Wikipedia, , etc.) con- Only 5% of the Freebase relations are used more tribute to the high ratio, since on the subject position than 500 times and about 70% are not used at all. the target URI of the Wikimedia project is written, in- 95% of the predicates are only used once. In DB- stead of the object position, as it is common for links pedia, half of the relations of the DBpedia ontol- like owl:sameAs.68 ogy are not used at all and only a quarter of the Ranking of KGs regarding number of objects relations is used more than 500 times. For Open- Figure 9 displays also the number of unique objects per Cyc, 99.2% of the relations are not used. We as- KG. Freebase shows the highest score in this regard, sume that they are used within Cyc only. OpenCyc again the lowest. – Instances and Entities: Freebase contains by far Noteworthy is especially the difference between DBpedia and YAGO. One reason for the higher value the highest number of entities and also covers en- regarding DBpedia might be the considerably higher tities well which are primarily known only in non- number of external links via owl:sameAs in case of English speaking countries. Wikidata exposes rel- DBpedia (29.0M vs. 3.8M links). atively many instances in comparison to the en- tities, since it uses reification and since it stores 4.1.6. Summary of Key Statistics the Wikipedia sitelinks. YAGO provides a high Based on the evaluation results presented in the last number of unique subjects, as it contains prove- subsections, we can highlight the following insights: nance information. DBpedia has a high number – Number of triples and statements: Freebase is the of unique objects, likely due to the high number largest KG in terms of number of triples, Open- of external links such as owl:sameAs. Cyc is the smallest. As Wikidata RDF export is stored in N-Triples format, reification has a great 4.2. Data Quality Analysis impact on the number of instantiations of state- ments, and, hence, of triples (about a tenth of all triples). We now present the results obtained by applying the – Number of classes: None of the KGs achieves a DQ metrics (as introduced in Section 2.1) for the KGs. high coverage value regarding the classes (classes 4.2.1. Accuracy with at least one instance in the KG). YAGO ob- Syntactic validity of RDF documents Evaluation method. For evaluating the syntactic va- 67For the provenance information, the re- lidity of RDF documents, we dereference the resource lations yago:extractionSource and “Hamburg” (as resource sample) in each KG. In case yago:extractionTechnique are used, leading to of DBpedia, YAGO, Wikidata, and OpenCyc there 308,619,422 unique subjects such as yago:id_6jg5ow_115_lm6jdp. 68An example triple is https://en.wikipedia.org/wiki/ are RDF/XML serializations of the resource available, Hapiness schema:about wdt:Q8. which can be validated by the official W3C RDF val- M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 25

Table 1 Summary of key statistics

DBpedia Freebase Opencyc Wikidata YAGO Number of triples 411 885 960 3 124 791 156 2 412 520 748 530 833 1 001 461 792 Number of classes 736 53 092 116 822 302 280 569 751 Number of relations 58 776 70 902 18 028 1874 106 Unique predicates 60 231 784 977 165 4839 88 736 Number of entities 4 298 433 49 947 799 41 029 18 697 897 5 130 031 Number of instances 20 764 283 115 880 761 242 383 142 213 806 12 291 250 Avg. number of entities per class 5840.3 940.8 0.35 61.9 9 Unique subjects without blank nodes 31 391 413 125 144 313 239 264 142 213 806 331 806 927 Number of blank nodes 0 0 21 833 64 348 0 Unique non-literals in object position 83 284 634 189 466 866 423 432 101 745 685 17 438 196 Unique literals in object position 161 398 382 1 782 723 759 1 081 818 308 144 682 682 313 508

Table 2 resentative relations, so that a meaningful coverage is Syntactic validity of RDF documents guaranteed. DB FB OC WD YA The KG OpenCyc is not taken into account for this criterion. Although OpenCyc comprises 1,081,818 lit- msynRDF 1 1 1 1 1 erals in total, these literals are essentially labels and de- msynLit 0.99 1 1 0.62 1 scriptions (given via rdfs:label and rdfs:com- msemT riple 1 1 1 1 1 ment) and are not bounded to special formats. Hence, OpenCyc has no syntactic invalid literals and is as- idator.69 Freebase only provides a Turtle serialization. signed the metric value 1. We evaluate the syntactic validity of this Turtle docu- As far as a literal with data type is given, its syntax is ment by verifying if the document can be loaded into verified with the help of the function RDFDatatype an RDF model of the Apache Jena Framework.70 .isValid(String) of the Apache Jena frame- Result. All considered KGs provide the syntac- work. Thereby, standard data types such as xsd:date tic validity of RDF documents. However, in case of can be validated easily. The flexible validation method YAGO and Wikidata, the RDF Validator declares the allows also the validation of literals with different data used language codes as invalid, since the validator types for a relation. In DBpedia, for instance, data evaluates language codes in accordance with ISO-639. for the relation dbo:birthdate is stored both as The criticized language codes are, in contrast, con- xsd:gYear and as xsd:date. If the literal has no tained in the newer standard ISO 639-3, so that they type or if it is of type xsd:String, the literal is are actually valid. evaluated by a regular expression, which was created manually (see below, depending on the relation con- Syntactic validity of literals Evaluation method. We evaluate the syntactic validity of literals by means of sidered). For each of the three relations selected for the the relations date of birth, number of inhabitants, and evaluation, we created a sample of 1M literal values International Standard Book Number (ISBN), as they per KG, as long as the respective KG contains so many cover different domains – namely, people, cities, and literals. books – and as they can be found in all KGs. In gen- Evaluation results. eral, domain knowledge is needed for selecting rep- Date of Birth For Wikidata, DBpedia, and Freebase, all verified literal values (1M per KG) were syntac- tically correct. Surprisingly, the Jena Framework as- 69 https://w3.org/RDF/Validator/ See , requested on sessed data values with a negative year (i.e., B.C.; e.g., Mar 2, 2016. 70See https://jena.apache.org/, requested Mar 2, “-600” for xsd:gYear) as invalid, despite the correct 2016. syntax. 26 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

For YAGO, we detected 519,049 syntactic errors facts75 or together with the actual ISBN numbers as (given 1M literal values) due to the usage of wild- coherent strings.76 cards in the date values. For instance, the birth date Semantic Validity of Triples Evaluation method. of yago:Socrates is specified as “470-##-##”, Evaluating the semantic validity is hard, even if which does not correspond to the correct syntax of a random sample set is evaluated manually, e.g., xsd:date. Obviously, the syntactic invalidity of lit- via crowd-sourcing [2]. Kontokostas et al. [31] pro- erals is accepted by the YAGO KG publishers in order pose a test-driven evaluation of Linked Data quality to keep the number of relations low.71 where test cases are automatically instantiated based Number of inhabitants For DBpedia, YAGO, and on schema constraints or semi-automatically enriched Wikidata, we evaluated the syntactic validity of the number of inhabitants by means of the data types schemata. The authors propose a test-case generator xsd:nonNegativeInteger, xsd:decimal, and for measuring, among other things, the ratio of valid xsd:integer of the typed literals. In Freebase, no domains and ranges of relations. For instance, an inter- data type is specified. Therefore, we evaluated the val- val specifies the valid height of a person and all triples ues by means of a regular expression which allows which lie outside of this interval are evaluated manu- only the decimals 0-9, period, and comma. The values ally. Beyond that, literals such as the ISBN number can for the number of inhabitants were valid in all KGs. be evaluated w.r.t. their semantic validity automatically ISBN The ISBN is an identifier for books and mag- by means of their check sum. This method requires the azines. The identifier can occur in various formats: considered literals to be consistently formatted. with or without preceding “ISBN”, with or without de- In this work, we use a semi-automatic method: First, limiters, and with 10 or 13 digits. Gupta72 released a rules for evaluating the semantic validity of the person regular expression for validating ISBN in its different names, the birth dates, and the number of inhabitants forms, which we use in our experiments. are created. Literals which do not match with the rules In Freebase, 698,736 ISBN numbers are available. are then evaluated manually. Out of them, 38 were assessed as syntactically incor- Person names rect. Typical mistakes were too long numbers73 and Evaluation method. We evaluated the semantic va- 77 wrong prefixes.74 In case of Wikidata, 18 of the 11,388 lidity of person names by means of a sample of ISBN numbers were syntactically invalid. However, 100,000 literals. The literals were selected in DB- some invalid numbers have already been corrected pedia based on foaf:name, in YAGO based on since the time of evaluation. This indicates that the yago:hasFamilyName for the family name and Wikidata community does not only care about insert- based on yago:hasGivenName for the given name. ing new data, but also increasing the quality of the For Freebase, Wikidata and OpenCyc, rdfs:label given KG data. In case of YAGO, we could only find was used in combination with the constraint that the 78 400 triples with the relation yago:hasISBN. Seven resource is instance of a person class. Literals with of the literals on the object position were syntactically more than two characters were assessed as valid, liter- incorrect. For DBpedia, we evaluated 24,184 literals. als with two or fewer characters were evaluated manu- 7,419 of them were assessed as syntactically incorrect. ally. In many cases of the incorrect literals, comments next Evaluation result. All considered KGs scored well to the ISBN numbers in the info-boxes of Wikipedia in this evaluation. Out of the 100,000 evaluated per- led to inaccurate extraction of data, so that the com- son names from DBpedia, 73 literals had two or fewer ments are either extracted as additional ISBN number characters. However, 64 of them consist of Chinese

71In order to model the dates to the extent they are known, 75An example is dbr:Prince_Caspian. further relations would be necessary, such as using was- 76An example is “ISBN 0755111974 (hardcover edition)” for BornOnYear with range xsd:gYear, wasBornOnYearMonth with dbr:My_Family_and_Other_Animals. range xsd:xsd:gYearMonth, and using them only if the whole date is 77We used person names instead of ISBN numbers, since we not known. wanted to check the semantic validity via a checksum automatically. 72See http://howtodoinjava.com/regex/java- However, this requires that the literal values to be checked have a regex-validate-international-standard-book- common format. This is not given for ISBN numbers in the consid- number-isbns/, requested on Mar 1, 2016. ered KGs as analyses have shown. 73E.g., the 16 digit ISBN 9789780307986931 (m/0pkny27). 78I.e., of type freebase:people.person in case of Free- 74E.g., prefix 294 instead of 978 regarding 2940045143431. base and of type “human” (wdt:Q5) in case of Wikidata. M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 27 characters and one literal is an alias.79 In total, 8 in- Table 3 valid literals were identified.80 The corresponding er- Measured values for the dimension Trustworthiness. rors can be traced back to errors in the DBpedia ex- DB FB OC WD YA traction framework. For YAGO, we found 158 liter- als with two or fewer characters. Our manual inves- mgraph 0.5 0.5 1 0.75 0.25 tigation revealed, however, that all of them are cor- mfact 0.5 1 0 1 1 rect names. The Wikidata sample contained 110 short mNoV al 0 1 0 1 0 literals. They were non-arabic letters and the names are presumably correct. For Freebase, 241 literals were marked as invalid. 214 of them were written in non- stored in the KG as additional facts via the relation latin letters. The remaining literals are often stage yago:hasConfidence. names such as “MB” (m/0x90hz6). Dates of birth 4.2.2. Trustworthiness For evaluating the dates of birth we reused the dates The values of the metrics are shown in Table 3. of birth sample from the syntactic validity evaluation. Trustworthiness on KG level Regarding the trustwor- Here, we achieved similar results than for the evalua- thiness of a KG in general, we can differentiate be- tion of the person names. Interesting is that we found 36 wrong dates in YAGO and 30 wrong dates in Wiki- tween the method of how new data is inserted into the data, such as February 31. KG and the method of how existing data is curated. Number of inhabitants The KGs differ considerably w.r.t. these dimensions. Getting to know the actual number of inhabitants Cyc is edited (expanded and modified) exclusively by and thereby semantically validating the number of in- a dedicated expert group. The free version, OpenCyc, habitants is very hard, since different sources may state is derived from Cyc and only the data of a local mirror different values. We therefore used a valid domain can be modified by the data consumer. Wikidata is also range for assessing the number of inhabitants. In our curated and expanded manually, but by volunteers of evaluation, all literal values passed the validity check the Wikidata community. Wikidata allows importing here.81 data from external sources such as Freebase.83 How- Conclusions ever, new data is not just inserted, but needs to be ap- Since we measured low numbers of semantic invalid proved by the community. Freebase was also curated literals in the relatively large sample sets (100,000 val- by a community of volunteers. In contrast to Wikidata, ues for each KG and attribute), the metric function re- the proportion of data imported automatically is con- turns 1 for all KGs (Table 2). This corresponds to the siderably higher and new data does not need commu- highest fulfillment score. During our evaluation, we nity approvals. The knowledge of both DBpedia and identified several noteworthy errors (which could be YAGO is extracted from Wikipedia, but DBpedia dif- used for improving the extraction systems; see the de- fers from YAGO w.r.t. the community involvement: tailed analysis in our wiki). However, a qualitatively Any user can engage in the mappings of the Wikipedia better evaluation is only achievable by a manual evalu- infobox templates to the DBpedia ontology in the map- ation – as it was performed for YAGO2 by the YAGO ping wiki of DBpedia84 and in the development of the developer team. In this evaluation, assessing 4,412 DBpedia extraction framework. statements manually resulted in an accuracy of 98.1% In total, OpenCyc obtains the highest value here, (with weighted averaging: 95%).82 The measured ac- followed by Wikidata. curacy values were generalized per relation and are

79“Mo” was used for dbr:Maurice_Smith_(kickboxer). systems/research/yago-naga/yago/statistics/, 80Among them, “or” for dbr:Yunreng, “*” requested on Mar 3, 2016. for dbr:Carmine_Falcone, and “()” for 83Note that imports from Freebase require the approval of dbr:Blagovest_Sendov. the community (see https://www.wikidata.org/wiki/ 81In the sample, there were also cities with no inhabitants. How- Wikidata:Primary_sources_tool. Besides that, there ever, those cities (such as “Miklarji” (m/0bbx9zd)) indeed do not are bots which import automatically (see https://www. have inhabitants. wikidata.org/wiki/Wikidata:Bots/de). 82See http://www.mpi-inf.mpg.de/de/ 84See http://mappings.dbpedia.org/, requested on departments/databases-and-information- Mar 3, 2016. 28 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Trustworthiness on statement level We defined the Table 4 metric for trusthworthiness on statement level via eval- Check of consistencies. uating whether provenance information is stored for DB FB OC WD YA statements in the KG. The picture is mixed: DB- pedia uses the relation prov:wasDerivedFrom mcheckRestr 0 1 0 1 0 from the W3C-PROV-O ontology to represent the mconClass 0.88 1 <1 1 0.33 sources of the entities and their statements. However, mconRelat 0.99 0.45 1 0 0.99 as the source are Wikipedia articles in all cases (e.g., http://en.wikipedia.org/wiki/Hamburg Freebase uses proprietary vocabulary for represent- for dbr:Hamburg) and since all DBpedia entities ing provenance: via Compound Value Types (CVT), have a corresponding Wikipedia page, this provenance relations of higher degree can be expressed [37]. For information are trivial and the fulfillment degree is, the relation /location/statistical_region hence, of rather formal nature. Furthermore, in the .population, for instance, a concrete value and a DBpedia ontology, prov:wasDerivedFrom and corresponding source (/measurement_unit/da- dcterms:source triples are used to connect the ted_integer/source 85 ) can be stored via an inter- ontology with the DBpedia mapping wiki. YAGO mediate node. uses its own vocabulary to indicate the source of in- OpenCyc differs from the other KGs in that it uses formation. Interestingly, YAGO stores per statement neither an external vocabulary nor a proprietary vocab- both the source (via yago:extractionSource; ulary for storing provenance information. e.g., the Wikipedia article) and the extraction tech- nique (via yago:extractionTechnique; e.g., Indicating unknown and empty values This crite- “Infobox Extractor” or “CategoryMapper”). The num- rion highlights the subtle of Wikidata and ber of indications about sources is 161M, and, hence, Freebase in comparison to the data models of the many times over the number of instances in the KG. other KGs: Wikidata and Freebase allow for storing The reason for that is that in YAGO the source is stored unknown values and empty values (e.g., that “Eliz- for statements. abeth I of England” (wdt:Q7207) had no chil- In Wikidata several relations can be used for re- dren) However, in the Wikidata RDF export such ferring to sources, such as wdt:P143 (“imported statements are only indirectly available, since they from”), wdt:P248 (“stated in”), and wdt:P854 are represented via blank nodes and via the relation (“reference URL”).86 Note, that “imported from” rela- owl:someValuesFrom. tions are used for automatic imports but that statements In YAGO, non-exact dates are representable via with such a reference are not accepted (“data is not wildcards (e.g., “1940-##-##”, if only the year is sourced”).87 To source data, the other relations, “stated known). Note, however, the invalidity of such strings as date literals (see Section 4.2.1). Unknown dates are in” and “reference URL”, can be used. The number of not supported by YAGO. all stored references in Wikidata88 is 971,915. Based on the number of all statements,89 74,272,190, this cor- 4.2.3. Consistency responds to a coverage of 1.3%. Note, however, that Check of schema restrictions during insertion of new not every statement in Wikidata requires a reference statements The values of the metric mcheckRestr re- according to the Wikidata guidelines. In order to be garding the restrictions during the insertion of new able to state how many references de facto are miss- statements are shown in Table 4. The Web interfaces ing, a manual evaluation would be necessary. How- of Freebase and Wikidata verify during the insertion of ever, such an evaluation would be presumably highly new statements by the user whether the input is com- subjective. patible with the respective data type. For instance, for the relation “date of birth” (wdt:P569) Wikidata ex- pects an input of a date in syntactically valid form. DB- 85See http://mappings.dbpedia.org, requested Mar 3, 2016. pedia, OpenCyc and YAGO have no schema restriction 86All source relations are instances of Q18608359. during insertion. 87See https://www.wikidata.org/wiki/Property: P143, requested Mar 3, 2016. Consistency of statements w.r.t. class constraints 88?s a wdo:Reference Evaluation method For evaluating the consistency 89?s a wdo:Statement of class constraints we used the relation owl:dis- M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 29 jointWith, since this is the only relation which is Table 5 used by more than half of the considered KGs. We only Create a ranking of statements considered direct instantiations here: If there is, for in- DB FB OC WD YA stance, the triple (dbo:Agent, owl:disjoint- With, dbo:Place), then there should be no in- mRanking 0 0 0 1 0 stances ?s with ?s rdf:type dbo:Agent and simultaneously ?s rdf:type dbo:Place. FunctionalProperty If a relation is instantiated as Evaluation results The scores are shown in Table 4. owl:FunctionalProperty, this relation should Freebase and Wikidata do not specify any constraints only be used at most once per resource. with owl:disjointWith. Hence, those two KGs The restriction via owl:FunctionalProperty have no inconsistencies w.r.t class restrictions so that is used by all KGs except Wikidata. On the talk we assign the metric value 1 here. In case of Open- pages about the relations on the Wikidata online plat- Cyc, 5 out of the 27,112 class restrictions are incon- form, users can specify this kind of cardinality re- sistent. DBpedia contains 24 class constraints. Three striction via setting the relation to "single"; how- out of them are inconsistent. For instance, over 1200 ever, this is not part of the Wikidata data model. instances exist which are both a dbo:Agent and a The other KGs mostly comply with the usage re- dbo:Place. YAGO contains 42 constraints, dedi- strictions of owl:FunctionalProperty. Note- cated mainly for WordNet classes. worthy is that in Freebase 99.9% of the inconsis- tencies obtained here are caused by the usages of Consistency of statements w.r.t. relation constraints the relations freebase:type.object.name and Evaluation method Here we used the relations rdfs: freebase:common.notable_for.display_ range owl:FunctionalProperty and , as they name. are used in more than every second considered KG. We only consider datatype properties since consisten- 4.2.4. Relevancy cies regarding object properties would require to dis- Creating a ranking of statements Only Wikidata sup- tinguish Open World assumption and Closed World ports the creation of a ranking of statements (see Ta- assumption. ble 5): Each statement is ranked and can either reach In the following, we consider the fulfillment degree the value “preferred rank” (wdo:PreferredRank), for the two mentioned relations separately: “normal” (wdo:NormalRank), or “deprecated” Range Wikidata does not use any rdfs:range (wdo:PreferredRank). The preferred rank cor- restrictions. Within the Wikidata data model, there is responds to the up-to-date value or the consensus of wdo:propertyType, but this indicates not the ex- the Wikidata community w.r.t. this statement. Freebase act allowed data type of a relation (e.g., wdo:prop- does not provide any ranking of statements, entities, or ertyTypeTime can represent a year or an exact relations. However, the meanwhile shutdown Freebase 90 date). On the talk page of the relations on https: Search API provided a ranking for resources. //wikidata.org, users can indicate the allowed 4.2.5. Completeness entity types via "One of" statements. However, this is We evaluated the completeness by a created gold not part of the Wikidata data model. standard, which is available online.91 The gold stan- DBpedia obtains the highest measured value w.r.t. dard comprises 41 classes and 22 relations and is based rdfs:range. Most inconsistencies here are due on the domains people, media, organizations, geogra- to inconsistent data types. For instance, the relation phy, and biology, as introduced in Section 4.1.2. The dbo:birthDate requires a data type xsd:date. classes are geared to corresponding WordNet synsets. In about 20.3% of those relations the data type xsd: Schema Completeness The criterion schema com- gYear is used, though. pleteness focuses on the completeness of the KG Also, YAGO, Freebase, and OpenCyc contain mainly schema, i.e., w.r.t. its classes and relations [35]. inconsistencies since they use data types on the schema level which are not used consistently on the in- 90 stance level. For instance, YAGO specifies the data See https://developers.google.com/freebase/ yago:yagoURL yago:yagoISBN v1/search-cookbook#scoring-and-ranking, re- types and in its quested Mar 4, 2016. schema. On the instance level, however, either no data 91See http://km.aifb.kit.edu/sites/knowledge- type is used or the unspecific data type xsd:string. graph-comparison/. 30 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Table 6 e.g., yago:created in the sense of "the director of Results from the completeness evaluation. the movie"). The relation yago:wasCreatedOn- DB FB OC WD YA Date can be used reasonably for both the foundation year of a company and for the publication date of a mcSchema 0.91 0.76 0.92 1 0.95 movie. We can state that expanding the YAGO schema mcColumn 0.40 0.43 0 0.29 0.33 by further relations appears reasonable. Freebase mcP op 0.93 0.94 0.48 0.99 0.89 : 1. Classes: Freebase lacks a class hier- archy and subclasses of classes are often in different m (short) 1 1 0.82 1 0.90 cP op namespaces (e.g., the class people /person/people mcP op (long) 0.86 0.88 0.14 0.98 0.88 has the subclasses artists /music/artist and sportsmen /sports/pro_athlete), which makes Evaluation method We evaluate the schema com- it difficult to find suitable subclasses and super classes. pleteness by means of the previously introduced gold Noticeable is also that the biology domain contains standard. no subclasses. This is due to the fact that families are Evaluation results represented as entities (e.g., tree /m/07j7r or ginko The evaluation results are shown in Table 6 and are m/0htd3). The ginko tree is then not classified as discussed in the following: tree, but by a generic class /biology/oganism_ DBpedia: classification. 1. Classes: The DBpedia ontology was created man- 2. Relations: According to our gold standard, Free- ually and covers all domains well. However, it is in- base is "relation-complete". Since a given entity can complete in the details and therefore appears superfi- be described by relations from different namespaces, cial. For instance, within the domain of plants the DB- many relations are generally available. pedia ontology uses not the class "tree", but the class Wikidata: Wikidata is complete both w.r.t. to classes "ginko", which is a subclass of trees. We can mention and relations. Besides frequently used generic classes as reason for such gaps and incorrect modeling the fact such as "human" (wdt:Q5) for people also special that the ontology is created by means of the most fre- classes exist. Interesting is also that Wikidata covers all quently used infobox templates in Wikipedia. relations of the gold standard, even though it contains 2. Relations: Relations are considerably well cov- considerably less relations (1,874 vs. 70,802). The ered in the DBpedia ontology, as our evaluation shows Wikidata methodology to let users propose new rela- (coverage of 0.91). Some missing relations or model- tions, to discuss about their coverage and reach, and ing failures are due to the Wikipedia info-box charac- finally to approve or disapprove the relations, seems to teristics. For example, to represent the sex of a per- be appropriate. son the existing relation foaf:gender seems to fit. OpenCyc: The ontology of OpenCyc was created However, it is only modeled in the ontology as belong- manually and covers both generic and specific classes ing to the class language and not used on instance such as social groups and "LandTopographicalFea- level. The sex of the person is often not explicitly men- ture". We can measure that OpenCyc is complete w.r.t. tioned in the Wikipedia info-boxes, but implicitly men- the classes. Regarding the relations, OpenCyc lacks tioned in the category names (e.g., "American male some relations of the gold standard such as the number singers"). While DBpedia does not exploit this knowl- of pages or the ISBN of books. edge, YAGO uses it and provides statements with the relation yago:hasGender. Column Completeness Evaluation method During YAGO: 1. Classes: To create the set of classes, the creation of the gold standard, we ensured that we se- Wikipedia categories are extracted and connected to lect only those relations to which a value typically ex- WordNet synsets. The schema completeness is already ists (e.g., there is no death date for living people). We covered by the WordNet classes. measure the schema completeness by calculating the 2. Relations: The YAGO schema does not contain average mean of all found class-relation-combinations many distinct, but rather abstract relations, which can which can occur on the instance level based on the be understood in different senses. The abstract relation schema information. In total, we create 25 combina- names make it often difficult to infer the meaning; of- tions based on the gold standard. ten the meaning of relations is given only after con- Results Table 6 shows the results of our evaluation. sidering the corresponding classes (domain and range; DBpedia and OpenCyc score well. Despite the high M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 31

Table 7 Table 8 Population Completeness regarding the different domains. Timelineness of KGs.

Domains DB FB OC WD YA DB FB OC WD YA

People 1 1 0.45 1 1 mF req 0.5 0 0.25 1 0.25 Media 0.75 0.80 0.40 0.95 0.85 mV alidity 0 1 0 1 1 Organizations 1 1 0.35 1 1 mChange 0 1 0 0 0 Geography 1 1 0.80 1 1 Biology 0.90 0.90 0.40 1 0.60 the sportsmen, we selected German sportsmen active in the year 2010, such as Maria Höfl-Riesch. The se- number of 3M represented people, Wikidata achieves lection of rather unknown entities in the domain of bi- a high coverage of birth dates (70.3%) and of the sex ology is based on the IUCN Red List of Threatened (94.1%). YAGO obtains a coverage of 63.5% for the Species92.93 sex. DBpedia, in contrast, does not contain this rela- Since in our gold standard each of the five domains tion in its ontology. If we consider the DBpedia rela- contains five classes and since for each class two well- tion dbp:gender, which originates from the generic known and two rather unknown entities were selected, info-box extractor, this leads to a coverage of only 100 entities were evaluated in total (see our website). 0.25% (5,434 people). We can note, hence, that the ex- Evaluation results The results of our evaluation traction of data out of the Wikipedia categories would are shown in Table 6. DBpedia, Freebase, and Wiki- be a further fruitful data source for DBpedia. Free- data are complete w.r.t. well-known entities. YAGO base surprisingly shows a high coverage (92.7%) of lacks some well-known entities of our gold standard, the authors of modeled books, given the basic popu- although some of them are represented in Wikipedia. lation of 1.7 M books. Note, however, that there are One reason for that fact is that those Wikipedia enti- /book/book not only single books modeled under , ties are not imported into YAGO for which a Word- but also other entries such as a description of the Lord Net class exists. For instance, there is no "Great White m/07bz5 of Rings ( ). ISBN are covered in 63.4% of Shark" entity, only a WordNet class yago:wordnet_ the cases. OpenCyc breaks ranks, as it contains mainly great_white_shark_101484850. taxonomic knowledge so that no values for the consid- Not very surprising is the fact that all KGs show a ered relations are stored in the KG. higher degree of completeness regarding well-known Population Completeness Evaluation method For entities than regarding rather unknown entities. The evaluating the population completeness, we reuse the reason for that is that general knowledge, which the gold standard introduced above. For each domain, five considered KGs want to capture, mainly covers well- classes for selected and for each class two well-known known entities. entities (called "short head") and two rather unknown Noteworthy is in particular the high population com- entities (called "long tail") were chosen based on the pleteness degree for Wikidata also for long tail entities. selection criterion described in the following. Hereby, This is a result from the central storage of interwiki only entities were considered which existed already in links between different Wikimedia projects (especially 2010, so that this criterion is measured independently between the different Wikipedia language versions) in from the timeliness. Wikidata: A Wikidata entry is automatically added to The well-known entities for the different domains Wikidata as soon as a new entity is added in one of were chosen "world-wide" and without temporal re- the many Wikipedia language versions. Note, however, strictions. To take the most popular entities per do- that in this way often English labels for the entities are main, we used quantitative statements. For instance, missing. to select well-known sportsmen, we ranked them by 4.2.6. Timeliness the number of won olympic medals; to select the most popular mountains, we ranked the mountains by their 92 heights. See http://www.iucnredlist.org, requested Apr 2, 2016. To select the rather unknown entities for the do- 93Note that selecting entities by their importance or popularity is mains, we considered entities in the context of both hard in general and that also other popularity measures such as the Germany and a specific year. For instance, regarding PageRank scores may be taken into account. 32 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Timeliness frequency of the KG Wikidata provides Table 9 the highest fulfillment score for this criterion. Modi- Comprehensibility of KGs. fications in Wikidata are via browser and via HTTP DB FB OC WD YA URI dereferencing immediately visible. Hence, Wiki- data falls in the category of continuous timeliness. Be- mDesc 0.70 0.97 1 <1 1 sides that, an RDF export is provided on a roughly mLang 1 1 0 1 1 monthly basis. The DBpedia KG is created about once muSer 1 1 0 1 1 a year and it is not modified in the meantime. In the m 1 0.5 1 0 1 last 36 months (as of February 2016), three DBpedia uURI versions have been published (April 2015, September 2014, and September 2013).94 Besides the static DB- via schema:dateModified. Also, the date of the pedia, the DBpedia live95 is continuously updated by last verification can be returned. tracking changes in Wikipedia in real-time.96 Open- Cyc and YAGO have been updated less than once per 4.2.7. Ease of Understanding year. The last OpenCyc version dates from May 2012. Description of resources Evaluation method. We YAGO3 was published 2015, YAGO2 in 2011 and the measured the extent to which resources are described interim version YAGO2s in 2013. Both KGs are devel- per KG by submitting SPARQL requests. Regard- oped further, but no exact dates of the next version are ing the labels, we considered rdfs:label for all known. Freebase had been updated continously until KGs. Regarding the descriptions, the corresponding 97 its close-down in March 2015. relations differ from KG to KG: DBpedia, for in- Specification of the validity period of statements In stance, uses rdfs:comment and dcelements: YAGO, Freebase, and Wikidata the temporal validity description, while Freebase uses freebase: period of statements (e.g., the term in office of a pres- common.topic.description.98 ident) can be specified (see the values for mV alidity Evaluation result. For all KGs the rule applies that in Table 8). In YAGO, this is done via the relations in case there is no label available, usually there is yago:occursSince, yago:occursUntil, and also no description available. The current metric could yago:occursOnDate; in Wikidata, via relations therefore (without significant restrictions) be applied “start time” (wdt:P580) and “end time” (wdt:P582). to rdfs:label occurrences only. In Freebase, Compound Value types (CVTs) are used YAGO, Wikidata, and OpenCyc contain a label for to represent higher-degree relations [37] such as /gov- almost every entity. In Wikidata, the entities with- ernment/government_position_held. out any label are of experimental nature and are most Specification of the modification date of statements likely not used.99 Freebase keeps the modification dates of statements in Surprisingly, DBpedia shows a relatively low cov- the KG: Via the relation freebase:freebase/ erage w.r.t. labels and descriptions (only 70.4%). Our valuenotation/is_reviewed the date of the manual investigations suggest that statements of higher last review of facts can be represented. Notewor- degrees are modeled by means of intermediate nodes thy is that in the DBpedia ontology the attribute which have no labels, so that reification is not nec- dcterms:modified is used to state the date of the essary.100 Due to the self-describing URIs of DBpe- last revision of the DBpedia ontology. In case of Wiki- dia (URIs are derived from the titles of the English data HTTP URI dereferencing, the latest modification Wikipedia articles), the meaning of DBpedia resources date of a resource (and not of a statement) is returned is admittedly mostly obvious.

94The online version for browsing and HTTP lookups has always been the latest RDF dump version. 98Human-readable resource descriptions may also be represented 95See http://live.dbpedia.org/, requested on Mar 4, by other relations [16]. However, we focused on those relations 2016. which are commonly used in the considered KGs. 96However, DBpedia live does not provide the full range of rela- 99For instance, “Q5127809” represents a game fo the Nintendo tions as DBpedia. Entertainment System, but there is no further information for an 97See https://plus.google.com/ identification of the entity available. 109936836907132434202/posts/bu3z2wVqcQc, re- 100E.g., http://dbpedia.org/page/Nayim links via quested Mar 4, 2016. dbo:careerStation to 10 entities of his carrier stations. M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 33

Labels in multiple languages Evaluation method. Table 10 Here we measure whether the KGs contain labels Interoperability of KGs. (rdfs:label) in other languages than English. This DB FB OC WD YA is done by means of the language annotations of the literal values such as “@de”. mReif 1 0.5 0.5 0 0.5 Evaluation results. According to this evaluation miSerial 1 0 0.5 1 1 method, DBpedia provides labels in 13 languages. Fur- mextV oc 0.61 0.11 0.41 0.68 0.13 ther languages are provided in the localized DBpedia m 0.15 0 0.51 >0 0 versions. YAGO integrates statements of the different propV oc language versions of Wikipedia into one KG. There- fore, it provides labels in 326 different languages. The self-describing URIs are thereby derived from the Freebase and Wikidata also provide a lot of languages rdfs:label values of the resources. (244 and 395 languages, respectively). Contrary to the On the other hand, Wikidata and Freebase (the lat- other KGs, OpenCyc only provides labels in English. ter in part) rely on ids: Wikidata uses Q-IDs for re- We also measured the coverage of selected lan- sources and P-IDs for relations. Freebase uses M- guages in the KGs by means of the language an- IDs for identifying resources. Entities can be al- 101 notations. This was done for the languages En- ternatively identified by self-describing keys (e.g., glish, French, German, Spanish, and Italian. Our re- /en/Michael_Jackson instead of /m/09889g). sults show that DBpedia, YAGO, and Freebase achieve In the RDF export, those self-decribing keys are only a high coverage regarding the English language. How- provided as literals of the relation /key/en (or ever, only YAGO has a coverage of over 10 % for Ger- in other languages with the corresponding language man. In contrast to those KGs, Wikidata shows a cov- code). Classes and relations are identified via self- erage regarding the English language of only 54.6%, describing URIs.102 but a coverage of over 30% for further languages such as German and French. Wikidata is, hence, not only the 4.2.8. Interoperability most diverse KG in terms of languages, but has also Avoiding blank nodes and RDF reification RDF reifi- the highest coverage regarding non-English languages. cation allows to represent further information about single RDF triples. Wikidata makes extensive use of Understandable RDF serialization The provisioning RDF reification. Due to that, SPARQL queries for of understandable RDF serializations in the context Wikidata data are more complex than SPARQL queries of URI dereferencing leads to a better understandabil- for KGs without reification. YAGO and Freebase also ity for human data consumers. DBpedia, YAGO, and uses reification, e.g., in order to store the provenance Wikidata provide N-Triples and N3/Turtle serializa- information. However, most of the statements are us- tions. Freebase, in contrast, only provides a Turtle se- able without reification. rialization. OpenCyc only provides RDF/XML, which is regarded as not easily understandable. We provide Blank nodes are non-dereferencable, anonymous re- an overview of all provided serialization formats of the sources. They are used by the Wikidata and OpenCyc different KGs in Section 4.2.8. data model. Self-describing URIs We can see here two different Provisioning of several serialization formats DBpe- paradigms of URI usage: On the one hand, DBpedia, dia, YAGO, and Wikidata fulfill the criterion of pro- OpenCyc, and YAGO rely on human-readable URIs, visioning several RDF serialization formats to the full and therefore achieve the full fulfillment score. In DB- extent, as they provide data during the URI derefer- pedia and YAGO, the URIs of the entities are deter- encing alternatively in RDF/XML or other serializa- mined by the corresponding English Wikipedia article. tion formats. DBpedia and YAGO provide (besides The mapping to the English Wikipedia is thus trivial. N3/Turtle, N-Triples and JSON) further RDF serial- In case of OpenCyc, two RDF exports are provided: ization formats via their SPARQL endpoints, such as one using generic and one using self-describing URIs. JSON-LD, Microdata, and CSV.

101Note that literals such as rdfs:label do not necessarily 102E.g., /music/album for the class music alums and have language annotations. In those cases, no language information /people/person/date_of_birth for the relation birth day is available. of a person. 34 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Freebase is the only KG providing RDF only in Tur- Table 11 tle format. Accessibility of KGs. Using external vocabulary This criterion indicates DB FB OC WD YA how often external vocabulary is used in comparison mDeref 1 1 0.44 0.41 1 to proprietary vocabulary. For that, for each KG we m <1 0.73 <1 <1 1 divide the occurrence number of triples with external Avai relations by the number of all relations in this KG. mSP ARQL 1 1 0 1 0 DBpedia uses 37 distinct external relations from 8 mExport 1 1 1 1 1 different vocabularies, while the other KGs mainly re- mNegot 0.5 1 0 1 0 strict themselves to the external vocabularies of RDF, mHTML_RDF 1 1 1 1 0 RDFS, and OWL. In accordance with that, DBpedia is m 1 0 0 0 1 covered by external vocabulary to a large extent (ratio Meta of 0.610). Also, Wikidata reveals a high external vocabulary Regarding its relations, DBpedia links to Wikidata and ratio. We can mention two obvious reasons for that schema.org,105 but reaches only a relation linking cov- fact: 1. Wikidata is provided in a huge variety of erage of 6.3%. The reason for that lies in the fact that languages, leading to 85M rdfs:label values and many of the relations in the DBpedia ontology are not 140M schema:description values. 2. Wikidata or rarly used (see Section 4.1.3). makes extensive usage of reification. Out of the 140M Regarding the classes, Wikidata provides links triples used for instantiations via rdf:type, about mainly to DBpedia. Considering all Wikidata classes, 74M (i.e., about the half) are used for instantiations of only about 0.1% of all Wikidata classes are linked to statements, i.e., for reification. equivalent external classes. This may be the result of the high number of classes in Wikidata in general. Re- Interoperability of proprietary vocabulary Evalua- garding the relations, Wikidata provides links in par- tion method. This criterion determines the degree to ticular to FOAF and schema.org relations and achieves which URIs of proprietary vocabulary are linked to ex- here a linking coverage of 2.1%. Although this is low, ternal vocabulary via equivalency relations. For each important relations are linked.106 KG, we measure which classes and relations are linked In OpenCyc, about half of all classes exhibit at owl:sameAs 103 owl:equivalentClass via , (in least one external linking via owl:sameAs.107 Inter- wdt: P1709 owl:equivalent- Wikidata: ), and nal links to resources of sw.cyc.com, the commer- Property wdt:P1628 (in Wikidata ) to external cial version of OpenCyc, were ignored in our eval- rdf: vocabulary. Although other relations such as uation. The considered classes are mainly linked to subPropertyOf could be taken into account; how- FOAF, UMBEL, DBpdia, and linkedmdb.org, the rela- ever, in this work we only consider equivalency rela- tions mainly to FOAF, DBpedia, Dublin Core Terms, tions. Regarding DBpedia, we only consider the DB- and linkedmdb.org. The relative high linking degree owl:sameAs pedia ontology. Freebase only provides of OpenCyc can be attributed to dedicated approaches links in the form of a separate RDF file, but these links of linking OpenCyc to other KGs (see, for instance, are only on instance level. YAGO contains 553.768 Medelyan et al. [33]). owl:equivalentClass links to classes under the namespace http://dbpedia.org/class/ 4.2.9. Accessibility yago. However, as the YAGO class hierarchy was im- Dereferencing possibility of resources Evaluation ported into DBpedia, we do not count these links as method We measured the dereferencing possibilities external links for YAGO. Regarding its classes, DBpedia reaches a relative 105Examples are dbo:birthDate linked to “date of birth” high interlinking degree of about 48.4%, using FOAF, (wdt:P569 and schema:birthDate. Wikidata, Schema.org and DUL104 as linking targets. 106Such as rdf:type and rdfs:subClassOf. 107OpenCyc makes not difference between instance and schema level for owl:sameAs. The OWL primer states "The built-in 103As OpenCyc contains owl:sameAs also on the schema level, OWL property owl:sameAs links an individual to an individual", we consider this relation also here. but also "The owl:sameAs statements are often used in defin- 104See http://www.ontologydesignpatterns.org/ ing mappings between ontologies", see https://www.w3.org/ ont/dul/DUL.owl. TR/owl-ref/#sameAs-def. M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 35 of resources by means of trying to dereference URIs set up, which checks the availability of a certain re- containing the fully-qualified domain name of the KG. source for successful URI dereferencing (i.e., return- For that, we randomly selected 15,000 URIs in the sub- ing HTTP status code 200 OK) every minute over the ject, predicate, and object position using the SPARQL time range of 60 days (Dec 18, 2015–Feb 15, 2016). option "ORDER BY RAND()". We submitted HTTP Evaluation result The online version of YAGO requests with the HTTP accept header field set to showed the worst availability performance. Here, the application/rdf+xml in order to perform con- outage was on a regular basis and lasted long (see the tent negotiation. diagram on our website). Evaluation results While the other KGs showed almost no outages DBpedia and OpenCyc dereferenced all URIs suc- and were again online after some minutes on average, cessfully and returned appropriate RDF data, so that YAGO was on average 3.5 hours offline per outage. In they fulfilled this criterion completely. For DBpedia, the given time range, four outages took longer than one 45,000 URIs wer analysed, for OpenCyc only 30,116 day. Based on these insights, we recommend to use a due to the small number of distinct predicates. Almost local version of YAGO for time-critical queries. the same picture for YAGO: No notable errors during Availability of a public SPARQL endpoint The dereferencing could be measured. SPARQL endpoints of DBpedia and YAGO are pro- For Wikidata, which contains also not so many vided by a Virtuoso server, the Wikidata SPARQL end- unique predicates, we could analyse 34,812 URIs. point via Blazegraph. Freebase and OpenCyc do not URIs of Wikidata relations with supplements (e.g. the provide an official SPARQL endpoint. However, an wdt:P1024s supplement "s" as in is used for rela- endpoint for the MQL query language for the Freebase tions referring to a statement), as they were recently in- KG is available. troduced, could not be dereferenced. Furthermore, the Especially regarding the Wikidata SPARQL end- blank nodes for reification on the subject and object point we observed tender limitations: The maximum position cannot be dereferenced. execution time per query is set to 30 seconds; there is Regarding Freebase, mainly all URIs on subject and no limitation regarding the returning number of rows, object position of triples could be dereferenced. A few but the frontend of the endpoint crashed in case of resources were not dereferencable repeatedly (HTTP large result sets.111 Although public SPARQL end- server error 503; e.g., m/0156q). Interestingly, server points need to be prepared for inefficient queries, the errors also appear while browsing Freebase, so that time limit of Wikidata may impede the execution of 108 data is partially not available. Regarding the pred- reasonable queries. icate position, many URIs are not dereferencable due to server errors (503) or due to unknown URIs (404) Provisioning of an RDF export All considered KGs .109 If a large number of Google API requests are per- provide RDF exports as downloadable files. The for- formed (as in case of Freebase requests), an API key is mat of the data differs from KG to KG. Mostly, data is necessary. In our experiments, the access was blocked provided in N-Triples and Turtle format. after a few thousand requests. The API key would in- Support of content negotiation We measure the sup- crease the daily limit up to 100,000 requests. In con- port of content negotiation regarding the serialization clusion, without an API key, the Freebase KG is usable formats RDF/XML, N3/Turtle, and N-Triples. Open- in limited form only. Cyc does not provide any content negotiation; only Availability of the KG Evaluation method We mea- RDF/XML is supported as content type. Therefore, sured the availability of KGs with the monitoring web OpenCyc does not fulfill the current criterion of pro- service Pingdom.110 For each KG, an uptime test was viding content negotiation. The endpoints for DBpedia112, Wikidata113, and YAGO114 correctly return the appropriate RDF serial- 108See http://www.freebase.com/m/0156q, the page about “Berlin”, requested on Mar 5, 2016 and Apr 7, 2016. 109Besides that, a URL forwarding is performed from 111Querying up to 1.5M rows was possible in our experiments. http://rdf.freebase.com/ns/m.03hrz to http: 112http://dbpedia.org/resource/[resource] //www.googleapis.com/freebase/v1/rdf. 113https://www.wikidata.org/entity/ 110See https://www.pingdom.com, requested Mar 2, [Resource] 2016. The HTTP requests of Pingdom are executed by various 114http://www.yago-knowledge.org/resource/ servers so that caching is prevented. [resource] 36 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Table 12 Table 13 Provisioning machine-readable licensing information of KGs. Linking via owl:sameAs of KGs.

DB FB OC WD YA DB FB OC WD YA

mmacLicense 1 0 0 1 0 mInst 0.59 0.02 0.44 0 (.65) 0.31

mURIs 0.93 0.95 0.89 0.96 0.96 ization format or the corresponding HTML represen- tation. Freebase115, however, currently does not pro- embeds licensing information during the derefrencing vide any content negotiation or HTML resource rep- of resources in the RDF document via cc:license resentation. Currently, only text/plain is returned and a link to the CC0 license.121 YAGO and Freebase as content type. do not provide machine-readable licensing informa- Noteworthy is also that regarding the N-Triples seri- tion. However, their data is published under the CC- alization YAGO and DBpedia expect the accept header BY license.122 OpenCyc embeds licensing information text/plain and not application/n-triples. into the RDF document during dereferencing, but not This is due to the usage of Virtuoso as endpoint. For in machine-readable form (only as plaintext with fur- DBpedia, the forwarding to http://dbpedia.org/ ther information under rdfs:comment). data/[resource].ntriples does not work; instead, the HTML representation is returned. 4.2.11. Interlinking Linking via owl:sameAs For this metric, we queried Linking HTML sites with RDF serialization All KGs all subjects of owl:sameAs triples in each KG, except OpenCyc interlink the HTML representations where the resource in the object position is out of the of resources with the corresponding RDF represen- domain of the KG (i.e., an “external” resource). In case tations by means of in the HTML headers. The reason for that is that Wikidata does not provide Provisioning of metadata about the KG For this cri- any owl:sameAs links, but that instead identical en- terion we measured which KG metadata is available tities in other data sources are stored via these identi- owl:sameAs (such as in the form of an VoID file116). DBpedia fiers. I.e., links can be created via URI integrates the VoID vocabulary directly in its KG117 patterns. and provides information such as the SPARQL end- DBpedia and Wikidata achieved the best results point URL and the number of all triples. OpenCyc re- w.r.t. this metric. In DBpedia, there are about 12M in- owl:sameAs veals the current KG version number via owl:ver- stances with at least on link. Links to de.dbpedia.org sionInfo. For YAGO, Freebase, and Wikidata no localized DBpedia versions (e.g., ) meta information could be found. were counted as internal links and, hence, not consid- ered here. In total, 59.2% of the instances have at least 4.2.10. License one owl:sameAs link. We can therefore confirm the Provisioning machine-readable licensing information statement by Bizer et al. [12] that DBpedia has estab- DBpedia and Wikidata provide licensing information lished itself as a hub in the Linked Data cloud. about their KG data in machine-readable form. For In Wikidata, no owl:sameAs links are provided, DBpedia, this is done in the ontology via cc:li- and also no corresponding proprietary relation is avail- cense118 and links to either CC-BY-SA119 or GNU able. Instead, Wikidata uses proprietary relations for Free Documentation License (GNU FDL)120. Wikidata instance equivalencies. Identifiers are instances of the class “Wikidata property representing a unique iden- 115http://rdf.freebase.com/ns/[resource] tifier” (wdt:Q19847637). The M-ID of a Free- 116See https://www.w3.org/TR/void/, requested Apr 7, base instance is then stored via the relation “Free- 2016. base identifier” (wdt:P646) as literal value (e.g., 117See http://dbpedia.org/void/page/Dataset, re- quested on Mar 5, 2016. 118Using namespace http://creativecomons.org/ns#. 121See http://creativecomons.org/ 119http://creativecomons.org/licenses/by- publicdomain/zero/1.0/. sa/3.0/ 122See http://createivecommons.org/licenses/ 120http://www.gnu.org/copyleft/fdl.html. by/3.0 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 37

“/m/01x3gpk”). So far, 426 distinct identifiers are – Syntactic Validity of RDF documents All KGs maintained this way. The identifiers need to be trans- provide syntactic valid RDF documents. formed into valid URIs during the RDF export. The – Syntactic Validity of Literals In general, the KGs Browser interface of Wikidata already transforms the achieve good scores regarding the syntactic valid- identifiers. Counting at most one identifier per re- ity of literals. Although OpenCyc comprises over source, we obtain 12,151,147 resources and, hence, a 1M literals in total, these literals are mainly la- coverage of 65%. However, although the links provide bels and descriptions which are not formatted in relevant contents, not always an RDF representation of a special format. the resources are available; instead, the representation For YAGO, we detected about 500K syntactic er- is often in HTML. rors (given 1M literal values) due to the usage of wildcards in the date values. Obviously, the syn- Validity of external URIs Regarding the dimension tactic invalidity of literals is accepted by the pub- accessibility, we already analyzed the dereferencing lishers in order to keep the number of triples low. possibility of resources using KG namespace. Now we In case of Wikidata, some few invalid literals such analyse the external links of the KG. This includes as the ISBN have been corrected since the time owl:sameAs links as well as links to non-RDF- of evaluation. This indicates that knowledge in based Web resources (e.g., via foaf:homepage). Wikidata is curated continuously. We measure errors such as timouts, client errors For DBpedia, comments next to the values to be (HTTP response 4xx), and server errors (HTTP re- extracted (such as ISBN) in the info-boxes of sponse 5xx). Wikipedia led to inaccurate extracted values. The external links are valid in most cases for all – Semantic Validity of Triples All considered KGs KGs: All KGs obtain a metric value between 0.89 scored well regarding this metric. Note, however, and 0.96. OpenCyc contains mainly external links to that a qualitatively better evaluation is achievable non-RDF-based Web resources to wikipedia.org and by a manual evaluation. w3.org. Despite a few invalid links, the links are valid. – Trustworthiness on KG level Based on the way of Regarding the owl:sameAs links, YAGO and Free- how data is imported and curated, OpenCyc and base achieve high metric values. This is due to the fact Wikidata can be trusted the most. that YAGO links mainly to DBpedia and GeoNames – Trustworthiness on statement level Here, espe- and Freebase mainly to Wikidata. The corresponding cially good values are achieved for Freebase, resources are highly available there. Wikidata, and YAGO. YAGO stores per state- For Wikidata the relation "reference URL" (wdt: ment both the source and the extraction tech- P854), which states provenance information among nique, which is unique among the KGs. Wikidata other relations, belongs to the links linking to external also supports to store the source of information, Web resources. Here we were able to resolve 95.5% of but only 1.3% of the statements have provenance the 2,451 URIs without errors. information attached to them. Note, however, that DBpedia contains provenance information via the not every statement in Wikidata requires a refer- relation prov:wasDerivedFrom. Since almost all ence and that it is hard to evaluate which state- links refer to Wikipedia, 99% of the resources are ments lack such a reference. available. – Using unknown and empty values Wikidata and Noticeable is that DBpedia and OpenCyc con- Freebase allow the indication of empty values, tain many owl:sameAs links to domains which Wikidata also allows storing which values are un- do not exist anymore (e.g., http://rdfabout. known. com, http://www4.wiwiss.fu-berlin.de/ – Check of schema restrictions during insertion of factbook/, and http://wikicompany.org). new statements Since Freebase and Wikidata are One solution for such invalid links might be to remove editable by community members, simple consis- them if they have been invalid for a certain time span. tency checks are made during the insertion of new facts in the user interface. 4.3. Bottom Line – Consistency of statements w.r.t. class constraints Freebase and Wikidata do not specify any class We now summarize the evaluation results of Sec- constraints via owl:disjointWith, while the tion 4.2: other KGs do. 38 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

– Consistency of statements w.r.t. relation con- – Timeliness frequency of the KG Only Wikidata straints provides the highest fulfillment score for this The inconsistencies of all KGs regarding the criterion, as it provided continuous timeliness range indications of relations are mainly due to for URI dereferencing and an RDF export on a inconsistent data types and due to missing in- monthly basis. stantiations of the instances. In order to achieve – Specification of the validity period of statements a higher instantiation ratio, instances could be In YAGO, Freebase, and Wikidata the temporal instantiated by the respective classes as soon as validity period of statements (e.g., term in office) those resources occur in relations where domain can be specified. and, respective, range are defined. – Specification of the modification date of state- Regarding the constraint of functional properties: ments Only Freebase keeps the modification dates The relation owl:FunctionalProperty is of statements. Wikidata provides the modification used by all KGs except Wikidata; in most cases date during URI dereferencing. the KGs comply with the usage restrictions of this – Description of resources YAGO, Wikidata, and relation. OpenCyc contain a label for almost every en- – Creating a ranking of statements Only Wikidata tity. Surprisingly, DBpedia shows a relatively supports a ranking of statements. This is in partic- low coverage w.r.t. labels and descriptions (only ular worthwhile in case of statements which are 70.4%). Manual investigations suggest that espe- only temporally limited valid. cially statements of higher degrees such as ca- Schema Completeness – The DBpedia ontology reer stations of people are not modeled via blank was created manually and covers all domains nodes, but via intermediate nodes for which no well. However, it is incomplete in many details. labels are provided. The relations are considerably well covered in the – Labels in multiple languages YAGO, Freebase, DBpedia ontology. and Wikidata support hundreds of languages re- The YAGO classes are connected to WordNet garding their stored labels. Only OpenCyc con- synsets. Some YAGO relations (e.g. yago: tains labels merely in English. While DBpedia, created) are ambiguous and can therefore be YAGO, and Freebase show a high coverage re- understood in different senses. garding the English language, Wikidata, in con- Freebase does not use a class hierarchy and trast, does not have such a high coverage regard- classes are grouped into different domain, i.e. it is sometimes difficult to find related classes if ing English, but instead covers other languages they are not in the same domain. The Freebase considerably. It is, hence, not only the most di- relations are complete w.r.t. our gold standard, verse KG in terms of languages, but also the since they are widely applicable and since many KG which contains the most labels for languages of them are available. other than English. Wikidata covers all relations of the gold standard, – Understandable RDF serialization DBpedia, even though it contains considerably fewer rela- YAGO, and Wikidata provide several understand- tions than Freebase for instance. This is due to able RDF serialization formats. Freebase only the Wikidata process of approving relations be- provides the understandable format RDF/Turtle. fore adopting them. OpenCyc relies only on RDF/XML, which is seen OpenCyc is complete w.r.t. the classes; it lacks as not easily readable for humans. some relations of the gold standard. – Self-describing URIs We can find mixed paradigms – Column Completeness DBpedia and OpenCyc regarding the URI generation: DBpedia, YAGO, show the best column completeness values, i.e. and OpenCyc rely on human-readable URIs, many entities have values for relations which are while Wikidata and Freebase (in part, i.e. for re- defined on the schema level. sources) use identifiers. – Population Completeness Not very surprising is – Avoiding blank nodes and RDF reification Wiki- the fact that all KGs show a higher degree of com- data, YAGO and Freebase are the KGs which use pleteness regarding well-known entities than re- reification. This is important when querying and garding rather unknown entities. reusing the data. M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 39

– Provisioning of several serialization formats Free- – Support of content negotiation OpenCyc does not base is the only KG providing data in the serial- support any content negotiation; only RDF/XML ization format RDF/Turtle only. is provided. Freebase currently only returns – Using external vocabulary DBpedia and Wiki- text/plain as content type. data show high degrees of external vocabulary us- – Linking HTML sites with RDF serialization All age. Regarding DBpedia, the RDF, RDFS, and KGs except OpenCyc interlink the HTML rep- OWL vocabulary is used. Wikidata has a high resentations of resources with the corresponding external vocabulary ratio, since there are many RDF representations. language labels and descriptions (modeled via – Provisioning of metadata about a KG Only DB- rdfs:label and schema:description). pedia and OpenCyc integrate metadata about the Also, due to instantiations of statements for reifi- KG in some form. DBpedia has the VoID vocab- cation purposes, rdf:type is used a lot. ulary integrated, while OpenCyc reveals the cur- – Interoperability of proprietary vocabulary While rent KG version as machine-readable metadata. almost every second class in DBpedia is linked to – Provisioning machine-readable licensing infor- external classes, only 6.3% of all relations have mation Only DBpedia and Wikidata provide links to external relations, i.e. relations not with- licensing information about their KG data in ing the DBpedia namespace. Note, that many of machine-readable form. the DBpedia relations are not or rarely used. – Interlinking via owl:sameAs DBpedia and Wikidata shows a very low interlinking degree of Wikidata provide the highest number of instance owl:sameAs classes to external classes and of relations to ex- equivalency relations ( links in ternal relations. As the introduction of new rela- DBpedia and the proprietary relations with liter- als in Wikidata). Based on the resource interlink- tions needs to be approved, classes can be im- age, these KGs may be called Linked Data hubs mediately introduced. We see, however, no influ- among the considered KGs. ence of this fact in the number of links to external – Validity of external URIs The links to external classes and relations. Web resources are valid in most cases for all Surprising is that half of all OpenCyc classes ex- KGs. OpenCyc contains mainly external links hibit at least one owl:sameAs link. to non-RDF-based Web resources at Wikipedia – Dereferencing possibility of resources Resources and w3.org. DBpedia and OpenCyc contain many in DBpedia, OpenCyc, and YAGO can be deref- owl:sameAs links to domains which do not ex- erenced without considerable issues. Wikidata ist anymore; those links could be deleted. has introduced relations with supplements and also uses blank nodes. Those kinds of terms are not dereferencable. For Freebase we measured a 5. KG Recommendation Framework quite considerable amount of dereferencing fail- ures due to server errors and unknown URIs. Note We now propose a framework for selecting the most also that Freebase requires an API key for many suitable KG for a given setting. The goal of using the API requests. framework is to obtain a concrete recommendation or – Availability of the KG While all other KGs an appropriate preselection of a given set of knowledge showed almost no outages, YAGO shows a note- graphs G = {g1, ..., gn}. For that, the user needs to go worthy instability regarding its online availabil- through the steps as depicted in Figure 10. ity. We measured 109 outages for YAGO in the In Step 1, the preselection criteria and the weights given time interval, taking on average 3.5 hours. for the criteria are specified. The preselection criteria – Provisioning of public SPARQL endpoints Note- can be both quality criteria and general criteria and worthy is here that the Wikidata SPARQL end- need to be selected dependent on the use case. The point has a maximum execution time per query of timeliness frequency of the KG is an example for a 30 seconds. This might be a bottleneck for some quality criterion. The license under which a KG is pro- queries. vided is an example for a general criterion. If the ac- – Provisioning of an RDF export Mostly, RDF ex- tive curation of the KG is a criterion given from the re- port data of the KGs is provided in N-Triples and quirements’ analysis, Freebase can be excluded, since the Turtle format. this KG is not curated anymore. 40 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Step 1: Requirements Analysis the criteria consistency [11], licensing and interlink- • Identification of the ing [42]. Further, we reintroduce the dimensions trust- • set of preselection criteria: P worthiness and interoperability as a collective term for • 푤 푓표푟 푒푎푐ℎ 푐 ∈ 퐶 weighting of the DQ criteria: 푖 푖 multiple dimensions. As many criteria and metrics are rather abstract, Step 2: Preselection based on the Preselection Criteria we select and develop criteria w.r.t. their applicabil- ity to cross-domain knowledge graphs of the linked • Set of KGs that fulfill the preselection criteria: G푃 open data cloud. Table 15 shows an overview of the most relevant papers that were considered during lit- Step 3: Quantitative Assessment of the KGs erature review that have similarities with criteria in • Evaluate our framework stating generic guidelines (e.g. for data • DQ criteria: 푚푖 푔 푓표푟 푒푎푐ℎ 푐푖 ∈ 퐶 publishers [24]) and introducing both criteria with cor- • KGs: ℎ 푔 푓표푟 푒푎푐ℎ 푔 ∈ 퐺 푃 responding metrics (e.g. [18,27]) and the criteria with- 퐺푅푒푠푢푙푡 = 푔 ∈ 퐺푃 ℎ 푔 = max(ℎ 푔 )} 푔 out metrics (e.g. [35,26]). In total, 27 of 34 criteria Step 4: Qualitative Assessment of the Result are supported by concepts described in earlier papers. Furthermore, we introduce seven new criteria mgraph, Fig. 10. Proposed Selection Process when using our framework. mNoV al, mcheckRstr, mRanking, mF req, mV alidity and mAvai. In the following, we give an overview of After weighting the criteria and dimensions, those which criteria have been presented so far and in which KGs are neglected which do not fulfill the preselection novel contexts we apply them. criteria. Pipino et al. introduce the criteria schema complete- Afterwords, the fulfillment degrees of the remaining ness, column completeness and population complete- KGs are calculated and the KG with the highest fulfill- ness in the context of databases [35]. We introduce ment degree is selected. metrics and apply them, to the best of our knowl- Finally, in a last step the result can be assessed edge, the first time on DBpedia and other cross-domain w.r.t. qualitative aspects (considering our quantitative knowledge graphs. assessments of the KGs in Section 4.2) and, if neces- OntoQA [38] introduces criteria and corresponding sary, another KG can be selected for the actual setting. metrics that can be used for the analysis of ontologies. Table 14 shows an example of how to use our frame- Besides simple statistical figures such as the average work for ranking KGs: Given the values of all metrics of instances per class, Tartir et al. introduce also crite- for the KGs as determined in our evaluation (see Sec- ria and metrics similar to our quality criteria descrip- tion 4.2), we can calculate a total score for each KG. tion of resources ( mDescr) and column completeness This total score can either be based on an unweighted (mcCol). Based on a large-scale crawl of the semantic averaging of the single metric values per KG, or be web, Hogan et al. [26] analyze quality problems based based on a weighting as specified by the user (see last on frequent errors in publishing data in RDF. Later, column in the table). Based on the total scores, the user Hogan et al. [27] introduce further criteria and met- can either take the KG with the highest score, or make rics based on linked data guidelines for data publish- a qualitative assessment of the KGs in addition (Step ers [24]. Whereas Hogan et al. crawl and analyze many 4). knowledge graphs, we analyze a selected set of knowl- edge graphs in more detail. Heath et al. [24] provide guidelines for linked data 6. Related Work but do not introduce criteria or metrics for the assess- ment of linked data quality. Still, the guidelines can 6.1. Linked Data Quality Criteria be easily translated into relevant criteria and metrics, e.g. „Do you refer to additional access methods“ leads Zaveri et al. [42] provide a conceptual framework to the criteria provisioning of public SPARQL end- for quality assessment of linked data based on the sys- points ( mSP ARQL) and provisioning of an RDF ex- tematic review of quality criteria and metrics which are port (mExport) or „Do you map proprietary vocabu- assigned to quality dimensions and categories based lary terms to other vocabularies?“ leads to the criteria on the framework of Wang et al. [40]. Our framework interoperability of proprietary vocabulary ( mpropV oc). is also based on Wang’s dimensions and extended by Similar metrics to our framework that are based on M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 41

Table 14 Framework with an example weighting which would be reasonable for a user setting as given in [30].

Dimension Metric DBpedia Freebase OpenCyc Wikidata YAGO Example of User

Weighting wi

Accuracy msynRDF 1 1 1 1 1 1

msynLit 0.994 1 1 1 0.624 1

msemT riple 1 1 1 1 1 1

Trustworthiness mgraph 0.5 0.5 1 0.75 0.25 1

mfact 0.5 1 0 1 1 2

mNoV al 0 1 0 1 0 1

Consistency mcheckRestr 0 1 0 1 0 1

mconClass 0.875 1 0.999 1 0.333 1

mconRelat 0.991 0.45 1 0 0.992 1

Relevancy mRanking 0 0 0 1 0 1

Completeness mcSchema 0.905 0.762 0.921 1 0.952 1

mcCol 0.402 0.425 0 0.285 0.332 1

mcP op 0.93 0.94 0.48 0.99 0.89 3

Timeliness mF req 0.5 0 0.25 1 0.25 3

mV alidity 0 1 0 1 1 1

mChange 0 1 0 0 0 1

Ease of understanding mDescr 0.704 0.972 1 0.9999 1 3

mLang 1 1 0 1 1 2

muSer 1 1 0 1 1 1

muURI 1 0.5 1 0 1 2

Interoperability mReif 1 0.5 0.5 0 0.5 1

miSerial 1 0 0.5 1 1 2

mextV oc 0.61 0.108 0.415 0.682 0.134 2

mpropV oc 0.15 0 0.513 0.001 0 1

Accessibility mDeref 1 0.437 1 0.414 1 2

mAvai 0.9961 0.9998 1 0.9999 0.7306 2

mSP ARQL 1 0 0 1 1 1

mExport 1 1 1 1 1 0

mNegot 0.5 0 0 1 1 1

mHTML_RDF 1 1 0 1 1 0

mMeta 1 0 1 0 0 1

Licensing mmacLicense 1 0 0 1 0 1

Interlinking mInst 0.592 0.018 0.443 0 0.305 2

mURIs 0.929 0.954 0.894 0.957 0.956 1 Unweighted Average 0.708 0.605 0.498 0.738 0.625 Weighted Average 0.718 0.575 0.516 0.742 0.646 42 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Table 15 Overview of Related Work regarding Data Quality Criteria for KGs.

Criterion [35] [38] [26] [24] [18] [20] [27] [41] [2] [31]

msynRDF XX msynLit XXXX msemT riple XXXX mfact XX mconClass XXX mconRelat XXXXXX

mcSchema XX mcCol XXXX mcP op XX mChange XX

mDescr XXXX mLang X muSer X muURI X mReif XXX miSerial X mextV oc XX mpropV oc X

mDeref XXXX mSP ARQL X mExport XX mNegot XXX mHTML_RDF X mMeta XXX mmacLicense XXX mInst XXX mURIs XX these guidelines can also be found in other frameworks SWIQA [20] is a quality assessment framework ([27,18]). that introduces criteria and metrics for the dimen- Flemming [18] introduces a framework for the qual- sions accuracy, completeness, timeliness and unique- ity assessment of linked data quality that measures the ness. Noteable, the dimensions accuracy is described linked data quality based on a sample of a few RDF by the criteria syntactic and sematic validity according documents through dereferencing. Based on a system- to Batini et al. [6] and the dimension completeness is atic literature review, criteria and metrics are intro- described by schema completeness, column complete- duced. Noteable, Flemming introduces the criteria la- ness and population completeness according to Pipino bels in multiple languages ( mLang) and validity of ex- et al. [35]. ternal URIs ( mURIs) the first time. The framework is TripleCheckMate [2] is a framework for linked data evaluated on a sample of RDF documents of DBpedia. quality assessment using a crowdsourcing-approach In contrast, we evaluate the whole knowledge graph. for manual validation of facts. Based on this approach, M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 43

Zaveri et al. [41] and Acosta et al. [3] analyze both syn- overview of the covered domains as entities can be in- tactic and semantic accuracy as well as the consistency stances of multiple classes in the same domain, z.B. of the data of DBpedia. dbo:Place and dbo:PopulatedPlace. Thus, Kontokostas et al. [31] present a framework for test- we adapt the idea of Hassanzadeh et al.: We manually driven evaluation of linked data quality that is inspired map the most frequent classes to domains such as peo- by the paradigm of test-driven software development. ple and geography and then delete duplicate entities The framework introduces 17 SPARQL templates of within a certain domain, i.e. if an entity is instantiated tests that can be used for analyzing knowledge graphs both as dbo:Place and dbo:PopulatedPlace w.r.t. accuracy and consistency. Noteable, tests can the entity will be counted only once in the domain ge- also evaluate external constraints that exists due to the ography. usage of external vocabulary, e.g. foaf:mbox is an inversfunctional relation. The framework is evaluated on a set of knowledge graphs including DBpedia. 7. Conclusion

6.2. Comparing KGs by Key Statistics Freely available knowledge graphs (KGs) have not been in the focus of any extensive comparative study so far. In this survey, we defined aspects according to Duan et al [15], Tartir [38], and Hassanzadeh [23] which KGs can be analyzed. We analyzed and com- can be mentioned as the most similar related work re- pared DBpedia, Freebase, OpenCyc, Wikidata, and garding the evaluation of the key statistics as presented YAGO along these aspects and proposed a framework in 4.1. and process to enable readers to find the most suitable Duan et al. [15] analyze the structuredness of RDF KG for their settings. data sets and describes them by simple statistical fig- ures that are calculated based on RDF dumps in N- triple serialization. In contrast to that, we use SPARQL References queries to obtain the figures, thus not limiting our- selves to the N-Tripel serialization of knowledge [1] M. Acosta, E. Simperl, F. Flöck, and M.-E. Vidal. HARE: A graphs. Duan et al. claim that simple statistical fig- Hybrid SPARQL Engine to Enhance Query Answers via ures are not sufficient to analyze the structuredness Crowdsourcing. In K-CAP2015: The 8th International Conference on Knowledge Capture. International Conference and differences of RDF data sets introducing a coher- on Knowledge Capture, ACM, 2015. ence metric. Accordingly, we analyze not only simple [2] M. Acosta, A. Zaveri, E. Simperl, D. Kontokostas, S. Auer, statistical figures but further analyze the data quality and J. Lehmann. Crowdsourcing linked data quality of knowledge graphs. Duan et al. provide statistics on assessment. In The Semantic Web–ISWC 2013, pages DBpedia, YAGO2, UniProt, and multiple benchmark 260–276. Springer, 2013. [3] M. Acosta, A. Zaveri, E. Simperl, D. Kontokostas, F. Flöck, datasets. and J. Lehmann. Detecting Linked Data Quality Issues via OntoQA [38] introduces metrics that can be used Crowdsourcing: A DBpedia Study. Semantic Web, 2016. for the analysis of ontologies, e.g. class richness that [4] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, is defined as ratio of number of classes with and with- and Z. Ives. DBpedia: A Nucleus for a Web of Open Data. In out instances. Tartir et al. do not analyze knowledge Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference, ISWC graphs but a set of ontologies, e.g. SWETO, TAP and 2007/ASWC 2007, pages 722–735. Springer, 2007. GlycO. Noteable, the statistical measures are often not [5] S. Auer, J. Lehmann, A.-C. Ngonga Ngomo, and A. Zaveri. the same but at least similar, e.g. both Duan et al. and Introduction to Linked Data and Its Lifecycle on the Web. In Tartir et al. calculate the ratio of instances per class Reasoning Web. Semantic Technologies for Intelligent Data whereas we consider entities as base line. Access, volume 8067 of Lecture Notes in Computer Science, pages 1–90. Springer Berlin Heidelberg, 2013. Tartir et al. [38] and Hassanzadeh et al. [23] ana- [6] C. Batini, C. Cappiello, C. Francalanci, and A. Maurino. lyze the coverage of domains. Tartir et al. introduce Methodologies for Data Quality Assessment and the measure importance as the number of instances Improvement. ACM Comput. Surv., 41(3):16:1–16:52, July per class and their subclasses. In our case, we can- 2009. not use this approach as Freebase has no hierarchy. [7] S. Bechhofer, F. van Harmelen, J. Hendler, I. Horrocks, D. L. McGuinness, and P. F. Patel-Schneider. OWL Web Ontology Hassanzadeh et al. analyzes the coverage of domains Language Reference. by listing the most frequent classes with the highest https://www.w3.org/TR/2004/REC-owl-ref- number of instances in a table. This gives only little 20040210, 2004. [Online; accessed 06-Apr-2016]. 44 M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

[8] T. Berners-Lee. Linked Data. http: through matching with knowledge bases: an empirical study. //www.w3.org/DesignIssues/LinkedData.html, In Proceedings of the 10th International Workshop on 2006. [Online; accessed 28-Feb-2016]. Ontology Matching collocated with the 14th International [9] T. Berners-Lee. Linked Data Is Merely More Data. Semantic Web Conference (ISWC 2015), 2015. http://www.w3.org/DesignIssues/LinkedData.html, 2006. [24] T. Heath and C. Bizer. Linked data: Evolving the web into a [Online; accessed 28-02-2016]. global data space. Synthesis lectures on the semantic web: [10] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. theory and technology, 1(1):1–136, 2011. Scientific American, 284(5):29–37, 5 2001. [25] J. Hoffart, F. Suchanek, K. Berberich, E. Kelham, G. de Melo, [11] C. Bizer. Quality Driven Information Filtering: In the Context G. Weikum, F. Suchanek, G. Kasneci, M. Ramanath, and of Web Based Information Systems. VDM Publishing, 2007. A. Pease. Yago2: A spatially and temporally enhanced [12] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, knowledge base from wikipedia. Commun. ACM, R. Cyganiak, and S. Hellmann. DBpedia–A crystallization 52(4):56–64, 2009. point for the Web of Data. Web Semantics: science, services [26] A. Hogan, A. Harth, A. Passant, S. Decker, and A. Polleres. and agents on the , 7(3):154–165, 2009. Weaving the Pedantic Web. Proceedings of the WWW2010 [13] P. Buneman, S. Khanna, and T. Wang-Chiew. Why and where: Workshop on Linked Data on the Web, 628, 2010. A characterization of data provenance. In Database [27] A. Hogan, J. Umbrich, A. Harth, R. Cyganiak, A. Polleres, Theory—ICDT 2001, pages 316–330. Springer, 2001. and S. Decker. An empirical survey of linked data [14] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, conformance. Web Semantics: Science, Services and Agents K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge on the World Wide Web, 14:14–44, 2012. Vault: A Web-scale Approach to Probabilistic Knowledge [28] P. Jain, P. Hitzler, K. Janowicz, and C. Venkatramani. There’s Fusion. In Proceedings of the 20th ACM SIGKDD No Money in Linked Data. http://corescholar. International Conference on Knowledge Discovery and Data libraries.wright.edu/cse/240, 2013. accessed Mining, KDD ’14, pages 601–610, New York, NY, USA, July 20, 2015. 2014. ACM. [29] J. M. Juran, F. Gryna, and R. Bingham. Quality Control [15] S. Duan, A. Kementsietsidis, K. Srinivas, and O. Udrea. Handbook. McGraw-Hill, New York, 1974. Apples and oranges: a comparison of RDF benchmarks and [30] G. Kobilarov, T. Scott, Y. Raimond, S. Oliver, C. Sizemore, real RDF datasets. In Proceedings of the 2011 ACM SIGMOD M. Smethurst, C. Bizer, and R. Lee. Media Meets Semantic International Conference on Management of data, pages Web – How the BBC Uses DBpedia and Linked Data to Make 145–156. ACM, 2011. Connections. In Proceedings of the 6th European Semantic [16] B. Ell, D. Vrandeciˇ c,´ and E. Simperl. Proceedings of the 10th Web Conference on The Semantic Web: Research and International Semantic Web Conference (ISWC 2011), chapter Applications, ESWC 2009 Heraklion, pages 723–737, Berlin, Labels in the Web of Data, pages 162–176. Springer Berlin Heidelberg, 2009. Springer-Verlag. Heidelberg, Berlin, Heidelberg, 2011. [31] D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, [17] C. Fellbaum. WordNet – An Electronic Lexical Database. J. Lehmann, R. Cornelissen, and A. Zaveri. Test-driven MIT Press, 1998. evaluation of linked data quality. In Proceedings of the 23rd [18] A. Flemming. Qualitätsmerkmale von Linked international conference on World Wide Web, pages 747–758. Data-veröffentlichenden Datenquellen (Quality characteristics ACM, 2014. of linked data publishing datasources). Diploma Thesis, [32] M. Mecella, M. Scannapieco, A. Virgillito, R. Baldoni, Humboldt University of Berlin, T. Catarci, and C. Batini. Managing data quality in http://www.dbis.informatik.hu- cooperative information systems. In On the Move to berlin.de/fileadmin/research/papers/ Meaningful Systems 2002: CoopIS, DOA, and diploma_seminar_thesis/Diplomarbeit_ ODBASE, pages 486–502. Springer, 2002. Annika_Flemming.pdf, 2011. [33] O. Medelyan and C. Legg. Integrating Cyc and Wikipedia: [19] G. Freedman and E. Reynolds. Enriching basal reader lessons meets rigorously defined common-sense. In with semantic webbing. Reading Teacher, 33(6):677–684, Wikipedia and Artificial Intelligence: An Evolving Synergy, 1980. Papers from the 2008 AAAI Workshop, page 65, 2008. [20] C. Fürber and M. Hepp. SWIQA – A Semantic Web [34] F. Naumann. Quality-driven query answering for integrated Information Quality Assessment Framework. In Proceedings information systems, volume 2261. Springer Science & of the 19th European Conference on Information Systems Business Media, 2002. (ECIS2011), volume 15, page 19, 2011. [35] L. L. Pipino, Y. W. Lee, and R. Y. Wang. Data quality [21] R. Guns. Tracing the Origins of the Semantic Web. Journal of assessment. Communications of the ACM, 45(4):211–218, the American Society for Information Science and 2002. Technology, 64(10):2173–2181, 2013. [36] E. Sandhaus. Semantic Technology at the New York Times: [22] H. Halpin, P. J. Hayes, J. P. McCusker, D. L. McGuinness, and Lessons Learned and Future Directions. In Proceedings of the H. S. Thompson. The Semantic Web – ISWC 2010: 9th 9th International Semantic Web Conference on The Semantic International Semantic Web Conference, ISWC 2010, Web - Volume Part II, ISWC’10, pages 355–355, Berlin, Shanghai, China, chapter When owl:sameAs Isn’t the Same: Heidelberg, 2010. Springer-Verlag. An Analysis of Identity in Linked Data, pages 305–320. [37] T. P. Tanon, D. Vrandeciˇ c,´ S. Schaffert, T. Steiner, and Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. L. Pintscher. From freebase to wikidata: The great migration. [23] O. Hassanzadeh, M. J. Ward, M. Rodriguez-Muro, and In World Wide Web Conference, 2016. K. Srinivas. Understanding a large corpus of web tables [38] S. Tartir, I. B. Arpinar, M. Moore, A. P. Sheth, and M. Färber et al. / Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 45

B. Aleman-meza. OntoQA: Metric-based ontology quality information systems, 12(4):5–33, 1996. analysis. In IEEE Workshop on Knowledge Acquisition from [41] A. Zaveri, D. Kontokostas, M. A. Sherif, L. Bühmann, Distributed, Autonomous, Semantically Heterogeneous Data M. Morsey, S. Auer, and J. Lehmann. User-driven quality and Knowledge Sources, 2005. evaluation of dbpedia. In Proceedings of the 9th International [39] R. Y. Wang, M. P. Reddy, and H. B. Kon. Toward quality data: Conference on Semantic Systems, pages 97–104. ACM, 2013. An attribute-based approach. Decision Support Systems, [42] A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, 13(3):349–372, 1995. and S. Auer. Quality assessment for linked data: A survey. [40] R. Y. Wang and D. M. Strong. Beyond accuracy: What data Semantic Web, 7(1):63–93, 2015. quality means to data consumers. Journal of management