<<

Semantic Web 1 (2015) 1–5 1 IOS Press

A Comparative Survey of DBpedia, , OpenCyc, , and

Michael Färber ∗,∗∗, Basil Ell, Carsten Menne, and Achim Rettinger Karlsruhe Institute of Technology (KIT), Institute AIFB, 76131 Karlsruhe, Germany

Abstract. In recent years, several noteworthy large, crossdomain and openly available knowledge graphs (KGs) have been created. These include DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Although extensively in use, these KGs have not been subject to an in-depth comparison so far. In this survey, we first define aspects according to which KGs can be analyzed. Next, we analyze and compare the above mentioned KGs along those aspects and finally propose a method for finding the most suitable KG for a given setting.

Keywords: , Comparison, DBpedia, Freebase, OpenCyc, Wikidata, YAGO

1. Introduction between entities (e.g., isSpouseOf ) can be represented on the Web. The idea of the is that of publishing When it comes to realizing the idea of the Seman- and querying knowledge on the Web in a semantically tic Web, knowledge graphs (KGs) are currently seen structured way. According to Guns [27], the term “Se- as one of the most essential components. We define mantic Web” already was being used in fields such as educational psychology, before it became prominent a knowledge graph as a (KB) (de- in science. Freedman and Reynolds [25], fined as the combination of an ontology and instances for instance, describe “semantic webbing” as organiz- of the classes in the ontology [59, p. 739]) consist- ing information and relationships in a visual display. ing to a large amount of facts about entities. Besides Berners-Lee presented his idea of using typed links domain-specific KGs, often general, i.e. encyclope- as vehicle of for the first time at the World dic/crossdomain knowledge is covered in openly avail- Wide Web Fall 1994 Conference under the heading able KGs as DBpedia exemplifies. This makes KGs “Semantics,” and under the heading “Semantic Web” widely applicable: not only a small set of users – as in 1995 [27]. The idea of a Semantic Web was introduced to a in the case of expert systems – benefit from using the wider audience by Berners-Lee in 2001 [11]. Accord- stored structured knowledge (e.g., via using specific ing to his vision, the traditional Web as a Web of Doc- search interfaces of expert systems), but any person on uments should be extended to a Web of Data where not the street having access to the Web can benefit, e.g., only documents and links between documents, but any by using Web search functionalities where semantic entity (e.g., a person or organization) and any relation queries against a KG extend traditional information re- trieval queries on documents. *Corresponding author. E-mail: [email protected]. In this survey, we focus on those KGs (i) that are **This work was carried out with the support of the German Fed- eral Ministry of Education and Research (BMBF) within the Soft- freely accessible and freely usable, (ii) that incorpo- ware Campus project SUITE (Grant 01IS12051). rate the Semantic Web standards to some extent such

1570-0844/15/$27.50 c 2015 – IOS Press and the authors. All rights reserved 2 . Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2 The organization of this survey is as follows: and (iii) that do not cover special domains such as the – In Section 2 we describe the genesis of semantic biomedical domain, but covers instead general knowl- data models and provide a definition for both se- edge (often also called crossdomain or encyclopedic mantic data models and graph models, since KGs knowledge). are realizations of both models. Thus, out of scope are KGs which are not openly – In Section 3 we describe aspects by which knowl- 3 available such as the Knowledge Graph, edge graphs can be analyzed. the Google Knowledge Vault [21], and the – In Section 4 we describe the knowledge graphs 4 Graph as well as KGs which are not based on Seman- we analyze. tic Web standards at all or are only accessible via an – In Section 5 we analyze the knowledge graphs API (see WolframAlpha5). Also excluded are unstruc- along the aspects listed in Section 3. tured or weakly structured knowledge collections. – In Section 6 we present a guideline to assess the For selecting the KGs for analysis, we regarded all knowledge graphs according to the user’s setting. datasets which were registered at the online dataset – In Section 7 we outline current limitations of KGs catalog http://datahub.io6 and which were – In Section 8 we glance over the possible future of tagged as “crossdomain”. Besides that, we took further the KGs and of the Semantic Web datasets into consideration which fulfilled the above – In Section 9 we conclude the survey. mentioned requirements (e.g., Wikidata). In total, we nominated DBpedia, Freebase, , Wikidata, and YAGO as KGs for our comparison. 2. Semantic Data Models and Graph Models In this paper, we give a systematic overview of these KGs in their current versions, and discuss how the facts Two types are especially relevant with of these KGs are modeled, stored, and queried. Note respect to KBs and, hence, to KGs: Semantic data that the focus of this survey is not the life of KGs models and graph data models. In this section, we first on the Web or in enterprises. We can refer in this re- describe the genesis of semantic data models and show spect to [8]. how both semantic data models and graph data models Besides juxtaposing the characteristics of the KGs have been defined. For an in-depth introduction into we provide a recipe for users who are interested in us- semantic data models and graph data models, the in- ing one of the mentioned KGs in a research or indus- terested reader is referred to [47] and [5], respectively. trial setting, but who are inexperienced in which KG to choose for their concrete settings. 2.1. Genesis of Semantic Data Models The main contributions of this survey are: The evolution from (DB) design toward 1. We define 35 aspects (characteristics) according KB design is coupled with increasing abstraction lay- to which KGs can be analyzed. ers. In the early stages of DB design, models for rep- 2. We analyze DBpedia, Freebase, Cyc, Wikidata, resenting data were modeled conceptually close to the and YAGO along these aspects. physical layer of data storage. After the basic physical 3. We propose a checklist which enables users to models, hierarchical models [63] became prominent. find the most suitable KG for their needs. They were superseded conceptually by network mod- els [62] and later by relational models [17]. Contrary to the hierarchical and network models, the relational 1 See http://www.w3.org/RDF/ (accessed June 16, 2015). model was not located on the primitive record level 2See http://www.w3.org/TR/rdf-sparql-query/ (accessed June 16, 2015). anymore, but between the physical and logical level, 3See http://www.google.com/insidesearch/ although still affiliated to the record level. features/search/knowledge.html First essential landmarks of semantic data model- 4See https://developers.facebook.com/docs/ ing arose in the mid-seventies when increas- graph- ingly supported the user’s view on the data. Then, new 5See http://products.wolframalpha.com/api/ 6This catalog is also used for registering Linked paradigms enriched semantic data modeling over time. datasets. The following concepts are notable in this respect: M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 3

1. The idea of data independence (see [16]) states Class that data is not modeled in the way as required Animals by the storage architecture, but according to the user’s application (i.e., having entities and Carnivores Herbivores Instructor Course relationships among them). This application- oriented paradigm was accompanied by the de- (a) Generalization (b) Aggregation velopments and the emergence of new program- ming languages such as C (developed between Fig. 1. Examples for generalization and aggregation. 1969 and 1973) and Smalltalk, which were more abstract than traditional languages and not data storage-oriented such as COBOL. 3. Later, in the 1980s, object-oriented models [36] 2. The idea of semantic “injection”/enrichment appeared: Data was considered as collections of (see for instance [54] and [56]) states that: Al- objects of specific classes. In parallel to the up- though semantics was encoded into data mod- rise of object-oriented models, graph models ap- els on a low level (i.e. to single data items) up peared. With this model users were able to repre- to the mid-seventies, new approaches considered sent the inherent graph structure of data. semantic relations between data items and how 4. Afterwards, other models such as semi-structured to model interrelational dependencies. By imple- models [15] and the XML model [13] were pro- menting rules which are based on these interre- posed. lational dependencies, consistency checks were 5. The Resource Description Framework (RDF) made possible. We can mention the following no- was originally published by the W3C as rec- table modeling approaches as steps in the evolu- ommendation in 1999 [40] and in a new ver- tion of semantic data modeling: sion in 2004 [37]. In 2014, RDF 1.1 [20] was published. RDF builds the basis of the semantic (a) Schmid [54] introduced the idea to model ba- graph model as we consider it in this survey. sic semantic properties that entities of a cer- tain class (e.g., person) may have as well as relationships that entities of certain classes 2.2. Definition of the Semantic Data Model may have (e.g., the has-spouse-relationship). (b) Smith [56] introduced generalization and ag- According to Hammer [29], semantic data models gregation as new forms of abstraction: a) Gen- are characterized as data models adhering to the fol- eralization is used to express similarities (see lowing principles: Figure 1a) and is modeled between classes: 1. A database is a collection of entities that corre- One class (e.g., carnivores) is subclass of an- spond to actual objects in the application envi- other class (e.g., animals) and shares proper- ronment. ties with the superclass. b) An aggregation 2. Entities in the database are organized into classes. is the composition of an object from a set of 3. Classes may be interconnected. objects. The aggregation class (see Class in 4. Entities and classes are characterized by relations Figure 1b) stands as a whole unit in place of (called attributes by Hammer) and relations may its components (in Figure 1b Instructor and Course). interconnect entities. (c) Brodie [14] introduced classification and as- 5. Relations can be derived from other relations via sociation relationships as further modeling entailment. approaches: a) Classification means to assign Well-known examples of the semantic data model a class to an entity (e.g., Markus is a per- are the entity-relationship model [16] and RDF.7 Fur- son). b) An association is a relationship be- ther examples are the IFO model [1] and SDM [30]. tween classes and describes the connections between classes in terms of the shared seman- tics and structure. Any aggregation or compo- 7See http://www.w3.org/TR/2014/ sition are associations. REC-rdf11-concepts-20140225/ (accessed July 22, 2015). 4 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

2.3. Definition of the Graph Data Model (ii) entity-relationship models, (iii) binary mod- els, (iv) models, and (v) info- When deciding for a data model, the choice typi- logical data models. cally depends on (i) the modeling domain, (ii) the end – Hull and King [31] compared semantic data mod- users, and (iii) the hardware and software constraints. els according to the features they provide, such as Codd [18] distinguishes three components each data aggregation, grouping, printable, object-valued, model possesses: 1) The set of data structure types, 2) and multi-valued. the set of operators or inference rules, and 3) the set – Kerschberg et al. [35] analyzed data models ac- of integrity rules. Angles and Gutierrez [5] define the cording to mathematical foundations, terminol- graph data model along this categorization: ogy, and semantic levels of abstraction, and dis- tinguished between graph theoretic and set theo- 1. Data structure types: In the graph data model, in- retic models. stance data and/or schema information is repre- sented as/by graphs, or by data structures which These approaches do neither consider current KGs generalize the notion of graph. with their data models nor ontologies. Also, the used 2. Operators and inference rules: Data is manip- criteria for comparing semantic data models are very ulated via graph transformation operators (see abstract, since a wide range of data models are com- [28]). pared against each other. Besides these works, there 3. Integrity rules: Rules can be constructed which are approaches which analyze (and sometimes assess) ensure data consistency. explicitly ontologies, but not KGs. Some of them are mentioned in the following (see also [12] for an In graph data models, the semantics comes into play overview of ontology evaluation): as follows: (i) Graph nodes are interpreted as entities or values; (ii) typed relations between nodes are inter- – Tartir et al. [61] introduced the approach On- preted as facts about the involved entities; and (iii) a toQA by which ontology schemas and their pop- schema is introduced by assigning types to instances ulations (i.e., KBs) can be analyzed through a set and introducing relations among classes. The informa- of metrics so that key characteristics of an ontol- tion focus supported by the graph data model therefore ogy schema can be highlighted. The analysis fo- encompasses the schema, the instances and the rela- cused on numerically expressible characteristics tions. of ontologies and results regarding the ontologies Established graph models are GOOD [28], GMOD [4], SWETO, TAP, and GlycO were presented. G-Log [46], Gram [3], and RDF. They all use a labeled – Lozano-Tello et al. [42] proposed ONTOMET- both for modeling the schema as well RIC, which allows the users to measure the as the instance level. suitability of existing ontologies, regarding the KGs, as we consider it in this survey, are realizations requirements of their systems. OntoMETRIC of both semantic data models and graph models, since presents a generic methodology and does not an- the KGs are characterized by having a set of entities, a alyze specific ontologies. set of classes, a set of relations between entities, and a – Vrandecic et al. [66] differentiated between struc- set of relations between entities and classes. tural and ontological metrics and provided prin- ciple means for the definition of metrics that take the semantic of the ontology appropriately into 3. Criteria for Comparison account. – Poveda-Villalon et al. [48] present a tool called Several works compared semantic data models: OOPS! by which an RDF document describing an ontology can be analyzed. Potential pitfalls that – Brodie [14] categorized semantic data models could lead to modeling errors are then presented into: (i) classical models, (ii) mathematical mod- to the user. els, (iii) irreducible data models, (iv) static se- mantic hierarchy models, and (v) dynamic se- Since these approaches focus only on ontologies, we mantic hierarchy models. cannot compare the used datasets. Also the criteria for – Tsichritzis and Lochovsky [64] categorized se- comparison are different to ours, since we do not only mantic data models into: (i) traditional models, focus on the schema. To the best of our knowledge, a M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 5 systematic comparison of openly-available knowledge 3. Dynamicity: Is the KG updated continuously graphs has not been carried out so far. Therefore, we (dynamic KG) or are only fixed versions of the systematically analyze and compare knowledge graphs KG offered (static KG)? according to aspects of the following categories: 4. HTTP lookup: Is machine-readable information about resources available via live HTTP lookup General information – : What general properties (i.e., querying on demand in order to follow the does the KG have? principles [9], so that no export – Format and representation: How are facts repre- functionality or file download is needed)? sented, stored, and queried? 5. RDF export: Is data available as RDF export, ei- – Genesis and usage: How was the KG created and ther via files or via SPARQL endpoint? how is it used? 6. Software for data storage: Which software is – Entities: How are entities represented and de- used for storing and querying the KG? scribed in the KG? 7. (online): Each KG may pro- – Relations: How are relations represented and de- vide one or several query languages in which scribed in the KG? queries against the KG are formulated. – Schema: What are the features of the schema of 8. Size of schema and instance graph: How many the KG? classes and relations are in the KG, how many – Particularities: What particularities (special fea- facts, and how many unique instances? tures) does the underlying data model of the KG have? 3.3. Genesis and Usage In the following, we list the criteria we use for com- paring the different KGs, grouped by the categories Where the stored facts in the KGs come from and mentioned. where they are applied, is addressed by these aspects: 1. Provenance of facts: Is the KG content derived 3.1. General Information from unstructured or semi-structured data by in- formation extraction techniques or is it gathered We use the following aspects to collect general in- manually by users and/or bots? formation about the KGs: 2. Quality ensurance of facts: Are there any re- strictions or constraints regarding the quality of 1. Homepage: The URL where the KG can be ac- stored knowledge? If the correctness of facts is cessed. ensured, how and with what precision is this per- 2. Current version: The version of the knowledge formed? base we consider in this survey. 3. Software projects: Which software projects 3. Languages: What languages (e.g., English) are make use of the KG? used in the KG on schema and instance level? 4. Influence on other LOD datasets: Which other 4. Covered domains: Which domains are covered KG-building initiatives take the KG as a starting by the KG? Are there any fields where the KB is point? filled only rudimentary? 5. License: Under which license is the content of 3.4. Entities the KG provided? The following aspects address the characteristics of 3.2. Format and Representation the entities in the KGs: 1. Entity reference: What kind of IDs are used to For comparing the different approaches for repre- refer to entities? senting, storing and querying knowledge, we use the 2. LOD registration: Is the dataset registered at following aspects: http://datahub.io as part of the Linked 1. Fact representation: Facts can be represented as Open Data (LOD) cloud?8 triples, quadrupels, or similar. 2. Dataset formats: The data storage format (e.g., 8The Linked Open Data (LOD) cloud is a collection of datasets JSON) in which data is provided. published on the Web following the Linked Data principles [9]. 6 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

3. LOD linkage: Are entities linked to entities of 4. External vocabulary: Is external vocabulary other KGs in the LOD Cloud? (classes or relations that belong to existing schemas, 4. Entity relevance: Is the ordering or ranking of e.g., schemas from datasets in the LOD cloud) entities according to some function such as a rel- reused in the KG? evance function supported? 5. Description of concepts: Are concepts described 5. Description of entities: Are entities - within the KG, e.g., via textual descriptions? readably described within the KG, e.g., via tex- What format is used for that? tual descriptions? What format is used for that? 6. Forms of abstraction: Is classification, general- ization, aggregation, or association supported? 3.5. Relations 7. Data types: Which data types are used in the KG? The following aspects address the characteristics of the relations in the KGs: 3.7. Particularities 1. Relation reference: What kind of IDs are used Here, particularities about the analyzed KGs are to refer to relations? identified which are not covered by the other aspects. 2. Relation relevance: Is the ordering or ranking Aspects anticipated before the analysis were: of relations according to relevance supported? In this way, relations can be declared as more im- 1. Temporal aspects: Facts that may change over portant, for instance since they are more relevant time (e.g., a country’s president) may be anno- to most users than other relations. tated according to the time when the fact was 3. Description of relations: Are relations human- valid (e.g., time interval). Furthermore, other readably described within the KG, e.g., via tex- time-related information about a fact may be tual descriptions? What format is used for that? stored, such as the point in time when the fact was added to the KG or when the fact was up- 3.6. Schema dated. 2. Source of facts: Is it stored where the knowledge The characteristics of the schema of the different in the KG was retrieved from (e.g., the document KGs can be addressed by the following aspects: it was extracted from)? 1. Schema restrictions: Is a fixed schema used or 3. Reification: Is it possible to represent statements can the schema be extended by users? about statements? Reification here means to have 2. Schema constraints: Are there any schema con- a means for referring to a statement via an iden- straints which need to be observed? May the KG tifier thus enabling to formulate statements about contain data that is inconsistent regarding the statements. schema? For example, if – according to the (logical) 4. Selection of KGs schema constraints – an entity may only oc- cur once as subject of a certain relation such as We consider the following knowledge graphs for our has-spouse , but within the KG occurs several comparative evaluation: times as subject with different objects (which are explicitly defined as different entities), then the – DBpedia: DBpedia9 is the most popular and KG contains data which violates the schema. prominent KG in the LOD cloud [7]. The project 3. Hierarchy and network of relations: Does the was initiated by researchers from the Free Uni- KG contain relations among relations, e.g., a tax- versity of Berlin and the University of Leipzig, onomy of relations (sub-relation, super-relation) in collaboration with OpenLink Software. Since or other types of relations (e.g., inverse relation)? the first public release in 2007, DBpedia is up- dated roughly once a year.10 DBpedia is cre-

The LOD cloud project originated from the W3C Linking Open Data project (see http://www.w3.org/wiki/SweoIG/ 9See http://dbpedia.org TaskForces/CommunityProjects/LinkingOpenData) 10There is also DBpedia live which started in 2009 and which and currently consists of almost 1000 datasets; see http: is updated when is updated. See http://live. //lod-cloud.net/. .org/. M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 7

ated from automatically-extracted structured in- – OpenCyc: The Cyc26 project started in 1984 as formation contained in the Wikipedia, such as part of Microelectronics and Computer Technol- from tables, categorization information, ogy Corporation. The aim of Cyc is to store (in geo-coordinates, and external links. Due to its a machine-processable way) millions of common role as the hub of LOD, DBpedia contains many sense facts such as “Every tree is a plant.” While links to other datasets in the LOD cloud such the focus of Cyc in the first decades was on infer- as Freebase, OpenCyc, UMBEL,11 GeoNames,12 encing and reasoning, more recent work puts a fo- Musicbrainz,13 CIA World Factbook,14 DBLP,15 cus on human-interaction such as building ques- Project Gutenberg,16 DBtune Jamendo,17 Euro- tion answering systems based on Cyc. Since Cyc is proprietary, a smaller version of the KG called stat,18 Uniprot,19 and Bio2RDF.20 DBpedia is OpenCyc27 was released under the open source used extensively in the Semantic Web research Apache license. In July 2006, ResearchCyc28 was community, but is also relevant in commercial published for the research community, containing settings: companies use it to organize their con- more facts than OpenCyc. tent, such as the BBC [38] and the New York – Wikidata: Wikidata29 is a project of Wikimedia Times [52]. Deutschland which started on October 30, 2012. 21 – Freebase: Freebase is a KG announced by The aim of the project is to provide data which Metaweb Technologies, Inc. in 2007 and was can be used by any Wikipedia project, including acquired by Google Inc. on July 16, 2010. In Wikipedia. contrast to DBpedia, Freebase had provided an Wikidata does not only store facts, but also the interface that allowed end-users to contribute corresponding sources, so that the validity of facts to the KG by editing structured data. Besides can be checked. Labels, aliases, and descriptions user-contributed data, Freebase integrated data of entities in Wikidata are provided in more than from Wikipedia, NNDB,22 FMD,23 and Mu- 350 languages. Wikidata is a community effort, sicBrainz.24 Freebase uses a proprietary graph i.e., users collaboratively add and edit informa- model for storing also complex statements. On tion. Also, the schema is maintained and extended December 16, 2014, the Freebase team announced based on community agreements. In the near fu- that Freebase will shutdown its services on June ture, Wikidata will grow due to the integration of 30, 2015. Wikimedia Deutschland and Google Freebase data. plan to integrate Freebase data into Wikidata in – YAGO: YAGO – Yet Another Great Ontology the near future – a tool for that will be developed – has been developed at the Max Planck In- stitute for Computer Science in Saarbrücken until August 2015 – and to close the Freebase since 2007. YAGO comprises information ex- website earliest three months later.25 tracted from the Wikipedia (e.g., categories, redi- rects, ), WordNet[23] (e.g., synsets, hy- 30 11See http://umbel.org/ ponymy), and GeoNames. As of March 24, 12See http://www.geonames.org/ 2015, YAGO3 is available.31 13See http://musicbrainz.org/ 14See https://www.cia.gov/library/ publications/the-world-factbook/ 15See http://www.dblp.org 5. Comparison 16See https://www.gutenberg.org/ 17See http://dbtune.org/jamendo/ In the Tables 1 – 7 we summarize our comparison 18See http://eurostat.linked-statistics.org/ of the knowledge graphs listed in Section 4 using the 19See http://www.uniprot.org/ 20See http://bio2rdf.org/ 21See http://freebase.com/ 26See http://www.cyc.com/ 22See http://www.nndb.com 27See http://www.opencyc.org/ 23See http://www.fashionmodeldirectory.com/ 28See http://research.cyc.com/ 24See http://musicbrainz.org/ 29See http://wikidata.org/. 25See https://plus.google.com/u/0/ 30See www.geonames.org/ 109936836907132434202/posts/bu3z2wVqcQc 31See http://www.mpi-inf.mpg.de/departments/ and https://www.wikidata.org/wiki/Wikidata: databases-and-information-systems/research/ WikiProject_Freebase. yago-naga/yago/downloads/ 8 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO aspects given in Section 3. An online version is also Secondly, OpenCyc can be seen as mature, but available at http://kg-wiki.org. In the follow- consists of much schema information and is – in ing subsections, we provide a detailed analysis in terms terms of the entities, and, hence in the sense of of the different aspects of the introduced evaluation a KG – rather a collection of entities belonging categories. to different classes. Hence, OpenCyc is predes- tinated for reasoning, but not so much for entity 5.1. Comparison of General KB Information retrieval purposes. – License: Data of all considered KGs except Wiki- The following findings are notable with regard to data is licensed under the Creative Commons At- general information of the considered KGs (cf. Ta- tribution 3.0 license32 which means that it is al- ble 1; information about the homepage and the consid- lowed to use the data for private and commercial ered KG version is not discussed any further): settings and to modify the data by the user. In case of Wikidata, all structured data of the main – Language support: Most KGs either only sup- name space and the property name space of Wiki- port the English language (such as OpenCyc), or data is licensed under Creative Commons CC0,33 other languages than English are added on top. while text of all other namespaces of Wiki- For Freebase and YAGO, entity and property la- data is available under the Creative Commons bels for additional languages are provided. Attribution/Share-Alike License.34 The Creative Remarkable in this context is Wikidata in terms Commons CC0 licence enables to waive as many of number of languages supported and in terms rights as legally possible and is especially used of its language-agnostic KG model (cf. the identi- for databases. fiers for entities and relations which intentionally In summary we can state that all considered KG consist of a character and a number). can be used without expenses, but in return appro- – Covered domains: Since we restricted our KG priate credit has to be given and the same license choice to crossdomain KGs, all considered KGs has to be used for further usage. contain general knowledge, for instance general Interesting in the context of KG licenses is the information about instances of persons such as study of Jain et al. [32] who studied the appli- Barack Obama. Besides specific domains such cability of well-known Linked Data datasets for as the biomedical domain, also commercial applications. The conclusion the au- knowledge (class-relationships such as “A human thors drew is that not the technical issues of de- has two legs” or “A child is a human”) and lin- ployment and use of Linked Data datasets is the guistic knowledge (relationships between linguis- crucial point, but legal aspects. Often, the license tic concepts such as “to compose is a of under which a Linked Data dataset can be reused to write”) were excluded from this survey. is not specified by the data providers. Although the scope of the considered KGs is broad and unrestricted in nature, we can make statements about the “relative filling degrees” (in 5.2. Comparison of Format and Representation terms of number of entities or number of state- ments) with respect to located parts of the KGs – Fact representation: The KGs DBpedia and Open- and, hence, about the maturity of the considered Cyc store facts as single triples and do not regard KGs: additional meta-information about facts such as Firstly, Wikidata still is in a start-up phase in the the confidence of the triples being correct or tem- sense that not all subdomains (indicated by the poral information related to the facts (e.g., the va- classes) are covered in depth. Wikidata is espe- lidity time). cially well populated in fields such as “Person” and biological entities, but provides only rudi- 32See https://creativecommons.org/licenses/ mentary information about entities in fields such by/3.0/. 33 as society. All other KGs can be classified as ma- See https://creativecommons.org/ publicdomain/zero/1.0/. ture, since they do not only exist for a rather long 34See https://creativecommons.org/licenses/ time, but are well positioned in all general do- by-sa/3.0/ and https://www.wikidata.org/wiki/ mains. Wikidata:Database_download/en#License. M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 9 http://www. mpi-inf.mpg. de/departments/ databases-and- information-systems/ research/ yago-naga/yago/ YAGO3 All entity names areEnglish from Wikipedia, some rdfs:label values have dif- ferent languages Creative Commons Attri- bution 3.0 http://wikidata. org 2012 community), even dialects Creative Commons1.0 CC0 UniversalPublic (CC0 Domain Dedication 1.0) http://opencyc. org OpenCyc 4.0English Cont. updated since Oct Almost every language (by Common senseCreative Commons Attri- bution 3.0 General knowledge General knowledge a http://freebase. com Mar 31, 2015 human readable IDs are in English, but everyand property has entity an i18n in many languages broad, sometimes deep Creative Commons Attri- bution Only Table 1: Comparison of the KGs regarding their general information. DBpedia Freebase OpenCyc Wikidata YAGO3 http://dbpedia. org English (properties etc.), butversions linked are available in 125 localized languages (localized aretions textual such asrdfs:comment, rdfs:label, descrip- dbpedia- owl:abstract.are alsoversions links There of Wikipedia) to local General knowledgeCreative GeneralAttribution-ShareAlike knowledge, very 3.0, Commons GNUmentation Free License Docu- Google announced to close Freebase on June 30, 2015. However, currently (July 30, 2015) it is still available. a Homepage Current version DBpedia 2015-04Languages continuously updated “Main” until DBpedia is Coveredmains do- License (con- tent) 10 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO a https://tools.wmflabs.org/wikidata-exports/rdf/ yaiiySai otnosyudtdSai otnosyudtdStatic updated Continuously graph in- stance and schema of (but Size demand on lookup Static via (on- demand on line) lookup language updated proprietary Continuously and (owl), Query RDF files demand on as lookup stor-age data for Software demand on lookup (nt) RDF Static export RDF ttl) lookup nq, HTTP (nt, RDF Dynamicity confidence with Triple formats Dataset Triple representation Fact pcaie aatype properties data 116 specialized properties, data type 1,600 object properties, 1,079 685 classes, entities, Mio 4.58 ( //dbpedia.org/ SPARQL Universal Server Virtuoso SPARQL Bei reaeOeCcWkdt YAGO3 Wikidata OpenCyc Freebase DBpedia ) al :Cmaio fteKsrgrigfra n representation. and format regarding KGs the of Comparison 2: Table http: rpdCcKoldeSre iiaerdmnayquery mio 2 rudimentary terms, 239k party) third (by files triples Bio as 1.9 Server query Knowledge Cyc //freebase.com/ Language; Query (Metaweb MQL file OWL as Graphd files as values ) https: l format file rpe,4klnsto links DBpedia 47k triples, Wikibase-API, Language CycL mul- has entity Every Triple RDF and SQL, XML, JSON, incomplete) 32Mosaeet 1 i niis >120 entities, Mio >10 statements Mio 63.2 party) (third SPARQL party) third (by SPARQL and qualifiers by accompa- nied value, a property and a claim of consists The and claim. one references or one more has Each statement statements. tiple a D tl,TSV (ttl), RDF rwe n demos and browser rudimentary interface, demand on lookup i facts Mio sparql openlinksw.com/ ( SPARQL via and files SPARQL as lo- cation) and time with (SPO triple tuple” “SPOTL http://lod2. ) M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 11

In contrast to these KGs stand Freebase, YAGO, – HTTP Lookup: Regarding all considered KGs, and Wikidata: For each triple, Freebase also data is made available via HTTP lookups on de- stores a confidence value. The authors of YAGO mand: Given a resource (of a KG which is part of use the so-called SPOTL(X) tuples for represent- the LOD cloud) identified by a HTTP URI, data ing spatio-temporally enhanced facts (with the el- about this resource can be obtained by derefer- ements subject (S), predicate (P), object (O), time encing this URI.42 Typically, the returned infor- (T), and location (L)). For Wikidata, a model is mation is made available using W3C standards used where each statement consists of a claim such as RDF. The idea of dereferencing is a cru- that something is the case and a list of references cial point of the Semantic Web vision: In this way, providing evidence for that claim. agents can traverse the LOD graph (i.e. follow- – Dataset formats: Regarding all considered KGs, ing links within and across single LOD datasets) data is available in RDF format: Data from DBpe- and gather the information which they need and dia and Freebase is available in the form of RDF which is available in the LOD cloud. files,35 data from OpenCyc is available as OWL HTTP lookups on demand are possible for all files,36 and YAGO is available as both TSV files and RDF files.37 KGs considered in this survey – thus allowing for Wikidata has a special position here: It is devel- data exports. oped on the basis of Wikibase,38 a proprietary – RDF export: Besides the HTTP lookup availabil- data model for Wikidata. This data model is per ity, data from the KGs is also made available as se not based on the RDF format. However, unof- files. The idea is here that the KG data can also ficial RDF export files of Wikidata39 (and some be downloaded and processed otherwise instead SPARQL endpoints40) are provided. of retrieving data via HTTP lookups on demand; – Dynamicity: Many KGs are static in the sense this includes the files directly or import- that they are not continuously updated. One rea- ing the data into an appropriate database such as a son for that is that some KGs such as DBpe- triple store. In this way, queries can be set up and dia are created by computationally-expensive in- the hardware load is on the client side. formation extraction processes. Therefore, DBpe- – Data storage software: Data is stored using dif- dia is static; however, DBpedia live – a derived ferent systems: While DBpedia uses Virtuoso version of DBpedia – is continuously updated. Universal Server43 and its available RDF dumps For that, Wikipedia provides a OAI-PMH update can be loaded into any triple store (such as Virtu- stream, by means of which 84 articles are ana- oso or 4store44), all other considered KGs (Free- 41 lyzed per minute. base, OpenCyc, Wikidata, and YAGO3) are – due Dynamic KGs are Freebase and Wikidata since to their different data models used internally – data is maintained by a user community. In case based on proprietary software systems. However, of these KGs, even the schema is extended by the the provided RDF dumps of these KGs can be users. loaded into any triple store. – Query language (online): Although all consid- 35 See http://wiki.dbpedia.org/Downloads2014, ered KGs both support RDF as data format and https://developers.google.com/freebase/data, http://tools.wmflabs.org/wikidata-exports/ are available online for HTTP lookups, not all rdf/. online versions of the KGs are offered with a 36See http://sw.opencyc.org/. SPARQL endpoint: Only DBpedia and YAGO are 37 See http://www.mpi-inf.mpg.de/departments/ 45 databases-and-information-systems/research/ queriable in this way. For Wikidata, several un- yago-naga/yago/downloads/. 38https://www.mediawiki.org/wiki/Wikibase/ DataModel 42See http://tools.ietf.org/html/rfc3986# 39See http://tools.wmflabs.org/ section-1.2.2 for more information about dereferencing wikidata-exports/rdf/ and [22]. URIs. 40See https://www.wikidata.org/wiki/Wikidata: 43See http://virtuoso.openlinksw.com/ Data_access/en. 44See http://4store.org/ 41See http://wiki.dbpedia.org/online-access/ 45See http://dbpedia.org/sparql and http: DBpediaLive. //lod2.openlinksw.com/sparql. 12 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

official SPARQL endpoints are available.46 If the of Freebase, provenance data stored for facts are client wants to query the KGs Freebase and Open- for example the IDs of the users that added the Cyc by means of SPARQL – and therefore rely on facts. For Wikidata, to each statement (consist- a W3C Recommendation instead of a proprietary ing, e.g., of a property and a value, such as (coun- query language –, he needs to load the provided try, Germany)) references can be attached which RDF data into a triple store. This procedure is reveal the source – and therefore indirectly the suggested by the authors of many KGs in case of trustfulness – of the statement. permanent, extensive querying, since in this way – Quality ensurance: The quality ensurance of facts the user needs to provide the hardware resources can be aligned with the two ways of fact provi- by himself. Furthermore, it should be noted that sioning in general (see also the aspect Provenance 47 in case of OpenCyc the language CycL was de- of facts): (i) Knowledge for the KG is extracted veloped to enable expressive reasoning. automatically from a database such as Wikipedia. – Size of schema and instance graph: In our anal- In that case, no quality ensurance check is im- ysis, we only took KGs into consideration which plemented, but a posteriori evaluations confirmed are already widely used and which are representa- a sufficient high average accuracy across the KG tive for open, semantically-structured datasets on YAGO [43]. (ii) Knowledge is gathered by user the Web. All considered KG are therefore large contributions. In those cases (see Freebase, Cyc, and a comparison of size per se is not reasonable, and Wikidata) no fact consistency checks are ap- since the KGs often cover different domains to a plied, but the correctness is based on the trustful- different extent or emphasize different levels of ness of the contributors. knowledge. Note that the numbers given in the ta- In general it is not possible to prioritize one of ble with respect to the KG size are not directly these two ways a priori. Using solely approach comparable. A complex fact may be represented (i) is only duable if the information is already and counted as one statement in one KG but rep- available in semi-structured formats (as in case of resented and counted as multiple statements in Wikipedia-DBpedia), so that the proportion of in- another KG. correct facts in the KG is kept small. As already outlined above under coverage, we Software projects: can state that all considered KGs fulfill the re- – All considered KGs are ex- quirements of being a KG. Outstanding are the ploited in many ways in research projects of uni- KGs Wikidata and OpenCyc. While Wikidata is versities and in industry, so that we only present not mature in all areas, but very focused on in- projects which are commonly known in the com- stances, the primary focus of OpenCyc is schema munity. Notable is in particular the project IBM 48 information; however, it contains many instances which uses several of the considered and is therefore numbered among the KGs. KGs (namely, DBpedia and YAGO so far). For a description of applications of Linked Data in gen- 5.3. Comparison of Genesis and Usage eral in the industry, we can refer to the use cases listed by the W3C.49 – Provenance of facts: For covering knowledge – Influence on other LOD datasets: Data of the sin- about general domain entities – as done primar- gle KGs has been reused in other data sources of ily by DBpedia, Freebase, Wikidata, and YAGO the LOD cloud – especially in datasets which fo- –, Wikipedia content is exploited to some cus more on the integration of multiple datasets with the help of tools. For instead of building a genuine own knowledge creating a more formal-logical representation of base (see UMBEL50 and BabelNet51 as exam- knowledge, experts need to be consulted as the ples). case of Cyc/OpenCyc demonstrates. In the case

48See http://www.aaai.org/Magazine/Watson/ 46See https://www.wikidata.org/wiki/Wikidata: watson.php Data_access for an overview and http://wdqs-beta. 49See http://www.w3.org/2001/sw/sweo/public/ wmflabs.org/ for an example. UseCases. 47See http://www.cyc.com/documentation/ 50See http://www.umbel.org/. ontologists-handbook/cyc-basics/syntax-/. 51See http://babelnet.org/. M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 13 d DBpedia, % correct- 95 e > ness for YAGO2 Wikipedia,Geonames Wordnet, Wikidata gazetter, no ensurance,tion evalua- of Watson, Broccoli UMBEL, Freebase users and bots verifiable information fromas sources books,publications, such or news- scientific paper articles,the original as Wikipedia, data in iscommunity controlled by some first prototypes YAGO NAGA, IBM Filled by experts Data is maintained by no Users should only add Terrorism Knowledge Base InitiallyWikipedia,sicBrainz etc.;information is gathered from Mu- new byFreebase algorithms, team, and the community the trusted “Freebaseperts” ex- keep anchanges, eye on scriptsfor incorrect look data GoogleVault, Bing Knowledge a IBM Table 3: Comparison of the KGs regarding genesis and usage. b c DBpediatracted from Wikipedia Freebasequality dependsWikipedia on and OpenCyc content algorithms/template onmappings extraction WikidataWikipedia Miner, Freebase, YAGO YAGO3 Wikidata UMBEL YAGO3 SUMO, Watson http://spotlight.dbpedia.org/ http://wikipedia-miner.cms.waikato.ac.nz/ broccoli.informatik.uni-freiburg.de http://www.ibm.com/smarterplanet/us/en/ibmwatson/ http://www.adampease.org/OP/ Provenance of facts Automatically ex- Quality ensurancefacts of Software projects DBpediaInfluence on other LOD Spotlight, datasets See See See See See a b c d e 14 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

not human readable) for the Semantic Web. Both DBpedia YAGO forms have their advantages and disadvantages. – LOD registration: Publishing a dataset accord- Wikidata Freebase ing to the Linked Data principles already im-

OpenCyc plies that this dataset is part of the LOD cloud. Besides that, there are Linked Open Data regis-

Fig. 2. owl:sameAs relations between the considered KGs. tration portals such as http://datahub.io where LOD datasets can be registered and, hence, In this context, the relation owl:sameAs is found quickly. worth mentioning: This relation indicates that the All KGs considered in this survey are published two resources which are linked by an owl:sameAs in RDF and are part of the LOD cloud. Besides relation refer to the same real-world object, even though they might have different URIs in the dif- that, until the submission of this survey, all con- ferent datasets. In this way, further information sidered KGs except Wikidata were also registered about the resources can be retrieved from linked at http://datahub.io as part of the LOD datasets and no information needs to be copied to cloud.53 other datasets. – LOD linkage: Most of the considered KGs link The most important sets of owl:sameAs rela- tions between the considered KGs are as follows their entities to entities of other datasets in the (see also Figure 2): DBpedia and YAGO link to LOD cloud. Remarkable are hereby DBpedia and each other; DBpedia links in addition to Freebase Freebase in terms of their high degree of connec- and Wikidata; Wikidata links to Freebase; Open- tivity with other LOD datasets. DBpedia is justi- Cyc links to DBpedia; DBpedia links to broken Cyc links.52 fiably called the hub of the LOD cloud [41,50]. – Entity relevance: In some scenarios it is helpful 5.4. Comparison of Entities to rank or order entities based on some impor- tance and/or relevance score (e.g., to find the most – Entity reference: Since all considered KGs were well-known football players or politicians). In the shaped by the vision of the Semantic Web, entities do not only have unique IDs, but URIs by which past, several approaches were presented which they can be referred to. While DBpedia, Free- calculate scores for entities. However, currently base, Cyc, and YAGO provide human-readable only Freebase provides relevance scores for enti- IDs, Freebase, and Cyc additionally operate with ties that were created by using the link counts in opaque URIs. Wikidata only provides entity IDs Freebase and Wikipedia.54 which consist of “Q” followed by a number in or- der to be language-agnostic. The labels for the en- – Description of entities: It can be difficult for users tities are stored in Wikidata separately. to figure out which entity is meant by a given As outlined by Berners-Lee in 1998 in “Cool ID or URI – especially if the ID is a mostly nu- URIs don’t change” [10], URIs should be de- merical value due to the constraint of the knowl- signed with three things in mind: simplicity, sta- edge representation being language-independent. bility, and manageability. In the context of KGs where each entity has a URI, well-chosen URIs In such cases, a textual description of the entities become even more important. Sauermann and is important. While some KGs offer textual de- Cyganiak [53] present so-called 303 URIs (which scriptions via special properties (see DBpedia and are human readable) and hash URIs (which are Freebase) or fields within the data model (as in case of Wikidata), YAGO does not offer any en- 52 http://sw.cyc. HTTP requests of URIs with the domain tity description and OpenCyc only for a fraction com result in a DNS error, but these URIs are dereferencable if the domain is replaced by http://sw.openyc.org. of the entities. M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 15 URIWikipedia page (basede.g., title), http:// on yago-knowledge. org/ resource/ Karlsruhe Links to DBpedia no unique ID, e.g., Q1040 links to FreebaseMusicbrainz and Description field for ev- ery entity,property no special field unique ID, English ID, e.g., T6OkHdRS-eUiqO5n8NA1g, Karlsruhe Links to DBpedia Some entitiesno have Textual description in comment no no ; of- MID andID sometimes (MIDnent is consistentchanges, perma- with human-readable as IDs) but/m/0qb1z not Links to BBCGeospecies Music, scores (calculatedlink counts by in Freebase and Wikipedia) yes,/common/topic/ viadescription property ten there is/common/topic/ an image image Table 4: Comparison of the KGs regarding entities. DBpediaWikipedia pagee.g., title), http://dbpedia. org/resource/ FreebaseKarlsruhe OpenCycOpenCyc,UMBEL, GeoNames, etc. YAGO, Wikidatahttp://dbpedia. org/ontology/ YAGO3 abstract Entity reference URI (based on LOD registrationLOD linkage yesEntity Links relevance to Freebase, noDescription of yes entities Yes, via property Entities have relevance yes no yes 16 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO eainrlvnen ontmnindys .. ydt no date by e.g., yes, mentioned not no a have properties All no relations of Description property (two relevance Relation URI reference Relation tion descrip- a to link a have properties some label, prefixes: Bei reaeOeCcWkdt YAGO3 Wikidata OpenCyc Freebase property DBpedia ) ontology and al :Cmaio fteKsrgrigrelations. regarding KGs the of Comparison 5: Table sa xctv pro- executive on” ducer an served as and person this produced sen- executive_ via producer/films_ e.g. one tence: mostly description /common/topic/ yes, eeecn nTbe4) Table in referencing entity (see MID and ID “Films : /film/ ; oys pcfi edi ev- in field specific yes, no hae(.. s)I URI ID IsA) (e.g., Phrase property ery atr fusage of pattern a with “hasGloss” erty prop- a have properties M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 17

5.5. Comparison of Relations tomatically extracted from Wikipedia, leading to many properties whose meaning is not given,55 – Relation reference: Analogously to entity refer- remains unclear56 or which are semantically over- ences, relations in the KGs are represented as IDs lapping with other properties.57 or URIs. – External vocabulary: DBpedia and YAGO use – Relation relevance: In this respect, Wikidata is vocabularies from other datasets (DBpedia uses in particular noteworthy. Since each statement owl, xsd, rdfs, rdf, , dc, skos, umbel58; YAGO stored in Wikidata has some meta-information uses skos, umbel, rdfs und rdf), while Freebase, such as a timestamp attached to it, properties of and Cyc only use their own vocabulary. Wiki- items in Wikidata can be ordered with respect to data uses also their own vocabulary, but also links these meta-information aspects. sometimes to external vocabulary via “equivalent – Description of relations: Freebase, Wikidata, and property” property. YAGO provide a textual description for all rela- – Description of classes: DBpedia, Freebase, Cyc, tions, so that the meaning of properties can be and Wikidata provide human-readable descrip- derived. YAGO stores how the properties are ex- dbpedia- pressed in freetext, while the other KGs provide tions of their classes (DBpedia uses custom-built textual descriptions. DBpedia pro- owl:abstract and rdfs:comment; YAGO vides a label and a link to a definition only for only uses rdf:label, no description; Freebase some properties. Cyc and BabelNet do not pro- uses the relation /common/topic/descrip- vide any description at all. tion; OpenCyc has a comment relation; Wiki- data provides a description and is exported in the 5.6. Comparison of Schema RDF dumps as schema:description). – Forms of abstraction: As outlined in Section 2.1, – Schema restrictions: In KGs where the data is de- classification, generalization, aggregation, and rived via automated extraction methods, both the association are among the most important meth- set of classes and the set of relations is fixed. In ods to model in a more abstract way. All con- case of KGs where end-users can contribute (see sidered KGs support the modeling of generaliza- Wikidata and Freebase) the schema is fixed, but tion, classification, and association. Freebase and can be extended. Wikidata also support aggregation. – Schema constraints: Constraints regarding the – Data types: The KGs either do not support any schema become relevant when new facts are data types for literal values, but just store strings added to the KG (see integrity constraints and (as in case of Cyc), or they support simple data data consistency as conceptual keystone of any types such as a subset of the XML Schema (see graph data model according to [18]). None of the DBpedia, Freebase, Wikidata, and YAGO; a typ- considered KGs use significant constraints when ical data type is xsd:integer). The highest facts are added: In case of DBpedia the type is number of data types is used by DBpedia59 and fixed during the mapping process. No further con- Freebase.60 straints are given. For YAGO, a type checker and constraint checker is provided. Wikidata has no constraint check tool, but users can report con- 55An example for that is http://dbpedia.org/ straint violations. OpenCyc has no constraints property/s. since the facts are created manually by experts. 56An example for that is http://dbpedia.org/ No information about constraints was found in property/useEw%25_ 57An example for that is http://dbpedia.org/ case of Freebase. property/develop, http://dbpedia.org/property/ – Hierarchy and network of relations: Only Open- developer, and http://dbpedia.org/property/ Cyc implements a hierarchy of relations [19]. No- develops. table is, however, that DBpedia properties are au- 58See http://lov.okfn.org/dataset/lov/vocabs/ dbpedia-owl 59See http://mappings.dbpedia.org/index.php/ 53See http://datahub.io/group/lodcloud. DBpedia_Datatypes. 54See http://wiki.freebase.com/wiki/Search_ 60See https://wikidata.org/wiki/Special: Cookbook#Scoring_and_Ranking ListDatatypes. 18 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO a See aatpsxddt ye,different types, data xsd types Data classifi- generalization, abstraction of Forms for ontology DBpedia classes of Description RDFS, FOAF, yes, vocabulary External relations hierarchy of or may Network data inconsistent constraints Schema on- DBpedia yes, restrictions Schema http://mappings.dbpedia.org/index.php/DBpedia_Datatypes ece etc. rencies cur- for types data other association cation, OWL in concepts, UMBEL YAGO, OWL, relations overlapping other with is or which the unclear is fuzzy meaning where many relations but no, occur domain fixed no with addition properties in non-mapping-based tology; Bei reaeOeCcWkdt YAGO3 Wikidata OpenCyc Freebase DBpedia a al :Cmaio fteKsrgrigteschema. the regarding KGs the of Comparison 6: Table ie n,fla,me- float, sometimes etc. dia_type, int, but time, date- as such types no, data no sociation as- aggregation, cation, classifi- generalization, computer?props= mentioned not com/computer/ //www.freebase. no (e.g., description described in are classes yes checks, constraint no no constraints no no mentioned not ex- schema and the pand edit al- to are lowed is users but schema used, fixed yes, /common/topic/ https: ) o sdol ipedt types data simple only used not association cation, classifi- generalization, com- collection a a for ment is there often but fixed, are properties fixed are properties coordinations timezone, time, as such sociation as- aggregation, cation, classifi- generalization, no entities in as just yes, vocabulary to external prop- links property erty” “equivalent a con- violations report straint can users schema the expand and edit to allowed are users sxsd:integer as such types data simple association cation, classifi- generalization, OWL, RDFS, WordNet yes, constraint provided checker and type classes >560k mentioned, not M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 19

5.7. Comparison of Particularities 6. Assessment of KGs

Besides general KG information and information Based on the Tables 1 – 7, we created a matrix (see about the storage of instances, relations, and the Table 8) where the most important aspects in which the schema of the KGs, there are several aspects of data KGs differ (extracted from the Tables 1 – 7) are formu- modeling which can be found only in the models of lated as yes-no-questions. These questions serve the purpose of guiding users that are interested in choosing distinct KGs. These aspects are: among the KGs those that best fit their purposes. – Temporal aspects: There are three types of tem- For assessing the KGs, a score is calculated for each poral aspects which can be attached to facts in KG. For each KG, this “fulfillment score” can be cal- a KG (see [57]): The valid time, i.e., the point culated as the number of times the answer for the de- in time or time span the fact is valid; the inser- sired KG matches the answer of the KG in question. tion time, i.e., the time the fact is or was inserted Also more sophisticated scoring functions are possi- into the KG; and a relevance time aspect, i.e., the ble where the matching regarding specific questions point in time or time interval which is relevant for is weighted higher. In the end, the KG which has the user’s application. Only two of the KGs sup- achieved the highest score is the KG which is favored by this framework. port the storage of temporal aspects besides pure facts. The data model of Wikidata allows users to store the time interval in which the statement 7. Limitations of KGs holds true. In this way, facts which are valid only for specific time spans such as election periods Peckham and Maryanski [47] argued that semantic can be stored. YAGO also supports temporal in- data models will be used widely when they perform formation to be stored attached to the fact such as sufficiently well for real-world settings (especially in the occurrence date. enterprises). We can argue that there are already some – Source of facts: For reasons of traceability the KG applications and many Linked Open Data datasets source (reference) of facts is stored together with available. Examples where Linked Open Data is used the facts. The knowledge where the facts are de- are the BBC,62 Best Buy,63 and the German National rived from might be important for the user to as- Library.64 sess the validity and trustability of the fact. Wiki- However, there are several limitations of the KGs data and YAGO are the only KGs where the stor- and, hence, of the Semantic Web in its current uptake age of facts is both supported by its data model which became apparent during the analysis of the con- and used by the users.61 sidered KGs and which we therefore would like to em- The other KGs do not store the source. In case of phasize: DBpedia the source of facts is obvious, namely 1. Domain specificity limitation: During the pro- Wikipedia, and does not need to be stored. Cyc is cess of selecting the KGs for comparison, it be- created completely by experts. It can be assumed came apparent that either many KBs in the LOD that the source of facts is not stored here, since all cloud are highly aligned to the general domain facts are reliable. Wikipedia covers – since Wikipedia is used as – Reification: RDF reification was intended as a knowledge source in these cases – or (i) the mechanism for making provenance statements KBs focus on specific domains (cf. the lexi- and other statements about RDF triples [67]. Re- cal databases WordNet and BabelNet) and/or (ii) garding our KGs, only Cyc and YAGO use reifi- cover more schema information than instance in- cation to some extent: Cyc allows the reification formation, so that they cannot be called KGs any of literals. In case of YAGO, time and location is more (cf. the common sense KBs ConceptNet attached to facts by reification. 62See http://www.bbc.co.uk/ontologies 63See http://www.bestbuy.com/ 61It can be noted that many facts of Wikidata are derived from 64See http://www.dnb.de/DE/Service/ Wikipedia, so that in many cases the Wikipedia URL is the only DigitaleDienste/LinkedData/linkeddata_node. source. For YAGO, the source of facts is provided in a separate file. html 20 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO oreo at o(l rmWkpda o etoe oys otyys(nfile (in of time yes valid the yes, location and time yes, mostly yes, it against decision no, no literals of reification no mentioned not exploited not Currently no mentioned not Wikipedia) from (all no Reification facts of Source no aspects Temporal Bei reaeOeCcWkdt YAGO3 Wikidata OpenCyc Freebase DBpedia al :Cmaio fteKsrgrigterparticularities. their regarding KGs the of Comparison 7: Table ntime) in points different for popula- tion the (e.g., facts tahdvareification via attached yagoSources currence oc- of time the e.g., yes, ) M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 21 c ) - - - - - X X X X X ( b ) - - X XXXX X XXXXX XX X ( ------X --- XXX X XXX X a ) ------X XXXXXXXXX XX XXXXX XX X ( DBpedia Freebase OpenCyc Wikidata YAGO3 http:// Table 8: Decision matrix for KG selection. ? 5. Is an official SPARQL endpoint provided? 6. Is some dataapprovement)? quality level ensured (manually or via manual 1. Is the available KG continuously updated (no fixed versions)? 3. Is it rather an instance KG4. than Is a data schema available via KG? HTTP lookup? 9. Are entity descriptions available? 12. Is the occurrence time of facts13. stored? Is the source of facts stored? datahub.io 2. Are other languages than English supported? 7. Is the KG part of8. the Do LOD the entities cloud have any according ordering or to ranking? 10. Do the relations have any ordering11. or Are ranking? relation descriptions available? 14. Is reification supported and done in15. practice? Are concepts descriptions available? 16. Are any standard data types used for literals? Continuous updates are available for DBpediaVia HTTP live, but lookups not the for user DBpedia can only retrieve the labels and the Wikipedia categories of Wikidata items. An evaluation of the quality was only performed for YAGO2, but not for YAGO3. a b c 22 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

and UMBEL). Although there is a considerable tools are the key for building amount of KGs which are freely available and KGs. focus on specific domains, most domains with 4. Limitations regarding the Linked Open Data potential use cases are not covered by KGs. Re- cloud (partly based on [34]): garding the prediction by Peckham and Maryan- (a) Lack of Conceptual Description of Datasets: ski [47] according to which semantic data mod- In order to identify the domain a specific LOD els are used mainly for the management of sci- dataset covers, a human expert is needed. entific, engineering, and manufacturing data, we There is currently no standard mechanism can state: This data exists, but most of the data or dataset description interface which states is not semantically enriched and/or in the format that, e.g., MusicBrainz is about music related RDF – and if, there are not many attributes. Fu- information while Geonames is about geo- ture projects on the so called “Industry 4.0” (in- graphical information. This leads to a miss- cluding concepts of cyber-physical systems, the ing overview of what datasets are there and of Things, and the Internet of Services) which can be used in a certain setting. There aim at changing this. are some attempts [49,2] to describe LOD 2. Limitations of modeling time aspects: Concepts datasets,65 but they do not focus on the con- for modeling dynamic and/or temporal aspects ceptual or semantic level, but instead on sta- within semantic (graph) data models have been tistical information or a prosaic description. developed since the uprise of semantic data mod- Since the LOD cloud consists of datasets els (see [58,57] for an overview of temporal which were published under the Linked Data database theory). Snowgrass and Ahn [57] distin- principles, nobody knows the complete pic- guish between transaction time (time when data ture of the LOD cloud. Even the well-known is stored), valid time (time the data is useful or LOD cloud diagram66 is only a particular valid), and user-defined time (additional time in- perspective on the Web of Data, and many formation to be stored) as the dimensions of rep- other valid perspectives are possible.67 Other resenting temporal data in databases. Despite re- approaches automatically assess, annotate search on temporal data models, dynamic and and index linked datasets, e.g., by extracting temporal data models have not become prevalent topic annotations for arbitrary Linked Data so far. Still today, most semantic graph data mod- datasets [24]; but these tools have not yet be- els neither encompass the temporal characteris- come widely used. tics of knowledge facts nor the spatial-temporal (b) Lack of LOD Schema Alignment: Links be- grounded representation of events. Also, Rula tween LOD datasets are almost exclusively et al. [51] showed that the amount of temporal on the level of instances. There are only a information available in the Linked Open Data few approaches or good practices for map- cloud is still very small. One reason might be ping concepts at the schema level of the LOD that adding temporal aspects multiplies the num- cloud. Although ontology matching has been ber of statements and also may complicate the widely studied in the Semantic Web area and situation for users and software developers who its tools usually produced strict mappings write queries since queries become more com- between concepts such as equivalence and plex. Keeping things simple – and neglecting subsumption, the situation in case of Linked temporal aspects – is the often selected mantra Data is difficult: Even though concepts may for building up scalable environments such as have a strong , the con- cepts are not necessarily equivalent. One ex- KBs. ample for an inconsistency is the fact that 3. KG Population: The Semantic Web suffers from dbpedia:Actor denotes professional ac- the difficulties of transforming text and other, mainly unstructured data into RDF. KGs of to- day have already some potential and can be ap- 65See also the LOD data catalogs http://datahub. io http://linkeddatacatalog.dws.informatik. plied to many settings; however, the KGs are de- and uni-mannheim.de. pendent on the supply of structured data from ex- 66See http://lod-cloud.net/. ternal sources. tools and 67See http://lod-cloud.net/. M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 23

tors, while the concept movie:actor of – Recently, there is noticeable progress towards LinkedMDB68 means a person who plays a constructing KGs automatically. This is neces- role in a movie, but who is not a stage actor. sary, since constructing and/or populating KGs The UMBEL ontology was developed to con- neither with the help of experts nor with the help nect schemas used by LOD datasets. How- of open communities does not scale to an extend ever, UMBEL does not take the individual us- that is needed for most applications. For instance, age patterns of the concepts into account [45]. in Freebase the place of birth relation was miss- Remarkable approaches for finding schema- ing for 71% of all people instances, although this level links between LOD datasets are pro- relation was mandatory according to the schema vided by Nokolov et al. [45] and Jain et al. [68]. Also, Buh et al. [60] showed that the growth [33]. of Wikipedia has been slowing down. Conse- (c) Lack of Expressivity: Publishing Linked Data quently, automatic knowledge base construction in the LOD cloud is done for rapid data re- (AKBC) methods have been attracted more atten- leases and for relying on the Web of Data as tion [44]. Noteworthy in this context is the ap- washing machine (cleaning data over time) proach of statistical inference in KGs. Predictive [6]. However, in Linked Data the rich fea- models are trained on known facts from the KG. tures of OWL are rarely used. Although enti- From that, unknown facts are derived and com- ties can be interlinked between datasets with pared to “noisy” facts extracted from external, of- owl:sameAs relations, there is no auto- ten unstructured sources such as the Web. New matic constraint check or reasoning whether the entities in different datasets contain inco- facts are added to the KG if they are supported herent information. The city of Berlin, for in- by both models with a certain confidence. This stance, can have a different population size methodology is for instance used in Google’s in DBpedia and in Geonames, and this is not Knowledge Vault project [21]. eradicated. This task remains as burden for the data consumers. 9. Conclusion

8. Outlook Freely available knowledge graphs (KGs) have not Future work on KGs and Semantic Web technolo- been in the focus of any extensive comparative study gies might focus on the following areas: so far. In this survey, we defined aspects according to which KGs can be analyzed. We analyzed and com- – There are new approaches of how to model pared DBpedia, Freebase, Cyc, Wikidata, and YAGO knowledge in a semantically-structured form – along these aspects and proposed a checklist to enable against the background of having learned of 15 readers to find the most suitable KG for their settings. years of . One example of We discussed the essential issues current KGs are con- such a new approach is the design and use of so flicted with and glanced over the possible future of the called ontology design patterns [26]: An ontology Semantic Web. design pattern is a reusable solution to a recur- rent modeling problem. The focus is on reusing existing components, since ontologies and ontol- ogy components have been reused only to a very References limited extend so far. – There might be new forms of KGs and KBs which [1] S. Abiteboul and R. Hull. IFO: A Formal Semantic Database do not focus on the storage of entities and their re- Model. ACM Trans. Database Syst., 12(4):525–565, Nov. 1987. lations, but instead on other things such as events [2] K. Alexander. Describing Linked Datasets – On the Design [39,55,65]. Papers published the last years indi- and Usage of voiD, the ’Vocabulary Of Interlinked Datasets’. cate that event-centric KGs will become more im- WWW 2009 Workshop: Linked Data on the Web, 2009. portant and also widely applicable. [3] B. Amann and M. Scholl. Gram: A Graph Data Model and Query Languages. In Proceedings of the ACM Conference on , ECHT ’92, pages 201–211, New York, NY, USA, 68See http://www.linkedmdb.org/. 1992. ACM. 24 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

[4] M. Andries, M. Gemis, J. Paredaens, I. Thyssens, and J. V. Vault: A Web-scale Approach to Probabilistic Knowledge den Bussche. Concepts for Graph-Oriented Object Fusion. In Proceedings of the 20th ACM SIGKDD Manipulation. In Proceedings of the 3rd International International Conference on Knowledge Discovery and Data Conference on Extending Database Technology: Advances in Mining, KDD ’14, pages 601–610, New York, NY, USA, Database Technology, EDBT ’92, pages 21–38, London, UK, 2014. ACM. UK, 1992. Springer-Verlag. [22] F. Erxleben, M. Günther, M. Krötzsch, J. Mendez, and [5] R. Angles and C. Gutierrez. Survey of D. Vrandeciˇ c.´ Introducing Wikidata to the Linked Data Web. Models. ACM Computing Surveys, 40(1):1:1–1:39, 2 2008. In Proceedings of the 13th International Semantic Web [6] S. Auer. Creating Knowledge out of Interlinked Data: Making Conference (ISWC’14), LNCS. Springer, 2014. the Web a Data Washing Machine. In Proceedings of the [23] C. Fellbaum. WordNet – An Electronic Lexical Database. International Conference on Web Intelligence, Mining and MIT Press, 1998. Semantics, WIMS ’11, pages 4:1–4:8, New York, NY, USA, [24] B. Fetahu, S. Dietze, B. P. Nunes, D. Taibi, and M. A. 2011. ACM. Casanova. Profiling of Linked Datasets using Structured [7] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Descriptions. In The 12th International Semantic Web and Z. Ives. DBpedia: A Nucleus for a Web of Open Data. In Conference (ISWC2013), 2013. Proceedings of the 6th International Semantic Web [25] G. Freedman and E. Reynolds. Enriching basal reader lessons Conference and 2nd Asian Semantic Web Conference, ISWC with semantic webbing. Reading Teacher, 33(6):677–684, 2007/ASWC 2007, pages 722–735. Springer, 2007. 1980. [8] S. Auer, J. Lehmann, A.-C. Ngonga Ngomo, and A. Zaveri. [26] A. Gangemi. Ontology Design Patterns for Semantic Web Introduction to Linked Data and Its Lifecycle on the Web. In Content. In Y. Gil, E. Motta, V. Benjamins, and M. Musen, Reasoning Web. Semantic Technologies for Intelligent Data editors, The Semantic Web – ISWC 2005, volume 3729 of Access, volume 8067 of Lecture Notes in Computer Science, Lecture Notes in Computer Science, pages 262–276. Springer pages 1–90. Springer Berlin Heidelberg, 2013. Berlin Heidelberg, 2005. [9] T. Berners-Lee. Linked Data – Design issues. http: [27] R. Guns. Tracing the Origins of the Semantic Web. Journal of //www.w3.org/Designissues/LinkedData.html. the American Society for and accessed May 15,2015. Technology, 64(10):2173–2181, 2013. [10] T. Berners-Lee. Cool URIs don’t change. Technical report, [28] M. Gyssens, J. Paredaens, and D. van Gucht. A Consortium, 1998. Graph-oriented Object . In Proceedings of the [11] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. 9th ACM SIGACT-SIGMOD-SIGART Symposium on Scientific American, 284(5):29–37, 5 2001. Principles of Database Systems, PODS ’90, pages 417–424, [12] J. Brank, M. Grobelnik, and D. Mladenic. A survey of New York, NY, USA, 1990. ACM. ontology evaluation techniques. In Proceedings of the [29] M. Hammer and D. Mc Leod. Database Description with conference on data mining and data warehouses (SiKDD SDM: A Semantic Database Model. ACM Trans. Database 2005), pages 166–170, 2005. Syst., 6(3):351–386, Sept. 1981. [13] T. Bray, J. Paoli, and C. M. Sperberg-McQueen. Extensible [30] M. Hammer and D. McLeod. The Semantic Data Model: A Markup Language (XML) 1.0, W3C Recommendation 10. Modelling Mechanism for Data Base Applications. In http: Proceedings of the 1978 ACM SIGMOD International //www.w3.org/TR/1998/REC-xml-19980210. Conference on Management of Data, SIGMOD ’78, pages accessed July 31, 2015. 26–36, New York, NY, USA, 1978. ACM. [14] M. L. Brodie. On the Development of Data Models. In M. L. [31] R. Hull and R. King. Semantic Database Modeling: Survey, Brodie, J. Mylopoulos, and J. W. Schmidt, editors, On Applications, and Research Issues. ACM Comput. Surv., Conceptual Modelling, Topics in Information Systems, pages 19(3):201–260, Sept. 1987. 19–47. Springer New York, 1984. [32] P. Jain, P. Hitzler, K. Janowicz, and C. Venkatramani. There’s [15] P. Buneman. Semistructured Data. In Proceedings of the No Money in Linked Data. http://corescholar. Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on libraries.wright.edu/cse/240, 2013. accessed Principles of Database Systems, PODS ’97, pages 117–121, July 20, 2015. New York, NY, USA, 1997. ACM. [33] P. Jain, P. Hitzler, A. P. Sheth, K. Verma, and P. Z. Yeh. [16] P. P.-S. Chen. The Entity-relationship Model: Toward a for Linked Open Data. In Proceedings of Unified View of Data. ACM Trans. Database Syst., 1(1):9–36, the 9th International Semantic Web Conference on The Mar. 1976. Semantic Web - Volume Part I, ISWC’10, pages 402–417, [17] E. F. Codd. A Relational Model of Data for Large Shared Berlin, Heidelberg, 2010. Springer-Verlag. Data Banks. Commun. ACM, 13(6):377–387, June 1970. [34] P. Jain, P. Hitzler, P. Z. Yeh, K. Verma, and A. P. Sheth. [18] E. F. Codd. Data Models in Database Management. SIGPLAN Linked Data Is Merely More Data. In AAAI Spring Not., 16(1):112–114, June 1980. Symposium: linked data meets artificial intelligence, [19] J. Curtis, J. Cabral, and D. Baxter. On the Application of the volume 11, 2010. Cyc Ontology to Sense Disambiguation. In FLAIRS [35] L. Kerschberg, A. C. Klug, and D. Tsichritzis. A Conference, pages 652–657, 2006. of Data Models. In Systems for Large Data Bases, pages [20] R. Cyganiak, D. Wood, and M. Lanthaler. RDF 1.1 Concepts 43–64. North Holland & IFIP, 1976. and Abstract Syntax. 2014. accessed July 30, 2015. [36] W. Kim. Object-Oriented Databases: Definition and Research [21] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, Directions. IEEE Transactions on Knowledge and Data K. Murphy, T. Strohmann, S. Sun, and W. Zhang. , 2(3):327–341, 1990. M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 25

[37] G. Klyne and J. J. Carroll. Resource Description Framework Heidelberg, 2008. Springer-Verlag. (RDF): Concepts and Abstract Syntax. http://www.w3. [50] M. A. Rodriguez. A graph analysis of the Linked Data cloud. org/TR/2004/REC-rdf-concepts-20040210, arXiv preprint arXiv:0903.0194, 2009. accessed July 31, 2004. accessed July 20, 2015. 2015. [38] G. Kobilarov, T. Scott, Y. Raimond, S. Oliver, C. Sizemore, [51] A. Rula, M. Palmonari, A. Harth, S. Stadtmüller, and M. Smethurst, C. Bizer, and R. Lee. Media Meets Semantic A. Maurino. On the Diversity and Availability of Temporal Web – How the BBC Uses DBpedia and Linked Data to Make Information in Linked Open Data. In The Semantic Web – Connections. In Proceedings of the 6th European Semantic ISWC 2012, volume 7649 of Lecture Notes in Computer Web Conference on The Semantic Web: Research and Science, pages 492–507. Springer Berlin Heidelberg, 2012. Applications, ESWC 2009 Heraklion, pages 723–737, Berlin, [52] E. Sandhaus. at the New York Times: Heidelberg, 2009. Springer-Verlag. Lessons Learned and Future Directions. In Proceedings of the [39] E. Kuzey, J. Vreeken, and G. Weikum. A Fresh Look on 9th International Semantic Web Conference on The Semantic Knowledge Bases: Distilling Named Events from News. In Web - Volume Part II, ISWC’10, pages 355–355, Berlin, Proceedings of the 23rd ACM International Conference on Heidelberg, 2010. Springer-Verlag. Conference on Information and , [53] L. Sauermann and R. Cyganiak. Cool URIs for the Semantic CIKM ’14, pages 1689–1698, New York, NY, USA, 2014. Web. W3C Note, http://www.w3.org/TR/2008/ ACM. NOTE-cooluris-20081203/, 12 2008. accessed July [40] O. Lassila and R. R. Swick. Resource Description Framework 10, 2015. (RDF) Model and Syntax Specification. http://www.w3. [54] H. A. Schmid and J. R. Swenson. On the Semantics of the org/TR/1999/REC-rdf-syntax-19990222/, 1999. Relational Data Model. In Proceedings of the 1975 ACM accessed July 4, 2015. SIGMOD International Conference on Management of Data, [41] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, SIGMOD ’75, pages 211–223, New York, NY, USA, 1975. P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, ACM. and C. Bizer. DBpedia – A large-scale, multilingual [55] R. Segers, P. Vossen, M. Rospocher, L. Serafini, E. Laparra, knowledge base extracted from Wikipedia. Semantic Web, and G. Rigau. ESO: a Frame based Ontology for Events and 6(2), 2012. Implied Situations. Proceedings of Maplex2015, 2015. [42] A. Lozano-Tello and A. Gómez-Pérez. Ontometric: A method [56] J. M. Smith and D. C. P. Smith. Database Abstractions: to choose the appropriate ontology. Journal of Database Aggregation and Generalization. ACM Trans. Database Syst., Management, 2(15):1–18, 2004. 2(2):105–133, June 1977. [43] F. Mahdisoltani, J. Biega, and F. M. Suchanek. YAGO3: A [57] R. Snodgrass and I. Ahn. A Taxonomy of Time Databases. In Knowledge Base from Multilingual . In CIDR Proceedings of the 1985 ACM SIGMOD International 2015, Seventh Biennial Conference on Innovative Data Conference on Management of Data, SIGMOD ’85, pages Systems Research, Asilomar, CA, USA, January 4-7, 2015, 236–246, New York, NY, USA, 1985. ACM. Online Proceedings, 2015. [58] R. T. Snodgrass. Temporal databases. In Theories and [44] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A Methods of Spatio-Temporal Reasoning in Geographic Space, Review of Relational for Knowledge volume 639 of Lecture Notes in Computer Science, pages Graphs: From Multi-Relational Link Prediction to Automated 22–64. Springer Berlin Heidelberg, 1992. Knowledge Graph Construction. arXiv preprint [59] S. Staab and R. Studer. Handbook on Ontologies. Springer arXiv:1503.00759, 2015. Publishing Company, Incorporated, 2nd edition, 2009. [45] A. Nikolov, V. Uren, E. Motta, and A. de Roeck. Overcoming [60] B. Suh, G. Convertino, E. H. Chi, and P. Pirolli. The Schema Heterogeneity between Linked Semantic Repositories Singularity is Not Near: Slowing Growth of Wikipedia. In to Improve Coreference Resolution. In A. Gómez-Pérez, Proceedings of the 5th International Symposium on and Y. Yu, and Y. Ding, editors, The Semantic Web, volume 5926 Open Collaboration, WikiSym ’09, pages 8:1–8:10, New of Lecture Notes in Computer Science, pages 332–346. York, NY, USA, 2009. ACM. Springer Berlin Heidelberg, 2009. [61] S. Tartir, I. B. Arpinar, M. Moore, A. P. Sheth, and [46] J. Paredaens, P. Peelman, and L. Tanca. G-Log: A B. Aleman-meza. OntoQA: Metric-based ontology quality Graph-Based Query Language. IEEE Trans. on Knowl. and analysis. In IEEE Workshop on Knowledge Acquisition from Data Eng., 7(3):436–453, June 1995. Distributed, Autonomous, Semantically Heterogeneous Data [47] J. Peckham and F. Maryanski. Semantic Data Models. ACM and Knowledge Sources, 2005. Comput. Surv., 20(3):153–189, Sept. 1988. [62] R. W. Taylor and R. L. Frank. CODASYL Data-Base [48] M. Poveda-Villalón, M. Suárez-Figueroa, and Management Systems. ACM Comput. Surv., 8(1):67–103, A. Gómez-Pérez. Did You Validate Your Ontology? OOPS! Mar. 1976. In E. Simperl, B. Norton, D. Mladenic, E. Della Valle, [63] D. C. Tsichritzis and F. H. Lochovsky. Hierarchical I. Fundulaki, A. Passant, and R. Troncy, editors, The Semantic Data-Base Management: A Survey. ACM Comput. Surv., Web: ESWC 2012 Satellite Events, volume 7540 of Lecture 8(1):105–123, Mar. 1976. Notes in Computer Science, pages 402–407. Springer Berlin [64] D. C. Tsichritzis and F. H. Lochovsky. Data Models. Prentice Heidelberg, 2015. Hall Professional Technical Reference, 1982. [49] B. Quilitz and U. Leser. Querying Distributed RDF Data [65] P. Vossen, T. Caselli, and Y. Kontzopoulou. Storylines for Sources with SPARQL. In Proceedings of the 5th European structuring massive streams of news. ACL-IJCNLP 2015, Semantic Web Conference on The Semantic Web: Research pages 40–49, 2015. and Applications, ESWC’08, pages 524–538, Berlin, [66] D. Vrandeciˇ c´ and Y. Sure. How to Design Better Ontology 26 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Metrics. In E. Franconi, M. Kifer, and W. May, editors, The volume 3841 of Lecture Notes in Computer Science, pages Semantic Web: Research and Applications, volume 4519 of 943–948. Springer Berlin Heidelberg, 2006. Lecture Notes in Computer Science, pages 311–325. Springer [68] R. West, E. Gabrilovich, K. Murphy, S. Sun, R. Gupta, and Berlin Heidelberg, 2007. D. Lin. Knowledge Base Completion via Search-based [67] E. R. Watkins and D. A. Nicole. Named Graphs as a . In Proceedings of the 23rd International Mechanism for Reasoning About Provenance. In X. Zhou, Conference on World Wide Web, WWW ’14, pages 515–526, J. Li, H. Shen, M. Kitsuregawa, and Y. Zhang, editors, New York, NY, USA, 2014. ACM. Frontiers of WWW Research and Development - APWeb 2006,