A Statistical Comparison of Current Knowledge Bases

Michael Färber Achim Rettinger Institute AIFB Institute AIFB Karlsruhe Institute of Technology (KIT) Karlsruhe Institute of Technology (KIT) Karlsruhe, Germany Karlsruhe, Germany [email protected] [email protected]

ABSTRACT Hence, our main contributions in this paper are: In the last years, many knowledge bases have been developed and used in real-world applications. These include DBpedia, • We calculate a variety of statistical measurements on Wikidata, and YAGO which all cover general knowledge and the widely used KBs DBpedia, Wikidata, and YAGO. therefore similar topics. In this poster, we present statistical • We give an analysis regarding these results. measurements on these KBs. Our experiments reveal that despite that fact that these KBs cover the same domains • We make our framework for statistical analysis of KBs to a considerable amount, they diﬀer from each other sig- available for the public2 so that other KBs can be easily niﬁcantly w.r.t. their graph-based structure and ontological integrated. aspects. The remainder of this paper is organized as follows: First we Categories and Subject Descriptors give an overview of related work of semantic graph analysis. H.4 [Information Systems Applications]: Miscellaneous; We then introduce the KBs which we selected for our analy- D.2.8 [Software Engineering]: Metrics—complexity mea- sis, and provide details regarding the current versions of the sures, performance measures KBs. We then present the results of applying several graph- based and semantics-based metrics on the KB datasets in Keywords question. After discussing particularities of our analysis in Section 3, we conclude in Section 4. Knowledge Bases, Knowledge Graphs, Statistics, Metrics

1. INTRODUCTION 2. COMPARISON OF KNOWLEDGE BASES In the last years, several knowledge bases (KBs) have been 2.1 Related Work developed and found their way into industrial applications. Firstly, some work on the analyis of the graph structure of Although KBs have been used a lot, to the best of our knowl- the (HTML) Web has been carried out. Early studies of Web edge, comparative studies on the statistical characteristics of topology were published already in the 1990s (see, e.g., [2]). KBs are very limited so far. This is in particular true for the In 2000, Broder et al. [3] found out that the structure of the KBs DBpedia, Wikidata, and YAGO. These KBs are freely Web can be modeled in the shape of a bow tie. Rather re- available and do not cover a specific domain, but general cently, Donato et al. [4] developed some models which were knowledge in general. In this paper, we focus on these KBs brought into accordance with their crawl dataset regarding and exhibit their particularities w.r.t. their structural and some characteristics such as the power law distribution for ontological conditions. Based on the fact that these KBs are degree. – from a conceptual point of view – directed graphs consist- Secondly, related work has been carried out on the analysis ing of RDF triples,1 we come up with simple graph-based of the Linked Open Data (LOD) cloud:3 Rodriguez [9], for and RDF-based metrics such as in-degree, out-degree and a instance, analyzed the graph of data sources in the LOD variety of other metrics. Given the results of these metrics, cloud. Among other things, he concluded that, despite the we can gain a better insight into the particularities of these general assumption of the LOD cloud being a crowded“ravel”, current KBs and learn to what extent they differ from each the LOD cloud can be disaggregated into a component around other. DBpedia and another component around DBLP.4 Gueret et 1 al. [6] confirmed that observation, but added a third com- See http://www.w3.org/RDF/. ponent around UniProt.5 Thirdly, a few analyses of single ontologies [11, 7, 5] were made – as we do it in this paper: Theoharis et al. [11] focus on power-law degree distributions. According to them, 2The implementation of the framework is available for down- load at http://www.aifb.kit.edu/web/KB-Statistics. 3See http://lod-cloud.net. 4See http://dblp.uni-trier.de. DBLP contains biblio- graphical information and is not domain-independent. 5See http.//www.uniprot.org.

18 ontologies exhibit power law degree distributions as soon as the integration of Freebase data.12 Our experiments they have a sufficient number of predicates or classes. In on Wikidata are based on the Wikidata simple state- this paper, we also calculate degree distributions and exam- ments dataset from February 2015.13 ine whether they follow a power-law. Hoser et al. [7] applied • YAGO: YAGO14 – Yet Another Great Ontology – social network analysis on the two ontologies SWRC6 and has been developed at the Max Planck Institute for Suggested Upper Merged Ontology (SUMO).7 According to Computer Science in Saarbrucken since 2007. YAGO the authors, eigenvalue analysis provides deep insights into ¨ comprises information extracted from the Wikipedia, the structure and focus of the ontology. In our work, in con- WordNet15, and GeoNames.16 As of March 24, 2015, trary, we do not take eigenvectors into consideration. In the YAGO3 is available, which we use in our experiments. context of describing and evaluating a benchmark generator Since the YAGO3 data set was not available in triple for Linked Data, Duan et al. [5] used measurements such as format at the time of the experiments, we transformed indegree and number of distinct subjects/objects of specific the available tsv files into the triple format. KBs such as DBpedia and YAGO (as of 2011). Their work is therefore mostly related to our work. Duan et al. found out that there is a bad fit between the degree distribution 2.3 Analysis of the Knowledge Bases of the Semantic Web benchmark and curated Linked Data 2.3.1 Number of Triples datasets. They propose a new metric called coherence since Comparing the number of triples in the different KBs (see the existing graph-based metrics do not make a point about Figure 1a), we can see that YAGO has much more triples the quality of a KB. However, as we see in our experiments, than DBpedia or Wikidata. One reason for that might be this metric is not properly applicable for our KBs. that in case of YAGO (and Wikidata) there was only one dataset with all covered languages given (containing labels 2.2 Overview of the Knowledge Bases in different languages), while for DBpedia we could restrict In the following, we shortly describe the different KBs which the KB to the English language. Wikidata is rather small, we analyze in the following sections. We focus on these three since knowledge stored in Wikidata was not extracted from KBs since they cover general, cross-domain knowledge and one text corpus – as in case of DBpedia –, but created by similar topics. users of the Wikidata community. 8 • DBpedia: DBpedia is the most popular and promi- 2.3.2 Disk Space nent KB in the LOD cloud [1]. Since the first public As visible in Figure 1b and as expectedly, the measured disk release in 2007, DBpedia is updated roughly once a 9 space is directly correlated to the number of triples. Figure year. DBpedia is created from automatically-extracted 1c shows the relative disk space. Interesting is here the fact structured information contained in the Wikipedia, such that – despite the relatively small number of triples – Wiki- as from infobox tables, categorization information, geo- data requires much less disk space than the other KBs. The coordinates, and external links. Due to its role as the reason for that is that Wikidata uses non-human readable hub of Linked Open Data, DBpedia contains many URIs (such as http://wikidata.org/entity/Q1040) while links to other datasets in the LOD cloud. DBpedia is the other KBs rely on human-readable URIs (e.g., http:// used extensively in the Semantic Web research com- dbpedia.org/resource/Karlsruhe and http://yago.org/ munity, but is also relevant in commercial settings: resource/Karlsruhe). In case of Wikidata, the human- companies use it to organize their content, such as the readable labels for entities and properties are stored sep- BBC [8] and the New York Times [10]. In our experi- arately. ments, we use the latest version of DBpedia, which is 10 DBpedia 2014. 2.3.3 Number of Distinct Subjects and Number of • Wikidata: Wikidata11 started on October 30, 2012 Distinct Objects as a project of Wikimedia Deutschland. The aim of Comparing the number of distinct subjects across the KBs the project is to provide data which can be used by in question (see Figure 1d) and the number of distinct ob- any Wikimedia project, including Wikipedia. Wiki- jects (see Figure 1e), it becomes apparent that DBpedia has data does not only store facts, but also the corre- relatively few distinct subjects, but instead more distinct sponding sources, so that the validity of facts can be objects. In other words: The set of resources with outgoing checked. Labels, aliases, and descriptions for entities edges is significantly smaller than the set of resources with in Wikidata are provided in more than 350 languages. incoming edges (ratio 1 : 1.6). YAGO, in contrast, has the Wikidata is a community effort, i.e., users collabora- opposite characteristic (ratio 21 : 1). Figure 1f and 1g show tively add and edit information. Also, the schema is the ratio of the set of distinct subjects/objects w.r.t. to the maintained and extended based on community agree- entire set of resources in the KBs. Notable is that in case of ments. In the near future, Wikidata will grow due to YAGO, only to relatively few resources is linked. 6 See http://ontobroker.semanticweb.org/ontologies/ 12See https://plus.google.com/u/0/ swrc-onto-2001-12-11.oxml. 109936836907132434202/posts/bu3z2wVqcQc 7 See http://www.ontologyportal.org. 13See http://tools.wmflabs.org/wikidata-exports/rdf/ 8See http://dbpedia.org. exports/20150223/. 9There is also DBpedia live which is updated when 14See http://www.mpi-inf.mpg.de/departments/ Wikipedia is updated. See http://live.dbpedia.org. databases-and-information-systems/research/ 10See our website for a list of the dump files used in our yago-naga/yago/downloads/ experiments. 15See https://wordnet.princeton.edu. 11See http://wikidata.org. 16See www.geonames.org.

19 8 10 8 x 10 Number of triples in each KB x 10 Disk Space Relative Diskspace x 10 Number of subjects 12 16 160 3.5

14 140 10 3 12 120 2.5 8 10 100 2 6 8 80 1.5 6 60 4

Number of triples 1 4 40 Number of subjects

2 Disk Space usage [Byte] 2 20 0.5 Relative Diskspace [Byte per Triple] 0 0 0 0 DBpedia 2014 Wikidata YAGO3 DBpedia 2014 Wikidata YAGO3 DBpedia 2014 Wikidata YAGO3 DBpedia 2014 Wikidata YAGO3

(a) Number of triples (b) Disk space used (c) Relative disk space (d) Number of distinct subjects

7 Number of objects Subject Resource Ratio Object Resource Ratio 4 x 10 x 10 Number of properties 5 100 100 10

4 80 80 8

3 60 60 6

2 40 40 4 Number of objects 1 20 20 Number of properties 2 Object Resource Ratio [%] Subject Resource Ratio [%]

0 0 0 0 DBpedia 2014 Wikidata YAGO3 DBpedia 2014 Wikidata YAGO3 DBpedia 2014 Wikidata YAGO3 DBpedia 2014 Wikidata YAGO3

(e) Number of distinct objects (f) Subject Resource Ratio (g) Object Resource Ratio (h) Number of distinct properties

5 4 x 10 Number of classes x 10 Avg no. of instances per class Average indegree Average indegree with literals 5 3.5 70 20

3 60 4 15 2.5 50

3 2 40 10 2 1.5 30 Average indegree Average indegree

Number of classes 1 20 5 1 0.5 10

0 Average number of instances per class 0 0 0 DBpedia 2014 Wikidata YAGO3 DBpedia 2014 Wikidata YAGO3 DBpedia 2014 Wikidata YAGO3 DBpedia 2014 Wikidata YAGO3

(i) Number of distinct classes (j) Average number of in- (k) Average indegree (l) Average indegree with liter- stances per class als Indegree Distribution Average outdegree Outdegree Distribution 8 10 10 20 10 DBPedia 2014 DBPedia 2014

Wikidata 8 Wikidata 6 YAGO3 10 YAGO3 10 15

6 10 4 10 10 4 10

Number of nodes 2 Number of nodes

10 Average outdegree 2 5 10

0 0 10 10 0 2 4 6 8 0 0 2 4 6 8 10 10 10 10 10 DBpedia 2014 Wikidata YAGO3 10 10 10 10 10 Indegree Outdegree (m) Indegree distribution (n) Average outdegree (o) Outdegree distribution

Figure 1: Statistics for the three KBs DBpedia, Wikidata, and YAGO.

2.3.4 Number of Distinct Properties 2.3.5 Number of Distinct Classes From our analysis regarding the number of distinct proper- For calculating the number of distinct classes (see Figure 1i), ties (see Figure 1h) we can derive that the used Wikidata we iterated over all instances contained in the KB datasets RDF version contains only around 1,323 distinct properties. and took the objects of the relation rdf:type.19 Although The reason for that is that properties are carefully intro- DBpedia often contains several classes according to this class- duced by the Wikidata community and go through an ex- assignment method, we only retrieved 526 distinct classes. tensive discussion process before they are released for usage. The small number in case of Wikidata can be justified again DBpedia contains many properties. However, they are very by the community approach of Wikidata. YAGO has a as- heterogeneous and the non-mapping-based properties17 (i.e., tonishing number of distinct classes since YAGO is mainly properties which were extracted not based on human-defined an ontology, i.e., containing class-based information such as mappings, but solely as they appeared in the info-boxes in the classes of the WordNet taxonomy. This last fact be- Wikipedia) are often very noisy.18 A similar situation holds comes apparent in Figure 1j where the average number of for YAGO. instances per class is visualized. 2.3.6 Indegree 17I.e. properties having the URI prefix Comparing the average indegree (defined as the average num- http://dbpedia.org/property/. ber of inlinks per node; see Figure 1k) where no triples with 18There are, for instance, 53,930 triples with the property http://dbpedia.org/property/s in DBpedia 2014 which 19Standing for http://www.w3.org/1999/02/ has obviously no meaning. 22-rdf-syntax-ns#type.

20 literals (values) on the object position were considered and 4. CONCLUSIONS comparing the average indegree where triples with literals A measurement how current knowledge bases such as DB- were considered in addition (see Figure 1l), we can see that pedia, Wikidata, and YAGO look like and how they are in general (i.e., for all KBs) the average indegree with liter- structured, is to a large extent missing. In this paper, we als is much lower than the average indegree where no literals presented a (freely available) framework for statistical anal- were counted. The indegree for DBpedia and Wikidata is ysis of KBs where any KB with triple format can easily be roughly the same. One reason might be that a considerable integrated. We calculated a variety of statistical measure- amount of Wikidata was taken from Wikipedia. It can be ments on the KBs DBpedia, Wikidata, and YAGO, since assumed that YAGO has a higher average indegree than DB- they all cover general knowledge and are used in many ap- pedia and Wikidata, since YAGO comprises many different plications. Our investigations revealed that all current KBs ontologies. performed very differently w.r.t. the presented metrics. The indegree distribution diagram (see Figure 1m) shows almost ideal logarithmic decreases of the number of nodes Acknowledgement for all considered KBs. This is especially interesting since all This work was carried out with the support of the German KBs were created in different ways: automatically extracted Federal Ministry of Education and Research (BMBF) within from Wikipedia (DBpedia), partly created by the commu- the Software Campus project SUITE (Grant 01IS12051). nity (Wikidata), or composed of several sources which were used partly automatically, partly manually (YAGO). In the 5. REFERENCES light of the figure we can also confirm that the power law [1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, is still applicable to the indegree distribution of semantic R. Cyganiak, and Z. Ives. DBpedia: A Nucleus for a graphs such as the considered KBs. Web of Open Data. In Proceedings of the 6th ISWC and 2nd ASWC, pages 722–735. Springer, 2007. 2.3.7 Outdegree [2] T. Bray. Measuring the Web. In Proceedings of the Considering the average outdegree for each KB (defined as Fifth International World Wide Web Conference on the average number of outlinks per node; see Figure 1n), we Computer Networks and ISDN Systems, pages can see that nodes in the DBpedia knowledge graph have the 993–1005. Elsevier Science Publishers B. V., 1996. highest number of outgoing links on average. Wikidata con- [3] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, tains currently some domains of knowledge which are repre- S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. sented very densely (such as persons) while other domains Graph structure in the web. Computer networks, are rarely covered yet. On average, however, Wikidata per- 33(1):309–320, 2000. forms similarly as YAGO w.r.t. the average outdegree of [4] D. Donato, L. Laura, S. Leonardi, and S. Millozzi. nodes.20 The Web As a Graph: How Far We Are. ACM Trans. The average outdegree of the KBs (see Figure 1o) suggest Internet Technol., 7(1), Feb. 2007. – as in the case of the average indegree – a power law dis- [5] S. Duan, A. Kementsietsidis, K. Srinivas, and tribution. However, if the outdegree is low, the power law O. Udrea. Apples and Oranges: A Comparison of distribution is broken. This confirms the theory of [11] which RDF Benchmarks and Real RDF Datasets. In states that a sufficient number of predicates or classes is nec- Proceedings of the 2011 ACM SIGMOD, pages essary for observing a power law distribution. 145–156, New York, NY, USA, 2011. ACM. [6] C. Guéret, S. Wang, and S. Schlohbach. The Web of 3. LESSONS LEARNED Data is a Complex System – First Insight into Its According to Theoharis et al. [11], ontologies exhibit power Multi-Scale Network Properties. In Proceedings of the law degree distributions as soon as they have a sufficient European Conference on Complex Systems, pages number of predicates or classes. Based on our experiments, 1–12, 2010. we can confirm that for the KBs we considered. [7] B. Hoser, A. Hotho, R. Jäschke, C. Schmitz, and Duan et al. [5] stated that “traditional” graph analysis met- G. Stumme. Semantic Network Analysis of Ontologies. rics such as the degree or the number of classes are not suit- In Y. Sure and J. Domingue, editors, The Semantic able when KBs should be compared. Given our experimen- Web: Research and Applications, pages 514–529. tal results, we can confirm that to a certain extent. Duan Springer Berlin Heidelberg, 2006. et al. proposed a new metric called coherence metric where [8] G. Kobilarov et al. Media Meets Semantic Web – How the “filling degree” of all entities of the different classes is the BBC Uses DBpedia and Linked Data to Make calculated and aggregated. This might be a good indicator, Connections. In Proceedings of the 6th ESWC, pages however, the calculation for our KBs is tricky, since we of- 723–737, Berlin, Heidelberg, 2009. Springer. ten do not know the set of possible properties an entity of [9] M. A. Rodriguez. A graph analysis of the Linked Data a specific class is able to have. Iterating over all existing cloud. arXiv preprint arXiv:0903.0194, 2009. properties of entities of this class is problematic since the [10] E. Sandhaus. Semantic Technology at the New York KBs are often very noisy (different properties use the same Times: Lessons Learned and Future Directions. In meaning, different object types are used for the same prop- Proceedings of the 9th ISWC, pages 355–355, Berlin, erty, etc.) and the considered KBs may contain multiple Heidelberg, 2010. Springer-Verlag. classes per instance. [11] Y. Theoharis, Y. Tzitzikas, D. Kotzinos, and V. Christophides. On Graph Features of Semantic 20The outlier where the outdegree is 108 can be traced back to the fact that Wikidata contains many blank nodes with Web Schemas. IEEE Trans. on Knowl. and Data a high outdegree. Eng., 20(5):692–702, 2008.