Wikidata from a Research Perspective - a Systematic Mapping Study of Wikidata

Home , Data quality, Infobox, Knowledge graph, Wikidata

Wikidata from a Research Perspective - A Systematic Mapping Study of Wikidata

Mariam Farda-Sarbas Claudia Müller-Birn [email protected] [email protected] Human Centered Computing | Freie Universität Berlin

ABSTRACT and can be accessed using SPARQL5. Wikidata is designed a) to be Wikidata is one of the most edited knowledge bases which contains open for anyone to edit (with or without Wikidata account), b) to structured data. It serves as the data source for many projects in the allow conflicting ideas to coexist, c) to provide data in a language in- Wikimedia sphere and beyond. Since its inception in October 2012, dependent form, d) to be controlled by the contributing community, it has been increasingly growing in term of both its community e) to provide data with references, and f) to continuously evolve to and its content. This growth is reflected by an expanding number address the emerging needs of the users and contributors [72]. of research focusing on Wikidata. Our study aims to provide a gen- As a multilingual knowledge base, Wikidata can provide data in eral overview of the research performed on Wikidata through a any context and language as long as it is available. Thus, the pro- systematic mapping study in order to identify the current topical vided data is already being used by other projects such as, WDAqua- 6 7 8 coverage of existing research as well as the white spots which need core , WikiGenomes and Open Street Map and the increasing further investigation. In this study, 67 peer-reviewed research from number of contributors and contributions show the rising interest journals and conference proceedings were selected, and classified in Wikidata. into meaningful categories. We describe this data set descriptively At the same time, the research community interest on Wikidata by showing the publication frequency, the publication venue and has accumulated recently, and this is an indication of its growing the origin of the authors and reveal current research focuses. These popularity. Numerous studies have explored Wikidata from various especially include aspects concerning data quality, including ques- angles, such as its internal structure, including both, data and com- tions related to language coverage and data integrity. These results munity, from a data perspective by looking at its completeness and indicate a number of future research directions, such as, multilin- coverage, from a engineering perspective by looking at the needed gualism and overcoming language gaps, the impact of plurality tools, and by an application perspective by providing case studies in on the quality of Wikidata’s data, Wikidata’s potential in various using Wikidata for projects in medicine, linguistics, or geography. disciplines, and usability of user interface. However, this research seems to be scattered over different research fields in disciplines and it is challenging to develop a mental map KEYWORDS of the existing state of the art of research. Motivated by this obser- vation,we conducted our study, which summarize and reflects on Wikidata, mapping study, research classification the insights of existing research and give an overall overview of what studies have been carried out so far, and what topics needs to 1 INTRODUCTION be explored in future research. Wikidata, the sister project of Wikipedia, is a collaborative knowl- A systematic mapping study provides a "map" of a research area. edge base (KB)1, which is openly accessible and contains human- It helps to shape research directions by revealing existing topics readable, as well as, machine-readable data. Wikidata was launched which aid to identify white spots [18]9. It is a way of getting an in October 2012 and since then, it has been one of the most often overall overview of the research performed in an area of interest edited knowledge bases with around 20,000 active users2. The main and classify the relevant research to get a better understanding of arXiv:1908.11153v2 [cs.DL] 18 Nov 2019 goal behind Wikidata’s development is to provide structured data which areas have been covered so far and provide a baselines to for Wikimedia projects to overcome the data inconsistencies of assist new research efforts37 [ ]. In our mapping study, we summa- Wikipedia’s language versions. Wikidata is designed in a way that rize what have been researched so far about Wikidata, when, from anyone can edit, browse, consume, and reuse the data in a fully which origins and where they were published. We also identify multilingual form [71]. which aspects of Wikidata has got more attention in the research Wikidata’s content is also stored as a knowledge graph (KG)3. community and which aspects are not yet given much efforts to Thus, the data are provided in a structured form in the RDF4 format study, by classifying and categorizing existing research from Octo- ber 2012 to June 2018. Based on the search results from academic 1A knowledge base is a centralized repository of data, which stores data in any form, such as in a tabular or graph format. 5 2 SPARQL is the abbrviation for Simple Protocol and RDF Query Language. https://www.wikidata.org/wiki/Wikidata:Statistics 6 3 A question answering service for RDF knowledge bases [15]. We refer to knowledge graph when we mean a knowledge base which stores the data 7A community created for consuming and curation gene annotation data [16]. in graph format. 8 4 A peer production project that creates an editable global map [40]. RDF is the abbreviation for Resource Description Framework. 9A mapping study differs from a systematic literature review insofar that the later tackles a specific research question49 [ ], therefore, a mapping study can be seen as a 2019. pre-study of a systematic literature review. ,, Farda-Sarbas and Müller-Birn search engines, i.e. ACM, Springer Link, DBLP and Google Scholar, Table 1: Search results from academic search engines. we identified 1,497 search results. All papers were screened and when needed, read in more detail for a more accurate decision for Search Engines Search First After inclusion of the papers in the final data set. Finally, all needed infor- Results Screening Selection mation was extracted from the final set of 67 papers to answer the ACM 53 53 21 research questions as listed in the following section. With this map- DBLP 68 44 14 ping study, we make the following contributions: (1) we provide an SpringerLink 379 329 21 overall overview of the current state of Wikidata research, (2) we Google Scholar 997 699 11 identify the research areas of Wikidata where research needs to be Total 1,497 1,125 67 deepen, and (3) we suggest future research areas. This article is organized as follows. In Section 2, we explain our approach and research method in more detail. Then in Section 3, which did not support “month+year” format). The search strategy we present our findings, and discuss and reflect them in Section 4. for this study is an automated search using digital libraries. We In the final Section 5, we provide a conclusion on our research. obtained Wikidata research from the ACM Digital Library (ACM DL)10, the Springer Link Digital Library (Springer Link)11, and the 2 RESEARCH METHOD Digital Bibliography & Library Project (DBLP)12. ACM DL and DBLP are bibliography search engines specifically for Computer Our research method is motivated by our goal to provide a general Science. Although, Springer Link provides results from a broader overview of the field by identifying the topics that are well-studied range of fields such as, social sciences and humanities, we decided and derive the open spots in research [18]. Mapping studies are to extend the scope of the search in order to achieve a more holistic insofar a suitable instrument since they provide the ground and image of the current state of research on Wikidata from different directions of the future research as well as educate the members of disciplines. Thus, we included search results from Google Scholar a community [36]. Search Engine (Google Scholar) as well13. Our study adopts guidelines for systematic mapping studies The ACM DL searches keywords everywhere in the text, and only which are defined by Petersen et50 al.[ ]. In the following, we annual date settings are possible. We received 53 articles. Springer describe every step to ensure that our results are comprehensible. Link was also searched with the same keyword and time range as Next, we introduce our research questions which frame the main ACM and returned 379 results. The search interface on DBLP does goals of the study and inform the data collection process. not provide a time range selection, however, it returned the results 2.1 Research Questions from 2012 till now, which resulted in 68 papers. Google Scholar Search Engine was searched through Harzing’s Publish or Perish14 In this study we want to provide an overview of Wikidata from software with the same criteria. The number of search results from a research perspective. The peer production system Wikipedia, Google Scholar was 997. This large number is caused by the fact for example, has already drawn research from a myriad of disci- that Google Scholar returns technical reports, white papers and plines [47] and the question is, whether we have the same situation theses as well. The total number of articles in the first stage was in the context of Wikidata. Our research is guided by the following 1,497 (cp. Table 1). questions: RQ1 What high-quality research has been conducted with Wiki- 2.3 Criteria Exclusion and Inclusion data as a major topic or data source? We defined inclusion criteria to find the most relevant research RQ2 What types of research have been published, when (year) papers. The defined criteria for exclusion are duplicates, results in and where (journals or conferences)? languages other than English and results which are not published RQ3 What are the origins of the research (which countries, and in journals or conference proceedings, such as, websites, reports institutions)? and data sets, theses and books. RQ4 Which aspects of Wikidata are covered by considered re- In a first step, we already excluded 160 non-English search results search and which aspects are still to be studied? (145 from Google Scholar and 15 from Springer Link), second, 132 In the following, we describe in more detail, how and where we duplicates (results which were received by more than one search en- searched articles, which papers we included or excluded respec- gine), and third, 80 non-papers (citations, refworks, reports, datasets tively, and finally what categories we derived from the articles. and books). Thus, the remaining 1,125 search results were subject to an inclusion process (cp. Figure 1). 2.2 Search Process and Data Sources 10ACM DL is available at: https://dl.acm.org/. Data collection is a crucial step in any research since findings are the 11Springer Link is available at: https://link.springer.com/. direct result of the gathered data. We defined the needed keywords 12DBPL is available at: https://dblp.uni-trier.de. which is a first step for searching literature. As the noun “Wikidata” 13Semantic Scholar (https://www.semanticscholar.org) is another source for Wikidata research papers, however, the filtering mechanism of this system was functioning is only used as the name of the structured data source so far, and unexpectedly and the results were not reproducible. Although we contacted the Se- has no further meanings, the search string was simply selected as manticScholar team, the issue could not be solved, and thus, this search engine was “Wikidata” in order to identify a broad range of related literature. not included in the study. 14Harzing’s "Publish or Perish" provides an interface to use Google Scholar and export Similarly, as Wikidata was launched in October 2012, the time all results in a number of formats. In this study we used the CSV format. The software range was defined from 10/2012 till 06/2018 (some search engines is available from http://www.harzing.com/pop.htm. Wikidata from a Research Perspective - A Systematic Mapping Study of Wikidata , ,

Table 2: Classification of Wikidata research papers.

Selection Apply inclusion Apply inclusion Applying search Remove non- Remove dupli- Remove non- Remove theses Selected papers or exclusion or exclusion Process on databases English results cates papers and short papers for the study based on titles based on content Sub-categories Labels Papers Sum

1500 1000 Community- Design Decision 3

100 # of included articles 1497 1337 1205 1125 917 86 67 67 oriented WD Community 5 14 10 log (10) Research Multilingualism 6 -10 # of excluded articles 160 132 80 208 831 19 -100 Engineering-oriented Enhancement Features 4 -1000 9 -1500 Research Vandalism Detection 5 Medical & Biological 4 Application Figure 1: Article selection process and the number of in- Data 7 Use Cases cluded search results. Linguistics 3 Knowledge Comparison of KGs 7 Graph Oriented Common issues of KGs 3 15 In a second step, we included research papers only, if they are Research Wikidata as Linked 5 published in academic journals or conference proceedings and are Data Provider full research papers with at least five pages. The latter criterion Data-oriented Data Quality Issues 9 22 is based on the reasoning that articles with four or less pages are Research considered as short papers, and usually are posters, position or Tools & Datasets 13 demonstration papers. After applying the aforementioned criteria on the remaining 1,125 articles, 208 articles were excluded by read- 30 27 15 ing the titles and, another 833 papers were excluded after reading 25 the abstracts, because they were not focused on Wikidata. After 20 17 reading the articles in more detail, another 17 could be identified 15 12 10 as (bachelor and master) theses and short papers. 5 5 5 In total, a majority of the 1,497 found articles were excluded and 1 0 June 2012 2013 2014 2015 2016 2017 only 67 papers remained in our sample. The reason for exclusion 2018 of this large number of search results were that we carried out a full text search of the term Wikidata. The search engines returned Figure 2: Frequency of publications per year. results which contained this term, even if it was used only once. As we intended to include only papers which focus solely on Wikidata, we had to exclude a large number of results. Another reason was more insights about each research, and therefore, we read the arti- that Google Scholar returned results which were not only papers. cle in more detail by focusing on the findings. The tools used for Our further discussion is based on these 67 articles. The resulting data extraction and analysis are Zotero and Microsoft Excel. data set is available on Zotero16. All papers are also listed in the references section. The ones marked with asterisk, are references 2.5 Research Paper Classification that are not part of the mapping study. At this stage, we read the abstract, introduction and conclusion parts of all articles to get more insights about each research for cat- 2.4 Data extraction egorization. In a number of cases, further sections of the papers had Within the data extraction part of our study, we specified what to be read to get a better understanding of the scope and the topic data we want to extract from our data set. Having a uniform data of the paper. After analyzing 67 papers, we manually categorized extraction form reduces both, bias and internal validity threats. We the papers as shown in Table 2. In order to define the categories, developed a data extraction form, to answer the research questions we started with descriptive labels for each paper. After reading of this study as stated in Section 2.1.We extracted title, author(s), a few papers, we tried to identify categories at a higher level of abstract, date, publisher to answer RQ1, RQ2 and RQ4. However, abstraction. We compared our categories throughout the reading RQ3 required manual extraction of the institutions and countries process to make sure that our coding scheme stays consistent. where the first author of the paper had performed the research. We focused on institutions rather than the first authors themselves, be- 3 OVERVIEW OF DATA SET cause some authors published by different institutions. One author, In this section, we describe the resulting dataset of articles in more for example, published a research paper from institution A, and details. First, we look at the frequency of publications, second, we later joined institution B and published there. It would be difficult determine the publication venues and third, we identify where the to select one institution as the origin of that author. RQ4 required research was published. Finally, we look at the geographical origin of the Wikidata research. 15The article excluded here were not caught automatically and detected after reading, which were either books, tutorials or teaching material on semantic web technologies, welcome notes of conference or workshop proceedings, blog posts or studies on 3.1 Frequency of Publication Wikipedia, DBpedia or YAGO for instance. The dataset can be shared on request. 16All papers are available in Zotero: https://www.zotero.org/groups/2212336/wikidata_ The majority (39, 58%) of the included 67 research papers, are re- study. cent researches from 01/2017 till 06/2018 (cp. Figure 2). This is an ,, Farda-Sarbas and Müller-Birn

Netherlands 1 Italy 4 3.4 Research Topics of Wikidata France 4 Germany 22 UK 7 Denmark 4 Spain 1 Austria 3 The main research topics of Wikidata which are obtained after Portugal 1 classification, are listed in Table 2 and explained in Section 4. While, there exist a number of research topics in Wikidata, there is still Russia 1 potential, as we show in Section 5, for usage of Wikidata for a USA 7 China 2 variety of purposes. Thailand 1 4 FINDINGS Brazil 2 Australia 1 Chile 4 In this section, we describe the state of the art in each of the defined Argentina 1 Tunisia 1 categories. The first section (cp. Section 4.1) comprises articles that look at Wikidata from a community perspective, the second section (cp. Section 4.2) contains articles from engineering perspective, the third section (cp. Section 4.3) focuses on usage of Wikidata in Figure 3: Research contributions from countries and conti- certain fields, the fourth section (cp. Section 4.4) discusses Wikidata nents. from a KG perspective, and final section (cp. Section 4.5) consists research on the data perspective. indication that Wikidata has gained more awareness in the research 4.1 Community-oriented Research community. Starting from 2012, except for 2013, this number has Is Wikidata just another peer production system? The research expanded each year. Considering this growth and the number of in this category reflects on Wikidata’s goals and features, exist- studies until June 2018 (12), the number of research articles on ing design decisions (esp. multilingualism), analyzes the Wikidata Wikidata are expected to reach between 40-50 by the end of 2018. community and their participation patterns. 4.1.1 Design Decision. The research articles in this section provide 3.2 Publishers and Publication Types mainly an overview on Wikidata and introduce its features and The most popular publishers for Wikidata research are Springer design principles. with 21 articles and ACM with 19 articles, and the most popu- One of the first articles on Wikidata is by Vrandečić, whomo- lar journal for publishing Wikidata research articles is The Se- tivates the need for integrating existing structured data from the mantic Web Journal. Among the 67 papers of this study, most various Wikipedia language versions into one single repository of them (53, 79%) are published as conference papers and the in order to overcome existing data inconsistencies [71]. The main resthttps://www.overleaf.com/project/5c3f2935235d8259ff21db4e are distinguishing features of Wikidata according to Vrandečić et al. journal articles (14, 21%). Thus, conference proceedings are the most are, being available internationally and support for multilingualism, popular publication type in Wikidata. storing links to facts as a secondary database, and the ability to store The most popular conferences where Wikidata research was contradictory facts to represent knowledge diversity [72]. Voß [70] 17 presented are the TheWebConf (The Web Conference) , ISWC discusses extraction and classification of knowledge organization (International Semantic Web Conference), OpenSym (The Interna- systems based on Wikidata. tional Symposium on Open Collaboration), ESWC (Extended Se- mantic Web Conference)18, WSDM (ACM International Conference 4.1.2 Participation Patterns of the Community. This section reflects on Web Search and Data Mining) and MTSR (Research Conference the efforts made to understand who are Wikidata’s contributors on Metadata and Semantics Research). and what participation patterns do they follow. Steiner [63], for example, develops an application which is capable of monitoring 3.3 Geographical Origins of Research real-time edit activity of all language versions of Wikipedia and Wikidata. We found that Europe (70%) is the dominating contributor in Wiki- Müller-Birn et al. [43] analyze the contribution patterns of the data research, with Germany being the leading country and United Wikidata community to better understand whether Wikidata com- Kingdom the second. America (20%) has also contributed in research munity participation pattern follows a peer-production approach focusing on Wikidata, with the US having the most contributions like Wikipedia, or a collaborative ontology engineering approach. (cp. Figure 3). The study also describes the characteristics of the Wikidata com- Regarding the most active contributions of institutions, the find- munity as, registered users, anonymous (not registered or logged ings show that University of Southampton has had the most contri- in) and bots. Based on the results of this study, Cuong et al. [13] butions (7 articles), following by the Chile University (4 articles). study the dynamics of Wikidata community participation process, University of Lyon, TU Dresden and TU Denmark have the same to know how the participation patterns of the community change level of contributions (3 articles) on the third place, while the other over time. Piscopo et al. [54] extends this line of research by study- contributions come across German Universities mainly. ing the participation patterns of Wikidata community members, from being an editor to becoming a community member and inves- 17Formarly known as International Conference on World Wide Web (WWW) tigate on how these patterns evolve. In another study Piscopo et 18Formarly known as European Semantic Web Symposium (ESWS) al. [53] analyzed the relationship between group composition of Wikidata from a Research Perspective - A Systematic Mapping Study of Wikidata , , bots, and humans (registered or anonymous) and the item quality the data quality and trustworthiness of a KB. In the following, we in Wikidata. In their research, they focussed on the knowledge provide an overview of research that focuses on detecting vandalism base but highlighted the importance of considering the knowledge and other efforts for making Wikidata more robust. graph in future research. In their study, Heindorf et al. [29] present a new machine learning- based approach for the automatic detection of vandalism in Wiki- 4.1.3 Multilingualism. Multilingualism is one of the design princi- data. Sarbadani et al. [61] develop a vandalism detection mechanism ples of Wikidata. Wikidata stores data in a language independent for Wikidata by adapting methods from the Wikipedia vandalism form and aims to provide data to anyone, anywhere in the world. detection literature and extending it to Wikidata’s structured knowl- This section comprises studies that focus primarily on this design edge base. The mechanism used identifies damaging changes and principle. classifies edits as vandalism in real time, using a machine classifica- Samuel [60] describes the multilingual collaborative ontology tion strategy. development process in Wikidata by explaining the development The ACM International Conference on Web Search and Data process of a new property and its major steps from being proposed Mining, held the competition for developing vandalism detection to get approved by the community and finally translated to other mechanisms for Wikidata, the WSDM Cup 2017.21 The main goal of languages. Kaffee et al.35 [ ] study the languages covered by Wiki- this competition was to develop a model for detecting malicious or data. Their results suggest that most of the labels and descriptions similarly damaging edits. As a result of participation in WSDM Cup on Wikidata are only available in a small number of languages like, 2017 both, Crescenzi et al. [12] and Grigorev et al. [25], presented English, Dutch, French, German, Spanish, Italian, and Russian. This their own vandalism detection mechanisms22. Crescenzi et al. [12] stands in contrast to the majority of languages which have close reflected on previous work of Heindorf et al.[29], an automatic data to no coverage. Kaffee et al. in another studies,33 [ , 34], investi- mining approach for vandalism detection in Wikidata. Grigorev et gate the generation of open domain Wikipedia summaries from al. [25] present an approach based on a linear classification model, Wikidata in “underserved languages” to overcome uneven content which according to authors, is faster compared to other existing distribution. Ta et al. [65] propose a mechanism to enrich Wikidata approaches. multilingual content by retrieving “semantic relations based on The evaluation of the proposed vandalism detection approaches alignment between info-box properties and Wikidata properties in at the WSDM Cup 2017, is done in [28]. Heindorf et al. [28] evaluate various languages”. Sáez et al. [64] investigate the development of their four baseline approaches23 along the five submissions. The “fully automatic methods” where info-boxes for Wikipedia can be study finds that the best approach is a semi-automatic scenarios generated from Wikidata descriptions. “where newly arriving revisions are ranked for manual review” is 4.2 Engineering-oriented Research from [12], while, the best approach in a fully automatic detection scenario “where the decision whether or not to revert a given This section contains all articles that suggest approaches and fea- revision is left with the classifier” is the baseline approach by the tures that enhance Wikidata’s functionality. These features are Wikidata Vandalism Detector (WDVD) system [29]. programmed for two main purposes: first, for improving the quality by adding new data or by interlinking with other sources, and second, for vandalism detection. 4.3 Application Use Cases From the beginning, Wikidata received many attentions from mem- 4.2.1 Enhancement Features. Wikidata’s functionality has evolved bers of various research fields. Many articles described possible use gradually with the needs of the community. This section contains cases for utilizing Wikidata as a central data hub, as we see in the research that proposes approaches that ease the process of adding next section. data to Wikidata either manually, or by using external data sources. Zangerle et al. [76] evaluate recommender algorithms, which as- 4.3.1 Medical and Biological Data. Recently, especially medical sist Wikidata contributors in the process of data insertion through and biological projects have started using Wikidata as a backend property recommendation. Pellissier Tanon et al. [48] introduces data source, to facilitate data exchange, mapping, and consumption. the Primary Sources Tool19 to facilitate the migration of the con- Mitraka et al. [41], for example, propose the usage of Wikidata for tent from Freebase to Wikidata. Sergieh et al. [42] propose an ap- addressing the crucial challenges in disseminating and integrating proach to bridge the missing linguistic information gap of Wiki- knowledge in life sciences contexts, by linking genes, drugs and data by aligning Wikidata with FrameNet20 lexicon. Hachey et diseases. Pfundner et al. [51] have specified an automated process to al. [26] present a neural network model for mapping structured and unstructured data and investigate the generation of Wikipedia biographic summary sentences from Wikidata.

4.2.2 Vandalism Detection. Wikidata provides data for Wikipedia 21More more information, please check https://www.wsdm-cup-2017.org/. and other Wikimedia projects; thus, the integrity and correctness 22The competition received five submissions: 1) Buffaloberry by Crescenzi et12 al.[ ], of data is of high importance. Vandalism detection is, therefore, an 2) Conkerberry by Grigorev [25], 3) Loganberry by Zhu et al. [77], 4) Honeyberry by Yamazaki et al. [6], and 5) Riberry by Yu et al. [7]. We included two of the submissions essential aspect of a knowledge repository and directly influences only, because [77] and [6] are short papers and [7] is not published. 23The four baseline approaches are: 1) Wikidata Vandalism Detector (WDVD) approach 19For more information, please check https://github.com/google/primarysources. from [29], 2) FILTER, a second baseline which contains trained data from 01.05.2013 20FrameNet is a lexical database of the English language. For more information, please to 30.04.2016, 3) ORES, the re-implementation of the approach in [61], and 4) META, a check https://framenet.icsi.berkeley.edu/. combination of all approaches in [27]. ,, Farda-Sarbas and Müller-Birn integrate data from ONC’s24 high priority DDI25 list into Wikidata. 4.4.1 Wikidata as Linked Data Provider. We summarize all articles The authors aim to integrate the data from ONC into Wikidata that propose approaches for storing Wikidata’s structured data in and then use Wikidata to display the integrated data in articles of RDF and on the other hand, suggest how projects in Wikimedia’s different Wikipedia language versions. Burgataller-Muehlbacher ecosystem can use the RDF data. et al. [10] import all human and mouse genes, and all human and Erxleben et al. [17] argue that despite being the data platform mouse proteins into Wikidata to improve the state of biological in the Wikimedia ecosystem, Wikidata provides its data not in data, and facilitate data management and data dissemination using RDF, which affects Wikidata’s popularity in the Semantic Web the WDQS of Wikidata. community negatively. Thus, the authors propose an RDF encoding Although, Wikidata is greatly being used in bioinformatics, it for Wikidata and introduce a tool 28 for creating such RDF file is still a challenging task for biologists to use it efficiently. One exports. Similarly, Hernández et al. [30] compare various options for major issue is for example, that the “structured query languages reifying RDF triples from Wikidata, and building on that study the like SPARQL are not commonly part of a researcher’s toolkit”. Thus, efficiency of various database engines for querying Wikidata [31]. E. Putman et al. [16] describe WikiGenomes, a web application From 2014, Wikidata stored its data in RDF based on the Erxleben based on Wikidata, that facilitates the “consumption and curation et al.-mapping [17] and provided the data via an SPARQL endpoint, of genomic data by the entire biomedical researcher community”. the Wikidata Query Service (WDQS)29. Bielefeldt et al. [8] analyzed WikiGenomes provides access to the centralized biomedical data the access logs from SPARQL endpoint and separate the bot-based and a simple user interface for non-developer biologists. from human-based traffic. As expected, the human part is smaller and shows clear trends, e.g. correlated to time of day, in comparison 4.3.2 Linguistics. Wikidata is also used in the linguistics field, ei- to the bot-based part which is “highly volatile and seems unpre- ther as a dictionary, or proposing further approaches for linking dictable even on larger time scales”. lexical datasets or relation extraction. Yang et al. [74] uses the data for improving Wikipedia. They Turki et al. [68] propose to adopt Wikidata as a dictionary which discuss that KGs can help machines to analyze plain texts, and can be used across multiple dialects of the Arabic language. The propose a Relation Linking System for Wikidata (RLSW) which authors emphasize that the Arabic language has many dialects and links the Wikidata KG to data in plain text format in Wikipedia. these dialects are not all mutually intelligible, and each one of them has its morphological and phonological and even semantic and lex- 4.4.2 Comparison of KGs. Next, we discuss articles which compare ical particularities. The study explains how it is possible to convert Wikidata with other general domain knowledge graphs. Ringler Wikidata into a multilingual multidialectal dictionary for Arabic di- & Paulheim [59], for example, study DBpedia, Freebase, OpenCyc, alects and describes how Wikidata (as a multilingual multidialectal Wikidata and YAGO knowledge graphs to find similarities and dif- dictionary for Arabic dialects) can be used by computational linguis- ferences of these KGs. Färber et al. compare in their research, KGs tics and computer scientists in the Natural Language Processing from a data quality perspective ( [20, 21]). Razniewski et al. [58] of the varieties of the Arabic language. Nielsen et al. [44] describe discuss the challenges of asserting completeness in KGs, and outline an ongoing effort for linking ImageNet26 WordNet27 synsets to possible solutions. The authors propose a framework for finding Wikidata. Yu et al. [75] present a new approach for meronym rela- the most suitable KG for a given setting. Abian et al. [1] compare tions extraction in Wikidata, which is, building a 13-dimensional Wikidata and DBpedia structured data sources, based on the criteria feature vector for each hyperlink to be classified with different clas- defined in the main data quality frameworks. In a similar study, sification algorithms, based on all 13 different three-node motifs. Thakkar et al. [66] compare DBpedia and Wikidata from a quality The high interest of this community might have one driver for the assurance perspective and have found that based on the majority development of the Wikibase Lexeme extension which allows for of relevant metrics, the quality of Wikidata is higher than DBpedia. modeling lexical entities. From 2018, Wikidata includes this new Data quality of Wikidata has also been studied from a KG perspec- type of data: words, phrases, and sentences. tive, as in study from Gad-elrab et al. [22] which discuss that KGs like DBpedia, Freebase, YAGO and Wikidata are inevitably incom- 4.4 Knowledge Graph Oriented Research plete. To address this, the authors analyze the former approach of data correlations and propose a method to overcome the problems Wikidata is maintained by an active community of contributors with former approach. who create a large amount of structured data. The knowledge base relies on the MediaWiki infrastructure. At the meantime, Wikidata’s 4.4.3 Common Issues of KGs. Ismayilov et al. [32] describe the structured data is stored in RDF and is accessible through SPARQL. integration of Wikidata into the DBpedia Data Stack in order to Wikidata belongs, therefore, to a group of other general purpose use Wikidata through DBpedia extractors. In their study, Chekol knowledge graphs, such as DBpedia, YAGO, and Cyc. & Stuckenschmidt [11] discuss that KGs, such as YAGO, Wikidata, NELL, and DBpedia, already contain temporal data (facts together 24The Office of the National Coordinator (abbrev. ONC) for Health Information Tech- with their validity time). The authors propose a “bitemporal” model nology is a division of United States’ Department of Health and Human Services. for knowledge graphs, to record the data extraction time from other 25DDI stands for Drug-Drug Interaction, i.e. the effect change of one drug on body by another drug. sources. Currently, only NELL records this time, while, Wikidata 26“ImageNet is an image dataset organized according to the WordNet hierarchy” (http: only contains the time which is valid about a fact. In another study, //image-net.org/. 27WordNet is a large lexical database of English and contains and groups nouns, verbs, 28For more information, please check https://www.mediawiki.org/wiki/Wikidata_ adjectives and adverbs in the form of sets of cognitive synonyms (synsets). For more Toolkit. information https://wordnet.princeton.edu. 29The WDQS is available here https://query.wikidata.org/. Wikidata from a Research Perspective - A Systematic Mapping Study of Wikidata , ,

Krötzsch discusses the modern knowledge representation technolo- et al. [3] introduce a tool that harmonizes street names from Open- gies and their advantages in information management, such as StreetMap30 and the entities they refer to in Wikidata. Another description logics, and their contribution to knowledge graphs, and study, Thornton et al. [67] explore the potential of Wikidata to motivates Wikidata as a use case [39]. serve as a technical metadata repository and how it provides dis- tinct advantages for usage in the domain of digital preservation. 4.5 Data-oriented Research There are also datasets which were developed based on Wikidata for different purposes. Nielsen et45 al.[ ] construct a dataset contain- This section contains research which make use of data from the ing pairs of digital photos of objects for a multi-modal knowledge Wikidata knowledge base and the knowledge graph. Some papers representation. Klein et al. [38] develop “Wikidata Human Gender belong to KB and some to KG, while all focus on their defined Indicators” (WHGI), a biographic dataset, to monitor gender related category. We organized the papers in two categories: (1) data quality issues. Spitz et al. [62] present an approach for constructing a net- quality aspects of Wikidata, and (2) the development of new tools work of locations from Wikipedia by computing the similarity of and datasets. locations based on their distances and linking it to Wikidata as a knowledge source. 4.5.1 Data Quality. The research highlighted in this section, is concerned with improving the data quality of the knowledge graph 5 DISCUSSION by providing tools for Wikidata’s knowledge base. Our mapping study shows an increase in the number of published Prasojo et al. [56] discuss “COOL-WD”, a tool for supporting research articles per year, which indicates the growing interest the completeness lifecycle of Wikidata and allow to produce and of the research community on Wikidata (cp. Section 3.1). The ar- consume completeness data by “data completion tracking, complete- ticles have a prevalence of computer science articles which we ness analytics, and query completeness assessment.” Augenstein [4] expected from the chosen databases which are mainly Computer discusses that KBs are far from complete and proposes informa- Science related (ACM, Springer Link, DBLP). However, by includ- tion extraction methods to populate missing knowledge from Web ing Google Scholar, we expected to identify more research from pages to KBs. Galárraga et al. [23] investigate “different signals to disciplines such as sociology or communication science. Unfortu- identify the areas where the knowledge base is complete” and exper- nately, our results suggest that this approach was less successful. iment in Wikidata and YAGO to generate completeness information However, as other peer production communities Wikidata provides automatically. Razniewski et al. [57] introduce the problems and a valuable opportunity to deepen our understanding of existing limitations of properties in Wikidata and propose entity-specific community practice. It might be interesting, for example, to ex- property ranking for Wikidata. Ahmeti et al. [2] and Balaraman plore existing difference to Wikipedia. Furthermore, within vari- et al. [5] propose and develop Recoin, a relative completeness tool ous Wikipedia language versions, there is still a resistance to use for evaluating completeness of entities in Wikidata. Recoin uses Wikidata. Further research is needed, to better understand existing information from the class structure of the knowledge graph, in reservations. Another interesting less studied aspect in Wikidata order to recommend possible properties for an item on the Wikidata is the existing human-bot-collaboration [43]. Wikidata might be, user interface. Brasileiro et al. [9] discuss the quality of taxonomic besides Wikipedia, an interesting use case to better understand the hierarchies in Wikidata to have a consistent data model and repre- social-technical infrastructure of a peer production community. sentation schema. Piscopo et al. ([52, 55]) analyze Wikidata quality Our results suggest that research on Wikidata seems to be en- from the provenance perspective, the relevance and authoritative- tirely concentrated on specific institutions, such as the University of ness of Wikidata external references. Southampton or the Universidad de Chile, or countries, for example, Germany and USA (cp. Section 3.3). It might be the origin of Wiki- 4.5.2 Tools & Datasets. This category contains research that re- data as a European project initiated by members of the Semantic sulted in the development of new tools, which mainly use Wikidata Web community which causes that research on Wikidata is more as a backend data source. Ontodia [73], for example, is a ”simple popular in Europe. We wonder, how this western perspective on and free online OWL and RDF diagramming tool“. Scholia [46] is knowledge representation might exclude other understandings of a tool for handling scientific bibliographic information through knowledge. For example, the indigenous peoples give their knowl- Wikidata, and NECKaR [24] is a named entities classifier based on edge orally from generation to generation. Research, which deals Wikidata, which provide also a Wikidata-based named entity data with the question of how this knowledge or the potential occurrence set. Ferrada et al. [19] present a new web interface for IMGpedia of such knowledge can be represented, would undoubtedly be use- dataset which can query more than 6 million images of IMGpe- ful to achieve the aim for becoming a global universal knowledge dia through Wikidata, while, Diefenbach et al. ([14, 15]), present base, which can be used by anyone for any purpose [72]. and discuss WDAqua-core which is a new Questions Answering While there have been studies on the multilingualism aspect of component which uses DBpedia and Wikidata. Veen et al. [69] use Wikidata, the data is still not present in every language. Current Wikidata to improve access to the collection of Dutch historical findings show that there are some dominant languages (e.g. English, newspapers. French, German, Spanish), while, many other languages as ‘under- More recently, some effort has been investigated to synchronize served’ (cp. Section 4.1.3). This indicates that, although, there have the data beween OpenStreetMap and Wikidata. Leyh et al. [40] discuss the opportunities and challenges of Wikidata as a central integration facility by interlinking it with OpenStreetMap. Almeida 30For mire information please check: https://www.openstreetmap.org/ ,, Farda-Sarbas and Müller-Birn been some efforts in addressing the issue of uneven languages dis- classification steps50 [ ]. We addressed theoretical validity by in- tribution, further studies are needed to overcome the language gap cluding peer-reviewed articles only, and to reduce bias, the results in Wikidata. Furthermore, these studies focus on the descriptions were carefully checked by the second author. Interpretive validity is and labels of an item. It might be interesting to understand better achieved when the conclusions are the result of the given data [50]. when Wikidata’s data model fails because a one-to-one relationship One of the threats in drawing conclusion is researcher bias, which between two words from different languages is not possible. has been controlled by the second author review. Repeatability can Continuous evolution is one of the design decisions of Wiki- be achieved by providing detailed information of each research step data, which means Wikidata grows with its community and tasks, and the data. We provide all our data, the search log on github and and new features are deployed incrementally [72]. The findings the final article sample on Zenodo.31 suggest only little research on improving the usability of the user interface. User studies concerning aspects such as the learnability 7 CONCLUSION or explainability are still rare on Wikidata. From the authors own In this mapping study, we have provided an overview of existing experiences on conducting Wikidata workshops, it can be said, that research about Wikidata. We identified existing research topics in people struggle with understanding Wikidata’s central concepts, this field and described potential new research topics for future for example, the difference between a class and an instance. It seems studies. The literature was collected from digital libraries and aca- that Wikidata has still untapped potential in becoming accessible demic search engines, and the selected papers were categorized for non-technical experts. based on research focus relevance. Research publications acceler- Many efforts are made to sustain and improve the quality and ated every year which is an indication of the interest of research completeness of data in Wikidata (cp. Section 4.5.1). One issue in community. Most of the research contributions come from Europe this context is, for example, the handling of vandalism and data so far, thus, Wikidata is still predominantly used and studied from integrity. In the context of data quality, we call for more research a Western perspective. This affects, for example, multilingualism on the effects of plurality, i.e., the co-existence of contradictory and knowledge diversity in Wikidata. Although, data in Wikidata is information, in order to enhance the trustworthiness of Wikidata available in various languages simultaneously, this applies only for content. However, if anyone can add contradictory information, a selected number of languages, i.e. many languages have very few further research is needed to provide such mechanisms in the user or no coverage in Wikidata. Thus, future directions of research on interface as well in the WDQS for providing this information in a Wikidata could be to: a) focus on multilingualism aspect of Wiki- possible format. data and overcome language gaps, b) study knowledge diversity As opposed to Wikidata, Wikipedia is studied from a variety and the effect of plurality on data trustworthiness of Wikidata, c) of disciplines, such as, humanities (e.g, history, literature, philoso- research on improvement of the usability of user interface, and d) phy), logic and mathematics, natural sciences (biology, chemistry), investigate the usage of Wikidata in various disciplines and study social sciences (e.g. communications, education, economics, law, it from non-technical perspective. journalism) and interdisciplinary (anthropology, computer science, health, industrial ecology and information science) [14]. While, ACKNOWLEDGMENTS Wikidata has the competence to be used in different disciplines, This part will be provided in the camera-ready version. the investigations are needed to find out whether Wikidata can be beneficial in the same areas where Wikipedia was used. Even though our study reveals the usage of Wikidata in various contexts, REFERENCES [1] D. Abián, F. Guerra, J. Martínez-Romanos, and Raquel Trillo-Lado. 2018. Wikidata the uses cases come from the biomedical domain and linguistics and DBpedia: A Comparative Study. In Semantic Keyword-Based Search on mainly (cp. Section 4.3). It might be valuable to see more use cases Structured Data Sources, Julian Szymański and Yannis Velegrakis (Eds.). Vol. 10546. from other disciplines, such as social sciences or humanities. It Springer International Publishing, Cham, 142–154. https://doi.org/10.1007/ 978-3-319-74497-1_14 might be valuable, for example, to use Wikidata in educational or [2] Albin Ahmeti, Simon Razniewski, and Axel Polleres. 2014. Assessing the museum settings. Completeness of Entities in Knowledge Bases. In The Semantic Web: ESWC 2017 Satellite Events (Lecture Notes in Computer Science). Springer, Cham, 7–11. https://doi.org/10.1007/978-3-319-70407-4_2 [3] Paulo Dias Almeida, Jorge Gustavo Rocha, Andrea Ballatore, and Alexander Zipf. 2016. Where the Streets Have Known Names. In Computational Science and Its 6 LIMITATIONS OF RESEARCH Applications – ICCSA 2016 (Lecture Notes in Computer Science). Springer, Cham, 1–12. https://doi.org/10.1007/978-3-319-42089-9_1 In this section, we discuss a number of validity threats this study [4] Isabelle Augenstein. 2014. Joint Information Extraction from the Web Using is subject to. To overcome descriptive validity (research design) Linked Data. In The Semantic Web – ISWC 2014 (Lecture Notes in Computer Science). Springer, Cham, 505–512. https://doi.org/10.1007/978-3-319-11915-1_32 threats adequate information, we provided all from data collec- [5] Vevake Balaraman, Simon Razniewski, and Werner Nutt. 2018. Recoin: Relative tion to data analysis as detailed as possible in the constrains of Completeness in Wikidata. In Companion Proceedings of the The Web Confer- the publication format. Further, the research was designed in way ence 2018 (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1787–1792. https: to minimize the number of missing literature by including results //doi.org/10.1145/3184558.3191641 from Google Scholar. Although, Google Scholar results contained [6] *Tomoya Yamazaki, Mei Sasaki, Naoya Murakami, Takuya Makabe, and Hiroki many irrelevant results (e.g. citations), it was meaningful never- Iwasawa. 2017. Ensemble Models for Detecting Wikidata Vandalism with Stacking - Team Honeyberry Vandalism Detector at WSDM Cup 2017. In WSDM Cup 2017 theless, because we captured 11 articles not found in the other Notebook Papers. Cambridge, UK. libraries. Theoretical validity is achieved by capturing the most relevant literature and controlling bias in the data extraction and 31This information will be included in the final version of this paper. Wikidata from a Research Perspective - A Systematic Mapping Study of Wikidata , ,

[7] *Tuo Yu, Yiran Zhao, Xiaoxiao Wang, Yiwen Xu, Huajie Shao, Yuhang Wang, [26] Ben Hachey, Will Radford, and Andrew Chisholm. 2017. Learning to generate Xin Ma, and Dipannita Dey. 2017. Vandalism Detection Midpoint Report—The one-sentence biographies from Wikidata. In Proceeding of the 15th Conference Riberry Vandalism Detector at WSDM Cup 2017. (2017). University of Illinois at of the European Chapter of the Association for Computational Linguistics, Vol. 1. Urbana?Champaign Student Report, not published. LongPress, valencia, Spain, 633–642. [8] Adrian Bielefeldt, Julius Gonsior, and Markus Krötzsch. 2018. Practical Linked [27] *Stefan Heindorf, Martin Potthast, Hannah Bast, Björn Buchhold, and Elmar Data Access via SPARQL: The Case of Wikidata. Lyon, France, 10. Haussmann. 2017. WSDM Cup 2017: Vandalism Detection and Triple Scoring. In [9] Freddy Brasileiro, João Paulo A. Almeida, Victorio A. Carvalho, and Giancarlo Proceedings of the Tenth ACM International Conference on Web Search and Data Guizzardi. 2016. Applying a multi-level modeling theory to assess taxonomic Mining (WSDM ’17). ACM, New York, NY, USA, 827–828. https://doi.org/10. hierarchies in Wikidata. In Proceedings of the 25th International Conference Com- 1145/3018661.3022762 panion on World Wide Web. International World Wide Web Conferences Steering [28] Stefan Heindorf, Martin Potthast, Gregor Engels, and Benno Stein. 2017. Overview Committee, 975–980. of the Wikidata Vandalism Detection Task at WSDM Cup 2017. arXiv:1712.05956 [10] Sebastian Burgstaller-Muehlbacher, Andra Waagmeester, Elvira Mitraka, Julia [cs] abs/1712.05956 (Dec. 2017). http://arxiv.org/abs/1712.05956 arXiv: Turner, Tim Putman, Justin Leong, Chinmay Naik, Paul Pavlidis, Lynn Schriml, 1712.05956. and Benjamin M. Good. 2016. Wikidata as a semantic framework for the Gene [29] Stefan Heindorf, Martin Potthast, Benno Stein, and Gregor Engels. 2016. Van- Wiki initiative. Database 2016 (2016). dalism detection in wikidata. In Proceedings of the 25th ACM International on [11] Melisachew Wudage Chekol and Heiner Stuckenschmidt. 2018. Towards Prob- Conference on Information and Knowledge Management. ACM, 327–336. abilistic Bitemporal Knowledge Graphs. In Companion Proceedings of the The [30] Daniel Hernández, Aidan Hogan, and Markus Krötzsch. 2015. Reifying RDF: Web Conference 2018 (WWW ’18). International World Wide Web Conferences What works well with wikidata? SSWS@ ISWC 1457 (2015), 32–47. Steering Committee, Republic and Canton of Geneva, Switzerland, 1757–1762. [31] Daniel Hernández, Aidan Hogan, Cristian Riveros, Carlos Rojas, and Enzo Zerega. https://doi.org/10.1145/3184558.3191637 2016. Querying Wikidata: Comparing SPARQL, Relational and Graph Databases. [12] Rafael Crescenzi, Marcelo Fernandez, Federico A. Garcia Calabria, Pablo Albani, In The Semantic Web – ISWC 2016. Springer, Cham, 88–103. https://doi.org/10. Diego Tauziet, Adriana Baravalle, and Andrés Sebastián D’Ambrosio. 2017. A 1007/978-3-319-46547-0_10 Production Oriented Approach for Vandalism Detection in Wikidata. In WSDM [32] Ali Ismayilov, Dimitris Kontokostas, Sören Auer, Jens Lehmann, and Sebastian Cup 2017 Notebook Papers. arxiv.org, Cambridge, UK. Hellmann. 2015. Wikidata through the Eyes of DBpedia. arXiv:1507.04180 [cs] [13] To Tu Cuong and Claudia Müller-Birn. 2016. Applicability of Sequence Anal- (July 2015). http://arxiv.org/abs/1507.04180 arXiv: 1507.04180. ysis Methods in Analyzing Peer-Production Systems: A Case Study in Wiki- [33] Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Fred- data. In Social Informatics. Springer, Cham, 142–156. https://doi.org/10.1007/ erique Laforest, Jonathon Hare, and Elena Simperl. 2018. Learning to Generate 978-3-319-47874-6_11 Wikipedia Summaries for Underserved Languages from Wikidata. Proceedings [14] Dennis Diefenbach, Kamal Singh, and Pierre Maret. 2017. WDAqua-core0: A of the 2018 Conference of the North American Chapter of the Association for Com- Question Answering Component for the Research Community. In Semantic Web putational Linguistics: Human Language Technologies, Volume 2 (Short Papers) 2 Challenges (Communications in Computer and Information Science). Springer, (2018), 640–645. https://doi.org/10.18653/v1/N18-2101 Cham, 84–89. https://doi.org/10.1007/978-3-319-69146-6_8 [34] Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, [15] Dennis Diefenbach, Kamal Singh, and Pierre Maret. 2018. WDAqua-core1: A Frédérique Laforest, Jonathon Hare, and Elena Simperl. 2018. Mind the (Lan- Question Answering Service for RDF Knowledge Bases. In Companion Proceedings guage) Gap: Generation of Multilingual Wikipedia Summaries from Wikidata of the The Web Conference 2018 (WWW ’18). International World Wide Web for ArticlePlaceholders. In The Semantic Web (Lecture Notes in Computer Science). Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, Springer, Cham, 319–334. https://doi.org/10.1007/978-3-319-93417-4_21 1087–1091. https://doi.org/10.1145/3184558.3191541 [35] Lucie-Aimée Kaffee, Alessandro Piscopo, Pavlos Vougiouklis, Elena Simperl, [16] Tim E. Putman, Sebastien Lelong, Sebastian Burgstaller, Andra Waagmeester, Leslie Carr, and Lydia Pintscher. 2017. A Glimpse into Babel: An Analysis of Colin Diesh, Nathan Dunn, Monica Munoz-Torres, Gregory S. Stupp, Chunlei Wu, Multilinguality in Wikidata. In Proceedings of the 13th International Symposium Andrew I. Su, and Benjamin Good. 2017. WikiGenomes: An open web application on Open Collaboration (OpenSym ’17). ACM, New York, NY, USA, 14:1–14:5. for community consumption and curation of gene annotation data in Wikidata. https://doi.org/10.1145/3125433.3125465 Database: The Journal of Biological Databases and Curation 2017 (March 2017). [36] *Barbara Kitchenham, Pearl Brereton, and David Budgen. 2010. The Educational https://doi.org/10.1093/database/bax025 Value of Mapping Studies of Software Engineering Literature. In Proceedings of the [17] Fredo Erxleben, Michael Günther, Markus Krötzsch, Julian Mendez, and Denny 32Nd ACM/IEEE International Conference on Software Engineering - Volume 1 (ICSE Vrandečić. 2014. Introducing Wikidata to the linked data web. In International ’10). ACM, New York, NY, USA, 589–598. https://doi.org/10.1145/1806799.1806887 Semantic Web Conference. Springer, 50–65. [37] *Barbara A. Kitchenham, David Budgen, and O. Pearl Brereton. 2011. Using * [18] Michael Felderer and Jeffrey C. Carver. 2018. Guidelines for Systematic Mapping Mapping Studies As the Basis for Further Research - A Participant-observer Case Studies in Security Engineering. arXiv:1801.06810 [cs] (Jan. 2018). http://arxiv. Study. Inf. Softw. Technol. 53, 6 (2011), 638–651. https://doi.org/10.1016/j.infsof. org/abs/1801.06810 arXiv: 1801.06810. 2010.12.011 [19] Sebastián Ferrada, Nicolás Bravo, Benjamin Bustos, and Aidan Hogan. 2018. [38] Maximilian Klein, Harsh Gupta, Vivek Rai, Piotr Konieczny, and Haiyi Zhu. Querying Wikimedia Images Using Wikidata Facts. In Companion Proceedings 2016. Monitoring the Gender Gap with Wikidata Human Gender Indicators. In of the The Web Conference 2018 (WWW ’18). International World Wide Web Proceedings of the 12th International Symposium on Open Collaboration (OpenSym Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, ’16). ACM, New York, NY, USA, 16:1–16:9. https://doi.org/10.1145/2957792. 1815–1821. https://doi.org/10.1145/3184558.3191646 2957798 [20] Michael Färber, Frederic Bartscherer, Carsten Menne, and Achim Rettinger. 2016. [39] Markus Krötzsch. 2017. Ontologies for Knowledge Graphs?. In Proceedings of the Linked data quality of dbpedia, freebase, opencyc, wikidata, and yago. Semantic 30th International Workshop on Description Logics (CEUR Workshop Proceedings), Web Preprint (2016), 1–53. Vol. Vol-1879. CEUR-WS. org, France. [21] Michael Färber, Basil Ell, Carsten Menne, and Achim Rettinger. 2015. A Com- [40] Werner Leyh and Homero Fonseca Filho. 2017. Interlinking Standardized Open- parative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Semantic StreetMap Data and Citizen Science Data in the OpenData Cloud. In Advances in Web 1 (2015) (2015), 26. Human Factors and Systems Interaction (Advances in Intelligent Systems and Com- [22] Mohamed H. Gad-Elrab, Daria Stepanova, Jacopo Urbani, and Gerhard Weikum. puting). Springer, Cham, 85–96. https://doi.org/10.1007/978-3-319-60366-7_9 2016. Exception-Enriched Rule Learning from Knowledge Graphs. In The Semantic [41] Elvira Mitraka, Andra Waagmeester, Sebastian Burgstaller-Muehlbacher, Lynn M Web – ISWC 2016 (Lecture Notes in Computer Science). Springer, Cham, 234–251. Schriml, Andrew I Su, and Benjamin M Good. 2015. Wikidata: A platform for https://doi.org/10.1007/978-3-319-46523-4_15 data integration and dissemination for the life sciences and beyond. bioRxiv (Jan. [23] Luis Galárraga, Simon Razniewski, Antoine Amarilli, and Fabian M. Suchanek. 2015). https://doi.org/10.1101/031971 2017. Predicting Completeness in Knowledge Bases. In Proceedings of the Tenth [42] Hatem Mousselly Sergieh and Iryna Gurevych. 2016. Enriching Wikidata with ACM International Conference on Web Search and Data Mining (WSDM ’17). ACM, Frame Semantics. In Proceedings of the 5th Workshop on Automated Knowledge New York, NY, USA, 375–383. https://doi.org/10.1145/3018661.3018739 Base Construction. https://doi.org/10.18653/v1/W16-1306 [24] Johanna Geiß, Andreas Spitz, and Michael Gertz. 2017. NECKAr: A Named [43] Claudia Müller-Birn, Benjamin Karran, Janette Lehmann, and Markus Luczak- Entity Classifier for Wikidata. In Language Technologies for the Challenges of Rösch. 2015. Peer-production system or collaborative ontology engineering effort: the Digital Age (Lecture Notes in Computer Science). Springer, Cham, 115–129. What is Wikidata?. In Proceedings of the 11th International Symposium on Open https://doi.org/10.1007/978-3-319-73706-5_10 Collaboration. ACM, 20. [25] Alexey Grigorev, Association for Computing Machinery, and Special Interest [44] Finn Årup Nielsen. 2018. Linking ImageNet WordNet Synsets with Wikidata. In Group on Information Retrieval (Eds.). 2017. Large-Scale Vandalism Detection Companion Proceedings of the The Web Conference 2018 (WWW ’18). International with Linear Classifiers The Conkerberry Vandalism Detector at WSDM .Cup 2017 World Wide Web Conferences Steering Committee, Republic and Canton of Association for Computing Machinery, New York, NY. OCLC: 946557248. Geneva, Switzerland, 1809–1814. https://doi.org/10.1145/3184558.3191645 ,, Farda-Sarbas and Müller-Birn

[45] Finn Årup Nielsen and Lars Kai Hansen. 2018. Inferring visual semantic similarity NY, USA, 25:1–25:7. https://doi.org/10.1145/2641580.2641613 with deep learning and Wikidata: Introducing imagesim-353. In Proceedings of the [64] Tomás Sáez and Aidan Hogan. 2018. Automatically Generating Wikipedia Info- First Workshop on Deep Learning for Knowledge Graphs and Semantic Technologies boxes from Wikidata. In Companion Proceedings of the The Web Conference 2018 (DL4KGS), Vol. Vol-2106. Department of Applied Mathmatics and Computer (WWW ’18). International World Wide Web Conferences Steering Committee, Science, Technical University of Denmark, Greece. Republic and Canton of Geneva, Switzerland, 1823–1830. https://doi.org/10. [46] Finn Årup Nielsen, Daniel Mietchen, and Egon Willighagen. 2017. Scholia, 1145/3184558.3191647 Scientometrics and Wikidata. In The Semantic Web: ESWC 2017 Satellite Events [65] Thang Hoang Ta and Chutiporn Anutariya. 2014. A Model for Enriching (Lecture Notes in Computer Science). Springer, Cham, 237–259. https://doi.org/10. Multilingual Wikipedias Using Infobox and Wikidata Property Alignment. 1007/978-3-319-70407-4_36 In Semantic Technology. Springer, Cham, 335–350. https://doi.org/10.1007/ [47] *Chitu Okoli, Mohamad Mehdi, Mostafa Mesgari, Finn Årup Nielsen, and Arto 978-3-319-15615-6_25 Lanamäki. 2012. The People’s Encyclopedia Under the Gaze of the Sages: A [66] Harsh Thakkar, Kemele M. Endris, Jose M. Gimenez-Garcia, Jeremy Debattista, Systematic Review of Scholarly Research on Wikipedia. SSRN Electronic Journal Christoph Lange, and Sören Auer. 2016. Are Linked Datasets Fit for Open- (2012). https://doi.org/10.2139/ssrn.2021326 domain Question Answering? A Quality Assessment. In Proceedings of the 6th [48] Thomas Pellissier Tanon, Denny Vrandečić, Sebastian Schaffert, Thomas Steiner, International Conference on Web Intelligence, Mining and Semantics (WIMS ’16). and Lydia Pintscher. 2016. From freebase to wikidata: The great migration. In ACM, New York, NY, USA, 19:1–19:12. https://doi.org/10.1145/2912845.2912857 Proceedings of the 25th international conference on world wide web. International [67] Katherine Thornton, Euan Cochrane, Thomas Ledoux, Bertrand Caron, and Carl World Wide Web Conferences Steering Committee, 1419–1428. Wilson. 2017. Modeling the Domain of Digital Preservation in Wikidata. In In [49] *Kai Petersen, Robert Feldt, Shahid Mujtaba, and Michael Mattsson. 2008. Sys- Proceedings of ACM International Conference on Digital Preservation. iPres’17, tematic Mapping Studies in Software Engineering. In Proceedings of the 12th Kyoto, Japan. International Conference on Evaluation and Assessment in Software Engineer- [68] H. Turki, D. Vrandecic, H. Hamdi, and I. Adel. 2017. Using WikiData as a Multi- ing (EASE’08). BCS Learning & Development Ltd., Swindon, UK, 68–77. http: lingual Multi-dialectal Dictionary for Arabic Dialects. In 2017 IEEE/ACS 14th //dl.acm.org/citation.cfm?id=2227115.2227123 International Conference on Computer Systems and Applications (AICCSA). 437– [50] *Kai Petersen, Sairam Vakkalanka, and Ludwik Kuzniarz. 2015. Guidelines for 442. https://doi.org/10.1109/AICCSA.2017.115 conducting systematic mapping studies in software engineering: An update. [69] Theo van Veen, Juliette Lonij, and Willem Jan Faber. 2016. Linking Named Information and Software Technology 64 (Aug. 2015), 1–18. https://doi.org/10. Entities in Dutch Historical Newspapers. In Metadata and Semantics Research 1016/j.infsof.2015.03.007 (Communications in Computer and Information Science). Springer, Cham, 205–210. [51] Alexander Pfundner, Tobias Schönberg, John Horn, Richard D. Boyce, and https://doi.org/10.1007/978-3-319-49157-8_18 Matthias Samwald. 2015. Utilizing the Wikidata system to improve the quality [70] Jakob Voß. 2016. Classification of Knowledge Organization Systems with of medical content in Wikipedia in diverse languages: a pilot study. Journal of Wikidata. In Proceedings of the 15th European Networked Knowledge Organi- medical Internet research 17, 5 (2015). zation Systems Workshop (NKOS 2016), Vol. Vol-1676. CEUR-WS. org, Hannover. [52] Alessandro Piscopo, Lucie-Aimée Kaffee, Chris Phethean, and Elena Simperl. 2017. https://doi.org/10.5281/zenodo.61767 Provenance Information in a Collaborative Knowledge Graph: An Evaluation [71] Denny Vrandecic. 2013. The rise of Wikidata. IEEE Intelligent Systems 28, 4 (2013), of Wikidata External References. In The Semantic Web – ISWC 2017 (Lecture 90–95. Notes in Computer Science). Springer, Cham, 542–558. https://doi.org/10.1007/ [72] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative 978-3-319-68288-4_32 knowledgebase. Commun. ACM 57, 10 (2014), 78–85. [53] Alessandro Piscopo, Chris Phethean, and Elena Simperl. 2017. What Makes [73] Gerhard Wohlgenannt, Nikolay Klimov, Dmitry Mouromtsev, Daniil a Good Collaborative Knowledge Graph: Group Composition and Quality in Razdyakonov, Dmitry Pavlov, and Yury Emelyanov. 2017. Using Word Wikidata. In Social Informatics (Lecture Notes in Computer Science). Springer, Embeddings for Visual Data Exploration with Ontodia and Wikidata. In Joint Cham, 305–322. https://doi.org/10.1007/978-3-319-67217-5_19 Proceedings of BLINK2017: 2nd International Workshop on Benchmarking Linked [54] Alessandro Piscopo, Christopher Phethean, and Elena Simperl. 2017. Wikidatians Data and NLIWoD3: Natural Language Interfaces for the Web of Data (CEUR are born: paths to full participation in a collaborative structured knowledge base. Workshop Proceedings), Vol. Vol-1932. CEUR-WS. org, Vienna, Austria. In Proceedings of the 50th Hawaii International Conference on System Sciences. [74] Xi Yang, Shiya Ren, Yuan Li, Ke Shen, Zhixing Li, and Guoyin Wang. 2017. University of Hawaii, 4354–4363. https://eprints.soton.ac.uk/413008/ Relation Linking for Wikidata Using Bag of Distribution Representation. In [55] Alessandro Piscopo, Pavlos Vougiouklis, Lucie-Aimée Kaffee, Christopher Natural Language Processing and Chinese Computing. Springer, Cham, 652–661. Phethean, Jonathon Hare, and Elena Simperl. 2017. What Do Wikidata and https://doi.org/10.1007/978-3-319-73618-1_55 Wikipedia Have in Common?: An Analysis of Their Use of External References. In [75] Xue-lu Yu and Lin Qiao. 2017. Meronymy Relation Extraction Based on 3-Motif Proceedings of the 13th International Symposium on Open Collaboration (OpenSym in Wikidata. DEStech Transactions on Computer Science and Engineering 0, cnsce ’17). ACM, New York, NY, USA, 1:1–1:10. https://doi.org/10.1145/3125433.3125445 (2017). https://doi.org/10.12783/dtcse/cnsce2017/8915 [56] Radityo Eko Prasojo, Fariz Darari, Simon Razniewski, and Werner Nutt. 2016. [76] Eva Zangerle, Wolfgang Gassler, Martin Pichl, Stefan Steinhauser, and Günther Managing and consuming completeness information for wikidata using COOL- Specht. 2016. An Empirical Evaluation of Property Recommender Systems for WD. CEUR-WS. org. Wikidata and Collaborative Knowledge Bases. In Proceedings of the 12th Inter- [57] Simon Razniewski, Vevake Balaraman, and Werner Nutt. 2017. Doctoral Advisor national Symposium on Open Collaboration (OpenSym ’16). ACM, New York, NY, or Medical Condition: Towards Entity-Specific Rankings of Knowledge Base USA, 18:1–18:8. https://doi.org/10.1145/2957792.2957804 * Properties. In Advanced Data Mining and Applications (Lecture Notes in Computer [77] Qi Zhu, Hongwei Ng, Liyuan Liu, Ziwei Ji, Bingjie Jiang, Jiaming Shen, and Science). Springer, Cham, 526–540. https://doi.org/10.1007/978-3-319-69179-4_37 Huan Gui. 2017. Wikidata Vandalism Detection - The Loganberry Vandalism [58] Simon Razniewski, Fabian Suchanek, and Werner Nutt. 2016. But What Do Detector at WSDM Cup 2017. In Proceedings of the WSDM Cup 2017: Vandalism We Actually Know? Association for Computational Linguistics, 40–44. https: Detection and Triple Scoring, Vol. abs/1712.06922. arxiv, Cambridge, UK. http: //doi.org/10.18653/v1/W16-1308 //arxiv.org/abs/1712.06922 arXiv: 1712.06922. [59] Daniel Ringler and Heiko Paulheim. 2017. One Knowledge Graph to Rule Them All? Analyzing the Differences Between DBpedia, YAGO, Wikidata & co.. In KI 2017: Advances in Artificial Intelligence (Lecture Notes in Computer Science). Springer, Cham, 366–372. https://doi.org/10.1007/978-3-319-67190-1_33 [60] John Samuel. 2017. Collaborative Approach to Developing a Multilingual On- tology: A Case Study of Wikidata. In Metadata and Semantic Research (Com- munications in Computer and Information Science). Springer, Cham, 167–172. https://doi.org/10.1007/978-3-319-70863-8_16 [61] Amir Sarabadani, Aaron Halfaker, and Dario Taraborelli. 2017. Building Auto- mated Vandalism Detection Tools for Wikidata. In Proceedings of the 26th Interna- tional Conference on World Wide Web Companion (WWW ’17 Companion). Interna- tional World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1647–1654. https://doi.org/10.1145/3041021.3053366 [62] Andreas Spitz, Johanna Geiß, and Michael Gertz. 2016. So far away and yet so close: augmenting toponym disambiguation and similarity with text-based networks. ACM Press, 1–6. https://doi.org/10.1145/2948649.2948651 [63] Thomas Steiner. 2014. Bots vs. Wikipedians, Anons vs. Logged-Ins (Redux): A Global Study of Edit Activity on Wikipedia and Wikidata. In Proceedings of The International Symposium on Open Collaboration (OpenSym ’14). ACM, New York,