Representing COVID-19 Information in Collaborative Knowledge Graphs: a Study of Wikidata
Total Page:16
File Type:pdf, Size:1020Kb
Representing COVID-19 information in collaborative knowledge graphs: a study of Wikidata Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University, Country Open review(s): Name Surname, University, Country a b Houcemeddine Turki (ORCID: 0000-0003-3492-2014), Mohamed Ali Hadj Taieb (ORCID: 0000-0002-2786-8913), Thomas c d e Shafee (ORCID: 0000-0002-2298-7593), Tiago Lubiana (ORCID: 0000-0003-2473-2313), Dariusz Jemielniak (ORCID: 0000-0002- b f 3745-7931), Mohamed Ben Aouicha (ORCID: 0000-0002-2277-5814), Jose Emilio Labra Gayo (ORCID: 0000-0001-8907-5348), g h i,* Mus‟ab Banat (ORCID: 0000-0001-9132-3849), Diptanshu Das (ORCID: 0000-0002-7221-5022), Daniel Mietchen (ORCID: 0000- 1 0001-9488-1870), on behalf of WikiProject COVID-19 a Faculty of Medicine of Sfax, University of Sfax, Sfax, Tunisia b Faculty of Sciences of Sfax, University of Sfax, Sfax, Tunisia c La Trobe University, Melbourne, Victoria, Australia d Computational Systems Biology Laboratory, University of São Paulo, São Paulo, Brazil e Department of Management in Networked and Digital Societies, Kozminski University, Warsaw, Poland f Web Semantics Oviedo (WESO) Research Group, University of Oviedo, Spain g Faculty of Medicine, Hashemite University, Zarqa, Jordan h Institute of Child Health (ICH), Kolkata, India h Medica Superspecialty Hospital, Kolkata, India i School of Data Science, University of Virginia, Charlottesville, Virginia, United States of America Abstract. Information related to the COVID-19 pandemic ranges from biological to bibliographic and from geographical to genetic. Wikidata is a vast interdisciplinary, multilingual, open collaborative knowledge base of more than 88 million entities connected by well over a billion relationships and is consequently a web-scale platform for broader computer-supported coop- erative work and linked open data. Here, we introduce four aspects of Wikidata that make it an ideal knowledge base for in- formation on the COVID-19 pandemic: its flexible data model, its multilingual features, its alignment to multiple external da- tabases, and its multidisciplinary organization. The structure of the raw data is highly complex, so converting it to meaningful insight requires extraction and visualization, the global crowdsourcing of which adds both additional challenges and opportuni- ties. The created knowledge graph for COVID-19 in Wikidata can be visualized, explored and analyzed in near real time by specialists, automated tools and the public, for decision support as well as educational and scholarly research purposes via SPARQL, a semantic query language used to retrieve and process information from databases saved in Resource Description Framework (RDF) format. Keywords: Public health surveillance, Wikidata, Knowledge graph, COVID-19, SPARQL, Community curation, FAIR data, Linked Open Data * Corresponding author. E-mail: [email protected]. 1 Project Member: Project members: Jan Ainali, Susanna Ånäs, Erica Azzellini, Mus‟ab Banat, Mohamed Ben Aouicha, Diptanshu Das, Lena Denis, Rich Farmbrough, Daniel Fernández-Álvarez, Konrad Foerstner, Jose Emilio Labra Gayo, Mohamed Ali Hadj Taieb, James Hare, Alejandro González Hevia, David Hicks, Netha Hussain, Jinoy Tom Jacob, Dariusz Jemielniak, Krupal Kasyap, Will Kent, Samuel Klein, Jasper J. Koehorst, Martina Kutmon, Antoine Logean, Tiago Lubiana, Andy Mabbett, Kimberli Mäkäräinen, Bodhisattwa Mandal, Daniel Mietchen, Nandana Mihindukulasooriya, Mahir Morshed, Peter Murray-Rust, Finn Årup Nielsen, Mike Nolan, Shay Nowick, Julian Leonardo Paez, João Alexandre Peschanski, Alexander Pico, Lane Rasberry, Mairelys Lemus-Rojas, Diego Saez-Trumper, Magnus Sälgö, John Samuel, Peter J. Schaap, Jodi Schneider, Thomas Shafee, Nick Sheppard, Adam Shorland, Ranjith Siji, Michal Josef Špaček, Ralf Stephan, Andrew I. Su, Hilary Thorsen, Houcemeddine Turki, Lisa M. Verhagen, Denny Vrandečić, Andra Waagmeester, and Egon Willighagen. 1. Introduction efficiently queried using SPARQL2, a semantic query language dynamically extracting triple information The COVID-19 pandemic is complex and multifa- from large-scale knowledge graphs. Here, we present ceted and touches on almost every area of current life. how various types of data related to the COVID-19 Coordinating efforts to systematize and formalize pandemic are currently represented in Wikidata knowledge about COVID-19 in a computable form is thanks to the flexible structure of the database and key in accelerating our response to the pathogen [1]. how useful visualizations for different subsets of the There are already attempts at community-based on- data linked to COVID-19 within the Wikidata know- tologies of COVID-19 knowledge and data [2], as ledge base can be generated. In this way, we can un- well as efforts to aggregate expert data [3]. Many derstand facets of how the machine world „sees‟ Wi- open data initiatives are started spontaneously [4-6]. kidata‟s COVID-19 knowledge. However, we believe that more attention in the re- In this research paper, we describe the data model search world is needed for Wikidata, in particular as of Wikidata in general and in the context of COVID- a platform for COVID-19 knowledge graph visual 19 pandemic (Section 2). Then, we give an overview representation. The interconnected, interdisciplinary, of the language support (Section 3) and database and international nature of the pandemic makes Wi- alignment (Section 4) of COVID-19 information in kidata a well-suited knowledge base to collate and Wikidata. Subsequently, we present a snapshot of make sense of this information [7]. how the COVID-19 knowledge graph of Wikidata Wikidata is a large-scale, collaborative, open- can be used to support computer applications, partic- licensed, multilingual knowledge base that is both ularly the SPARQL-based visualization of multidis- human- and machine-readable. Notably, it is availa- ciplinary information about COVID-19 (Section 5). ble in the standardized RDF (Resource Description Finally, we discuss the outcomes of the open devel- Framework) format, where data is organized into opment of the COVID-19 knowledge graph in Wiki- entities named items and the relationships that con- data (Section 6) and we draw conclusions and future nect them, named properties [8]. Wikidata is a peer directions for this research paper (Section 7). production project, developed under the umbrella of the Wikimedia Foundation. Similarly to Wikipedia, it relies on collaborative development design and is 2. Data Model both a-hierarchical and largely uncoordinated [9]. As a result, it develops entirely organically, basing on In Wikidata, each concept has an item (a human, the editor community‟s consensus. This community disease, drug, city, etc), is assigned a unique identifi- develops ontologies and typologies used in the data- er (Q-number; brown in Fig. 1), and optionally a la- base. This spontaneous approach is both a blessing bel, description and aliases in multiple languages and a curse: on the one hand, it makes methodical (yellow in Fig. 1). The true richness of the know- planning of the whole structure and its granularity ledge base comes from the connections between the very difficult, if not impossible [10]. On the other, it items: statements in the form of RDF triples (subject- provides for an unparalleled flexibility and versatility predicate-object) where the subject is the respective of uses. item, the predicate is a Wikidata property (red in Fig. One of Wikidata‟s key strengths is that each item 1), and the object is another Wikidata item or piece can be understood by both machines and humans as it of information (blue in Fig. 1). The properties that represents data in the form of triples [8]. However, relate items are similarly each assigned an identifier where a computer can easily hold the entire know- (P-number). Some properties relate a Wikidata item ledge base in its memory at once, the same is ob- as the object and can be taxonomic (e.g. instance of viously not true for a human. Since we still rely on [P31], subclass of [P279], and part of [P361]) or non- human interpretation to extract meaning out of com- taxonomic (e.g. significant person [P3342], drug plex data, it is necessary to pass that data from ma- used for treatment [P2176], and symptoms [P780]). chine-to-human in an intuitive manner [11]. The Conversely, other properties can have an object that main way of doing this is by visualising some subset is a value (e.g. number of cases [P1603]), date (e.g. of the data: effectively, the human eye acts as the point in time [P585]), URL (e.g. official website input interface with the greatest bandwidth. Because [P856]), string (e.g. official name [P1448]), or exter- Wikidata is available in the RDF format, it can be 2 The recursive acronym for "SPARQL Protocol and RDF Query Language", the current version of which is SPARQL 1.1. nal identifier (e.g. Library of Congress authority ID themselves are also organised as RDF triples [7]. [P244]). Each statement can be given further detail This comes together to create an integrated network and specificity via qualifiers (black in Fig. 1) or of 88 million items interlinked by over a billion provenance via references (purple in Fig. 1), which statements. Fig. 1. Data Structure of a Wikidata item. The simple, consistent structure of a Wikidata item makes it both human- and machine-readable. Each Wikidata item has a unique identifier (Brown). They can have labels, descriptions and aliases in multiple languages (Yellow). They then include any number of statements having predicates (Red), objects (Blue), qualifiers (Black) and references (Purple) where the subject is the item. Finally, where additional Wikimedia resources are available about an item‟s topic, those are listed (Green). Source: https://www.wikidata.org/wiki/Q84263196, available at: https://w.wiki/auF. In the context of the COVID-19 pandemic, and as of 20 August 2020; blue in Fig. 2). The collabora- building on pilot work that was started at the onset of tive work to populate and curate this data has been the Zika pandemic [12], an ontological database largely accomplished by WikiProject COVID-19 3 , representing all the aspects of the outbreak has been launched in March 2020 [13].