Representing COVID-19 information in collaborative knowledge graphs: a study of Wikidata

Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University, Country Open review(s): Name Surname, University, Country

a b Houcemeddine Turki (ORCID: 0000-0003-3492-2014), Mohamed Ali Hadj Taieb (ORCID: 0000-0002-2786-8913), Thomas c d e Shafee (ORCID: 0000-0002-2298-7593), Tiago Lubiana (ORCID: 0000-0003-2473-2313), Dariusz Jemielniak (ORCID: 0000-0002- b f 3745-7931), Mohamed Ben Aouicha (ORCID: 0000-0002-2277-5814), Jose Emilio Labra Gayo (ORCID: 0000-0001-8907-5348), g h i,* Mus‟ab Banat (ORCID: 0000-0001-9132-3849), Diptanshu Das (ORCID: 0000-0002-7221-5022), Daniel Mietchen (ORCID: 0000- 1 0001-9488-1870), on behalf of WikiProject COVID-19 a Faculty of Medicine of Sfax, University of Sfax, Sfax, Tunisia b Faculty of Sciences of Sfax, University of Sfax, Sfax, Tunisia c La Trobe University, Melbourne, Victoria, Australia d Computational Systems Biology Laboratory, University of São Paulo, São Paulo, Brazil e Department of Management in Networked and Digital Societies, Kozminski University, Warsaw, Poland f Web Semantics Oviedo (WESO) Research Group, University of Oviedo, Spain g Faculty of Medicine, Hashemite University, Zarqa, Jordan h Institute of Child Health (ICH), Kolkata, India h Medica Superspecialty Hospital, Kolkata, India i School of Data Science, University of Virginia, Charlottesville, Virginia, United States of America

Abstract. Information related to the COVID-19 pandemic ranges from biological to bibliographic and from geographical to genetic. Wikidata is a vast interdisciplinary, multilingual, open collaborative knowledge base of more than 88 million entities connected by well over a billion relationships and is consequently a web-scale platform for broader computer-supported coop- erative work and linked open data. Here, we introduce four aspects of Wikidata that make it an ideal knowledge base for in- formation on the COVID-19 pandemic: its flexible data model, its multilingual features, its alignment to multiple external da- tabases, and its multidisciplinary organization. The structure of the raw data is highly complex, so converting it to meaningful insight requires extraction and visualization, the global crowdsourcing of which adds both additional challenges and opportuni- ties. The created knowledge graph for COVID-19 in Wikidata can be visualized, explored and analyzed in near real time by specialists, automated tools and the public, for decision support as well as educational and scholarly research purposes via SPARQL, a semantic query language used to retrieve and process information from databases saved in Resource Description Framework (RDF) format.

Keywords: Public health surveillance, Wikidata, Knowledge graph, COVID-19, SPARQL, Community curation, FAIR data, Linked Open Data

* Corresponding author. E-mail: [email protected]. 1 Project Member: Project members: Jan Ainali, Susanna Ånäs, Erica Azzellini, Mus‟ab Banat, Mohamed Ben Aouicha, Diptanshu Das, Lena Denis, Rich Farmbrough, Daniel Fernández-Álvarez, Konrad Foerstner, Jose Emilio Labra Gayo, Mohamed Ali Hadj Taieb, James Hare, Alejandro González Hevia, David Hicks, Netha Hussain, Jinoy Tom Jacob, Dariusz Jemielniak, Krupal Kasyap, Will Kent, Samuel Klein, Jasper J. Koehorst, Martina Kutmon, Antoine Logean, Tiago Lubiana, Andy Mabbett, Kimberli Mäkäräinen, Bodhisattwa Mandal, Daniel Mietchen, Nandana Mihindukulasooriya, Mahir Morshed, Peter Murray-Rust, Finn Årup Nielsen, Mike Nolan, Shay Nowick, Julian Leonardo Paez, João Alexandre Peschanski, Alexander Pico, Lane Rasberry, Mairelys Lemus-Rojas, Diego Saez-Trumper, Magnus Sälgö, John Samuel, Peter J. Schaap, Jodi Schneider, Thomas Shafee, Nick Sheppard, Adam Shorland, Ranjith Siji, Michal Josef Špaček, Ralf Stephan, Andrew I. Su, Hilary Thorsen, Houcemeddine Turki, Lisa M. Verhagen, Denny Vrandečić, Andra Waagmeester, and Egon Willighagen. 1. Introduction efficiently queried using SPARQL2, a semantic query language dynamically extracting triple information The COVID-19 pandemic is complex and multifa- from large-scale knowledge graphs. Here, we present ceted and touches on almost every area of current life. how various types of data related to the COVID-19 Coordinating efforts to systematize and formalize pandemic are currently represented in Wikidata knowledge about COVID-19 in a computable form is thanks to the flexible structure of the database and key in accelerating our response to the pathogen [1]. how useful visualizations for different subsets of the There are already attempts at community-based on- data linked to COVID-19 within the Wikidata know- tologies of COVID-19 knowledge and data [2], as ledge base can be generated. In this way, we can un- well as efforts to aggregate expert data [3]. Many derstand facets of how the machine world „sees‟ Wi- open data initiatives are started spontaneously [4-6]. kidata‟s COVID-19 knowledge. However, we believe that more attention in the re- In this research paper, we describe the data model search world is needed for Wikidata, in particular as of Wikidata in general and in the context of COVID- a platform for COVID-19 knowledge graph visual 19 pandemic (Section 2). Then, we give an overview representation. The interconnected, interdisciplinary, of the language support (Section 3) and database and international nature of the pandemic makes Wi- alignment (Section 4) of COVID-19 information in kidata a well-suited knowledge base to collate and Wikidata. Subsequently, we present a snapshot of make sense of this information [7]. how the COVID-19 knowledge graph of Wikidata Wikidata is a large-scale, collaborative, open- can be used to support computer applications, partic- licensed, multilingual knowledge base that is both ularly the SPARQL-based visualization of multidis- human- and machine-readable. Notably, it is availa- ciplinary information about COVID-19 (Section 5). ble in the standardized RDF (Resource Description Finally, we discuss the outcomes of the open devel- Framework) format, where data is organized into opment of the COVID-19 knowledge graph in Wiki- entities named items and the relationships that con- data (Section 6) and we draw conclusions and future nect them, named properties [8]. Wikidata is a peer directions for this research paper (Section 7). production project, developed under the umbrella of the Wikimedia Foundation. Similarly to , it relies on collaborative development design and is 2. Data Model both a-hierarchical and largely uncoordinated [9]. As a result, it develops entirely organically, basing on In Wikidata, each concept has an item (a human, the editor community‟s consensus. This community disease, drug, city, etc), is assigned a unique identifi- develops ontologies and typologies used in the data- er (Q-number; brown in Fig. 1), and optionally a la- base. This spontaneous approach is both a blessing bel, description and aliases in multiple languages and a curse: on the one hand, it makes methodical (yellow in Fig. 1). The true richness of the know- planning of the whole structure and its granularity ledge base comes from the connections between the very difficult, if not impossible [10]. On the other, it items: statements in the form of RDF triples (subject- provides for an unparalleled flexibility and versatility predicate-object) where the subject is the respective of uses. item, the predicate is a Wikidata property (red in Fig. One of Wikidata‟s key strengths is that each item 1), and the object is another Wikidata item or piece can be understood by both machines and humans as it of information (blue in Fig. 1). The properties that represents data in the form of triples [8]. However, relate items are similarly each assigned an identifier where a computer can easily hold the entire know- (P-number). Some properties relate a Wikidata item ledge base in its memory at once, the same is ob- as the object and can be taxonomic (e.g. instance of viously not true for a human. Since we still rely on [P31], subclass of [P279], and part of [P361]) or non- human interpretation to extract meaning out of com- taxonomic (e.g. significant person [P3342], drug plex data, it is necessary to pass that data from ma- used for treatment [P2176], and symptoms [P780]). chine-to-human in an intuitive manner [11]. The Conversely, other properties can have an object that main way of doing this is by visualising some subset is a value (e.g. number of cases [P1603]), date (e.g. of the data: effectively, the human eye acts as the point in time [P585]), URL (e.g. official website input interface with the greatest bandwidth. Because [P856]), string (e.g. official name [P1448]), or exter- Wikidata is available in the RDF format, it can be 2 The recursive acronym for "SPARQL Protocol and RDF Query Language", the current version of which is SPARQL 1.1. nal identifier (e.g. Library of Congress authority ID themselves are also organised as RDF triples [7]. [P244]). Each statement can be given further detail This comes together to create an integrated network and specificity via qualifiers (black in Fig. 1) or of 88 million items interlinked by over a billion provenance via references (purple in Fig. 1), which statements.

Fig. 1. Data Structure of a Wikidata item. The simple, consistent structure of a Wikidata item makes it both human- and machine-readable. Each Wikidata item has a unique identifier (Brown). They can have labels, descriptions and aliases in multiple languages (Yellow). They then include any number of statements having predicates (Red), objects (Blue), qualifiers (Black) and references (Purple) where the subject is the item. Finally, where additional Wikimedia resources are available about an item‟s topic, those are listed (Green). Source: https://www.wikidata.org/wiki/Q84263196, available at: https://w.wiki/auF.

In the context of the COVID-19 pandemic, and as of 20 August 2020; blue in Fig. 2). The collabora- building on pilot work that was started at the onset of tive work to populate and curate this data has been the Zika pandemic [12], an ontological database largely accomplished by WikiProject COVID-19 3 , representing all the aspects of the outbreak has been launched in March 2020 [13]. Indeed, WikiProject represented in Wikidata. There are three main items COVID-19 itself has a Wikidata item (Q87748614) that form the core of this structure (red in Fig. 2): and items are linked to this using the property on COVID-19 (Q84263196), SARS-CoV-2 (Q82069695), focus list of Wikimedia project (P5008). This Wiki- and COVID-19 pandemic (Q81068910). Those three Project has built heavily on the lessons learnt from core COVID-19-related Wikidata items have rela- the pilot work during the Zika pandemic [12]. tively simple links to one another. Mainly that These COVID-19-related items are linked to their SARS-CoV-2 causes COVID-19, which itself has respective classes or types using instance of [P31] or had the downstream effect of the COVID-19 pan- subclass of [P279] relations, and they are linked be- demic. tween each other using non-taxonomic relations de- These three core items then link out to a vast array of items related to all aspects of the disease, its causative virus, and the resulting pandemic (>17,000 3 https://www.wikidata.org/wiki/Wikidata:WikiProject_COVID-19 fining knowledge about various and multi- didates (Q28051899), scholarly articles disciplinary aspects of COVID-19 (Fig. 2). Biomedi- (Q13442814) and COVID-19 dashboards cal relations between Wikidata items can be assigned (Q90790055). This graph is augmented by biomedi- nature of statement (P5102) or sourcing circums- cal, geographical and other information already tances (P1480) qualifiers to state the status (e.g. offi- available in Wikidata and representing an important cial, hypothesis and de facto) and the occurrence overview of clinical knowledge [7, 13]. This action probability (e.g. rarely, possibly and often) of the allows the inclusion of other classes to COVID-19 described semantic relation. The network of these knowledge graphs, including genes (Q7187), proteins items and relations forms a high-scale knowledge (Q8054), and biological processes (Q2996394) as graph for COVID-19, where the main Wikidata items well as the definition of semantic relations between are COVID-19 pandemic (Q81068910), SARS-CoV-2 COVID-19-related items and other Wikidata items. (Q82069695) and COVID-19 (Q84263196) and This, consequently, allows the expansion of the cov- where the mainly developed Wikidata classes are erage of COVID-19 information in Wikidata and a disease outbreaks (Q3241045) in regions such as better characterization of COVID-19-related items. continents, sovereign states, and constituent states, COVID-19 tracing apps (Q89288125), vaccine can-

Fig. 2. Simplified skeleton of the data model of COVID-19 information on Wikidata. The three main COVID-related items (the „C3 items‟)4 are represented in red, selected classes of items related to these are shown in blue, with the relations between them represented as arrows. The number of statements relating to each item from the relevant class is indicated next to the item (In the case of scholarly articles, relations to each of the three COVID-related items is indicated by colour). Relation types regularly used to define items within Wikidata classes are omit- ted (e.g. chromosome [P1057] for human genes), as of 20 August 20205, available at: https://w.wiki/auD , license: CC BY 4.0.

4 COVID and C3 stand for any subset of {COVID-19 [Q84263196], SARS-CoV-2 [Q82069695], COVID-19 pandemic [Q81068910]}. 5 Source queries: https://w.wiki/Ypc, https://w.wiki/Ypd, https://w.wiki/Ype, https://w.wiki/Ypg, https://w.wiki/Yph, and https://w.wiki/Ypi. Given the distinctive feature of knowledge graphs [P8010], number of clinical tests [P8011], and num- that assigns triples to items where the object is not an ber of hospitalized cases [P8049]) proving the flex- item [14], Wikidata items are assigned an important ibility of the knowledge base. To keep records of the number of non-relational statements for various pur- progress of the COVID-19 pandemic over time, each poses. On the one hand, items are assigned their iden- statistical statement is assigned a point in time [P585] tifiers in external databases, including semantic re- relation as a qualifier. These epidemiological state- sources, using human efforts and tools such as ments are retrieved from CC0 databases such as the Mix‟n‟match [15]. These links make Wikidata a key COVID-19 DataHub database 6 and are linked to part of the open data ecosystem, not only contribut- them as references. These statements can be used to ing its own items and internal links, but also bridging automatically infer other measures that are not sup- between other open databases (Fig. 3). Wikidata ported by Wikidata but give a full overview of the therefore supports alignment between disparate epidemiology of COVID-19: let c be the total num- knowledge bases and, consequently, semantic data ber of confirmed cases at a given day Z when the integration [16] and federation [15] in the context of epidemiological evaluation takes place, d the number the linked open data cloud [17]. Such statements also of confirmed deaths until that day, r the number of permit the enrichment of Wikidata items with data confirmed recoveries by that day, h the number of from external databases when these resources are confirmed hospitalized cases on that day, t the num- updated, particularly in relation with the regular ber of clinical tests until that day. On the basis of changes of the multiple characteristics of COVID-19. these values (which could all be represented in Wiki- Examples of Wikidata properties used to define ex- data if matters related to the coverage and conflicts of ternal identifiers can be found in Table 1. information from multiple sources are solved), the following measures can be inferred:  Confirmed active cases (noted v) = c – (d + r)  Confirmed recovery rate (noted a) = r / c  Confirmed patient-days (noted p) =∑h if all infection days are represented

 New confirmed cases (noted ncZ) = cZ - cZ-1  New confirmed deaths (noted ndZ) = dZ - dZ-1  New clinical tests (noted ntZ) = tZ - tZ-1  New confirmed recoveries (noted nrZ) = rZ - rZ-1. This set of COVID-19 information is integrated in- to Wikidata using human efforts, the QuickState- ments tool7, the Wikidata API8, and bots mainly writ- ten in Python (e.g. CovidDatahubBot 9), which ex- plains its quantity and coverage [7].

Fig. 3. Wikidata in the Linked Open Data Cloud. Databases indi- cated as circles (with Wikidata indicated as „WD‟), with grey lines Table S1 linking databases in the network if their data is aligned, as of 20 August 2020 (available at: https://w.wiki/bYM, license: CC BY Examples of Wikidata properties used to define non-relational 4.0). statements

On the other hand, disease outbreak items for the Wikidata Name Description COVID-19 pandemic are assigned statistical state- ments to outline the evolution of the epidemiological 6 https://datahub.io/core/covid-19 status of different countries. The properties used to 7 QuickStatements (QS) is a web service that can modify Wiki- define these statements are shown in Table 1 and data, based on a simple text commands: https://quickstatements.toolforge.org/ include data about the morbidity, the mortality, the 8 testing and the clinical management of COVID-19 at An application programming interface (API) is a machine- friendly interface of a web service that can be used to feed another the level of continents, countries and constituent computer program with needed information. The Wikidata API is states. Some of the Wikidata properties used to store available at https://www.wikidata.org/w/api.php this epidemiological information have been created 9 in response to COVID-19 (e.g. Number of recoveries https://www.wikidata.org/wiki/Wikidata:Requests_for_permission s/Bot/CovidDatahubBot ID P8011 number of clini- cumulative number of clinical tests cal tests Properties for the alignment with scholarly databases P8049 number of hospi- number of cases that are hospitalized P496 ORCID iD identifier for a researcher (Open Re- talized cases searcher and Contributor ID) P3488 minimal incuba- minimal time between an infection and P1153 Scopus Author identifier for an author in the Scopus tion period in the onset of disease symptoms in in- ID bibliographic database humans fected humans P214 VIAF ID identifier for the Virtual International P3487 maximal incuba- maximal time between an infection and Authority File database tion period in the onset of disease symptoms in in- P7859 WorldCat Identi- entity on WorldCat for authority con- humans fected humans ties ID trol of authors‟ data P3492 basic reproduc- number of infections caused by one tion number infection within an uninfected popula- P1053 ResearcherID identifier for a researcher in a system for scientific authors, primarily used in tion Web of Science Properties for the alignment with clinical language resources and encyclopedias P494 ICD-10 identifier in the ICD catalogue codes 3. Language Representation for diseases - Version 10 Thanks to its multilingual and language- P672 MeSH tree code Medical Subject Headings (MeSH) codes are an index and thesaurus for independent data model as well as its link with vari- the life sciences (≠ MeSH ID, P486) ous biomedical ontologies and knowledge bases, Wi- kidata‟s biomedical language coverage in English, P1417 Encyclopædia identifier for an article in the online Britannica Online version of Encyclopædia Britannica French, German and Dutch is comparable to other 10 ID semantic resources such as SNOMED-CT , Babel- MeSH11, and ICD-1012 [7]. Despite the recent origin P486 MeSH descriptor identifier for Descriptor or Supplemen- ID tary concept in the Medical Subject of the COVID-19 pandemic, Wikidata‟s coverage on Headings controlled vocabulary the matter is already quite granular, with the main P3098 Clinical- identifier in the ClinicalTrials.gov three COVID items linked to 17,000 other items via Trials.gov Iden- database 55,000 relations at the time of writing. The degree of tifier translation of that information is interestingly high P6680 MeSH term ID identifier of a "MeSH term" (Medical with an important representation of the concepts in Subject Headings) more than 50 languages (Fig. 4E). However, this coverage varies between languages, with English as P6694 MeSH concept identifier of a Medical Subject Head- ID ings concept the unsurprising front-runner in items with COVID as the object, since many of those items are journal Properties for the non-relational characterization of Wikidata items articles with untranslated titles (Fig. 4A). The names P569 date of birth date on which the subject was born of the properties that link them (Fig. 4B,D) have much more even coverage, as do items with COVID P856 official website URL of the official homepage of an item (current or former) as the subject (Fig. 4C). This linguistic coverage is P1603 number of cases cumulative number of confirmed, prob- able and suspected occurrences 10 http://www.snomed.org/snomed-ct/sct-worldwide (Accessed on November 11, 2019): SNOMED-CT supports English, French, P1120 number of deaths total (cumulative) number of people Danish, Dutch, Spanish, Swedish, and Lithuanian. who died since start as a direct result of 11 https://lhncbc.nlm.nih.gov/project/babelmesh-and-pico- an event or cause linguist (Accessed on November 11, 2019): BabelMeSH supports Arabic, Chinese, Dutch, English, French, German, Italian, Japa- P3457 Case fatality rate proportion of patients who die of a nese, Korean, Portuguese, Russian, Spanish, and Swedish. particular medical condition out of all 12 ICD-10: International Classification of Diseases, 10th Revi- who have this condition within a given sion [18]: ICD-10 supports Arabic, Chinese, English, French, time frame (equal to the quotient of the Russian, Spanish, Albanian, Armenian, Azeri, Basque, Bulgarian, number of cases by the number of Catalan, Croatian, Czech, Danish, Dutch, Estonian, Persian, Fin- deaths as stated in a given day) nish, German, Greek, Hungarian, Icelandic, Italian, Japanese, P8010 Number of reco- number of cases that recovered from Korean, Latvian, Lithuanian, Macedonian, Mongolian, Norwegian, veries disease Polish, Portuguese, Serbian, Slovak, Slovenian, Swedish, Thai, Turkish, Turkmen, Ukrainian, and Uzbek. less uneven than other biomedical semantic resources 7 languages, LOINC 14 supports 13 languages, Ba- [19, 20] and is in line with efforts of generating mul- belMeSH [26] supports 13 languages, and ICD-10 tilingual language resources to be used for natural supports 42 languages. The support of other natural language processing purposes in clinical contexts languages can also be explained by the use of bots [21]. that extract multilingual terms representing clinical concepts based on natural language processing tech- niques and machine learning 15 [27] and by the in- volvement of research institutions and scientists speaking these languages, particularly German and Dutch, in adding biomedical information to Wikidata [28, 29]. These factors are not the only ones behind the lan- guage coverage of medical Wikidata relations, as the distribution of medical Wikidata labels, particularly for diseases‟ class, seems to be linked in part to the number of speakers of each language among the edi- tors of Wikidata [30], to the overall number of Wiki- data labels for each language [31] and to the number of medical Wikipedia articles for each language [32] as shown in Table 2. These correlations can be put to a test by querying Wikidata to find out the current status of the editing of this knowledge graph and of Wikipedia, as shown in Table 3. Despite several differences like the higher visibility of Asian languages (particularly Arabic and Chinese) in that table, the query results largely match the literature-derived data reported in Table 2.

Fig. 4. Language representation of COVID-19-related statements. A-D) Language coverage for items and properties used in state- ments when either the object or subject is one of the three COVID- related items (as per figure 2; note: log y-axis). The eight most common languages in Wikidata are shown: en=English, fr=French, de=German, es=Spanish, zh=Chinese, ar=Arabic, ja=Japanese, ru=Russian. E) Extended data for panels A-D, with the top 150 languages ordered by usage for each item (note: languages not necessarily in same order for each), as of August 15, 2020 (availa- ble at: https://w.wiki/auE, license: CC BY 4.0; live data: https://w.wiki/Yj$, https://w.wiki/Yk3, https://w.wiki/Yk5, and https://w.wiki/Yk6)

The better coverage of English is explained in part by the higher support of this language in both bio- medical language resources [22] and Wikipedia [23]. Cooperation with publishers such as Cochrane has a Hungarian, Italian, Japanese, Norwegian, Polish, Portuguese, significant effect on English Wikipedia coverage, too Romanian, Russian, Serbian, Slovenian, Spanish, and Swedish. 14 https://loinc.org/international/ (Accessed on August 13, [24]. The significant coverage of other languages like 2020): LOINC supports Chinese, Dutch, Estonian, English, French, French, Spanish, German, Chinese and Swedish in German, Greek, Italian, Korean, Portuguese, Russian, Spanish, and Medical Wikidata can be explained by their support Turkish. by major biomedical multilingual databases: ICPC-2 15 An example of such a Wikidata bot can be Edoderoobot 2, 13 which is specifically working on labelling, thereby translating [25] supports 24 languages , SNOMED-CT supports structured data into prose in the respective language. Further in- formation about this bot can be found at 13 ICPC-2 supports Afrikaans, Basque, Chinese, Croatian, Da- https://www.wikidata.org/wiki/Wikidata:Requests_for_permission nish, Dutch, English, Finnish, French, German, Greek, Hebrew, s/Bot/Edoderoobot_2.

Table 2 Languages ranked according to various variables, based on data from the literature: Number of medical Wikipedia articles, number of Wikida- ta labels, number of native speakers, and number of Wikidata users. Style code: Italic for languages appearing in all four lists; bold for those appearing in only one. Medical Wikipedia, 2013 [32] Wikidata labels, 2017 [31] Population, 2019 [33] Wikidata users, 2018 [30] Rank Language Number of Language Rate of Language Native Language medical labels speakers articles (millions) 1 English 29072 English 11.04% Chinese 1323 English 2 German 7761 Dutch 6.47% Spanish 463 French 3 French 6372 French 6.02% English 369 German 4 Spanish 6367 German 5.08% Hindi 342 Spanish 5 Polish 5999 Spanish 4.07% Arabic 335 Italian 6 Italian 5677 Italian 3.9% Bengali 228 Russian 7 Portuguese 5269 Swedish 3.89% Portuguese 227 Dutch 8 Russian 4832 Russian 3.54% Russian 154 Japanese 9 Dutch 4391 Cebuano 2.21% Japanese 126 Danish 10 Japanese 4303 Bengali 1.94% Western Pun- 82.5 Portuguese jabi

Table 3 Languages ranked according to various variables, based on Wikidata queries (as of August 11, 2020). The Medical Wikipedia query yields Wikipedia articles associated with Wikidata items that have a Disease Ontology ID (P699) or are in the tree of any of the following classes: medicine (Q11190), disease (Q12136), medical procedure (Q796194) or medication (Q12140). The Medical Wikidata labels query yields labels of Wikidata items that have a Disease Ontology ID (P699) or a MeSH Desccriptor ID (P486) or are in the tree of any of the same four classes. The Wikipedia and Wikidata users column provides a snapshot from the Wikidata dashboard that lists Wikidata users who also edit Wikipedia by number of such users per Wikipedia language. Style code: Italic for languages appearing in all three lists; bold for those appear- ing in only one. Medical Wikipedia articles Medical Wikidata labels Wikipedia and Wikidata users https://w.wiki/Z6a https://w.wiki/Z6h https://w.wiki/Z6W Rank Language Number of med- Language Number of la- Language Number of users ical articles bels 1 English 16670 English 65986 English 9600 2 German 8911 French 37053 French 2580 3 Arabic 8596 German 22432 German 2490 4 French 7258 Spanish 21505 Spanish 2330 5 Spanish 6979 Arabic 18581 Russian 1790 6 Italian 6498 Italian 18074 Italian 1430 7 Polish 6071 Japanese 17992 Chinese 1120 8 Portuguese 5652 Dutch 17985 Japanese 1090 9 Russian 5564 Chinese 17462 Portuguese 979 10 Japanese 4651 Russian 17165 Arabic 688

Similarly, the current representation of COVID-19 cal Wikidata is highly valuable to ameliorate and Wikidata items in natural languages seems to corre- increase the language coverage of Wikidata in medi- late with Wikipedia pages, edits and pageviews re- cal domains. However, this language coverage of the lated to COVID-19 pandemic as shown in Table 4. studied medical relations of Wikidata does not corre- Consequently, encouraging the contribution by late with the number of speakers of each natural lan- speakers of under-resourced and unrepresented lan- guage as shown in Table 2. This finding aligns with guages to medical Wikipedia projects16 and to Medi- previous research showing that the overall language distribution of Wikidata labels does not seem to cor- relate with the number of speakers of each language 16 Current efforts to enhance the coverage and language support [31]. of medical knowledge in Wikipedia are mainly driven by Wikime- dia Medicine. For further information, please refer to https://meta.wikimedia.org/wiki/Wiki_Project_Med. An example of the initiatives under this umbrella is the Special Wikipedia on languages of India. An explanation of this project can be found Awareness Scheme for The Healthcare Affiliates project, focused at https://en.wikipedia.org/wiki/Wikipedia:SWASTHA.

Table 4 Languages ranked according to various COVID-19-related variables (as of August 13, 2020). The COVID Wikidata content query sorts lan- guages by the number of labels of Wikidata items with a direct link to and/or from any of the core COVID-19 items - Q84263196 (COVID-19), Q81068910 (COVID-19 pandemic) and Q82069695 (SARS-CoV-2) - excluding items about humans (3131) or scholarly publications (40164). The COVID Wikipedia pages query filters those Wikidata items for associated Wikipedia articles and sorts languages by the number of such articles. The values in the COVID Wikipedia edits column represent the revision counts per Wikipedia language as taken from the dashboard listing Wikimedia projects by total number of revisions to COVID-19-related articles. The COVID-19 pandemic Wikipedia pageviews column represents daily average user traffic (averaged since January 1, 2020) to the article about the COVID-19 pandemic in the respective language. Style code: Italic for languages appearing in all four lists; bold for those appearing in only one. COVID Wikidata content COVID Wikipedia pages COVID Wikipedia edits COVID-19 pandemic Wikipe- https://w.wiki/ZSq https://w.wiki/ZSt https://covid- dia pageviews da- https://w.wiki/ZTG ta.wmflabs.org/perProjectN oHumans Rank Language Number Language Number of Language Number of Language Average of labels articles edits daily page- views 1 English 1429 English 561 English 250306 English 52872 2 Dutch 785 Arabic 517 German 126359 Russian 41246 3 Arabic 623 German 431 French 42029 Spanish 37722 4 Catalan 579 Portuguese 427 Chinese 41545 Chinese 27598 5 German 561 Korean 408 Spanish 30869 German 20707 6 French 517 Chinese 396 Arabic 19963 Italian 8490 7 Japanese 503 Vietnamese 392 Russian 18719 French 7959 8 Chinese 483 French 379 Japanese 11508 Portuguese 7648 9 Portuguese 463 Spanish 370 Ukrainian 10599 Japanese 5227 10 Spanish 433 Indonesian 363 Hebrew 10386 Arabic 4300

updated on a daily basis [8, 35, 36]. The output using such Wikidata properties can be adapted as an open 4. Database alignment license framework for the automatic evaluation and learning of knowledge graph alignment approaches As shown in the “Data model” section, Wikidata [8, 37] and for the integration of scholarly knowledge items are linked to their equivalents in other semantic [38]. databases using statements where the property pro- In the circumstances of the COVID-19 outbreak, a 19 vides details about a given resource and the object is SPARQL query has been formulated to analyze the the external identifier of the item in the aligned data- integration of external identifiers in Wikidata. This base. Similarly to Wikidata items, these database query succeeded in returning the main aligned exter- alignment properties are defined by labels, descrip- nal resources to the set of scholarly articles and clini- tions and aliases in various languages and by state- cal trials, of diseases, of symptoms, of drugs, of hu- ments describing logical conditions for their usage mans, of sovereign states, of genes, of proteins, and including formatting constraints and allowed values of other items related to the ongoing COVID-19 pan- of subject classes [34]. As of September 1, 2020, demic in Wikidata. This confirms the centrality of 530217 out of 787718 Wikidata properties are used to Wikidata within the linked open data cloud (Fig. 3) state the external identifiers of the Wikidata items. and consequently the usefulness of Wikidata to dy- These properties ensure the interoperability between namically integrate various types of semantic data in Wikidata and other databases and consequently the the context of the disease outbreak [17]. regular enrichment of Wikidata with detailed infor- As shown in Table 5, scholarly articles and clinical mation from online ontologies and knowledge graphs trials have been linked to numerous external identifi- ers, particularly the Digital Object Identifier (DOI), the PubMed ID, the Dimensions Publication ID, the 17 For the updated count of the properties defining external PubMed Central ID (PMCID) and the Clinical- identifiers, refer to https://w.wiki/ayn. 18 For the updated count of all the properties, refer to https://w.wiki/ayo. 19 https://w.wiki/auR Trials.gov Identifier. Most of these identifiers are P8150 COVIDWHO ID 11718 added thanks to WikiProject WikiCite aiming to add support of bibliographic information on Wikidata P2536 Sandbox-External identifier 6790 [39-41]. The current representation of external iden- tifiers for the scientific literature in Wikidata seems P8299 Semantic Scholar corpus ID 4612 to be similar to the general one for the bibliographic P3098 ClinicalTrials.gov Identifier 246 data in the knowledge graph. As of September 3, 20 2020, 36208373 scholarly articles are currently P818 arXiv ID 47 represented in Wikidata. 31425586 of which have PubMed IDs and 25896956, 6016452, and 346114 P2880 NIOSHTIC-2 ID 23 scientific publications respectively have DOIs,

PubMed Central IDs and ArXiv IDs. However, this As for the diseases and symptoms related to Wikidata coverage of the availability of COVID-19- COVID-19, Wikidata interestingly assigned them related publications in external research databases external identifiers in the main biomedical semantic does not seem to fully represent full records of CO- 28 29 databases such as MeSH, ICD-10 , and UMLS as VID-19 literature in aligned resources. In fact, 51007 well as in open lexical databases like OBO Foundry COVID-19-related records are available on ontologies (e.g. Human Phenotype Ontology) and PubMed21, 117583 COVID-19 publications are avail- Freebase as shown in Table 6. This is mainly due to able on Dimensions 22 , 53001 are accessible on the use of machine learning algorithms to align these PubMed Central23, 213000 records are available on major online biomedical resources to Wikipedia ar- Semantic Scholar24, 3232 records are found at Clini- ticles and consequently to Wikidata items [43]. The calTrials.gov25, 2017 records are available on arXiv representation of open license resources is particular- ID26, and 53 records are reachable on NIOSHTIC-227 ly explained by the use of these databases to form the as of September 3, 2020. This lack of support is core of the biomedical knowledge in Wikidata mainly explained by the method of development of through mass uploads and timely updates [44]. Dis- scientific metadata on Wikidata that is based on la- eases and symptoms are also assigned external iden- tent crowdsourcing of information from multiple tifiers linking them to several online encyclopedias sources through bots and human efforts and not on (e.g. eMedicine, Encyclopedia Britannica, and Med- the real-time screening of the external scholarly re- linePlus) and to non-medical databases such as scho- sources [41, 42]. 30 larly repositories (e.g. JSTOR ) and bibliographic Table 5 databases (e.g. Microsoft Academic31). This can be Main Wikidata properties used to represent the external identifiers explained by the efforts of WikiProject WikiCite to of scholarly articles and clinical trials (as of August 31, 2020). align topic pages in research databases to Wikidata Wikidata ID Wikidata Property Count items so that the active members of this project can easily extract topics of research publications from P356 DOI 45101 source databases and assign them to the correspond- ing Wikidata items using main subject [P921] rela- P698 PubMed ID 42294 tions [39]. The links between diseases and symptoms P6179 Dimensions Publication ID 16944 to online first-class encyclopedias is not restricted to the context of COVID-19 pandemic [44] and is rather P932 PMCID 12590 established to provide further specialized readings to users about every concerned Wikidata item [45] and to allow comparison of medical data quality between 20 https://scholia.toolforge.org/ 21 Wikipedia and other encyclopedias [32]. https://pubmed.ncbi.nlm.nih.gov/?term=COVID-19 22 https://tinyurl.com/y6kwrdth 23 https://www.ncbi.nlm.nih.gov/pmc/?term=COVID-19 24 https://www.semanticscholar.org/search?q=COVID- 19&sort=relevance 25 https://clinicaltrials.gov/ct2/results?cond=COVID- 28 International Classification of Diseases, Tenth Revision 19&term=&cntry=&state=&city=&dist= (https://www.who.int/classifications/icd/en/) 26 https://arxiv.org/search/?query=COVID- 29 Unified Medical Language System 19&searchtype=all&source=header (https://www.nlm.nih.gov/research/umls/index.html) 27 https://www2a.cdc.gov/nioshtic-2/Buildqyr.asp?S1=COVID- 30 https://www.jstor.org/ 19&Submit=Search 31 https://academic.microsoft.com/ Table 6 (e.g. Google Play) and disease outbreaks (e.g. Sub- Main Wikidata properties used to represent the external identifiers reddit) as shown in Table 8. The alignment to various of diseases and symptoms (as of August 31, 2020). metadata databases like VIAF36, WorldCat37, Library of Congress and IMDb38 is motivated by the mass Wikidata ID Wikidata Property Diseases Symptoms import of authority control data for the interoperabili- count count ty between library metadata and for the prevention of P672 MeSH tree code 40 12 the duplication of items including book authors, ac- tors and films [45, 46]. The alignment of Wikidata P2892 UMLS CUI 38 11 items about sovereign states and humans to corres- ponding topic pages and user pages in social net- P494 ICD-10 32 8 working services () and question answering P4229 ICD-10-CM32 32 1 forums (Quora and Reddit) are mainly done to track the effect of the information provided by Wikimedia P3827 JSTOR topic ID 32 10 projects, particularly Wikipedia, on online communi- ties [47] or to retrieve information about items in P6366 Microsoft Aca- 29 11 social media and consequently to support the topic demic ID modelling of the coverage of COVID-19 pandemic in P493 ICD-933 26 5 social networks [48]. The sum of these database alignments are useful to integrate new non-clinical P673 eMedicine ID 24 2 information to Wikidata, to allow correlations be- tween epidemiological data and non-medical infor- P1417 Encyclopedia 23 7 mation about countries, individuals, masterpieces and Britannica Online disease outbreaks such as geopolitical, software pro- ID gramming and economic data, and to provide further readings about the concerned items [38]. P486 MeSH descriptor 23 9 Table 7 ID Main Wikidata properties used to represent the external identifiers P646 Freebase ID 21 10 of humans and sovereign states (as of August 31, 2020).

P3841 Human Phenotype 18 9 Wiki- Wikidata Property Sovereign Humans Ontology ID data states ID P604 MedlinePlus ID 19 9 P214 VIAF ID 159 654 P508 BNCF34 Thesau- 17 7 rus ID P7859 WorldCat Identities ID 146 548

P1296 Gran Enciclopedia 10 7 P244 Library of Congress authority 125 458 Catalana ID ID

P8408 KBpedia35 ID 16 7 P213 ISNI39 100 443

P646 Freebase ID 124 379 The matching between Wikidata items and online encyclopedias and non-medical resources is P2002 Twitter username 16 353 not only restricted to disease and symptoms but also 40 to humans and sovereign states (e.g. VIAF, WorldCat, P227 GND ID 125 308 Library of Congress and GeoNLP) as shown in Table 7 and to films (e.g. IMDb), computer applications 36 Virtual International Authority File (http://viaf.org/) 37 https://www.worldcat.org/ 32 International Classification of Diseases, Tenth Revision, 38 Internet Movie Database (https://www.imdb.com/) Clinical Modification 39 https://isni.org/ 33 International Classification of Diseases, Ninth Revision 40 Gemeinsame Normdatei (German National Library, Germa- 34 Biblioteca Nazionale Centrale di Firenze (Central National ny), Library of Florence, Italy) https://www.dnb.de/DE/Professionell/Standardisierung/GND/gnd_ 35 https://kbpedia.org/ node.html P345 IMDb ID 274 nesis of coronavirus and mainly COVID-19 through the analysis of the physiological features of infected P268 Bibliothèque nationale de 177 269 taxons [51]. The sum of these biomedical alignments France ID is developed using human edits and computer tools

41 thanks to large initiatives to develop open ontological P269 IdRef ID 84 265 databases for curating advanced molecular biology P998 DMOZ42 ID 158 data such as WikiGenomes [29] and Gene Wiki [16] and is enhanced in the context of the current pandem- P3417 Quora topic ID 141 73 ic through the contributions of WikiProject COVID- 19 [13]. P1417 Encyclopedia Britannica On- 138 53 Table 8 line ID Main Wikidata properties used to represent the external identifiers P5400 GeoNLP ID 128 for other Wikidata classes (as of August 31, 2020).

P349 National Diet Library ID 127 54 Wikidata Class Wikidata ID Wikidata Property Count

P4801 LoC MARC43 vocabularies ID 126 drug [Q11173] P6689 MassBank accession 44 ID

Concerning drugs, proteins, genes and taxons, Wi- drug [Q11173] P4964 SPLASH49 31 kidata items are mainly assigned external identifiers protein [Q8054] P638 PDB structure ID 31 in the major knowledge graphs for pharmacology 44 45 (e.g. MassBank ), for biodiversity (e.g. IRMNG ), film [Q11424] P345 IMDb ID 25 for genomics (e.g. Entrez Gene) and for proteomics (e.g. PDB 46) and are rarely linked to non-medical film [Q11424] P2603 Kinopoisk film ID 23 databases or to encyclopedias as shown in Table 8. The lack of alignment between these biomedical Wi- film [Q11424] P7177 Cinestaan film ID 22 kidata items and their equivalents in social web ser- disease outbreak P3984 subreddit 22 vices is clearly explicated by the better interest of the [Q3241045] users of social media to the health policies and epi- demiology of COVID-19 rather than the therapeutic protein [Q8054] P637 RefSeq50 protein ID 18 options and molecular aspects related to the disease committee group P8433 Swedish Riksdag 18 [49]. The most important interest in matching these motion document ID concepts in Wikidata to graph databases (e.g. Mass- [Q97695005] bank, PDB, and KEGG47) and semi-structured data- bases (e.g. Guide to Pharmacology) for bioinformat- film [Q11424] P2529 ČSFD51 film ID 17 ics rather than online encyclopedias is due to the bet- 52 ter availability of genomic and proteomic informa- drug [Q11173] P267 ATC code 17 tion in these specialized semantic resources [44, 50]. protein [Q8054] P352 UniProt protein ID 16 The alignment of taxon items in Wikidata to biodi- 48 versity knowledge graphs (e.g. NCBI taxonomy and protein [Q8054] P5458 Guide to Pharmacol- 15 IRMNG) is to permit the discussion of the pathoge- ogy Target ID

COVID-19 app P7771 PersonalData.IO ID 14 41 Identifiants et Référentiels pour l‟enseignement supérieur et [Q89288125] la recherche (Identifiers and credentials for higher education and research in France) gene [Q7187] P351 Entrez Gene ID 12 42 Directory Mozilla (https://dmoz-odp.org/) 43 https://www.loc.gov/marc/ 44 https://massbank.eu/MassBank/ 45 Interim Register of Marine and Nonmarine Genera 49 Spectral Hash Identifier (https://splash.fiehnlab.ucdavis.edu/) (https://www.irmng.org/) 50 NCBI Reference Sequence Database 46 Protein Data Bank (https://www.rcsb.org/) (https://www.ncbi.nlm.nih.gov/refseq/) 47 Kyoto Encyclopedia of Genes and Genomes 51 Česko-Slovenská filmová databáze (Czech-Slovak Film Da- (https://www.genome.jp/kegg/) tabase, https://www.csfd.cz/) 48 National Center for Biotechnology Information 52 Anatomical Therapeutic Chemical (ATC) Classification Sys- (https://www.ncbi.nlm.nih.gov/) tem (https://www.whocc.no/atc_ddd_index/) COVID-19 app P3418 Google Play Store 12 virus, and the resulting pandemic (extended list, sup- [Q89288125] app ID plementary table S2). Sample SPARQL queries are available at supplementary table S1. This section will gene [Q7187] P2393 NCBI locus tag 11 present examples across different aspects of COVID- 19, adapted from five main sources to which we have macromolecular P7718 Complex Portal ac- 11 56,57,58,59,60 complex cession ID contributed substantially . [Q22325163] 5.1. Biological and clinical aspects protein fragment P638 PDB structure ID 11 [Q78782478] A simple demonstration of Wikidata‟s encoding of drug [Q11173] P231 CAS Registry 53 9 SARS-CoV-2‟s basic biology is in its genetics (Fig. Number 5) and resulting symptoms (Fig. 6). The viral genome contains 11 genes that encode 30 proteins (and va- drug [Q11173] P715 DrugBank ID 9 riants), which are currently known to interact with over 170 different human proteins. Although there drug [Q11173] P665 KEGG ID 9 are two genome browsers based on Wikidata [29, 53], drug [Q11173] P638 PDB structure ID 9 neither yet display the SARS-CoV-2 genome. SPARQL visualizations provide a broader way to drug [Q11173] P652 UNII54 9 explore biomedical knowledge about the studied vi- rus and the related infectious disease. As the know- protein [Q8054] P705 Ensembl protein ID 8 ledge graph grows, this is allowing linking together COVID-19 app P3861 App Store app ID 8 complex knowledge on biochemistry (e.g. genes and [Q89288125] (global) proteins), biology (e.g. host taxa), clinical medicine (e.g. interventions) [44]. Such queries can be ex- drug [Q11173] P595 Guide to Pharmacol- 8 panded by considering the qualifiers that modulate ogy Ligand ID biomedical statements. These qualifiers allow the assignment of weights to assumptions according to drug [Q11173] P6366 Microsoft Academic 8 ID their importance and certainty. For instance, some treatments are indicated as hypothetical, or symptoms disease outbreak P3479 Omni topic ID 7 are listed as rare, as defined by their nature of state- [Q3241045] ment [P5102] or sourcing circumstances [P1480] qualifiers, with references to back these up (live data: taxon [Q16521] P5055 IRMNG ID 6 https://w.wiki/bmJ). taxon [Q16521] P685 NCBI taxonomy ID 6

5. Visualizing facets of COVID-19 via SPARQL 56 WikiProject COVID-19 (WPCOVID) queries: extracts from the query collection of Wikidata‟s WikiProject COVID-19; https://www.wikidata.org/wiki/Wikidata:WikiProject_COVID- The flexible data model of Wikidata enables it to 19/Queries be highly multidisciplinary, including information 57 SARS-CoV-2-Queries: extracts from the book “Wikidata ranging from medical to geopolitical to social aspects Queries around the SARS-CoV-2 virus and pandemic” [52]; of the pandemic. Given the breadth of Wikidata‟s https://egonw.github.io/SARS-CoV-2-Queries/ 58 SPEED queries: extracts from the Wikidata-based epidemio- COVID-19-related information (examples in sup- logical surveillance dashboard for COVID-19 pandemic in Tunisia. plementary figure S1), extracting specific subsets of It was partially built upon COVID-19 Wikidata dashboard; that information using SPARQL55 can illustrate dif- https://sites.google.com/view/covid19-dashboard 59 ferent aspects of the COVID-19 disease, its causative Scholia queries: queries underlying COVID-19-related visua- lizations from the Wikidata-based scholarly profiling tool Scholia [39]; https://scholia.toolforge.org/ 53 https://www.cas.org/support/documentation/chemical- 60 Covid-19 Summary queries: queries visualizing COVID-19 substances information in Wikidata linked to the epidemiological information 54 Unique Ingredient Identifier (https://fdasis.nlm.nih.gov/srs/) of the outbreak and to the characteristics of the infected famous 55 Technical documentation about SPARQL can be found at people; https://public.paws.wmcloud.org/User:99of9/Covid- https://en.wikibooks.org/wiki/SPARQL. 19.ipynb

Fig. 5. SARS-CoV-2 interactions with the human proteome (available at: https://w.wiki/c3D, license: CC BY 4.0). Proteins encoded by SARS- CoV-2 genes (note that some genes encode multiple proteins) and the currently known human protein interaction partners (live data: https://w.wiki/beR).

Fig. 6. Symptoms of COVID-19 and similar conditions (available at: https://w.wiki/byX, license: CC BY 4.0). A) Currently listed symptoms of COVID-19, with qualifiers indicating their frequency. (live data: https://w.wiki/N8f). B) Other medical conditions sorted by the number of shared symptoms with COVID-19. (live data: https://w.wiki/bqV; adapted from https://scholia.toolforge.org/disease/Q84263196)

5.2. Epidemiology B) Age distribution of notable persons who have died of COVID- 19 (blue), compared to the death age distribution for people who were born after 1901 (green), calculated from individual dates of Wikidata also contains the necessary information birth and death (live data: https://w.wiki/be7 and to calculate common epidemiology data for different https://w.wiki/but). countries, such as mortality per day per capita, and case number to mortality rate correlation. In some cases this is stored as aggregate data, such as the case mortality rate [P3457] statements for regional epi- 5.3. Research outputs demics stored as numeric data (Fig. 7A), whereas others common visualisations can be calculated from A large portion of Wikidata is dedicated to publi- scratch from granular data such as the individual date cation metadata and citation links. There are several of birth [P569] and of death [P570] of notable indi- ways to investigate the relevant topics in publications viduals deceased from COVID-19 (Figure 7B). Al- regarding COVID-19. Firstly, topic keywords can be though this reflects the age distribution of COVID extracted directly from the titles of articles with CO- mortality, it is also influenced by the demographics VID-19 as a main topic (Fig. 8A). This is a useful of persons sufficiently notable to have Wikidata and rapid first-approximation of topics covered by items. In some cases summary data is also time- those publications, extracted as plain text, and can be resolved, allowing inquiry of its change over time expanded upon by querying the main topics to con- over time (supplementary figure S2), capturing fea- cepts in Wikidata using the property main subject tures not depicted in several statistical predictions of (P921). This property acts analogously to Medical the epidemiological evolution of COVID-19 out- Subject Headings (MeSH) descriptors [58], though breaks [54] and clearly seen in other data sources, far more extensive. Those statements allow broader such that mortality peaks at the beginning of a dis- querying of the literature as a network via co- ease outbreak [55]. Wikidata‟s granularity and colla- occurrence of topics as the main subject of articles borative editing have also made it highly up to date (Fig. 8B)61. This enables rapid traversal and faceting on queries such as the most recent death of notable of the literature on topics in addition to the traditional persons due to COVID-19, a result likely difficult to links made by tracing citations [59], such as extract- achieve with other datasets (supplementary figure ing common pharmacological and non- S3), and mirroring Wikipedia‟s well-known rapid pharmacological interventions (live data: response to updating information on deaths [56, 57]. https://w.wiki/N8i). The „WikiCite‟ project is work- ing on importing the citation network into Wikidata to make a fully open citation network (Fig. S4) [60]. Because Wikidata is agnostic to the exact type of research output, its structure is equally suited to representing information on research publications, preprints (Fig. S5), clinical trials (Fig. S6) or com- puter applications (Fig. S7). However, preprints are not yet thoroughly covered in Wikidata, a limitation for this context as preprints have become particularly important during the rapid pace of COVID-19 re- search [61]. Further, Wikidata‟s rich biographical and institutional data makes extracting information on authors or others easy (Fig. S8), and eventually other contributors [62].

Fig. 7. Summary epidemiological data on the COVID-19 pandem- ic (available at: https://w.wiki/byW, license: CC BY 4.0). A) Cor- relation between the current number of cases and mortality rates in every country, calculated from numeric summary data for each region. Countries coloured randomly (live data: https://w.wiki/bf$). 61 https://ts404.shinyapps.io/topicnetwork 5.4. Societal aspects

Further emphasising the multidisciplinary nature of Wikidata, there are also significant social aspects of the pandemic contained in the knowledge base. This includes simple collation of information, such as regional official COVID websites, and unofficial but common hashtags (Fig. S9), or relevant images under Creative Commons licenses (Fig. S10). It also in- cludes more cross-disciplinary information, such as companies that have reported bankruptcy, with the pandemic recorded as the main cause (Fig. 9), or the locations of those working on COVID (Fig. S8B). However, this also exemplifies how misleading miss- ing data can be: Wikidata currently has highly incon- sistent coverage of companies that are not publicly listed, which heavily biases the results. Standardised methods to audit and validate Wikidata‟s content on various topics are still under investigation and devel- opment [34].

Fig. 8. COVID-19 publication topics (available at: https://w.wiki/byV, license: CC BY 4.0). A) Common words and word combinations (ngrams) in the titles of publications (live data: Fig. 9. Bankrupt publicly listed businesses due to the COVID-19 https://w.wiki/cFu). B) Co-occurrence of topics in publications pandemic (available at: https://w.wiki/byY, license: CC BY 4.0). with one of the COVID-related items as a topic, with ribbon A) Tabular output of SPARQL query B) Bankruptcies per month widths proportional to the number of publications sharing those C) ratios of different industries associated with bankrupt compa- topics (log scale). Topics coloured by group as determined by nies. (live data:https://w.wiki/cG6). louvain clustering, topics shared in fewer than 5 publications omit- ted (interactive version: https://csisc.github.io/WikidataCOVID19SPARQL/Fig8B.html; live data: https://w.wiki/bww). 6. Discussion

Many knowledge graphs have been recently de- veloped to represent various types of COVID-19 in- formation, including government responses [4], epi- demiology [63], clinical data [3], scholarly outputs These factors have facilitated Wikidata‟s rapid and outcomes [64], economic impacts [65], physi- growth since its creation in 2012 - likely surpassing opathology [1], social networking [66] among other 90 million items in the coming months - into a richly features related to the COVID-19 pandemic. The interconnected and interdisciplinary network [7, 34]. semantic databases are either built using a combina- In the context of COVID-19 outbreak, Wikidata has tion of human efforts and crowdsourcing techniques proven its efficiency to represent multiple facets of [4] or through the automatic extraction of informa- the pandemic ranging from biomedical information tion using natural language processing techniques to social impacts, surpassing other integrated seman- from scholarly publications about the outbreak, par- tic graphs such as CIDO [2], COVID-19 data hub ticularly the Covid-19 Open Research Dataset [6]. [67], COVID-19 Living Data 63 [70] and Know- Despite the importance of these provided resources, ledge4COVID-19 64 [71] that only combine two to they tend to cover a narrow aspect of the disease. three distinct features of the pandemic as shown in That is why initiatives have been launched by closed the “data model” and “Visualizing facets of COVID- groups of data curators to create and sustain plat- 19 via SPARQL” sections. This large-scale informa- forms to integrate COVID-19 knowledge such as tion is supported in multiple languages as explained CIDO [2] and COVID-19 data hub [67]. These cen- in the “language representation” section and is tralized semantic databases allowed the combination matched to its equivalents in other semantic databas- of many divergent factors of COVID-19 (e.g., clini- es as revealed by “database alignment” section. cal data with genomics, and epidemiological data Moreover, SPARQL as a semantic query language with social interactions) to achieve more complex has been proved as useful to enable deeper insights scientific interpretations about SARS-CoV-2 virus into the different facets of the multidisciplinary CO- and the pandemic and consequently to support more VID-19 information in Wikidata, characterized as big advanced decision making related to the outbreak [2, data by its volume, variety, velocity and veracity [72, 67]. However, due to the sharply growing nature of 73]. This confirms previous findings about the im- the scholarly literature about coronaviruses [68], the portance of querying COVID-19 semantic resources COVID-19 knowledge has become difficult to follow such as CIDO [2] to compare clinical information and manage by a restricted community of scientists. with other types of COVID-19 information and con- Whereas most knowledge graphs tend to be spe- sequently to generate deeper characteristics of the cialized and closed, Wikidata deliberately takes a disease [74]. The advantage of applying SPARQL to multidisciplinary, multilingual position in the linked extract and visualize COVID-19 information from a open data ecosystem. It is this breadth, combined generalized knowledge graphs such as WIkidata with its interoperability, that makes it unique when compared to knowledge graphs developed for amongst even other user-generated collaborative the pandemic like CIDO [2] is that open knowledge projects. Indeed, uniquely suited to topics such as the graphs also include non-COVID-19 information such COVID-19 pandemic [13, 44]. In comparison to oth- as economic, industrial, climatic and social facts that er resources like DBPedia, Wikidata is not just edited can be used to generate summary information to ex- by machines and built from data automatically ex- plain the reasons behind the dynamics of the studied tracted from textual resources like Wikipedia [69]. pandemic. Wikidata is enriched and adjusted as well by a com- Despite the advantages of collaborative editing and munity of over 23000 active human users on a daily free reuse of open knowledge graphs like Wikidata to basis62 and is released under the CC0 license allow- support and enrich COVID-19 information, these two ing the free and unconditioned reuse and interopera- features have several drawbacks related to data quali- bility of its information in other systems and datasets ty and legal concerns. It is true that the use of fully and consequently the growth of interest of many open licenses (CC0 or Public domain) in centralized people in using, enriching and adjusting it [34]. By knowledge graphs removes all legal barriers to their being highly multilingual, its human-readability ex- reuse in other knowledge graphs or to drive know- tends beyond just English to support international ledge-based systems and encourages the development contribution [7, 34]. Also, its flexible editing policy of intelligent support to tasks related to COVID-19. and RDF structure permit the easy creation of new However, The application of CC0 license on these classes, properties and data models to rapidly support databases causes them not to integrate information emerging data topics [7, 34]. 63 https://covid-nma.com/ 62 https://www.wikidata.org/wiki/Special:Statistics 64 https://devpost.com/software/covid-19-kg for semantic resources and datasets with partially mainly the ones from the linked open data cloud [7] open licenses (e.g. CC BY and MIT) as these licenses but also on the use of validation schemas in ShEx68, require either the attribution of the source work to of logical constraints implemented in SPARQL and authors or the use of the same license to process the of bot edits to verify the structure and accuracy of data [75, 76]. This situation seems analogous to the COVID-19 information [34, 83]. The data validation group O red blood cells status as universal donor but infrastructure of Wikidata seems to be in accordance restricted recipient [77]. with the latest updates in knowledge graph evaluation Although collaborative editing contributed to the and refinement techniques [84, 85] and explains in development of large-scale information about all part the reasons behind the robustness of the data aspects of the disease, there are currently still signifi- model of COVID-19 information in this open know- cant gaps and biases in the dataset that can lead to ledge graph. misleading results if not interpreted with caution. For example, the COVID-19 outbreaks on cruise 65 and naval66 ships are more covered in Wikipedia than in 7. Conclusion Wikidata (as well as in many other online resources). Another example can be the distorted representation In this research paper, we demonstrate the ability of scholarly citations in Wikidata, since no scalable of open and collaborative knowledge graphs such as workflows currently exist for their systematic cura- Wikidata to represent and integrate all the multidis- tion. Although many of these gaps are rapidly being ciplinary aspects of the COVID-19 information and addressed and closed over time, errors of omission to generate summary visualizations about the infec- and bias are inevitable to some extent. Such deficien- tious disease using SPARQL. As an open semantic cies can only be solved by applying algorithms that resource in the RDF format, Wikidata has become a assess data completeness of an item included in a hub for COVID-19 knowledge. The insertion of in- given class within open knowledge graphs by com- formation in the Linked Open Data format provides paring it with other items of the same class [78, 79] the flexibility to integrate data from many facets of or through enhancing the use of knowledge graph COVID-19 data with non-COVID-19 data. By its learning techniques to enrich, sustain and update multilingual structure, these inputs are contributed to open knowledge graphs from textual databases like (and reused by) people all over the world, with dif- scholarly publications [80] and electronic health ferent backgrounds. Effectively, the Wikiproject records [81]. Moreover, collaborative editing can COVID-19 has made COVID-19 knowledge more cause several inaccuracies in the declaration of FAIR: Findable, Accessible, Interoperable and Reus- statements in open knowledge graphs disregarding able. An important partner for Wikidata FAIRness is the metadata standards of the knowledge bases [82]. the SPARQL query service. The SPARQL endpoint, These inconsistencies can persist particularly when combined with community-contributed data visuali- the database and the largely growing scholarly litera- zation tools, provides a human-friendly interface. As ture about COVID-19 is managed by a limited num- shown here, SPARQL visualizations are an entry- 67 ber of administrators and can consequently cause point for deeper insights into COVID-19, both re- matters about the trustworthiness of the reuse of data garding the biomedical facets of this still new disease, [82]. However, critical problems related to structural as well as into the societal details of the pandemic. deficiencies in defining statements or to the inclusion Even though COVID-19 knowledge is ab- of mistaken data in open knowledge graphs seem to undant on Wikidata, there is always room for future happen less frequently in Wikidata [35]. This is not improvement. As a collaborative endeavour, Wikida- only due to the involvement of more contributors in ta (and the WikiProject COVID-19 specially) is like- Wikidata than in other open knowledge graphs [35] ly to become richer with time. By the collective ef- or on the reliance of this knowledge base on second- forts of contributors, we hope that the database will ary databases that are timely curated and enriched, grow in quality and coverage, supporting other types of information, such as the outcomes of the ongoing 65 https://en.wikipedia.org/wiki/COVID- COVID-19-related research efforts. As Wikidata is 19_pandemic_on_cruise_ships 66 https://en.wikipedia.org/wiki/COVID- 19_pandemic_on_naval_ships 68 The validation schemas for COVID-19 information in Wiki- 67 As of September 13, 2020, there are only 63 Wikidata admin- data are currently available at istrators as shown at https://www.wikidata.org/wiki/Wikidata:WikiProject_COVID- https://www.wikidata.org/wiki/Special:Statistics. 19/Data_models. community-oriented and broadly themed, virtually P. Sloan Foundation under grant number G-2019- any researcher can take advantage of its knowledge, 11458. We thank the Wikidata community, Egon and contribute to it. SPARQL queries can comple- Willighagen (Maastricht University, Netherlands), ment and enrich research publications, providing Toby Hudson (University of Sydney, Australia), both an overview of domain-specific knowledge for Adam Shorland (Wikimedia Deutschland, Germany), original research, as well as serving as the base for and Mahir Morshed (University of Illinois at Urbana- systematic reviews or scientometric studies. Of note, Champaign, United States of America) for useful SPARQL queries can be inserted into living publica- comments and discussions about the topic of this tions, which can keep up to date with the advance- research paper. ments both in human knowledge and its coverage on Wikidata. References

8. Author statements [1] D. Domingo-Fernández, S. Baksi, B. Schultz, Y. Gadiya, R. Karki, T. Raschka, et al. COVID-19 Knowledge Graph: a computable, multi-modal, cause-and-effect knowledge model Conflict of interest: All authors of this paper are of COVID-19 pathophysiology. BioRxiv (2020). active members of WikiProject Medicine, the com- doi:10.1101/2020.04.14.040667. munity curating clinical knowledge in Wikidata, and [2] Y. He, H. Yu, E. Ong, Y. Wang, Y. Liu, A. Huffman, et al. of WikiProject COVID-19, the community develop- CIDO, a community-based ontology for coronavirus disease knowledge and data integration, sharing, and analysis. Scien- ing multidisciplinary COVID-19 information in Wi- tific data, 7.1 (2020), p. 181. doi:10.1038/s41597-020-0523-6. kidata. DJ is a non-paid voluntary member of the [3] M. Ostaszewski, A. Mazein, M. E. Gillespie, I. Kuperstein, A. Board of Trustees of the Wikimedia Foundation, the Niarakis, H. Hermjakob, et al. COVID-19 Disease Map, build- non-profit publisher of Wikipedia and Wikidata. ing a computational repository of SARS-CoV-2 virus-host in- teraction mechanisms. Scientific data, 7.1 (2020)., p. 136. Data availability: Source files for most of the doi:10.1038/s41597-020-0477-8. tables featured in this work are available at [4] A. Desvars-Larrive, E. Dervic, N. Haug, T. Niederkrotenthaler, https://github.com/csisc/WikidataCOVID19SPARQL J. Chen, A. Di Natale, et al. A structured open dataset of gov- under the CC0 license [86]. Figures involved in this ernment interventions in response to COVID-19. Scientific data, 7.1 (2020), p. 285. doi:10.1038/s41597-020-00609-9. research study are available at https://w.wiki/bnW [5] Y. Liu, W. K. B. Chan, Z. Wang, J. Hur, J. Xie, H. Yu, and Y. mostly under the CC BY 4.0 License and their source He. Ontological and bioinformatic analysis of anti-coronavirus SPARQL queries69 are linked in the figure legends. drugs and their Implication for drug repurposing against CO- Internet archive links for the cited URLs are made VID-19. Preprints (2020), p. 2020030413. doi:10.20944/preprints202003.0413.v1. available at https://doi.org/10.5281/zenodo.4022591 [6] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. thanks to ArchiveNow [87]. Eide, et al. CORD-19: The Covid-19 Open Research Dataset. arXiv preprint arXiv:2004.10706 (2020). https://www.ncbi.nlm.nih.gov/pmc/articles/pmc7251955/. [7] H. Turki, T. Shafee, M. A. Hadj Taieb, M. Ben Aouicha, D. 9. Acknowledgements Vrandečić, D. Das, and H. Hamdi. Wikidata: A large-scale collaborative ontological medical database. Journal of Bio- The work done by Houcemeddine Turki, Mo- medical Informatics, 99 (2019), p. 103292. doi:10.1016/j.jbi.2019.103292. hamed Ali Hadj Taieb and Mohamed Ben Aouicha [8] D. Vrandečić and M. Krötzsch. Wikidata: a free collaborative was supported by the Ministry of Higher Education knowledgebase. Communications of the ACM, 57.10 (2014), and Scientific Research in Tunisia (MoHESR) in the pp. 78-85. doi:10.1145/2629489. framework of Federated Research Project [9] D. Jemielniak. Common knowledge?: An ethnography of Wikipedia, Stanford: Stanford University Press, 2014. PRFCOV19-D1-P1. The work done by Jose Emilio ISBN:978-0804789448. Labra Gayo was partially funded by the Spanish [10] P. Konieczny. Adhocratic governance in the Internet age: A Ministry of Economy and Competitiveness (Society case of Wikipedia. Journal of Information Technology & Poli- challenges: TIN2017-88877-R). The work done by tics, 7.4 (2010), pp. 263-283. doi:10.1080/19331681.2010.489408. Daniel Mietchen was supported in part by the Alfred [11] A. Kirk. Data visualisation: A handbook for data driven de- sign, Sage, 2016. ISBN:9781473966314. [12] S. Ekins, D. Mietchen, M. Coffee, T. P. Stratton, J. S. Freun- 69 The source code of the SPARQL queries used in this work dlich, L. Freitas-Junior, et al. Open drug discovery for the Zi- are also made available at ka virus. F1000Research, 2016 (2016), p. 5:150. doi: https://web.archive.org/web/20200914223401/https://www.wikidat 10.12688/f1000research.8013.1. a.org/wiki/Wikidata:WikiProject_COVID- 19/Queries/SPARQL_Study. [13] A. Waagmeester, E. L. Willighagen, A. I. Su, M. Kutmon, J. E. Evaluation, 54 (2019), pp. 385-418. doi:10.1007/s10579-019- Labra Gayo, D. Fernández-Álvarez, et al. A protocol for add- 09453-9. ing knowledge to Wikidata, a case report. BioRxiv (2020). [28] A. Waagmeester, L. Schriml, and A. I. Su. Wikidata as a doi:10.1101/2020.04.05.026336. linked-data hub for Biodiversity data. Biodiversity Information [14] L. Ehrlinger and W. Wöß. Towards a Definition of Know- Science and Standards, 3 (2019), p. e35206. ledge Graphs. CEUR Workshop Proceedings, 1695 (2016), pp. doi:10.3897/biss.3.35206. 1-4. [29] T. E. Putnam, S. Lelong, S. Burgstaller-Muehlbacher, A. [15] S. Malyshev, M. Krötzsch, L. González, J. Gonsior, and A. Waagmeester, C. Diesh, N. Dunn, et al. WikiGenomes: an Bielefeldt. Getting the most out of Wikidata: Semantic tech- open web application for community consumption and cura- nology usage in wikipedia‟s knowledge graph, in: Interna- tion of gene annotation data in Wikidata. Database, 2017 tional Semantic Web Conference, Springer, Cham, 2018, pp. (2017), p. bax025. doi:10.1093/database/bax025. 376-394. doi:10.1007/978-3-030-00668-6_23. [30] Kaffee, L.-A., & Simperl, E. (2018). Analysis of Editors' [16] S. Burgstaller-Muehlbacher, A. Waagmeester, E. Mitraka, J. Languages in Wikidata, in: Proceedings of the 14th Interna- Turner, T. Putman, J. Leong, et al. Wikidata as a semantic tional Symposium on Open Collaboration, ACM, 2018, p. 21. framework for the Gene Wiki initiative. Database, 2016 doi:10.1145/3233391.3233965. (2016), p. baw015. doi:10.1093/database/baw015. [31] Kaffee, L. A., Piscopo, A., Vougiouklis, P., Simperl, E., Carr, [17] J. Debattista, C. Lange, S. Auer, and D. Cortis. Evaluating the L., & Pintscher, L. (2017). A glimpse into babel: An analysis quality of the LOD cloud: An empirical investigation. Seman- of multilinguality in Wikidata, in: Proceedings of the 13th In- tic Web, 9.6 (2018), pp. 859-901. doi:10.3233/SW-180306. ternational Symposium on Open Collaboration, ACM, 2017, p. [18] N. Jetté, H. Quan, B. Hemmelgarn, S. Drosler, C., D.-G. 14. doi:10.1145/3125433.3125465. Maass, et al. The development, evolution, and modifications [32] J. M. Heilman and A. G. West. Wikipedia and medicine: of ICD-10: challenges to the international comparability of quantifying readership, editors, and the significance of natural morbidity data. Medical Care, 48.12 (2010), pp. 1105-1110. language. Journal of Medical Internet Research, 17.3 (2015), doi:10.1097/MLR.0b013e3181ef9d3e. p. e62. doi:10.2196/jmir.4069. [19] F. Liu, P. Fontelo, and M. Ackerman BabelMeSH: develop- [33] D. M. Eberhand, G. F. Simons, and C. D. Fennig. Ethnologue: ment of a cross-language tool for MEDLINE/PubMed, in: Languages of the World, SIL International, Dallas, Texas, AMIA Annual Symposium Proceedings, American Medical In- 2019. formatics Association, 2006, p. 1012. [34] H. Turki, D. Jemielniak, M. A. Hadj Taieb, J. E. Labra Gayo, [20] A. Henriksson, M. Skeppstedt, M. Kvist, M. Duneld, and M. M. Ben Aouicha, M. Banat, T. Shafee, E. Prud'Hommeaux, T. Conway. Corpus-driven terminology development: populating Lubiana, D. Das, and D. Mietchen. Using logical constraints Swedish SNOMED CT with synonyms extracted from elec- to validate information in collaborative knowledge graphs: a tronic health records, in: Proceedings of the 2013 Workshop study of COVID-19 on Wikidata. Zenodo (2020). on Biomedical Natural Language Processing, 2013, pp. 36-44. doi:10.5281/zenodo.4008358. [21] G. De Melo and G. Weikum. Towards universal multilingual [35] M. Färber, F. Bartscherer, C. Menne, and A. Rettinger. Linked knowledge bases, in: Principles, Construction, and Applica- data quality of DBpedia, Freebase, OpenCyc, Wikidata, and tions of Multilingual Wordnets. Proceedings of the 5th Global YAGO. Semantic Web, 9.1 (2018), pp. 77–129. WordNet Conference (GWC 2010), Narosa Publishing, New doi:10.3233/SW-170275. Delhi, India, 2010, pp. 149-156. [36] F. Erxleben, M. Günther, M. Krötzsch, J. Mendez, and D. [22] F. Freitas, S. Schulz, and E. Moraes. Survey of current termi- Vrandečić (2014). Introducing Wikidata to the linked data nologies and ontologies in biology and medicine. Reciis, 3.1 web, in: International semantic web conference, Springer, (2009), pp. 7-18. doi:10.3395/reciis.v3i1.239en. Cham, 2014, pp. 50-65. doi:10.1007/978-3-319-11964-9_4. [23] T. Shafee, G. Masukume, L. Kipersztok, D. Das, M. [37] P. Ristoski, G. K. D. De Vries, and H. Paulheim. A collection Häggström, and J. Heilman. Evolution of Wikipedia‟s medical of benchmark datasets for systematic evaluations of machine content: past, present and future. J Epidemiol Community learning on the semantic web, in: International Semantic Web Health, 71.11 (2017), pp. 1122-1129. doi:10.1136/jech-2016- Conference, Springer, Cham, 2016, pp. 186-194. 208601. doi:10.1007/978-3-319-46547-0_20. [24] D. Jemielniak, G. Masukume, and M. Wilamowski. The most [38] D. Mietchen, G. Hagedorn, E. Willighagen, M. Rico, A. influential medical journals according to Wikipedia: quantita- Gómez-Pérez, E. Aibar, K, Rafes, C. Germain, A. Dunning, L. tive analysis. Journal of medical Internet research, 21.1 Pintscher, and D. Kinzler. Enabling open science: Wikidata (2019), e11429. doi:10.2196/11429. for research (Wiki4R). Research Ideas and Outcomes, 1 [25] Rodgers, R. C., Sherwin, Z., Lamberts, H., & Okkes, I. M. (2015), p. e7573. doi:10.3897/rio.1.e7573. (2004). ICPC Multilingual Collaboratory: a Web-and Un- [39] F. Å. Nielsen, D. Mietchen, and E. Willighagen. Scholia, icode-based system for distributed editing/translating/viewing scientometrics and Wikidata, in: European Semantic Web of the multilingual International Classification of Primary Conference, Springer, Cham, 2017, pp. 237-259. Care, in: Medinfo 2004, IOS Press, 2004, pp. 425-429. doi:10.1007/978-3-319-70407-4_36. doi:10.3233/978-1-60750-949-3-425. [40] D. Mietchen. State of WikiCite in 2020, in: Workshop on [26] P. Fontelo, F. Liu, S. Leon, A. Abrahamane, and M. Acker- Open Citations and Open Scholarly Metadata 2020, 2020. man. PICO Linguist and BabelMeSH: development and par- doi:10.5281/zenodo.4019954. tial evaluation of evidence-based multilanguage search tools [41] L. Wyatt, P. Ayers, M. Proffitt, D. Mietchen, E. Seiver, A. for MEDLINE/PubMed, in: Medinfo 2007: Proceedings of the Stinson, D. Taraborelli, C. Virtue, J. Tud, and J. Curiel. Wiki- 12th World Congress on Health (Medical) Informatics, IOS Cite Annual Report, 2019–20. Zenodo (2020). Press, 2007, p. 817. doi:10.5281/zenodo.3869809. [27] A. R. Terryn, V. Hoste, and E. Lefever. In no uncertain terms: [42] D. Taraborelli, L. Pintscher, D. Mietchen, and S. Rodlund. a dataset for monolingual and multilingual automatic term ex- WikiCite 2017 report. Figshare (2017). traction from comparable corpora. Language Resources and doi:10.6084/m9.figshare.5648233. [43] A. Rahimi, T. Baldwin, and K. Verspoor. WikiUMLS: Align- [61] M. S. Majumder and K. D. Mandl. Early in the epidemic: ing UMLS to Wikipedia via Cross-lingual Neural Ranking. impact of preprints on global discourse about COVID-19 arXiv preprint arXiv:2005.01281 (2020). transmissibility. The Lancet Global Health, 8.5 (2020), pp. [44] A. Waagmeester, G. Stupp, S. Burgstaller-Muehlbacher, B. M. e627-e630. doi:10.1016/S2214-109X(20)30113-3. Good, G. Malachi, O. L. Griffith, et al. Wikidata as a know- [62] J. Nunn, T. Shafee, S. Chang, R. Stephens, J. Elliott, S. Oliver, ledge graph for the life sciences. eLife, 9 (2020), p. e52614. et al. Standardised Data on Initiatives-STARDIT: Alpha Ver- doi:10.7554/eLife.52614. sion. OSF Preprints (2019). doi:10.31219/osf.io/5q47h. [45] M. Klein and A. Kyrios. VIAFbot and the integration of li- [63] B. Xu, M. U. Kraemer, and Data Curation Group. Open access brary data on Wikipedia. Code4lib journal, 22 (2013). epidemiological data from the COVID-19 outbreak. The Lan- [46] S. Allison-Cassin and D. Scott. Wikidata: a platform for your cet Infectious Diseases, 20.5 (2020), pp. 534. library‟s linked open data. Code4Lib Journal, 40 (2018). doi:10.1016/S1473-3099(20)30119-5. [47] N. Vincent, I. Johnson, and B. Hecht. Examining Wikipedia [64] M. Y. Jaradeh, A. Oelen, K. E. Farfar, M. Prinz, J. D'Souza, G. with a broader lens: Quantifying the value of Wikipedia's rela- Kismihók, et al. Open research knowledge graph: next genera- tionships with other large-scale online communities, in: Pro- tion infrastructure for semantic scholarly knowledge, in: Pro- ceedings of the 2018 CHI Conference on Human Factors in ceedings of the 10th International Conference on Knowledge Computing Systems, ACM, 2018, pp. 1-13. Capture, 2019, pp. 243-246. doi:10.1145/3360901.3364435. doi:10.1145/3173574.3174140. [65] L. Bellomarini, M. Benedetti, A. Gentili, R. Laurendi, D. [48] L. Ciechanowski, D. Jemielniak, and P. A. Gloor. TUTORI- Magnanimi, A. Muci, and E. Sallinger. COVID-19 and Com- AL: AI research without coding: The art of fighting without pany Knowledge Graphs: Assessing Golden Powers and Eco- fighting: Data science for qualitative researchers. Journal of nomic Impact of Selective Lockdown via AI Reasoning. arXiv Business Research, 117 (2020), pp. 322-330. preprint arXiv:2004.10119 (2020). doi:10.1016/j.jbusres.2020.06.012. [66] D. Dimitrov, E. Baran, P. Fafalios, R. Yu, X. Zhu, M. Zloch, [49] C. Ordun, S. Purushotham, and E. Raff. Exploratory analysis and S. Dietze. TweetsCOV19--A Knowledge Base of Seman- of covid-19 tweets using topic modeling, umap, and digraphs. tically Annotated Tweets about the COVID-19 Pandemic. ar- arXiv preprint arXiv:2005.03082 (2020). Xiv preprint arXiv:2006.14492 (2020). [50] J. W. Huss III, C. Orozco, J. Goodale, C. Wu, S. Batalov, T. J. [67] E. Guidotti and D. Ardia. COVID-19 data hub. Journal of Vickers, et al. A gene wiki for community annotation of gene Open Source Software, 5.51 (2020), p. 2376. function. PLoS Biol, 6.7 (2008), p. e175. doi:10.21105/joss.02376. [51] R. Page. Towards a biodiversity knowledge graph. Research [68] D. Kagan, J. Moran-Gilad, and M. Fire. Scientometric trends Ideas and Outcomes, 2 (2016), p. e8767. for coronaviruses and other emerging viral infections. GigaS- doi:10.3897/rio.2.e8767. cience, 9.8 (2020), p. giaa085. [52] Addshore, D. Mietchen, and E. Willighagen. Wikidata Que- doi:10.1093/gigascience/giaa085. ries around the SARS-CoV-2 virus and pandemic. Zenodo [69] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, (2020). doi:10.5281/zenodo.3977414. P. N. Mendes, et al. DBpedia–a large-scale, multilingual [53] M. Manske, U. Böhme, C. Püthe, and M. Berriman. GeneDB knowledge base extracted from Wikipedia. Semantic Web, 6.2 and Wikidata. Wellcome open research, 4 (2019), p. 114. (2015), pp. 167-195. doi:10.3233/SW-140134. doi:10.12688/wellcomeopenres.15355.2. [70] S. Juul, N. Nielsen, P. Bentzer, A. A. Veroniki, L. Thabane, A. [54] L. Chaari and O. Golubnitschaja. Covid-19 pandemic by the Linder, et al. Interventions for treatment of COVID-19: a pro- “real-time” monitoring: the Tunisian case and lessons for tocol for a living systematic review with network meta- global epidemics in the context of 3PM strategies. EPMA analysis including individual patient data (The LIVING journal, 11.2 (2020), pp. 133-138. doi:10.1007/s13167-020- Project). Systematic Reviews, 9 (2020), p. 108. 00207-0. doi:10.1186/s13643-020-01371-0. [55] Z. Zhang, W. Yao, Y. Wang, C. Long and X. Fu. Wuhan and [71] E. Iglesias, S. Jozashoori, D. Chaves-Fraga, D. Collarana, and Hubei COVID-19 mortality analysis reveals the critical role of M. E. Vidal. SDM-RDFizer: An RML Interpreter for the Effi- timely supply of medical resources. The Journal of infection, cient Creation of RDF Knowledge Graphs. arXiv preprint ar- 81.1 (2020), p. 147. doi:10.1016/j.jinf.2020.03.018. Xiv:2008.07176 (2020). [56] B. C. Keegan and J. R. Brubaker. 'Is' to 'Was' Coordination [72] P. Hitzler and K. Janowicz. Linked Data, Big Data, and the and Commemoration in Posthumous Activity on Wikipedia 4th Paradigm. Semantic Web, 4.3 (2013), pp. 233-235. Biographies, in: Proceedings of the 18th ACM Conference on doi:10.3233/SW-130117. Computer Supported Cooperative Work & Social Computing, [73] H. Sebei, M. A. Hadj Taieb, and M. Ben Aouicha. Review of ACM, 2015, pp. 533-546. doi:10.1145/2675133.2675238. social media analytics process and big data pipeline. Social [57] B. C. Keegan and C. Tan. A Quantitative Portrait of Wikipe- Network Analysis and Mining, 8.1 (2018), p. 30. dia's High-Tempo Collaborations during the 2020 Coronavirus doi:10.1007/s13278-018-0507-0. Pandemic. arXiv preprint arXiv:2006.08899 (2020). [74] A. S. Dadzie, and M. Rowe. Approaches to visualising linked [58] H. Turki, M. A. Hadj Taieb, and M. Ben Aouicha. MeSH data: A survey. Semantic Web, 2.2 (2011), pp. 89-124. qualifiers, publication types and relation occurrence frequency doi:10.3233/SW-2011-0037. are also useful for a better sentence-level extraction of bio- [75] G. Hagedorn, D. Mietchen, R. A. Morris, D. Agosti, L. Penev, medical relations. Journal of biomedical informatics, 83 W. G. Berendsohn, and D. Hobern. Creative Commons li- (2018), pp. 217-218. doi:10.1016/j.jbi.2018.05.011. censes and the non-commercial condition: Implications for the [59] X. Hu, R. Rousseau, and J. Chen. On the definition of forward re-use of biodiversity information. ZooKeys, 150 (2011), p. and backward citation generations. Journal of Informetrics, 127. doi:10.3897/zookeys.150.2189. 5.1 (2011), pp. 27-36. doi:10.1016/j.joi.2010.07.004. [76] L. Penev, D. Mietchen, V. Chavan, G. Hagedorn, V. Smith, D. [60] A. Boccone and R. Rivelli. The bibliographic metadata in Shotton, et al. Strategies and guidelines for scholarly publish- Wikidata: Wikicite and the «Bibliothecae.it» case study. Bib- ing of biodiversity data. Research Ideas and Outcomes, 3 liothecae.it, 8.1 (2019), pp. 227-248. doi:10.6092/issn.2283- (2017), p. e12431. doi:10.3897/rio.3.e12431. 9364/9503. [77] D. M. Lublin. Universal RBCs. Transfusion, 40.11 (2000), pp. 1285-1289. doi:10.1046/j.1537-2995.2000.40111285.x. [78] F. Darari, S. Razniewski, R. E. Prasojo, and W. Nutt. Enabling fine-grained RDF data completeness assessment, in: Interna- tional Conference on Web Engineering, Springer, Cham, 2016, pp. 170-187. doi:10.1007/978-3-319-38791-8_10. [79] V. Balaraman, S. Razniewski, and W. Nutt. Recoin: relative completeness in Wikidata, in: Companion Proceedings of the The Web Conference 2018, ACM, 2018, pp. 1787-1792. doi:10.1145/3184558.3191641. [80] Y. Zhang, H. Lin, Z. Yang, J. Wang, S. Zhang, Y. Sun, and L. Yang. A hybrid model based on neural networks for biomedi- cal relation extraction. Journal of biomedical informatics, 81 (2018), 83-92. doi:10.1016/j.jbi.2018.03.011. [81] M. Rotmensch, Y. Halpern, A. Tlimat, S. Horng, and D. Son- tag. Learning a health knowledge graph from electronic medi- cal records. Scientific reports, 7.1 (2017), 1-11. doi:10.1038/s41598-017-05778-z. [82] L. M. Schriml, M. Chuvochina, N. Davies, E. A. Eloe-Fadrosh, R. D. Finn, P. Hugenholtz, et al. COVID-19 pandemic reveals the peril of ignoring metadata standards. Scientific data, 7.1 (2020), 188. doi:10.1038/s41597-020-0524-5. [83] M. Farda-Sarbas, H. Zhu, M. F. Nest, and C. Müller-Birn. Approving automation: analyzing requests for permissions of bots in Wikidata, in: Proceedings of the 15th International Symposium on Open Collaboration, 2019, pp. 1-10. doi:10.1145/3306446.3340833. [84] A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer. Quality assessment for linked data: A survey. Semantic Web, 7.1 (2016), pp. 63-93. doi:10.3233/SW-150175. [85] H. Paulheim. Knowledge graph refinement: A survey of ap- proaches and evaluation methods. Semantic Web, 8.3 (2017), pp. 489-508. doi:10.3233/SW-160218. [86] H. Turki, M. A. Hadj Taieb, T. Shafee, T. Lubiana, D. Jemiel- niak, M. Ben Aouicha, J. E. Labra Gayo, M. Banat, D. Das, and D. Mietchen. csisc/WikidataCOVID19SPARQL: Data about Wikidata coverage of COVID-19. Zenodo (2020). doi:10.5281/zenodo.4022591. [87] M. Aturban, M. Kelly, S. Alam, J. A. Berlin, M. L. Nelson, and M. C. Weigle. ArchiveNow: Simplified, Extensible, Mul- ti-Archive Preservation, in: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, ACM, 2018, pp. 321-322. doi:10.1145/3197026.3203880.

Country (SPEED) Z30 Daily evolution of the number of COVID-19 Recoveries by Country (SPEED) 10. Supplementary Data Z31 Daily evolution of the COVID-19 Recovery Rate by Coun- try (SPEED) 10.1. Supplementary tables Z32 Daily evolution of the number of COVID-19 Cases in a given country (SPEED, SARS-CoV-2-Queries) Z33 Daily evolution of the number of COVID-19 Deaths in a Table S1 given country (SPEED, SARS-CoV-2-Queries) List of sample queries on COVID-19 overview, international situa- Z34 Daily evolution of the number of COVID-19 Clinical Tests tion, international daily epidemiological evolution, Tunisian daily in a given country (SPEED) epidemiological evolution, Tunisian governorate-level situation, Z35 Daily evolution of the number of COVID-19 Recoveries in Tunisian correlations, and Worldwide correlations a given country (SPEED) Z36 Daily evolution of the COVID-19 Mortality Rate in a given Table available as Query/COVID-19.xlsx in country (SPEED) http://doi.org/10.5281/zenodo.4022591. Z37 Daily evolution of the COVID-19 Positive Clinical Test Rate in a given country (SPEED) Table S2 Z38 Daily evolution of the COVID-19 Recovery Rate in a given country (SPEED) List of the tasks fulfilled by the SPARQL queries for the visualiza- Z39 Daily evolution of the number of COVID-19 Cases by tion of the COVID-19 information in Wikidata administrative subdivision of a given country (SPEED) Task Description Z40 Daily evolution of the number of COVID-19 Deaths by Genomic data and clinical knowledge administrative subdivision of a given country (SPEED) Z1 Symptoms of COVID-19 (SPEED, SARS-CoV-2-Queries) Z41 Daily evolution of the COVID-19 Mortality Rate by ad- Z2 Potential treatments of COVID-19 (SPEED) ministrative subdivision of a given country (SPEED) Z3 Linnean Taxonomy of SARS-CoV-2 (SPEED) Z42 Daily evolution of the number of COVID-19 New Cases Z4 All SARSr viruses (SARS-CoV-2-Queries) (SPEED) Z5 Coronaviruses that infect humans (SARS-CoV-2-Queries) Z43 Daily evolution of the number of COVID-19 New Deaths Z6 All betacoronaviruses (SARS-CoV-2-Queries, WPCOVID) (SPEED) Z7 All coronaviruses (SARS-CoV-2-Queries) Z44 Daily evolution of the number of COVID-19 New Clinical Z8 Comparing viruses with SARS-CoV-2 (SARS-CoV-2- Tests (SPEED) Queries) Z45 Daily evolution of the number of COVID-19 New Recove- Z9 NCBI Taxonomy IDs of coronaviruses (SARS-CoV-2- ries (SPEED) Queries) Z46 Daily evolution of the number of COVID-19 Active Cases Z10 SARS-CoV-2 genomes (SARS-CoV-2-Queries) (SPEED) Z11 SARS-CoV-2 genes (SARS-CoV-2-Queries) Z47 Daily evolution of the number of COVID-19 Clinical Tests Z12 SARS-CoV-2 proteins (SARS-CoV-2-Queries) by Laboratory in a given country (SPEED) Z13 SARS-CoV-2 protein complexes (SARS-CoV-2-Queries) Z48 Number of COVID-19 Cases by administrative subdivision Z14 SARSr genes (SARS-CoV-2-Queries) of a given country (SPEED) Z15 SARSr proteins (SARS-CoV-2-Queries) Z49 Number of COVID-19 Deaths by administrative subdivi- Z16 Human coronavirus‟ genes (SARS-CoV-2-Queries) sion of a given country (SPEED) Z17 Human coronavirus‟ proteins (SARS-CoV-2-Queries) Z50 COVID-19 Mortality Rate by administrative subdivision of Z18 Coronavirus‟ proteins interacting with human proteins a given country (SPEED) (SARS-CoV-2-Queries) Z51 Number of COVID-19 Cases per Capita by administrative Z19 Biological process for the pathogenesis of coronaviruses subdivision of a given country (SPEED) (SARS-CoV-2-Queries) Z52 Number of COVID-19 Deaths per Capita by administrative Z20 Antibodies for the coronaviruses (SARS-CoV-2-Queries) subdivision of a given country (SPEED) Z21 Vaccines for the coronaviruses (SARS-CoV-2-Queries) Z53 Number of COVID-19 Cases per Area by administrative Z22 Drugs for the coronaviruses (SARS-CoV-2-Queries) subdivision of a given country (SPEED) Z23 COVID-19, COVID-19 pandemic and SARS-CoV-2 in the Z54 Number of COVID-19 Deaths per Area by administrative context of the Wikidata knowledge graph (Scholia) subdivision of a given country (SPEED) Epidemiology Z55 Current Epidemiological Status in a given country Z24 Daily evolution of the global number of COVID-19 cases (SPEED) (SARS-CoV-2-Queries, WPCOVID, COVID-19 Sum- Z56 Number of COVID-19 Clinical Tests by Laboratory in a mary) given country (SPEED) Z25 Daily evolution of the number of COVID-19 Cases by Z57 Map of Affected Countries (SPEED, WPCOVID) Country (SPEED) Z58 Number of COVID-19 Cases by Country (SPEED, Z26 Daily evolution of the number of COVID-19 Deaths by WPCOVID) Country (SPEED) Z59 Number of COVID-19 Cases per 100000 inhabitants by Z27 Daily evolution of the COVID-19 Mortality Rate by Coun- Country (SPEED) try (SPEED) Z60 Number of COVID-19 Deaths by Country (SPEED) Z28 Daily evolution of the number of COVID-19 Clinical Tests Z61 Number of COVID-19 Deaths per 100000 inhabitants by by Country (SPEED) Country (SPEED) Z29 Daily evolution of the COVID-19 Positive Test Rate by Z62 COVID-19 Mortality rates by Country (SPEED) Z63 Number of COVID-19 Clinical Tests by Country (SPEED) Country Population (SPEED) Z64 Number of COVID-19 Clinical Tests per 100000 inhabi- Z95 Country-level number of COVID-19 Deaths in function of tants by Country (SPEED) Country Area (SPEED) Z65 Number of COVID-19 Recoveries by Country (SPEED) Z96 Country-level number of COVID-19 Deaths in function of Z66 Number of COVID-19 Recoveries per 100000 inhabitants Country Density Rate (SPEED) by Country (SPEED) Z97 Country-level COVID-19 Mortality Rate in function of Z67 Famous COVID-19 Victims (SPEED, WPCOVID, CO- Country Population (SPEED) VID-19 Summary) Z98 Country-level COVID-19 Mortality Rate in function of Z68 Age distribution of Famous COVID-19 Victims (COVID- Country Area (SPEED) 19 Summary) Z99 Country-level COVID-19 Mortality Rate in function of Z69 Field of work of Famous COVID-19 Victims (COVID-19 Country Population Density Rate (SPEED) Summary) Z100 Duration between first case and first death based on num- Z70 Place of birth of Famous COVID-19 Victims (COVID-19 ber of cases and number of deaths in a given country Summary) (SARS-CoV-2-Queries) Z71 Number of COVID-19 Cases per area by Country (SPEED, Z101 Lockdowns due to the COVID-19 pandemic (WPCOVID) COVID-19 Summary) Research outputs and computer applications Z72 Number of COVID-19 Deaths per area by Country Z102 Scholarly publications about COVID-19 pandemic and (SPEED) SARS-CoV-2 (SPEED, SARS-CoV-2-Queries, WPCO- Z73 Number of COVID-19 Clinical Tests per area by Country VID, Scholia) (SPEED) Z103 Tools and Resources about COVID-19 pandemic by type Z74 Number of COVID-19 Recoveries per area by Country (SPEED) (SPEED) Z104 Tools and Resources about COVID-19 pandemic (SPEED) Z75 Number of COVID-19 Cases in function of the number of Z105 Tools and Resources about COVID-19 pandemic by pub- clinical tests in a given country (SPEED) lisher (SPEED) Z76 Number of COVID-19 Deaths in function of the number of Z106 Tools and Resources about COVID-19 pandemic by li- cases in a given country (SPEED) cense (SPEED) Z77 COVID-19 Mortality Rate in function of the number of Z107 Tools and Resources about COVID-19 pandemic by field cases in a given country (SPEED) of work (SPEED) Z78 Number of COVID-19 cases in an administrative subdivi- Z108 Clinical trials about COVID-19 pandemic (SARS-CoV-2- sion of a given country in function of population (SPEED) Queries) Z79 Number of COVID-19 cases in an administrative subdivi- Z109 Scholarly publications about the virus transmission of sion of a given country in function of area (SPEED) coronaviruses (SARS-CoV-2-Queries) Z80 Number of COVID-19 cases in an administrative subdivi- Z110 Scholarly publications about the SARS-CoV-2 genes sion of a given country in function of population Density (SARS-CoV-2-Queries) Rate (SPEED) Z111 Scholarly publications about the SARS-CoV-2 proteins Z81 Number of COVID-19 deaths in an administrative subdivi- (SARS-CoV-2-Queries) sion of a given country in function of population (SPEED) Z112 Scholarly publications about coronaviruses (SARS-CoV-2- Z82 Number of COVID-19 deaths in an administrative subdivi- Queries) sion of a given country in function of area (SPEED) Z113 Scholarly publications about human coronaviruses (SARS- Z83 Number of COVID-19 deaths in an administrative subdivi- CoV-2-Queries) sion of a given country in function of population Density Z114 Contact tracing protocols related to the COVID-19 pan- Rate (SPEED) demic (WPCOVID) Z84 COVID-19 Mortality Rate in an administrative subdivision Z115 Scholarly publications about COVID-19 pandemic and of a given country in function of population (SPEED) SARS-CoV-2 by year (Scholia) Z85 COVID-19 Mortality Rate in an administrative subdivision Z116 Research scientists mostly publishing scholarly publica- of a given country in function of area (SPEED) tions about COVID-19 pandemic and SARS-CoV-2 (Scho- Z86 COVID-19 Mortality Rate in an administrative subdivision lia) of a given country in function of population Density Rate Z117 Collaboration network of the research scientists working on (SPEED) COVID-19 pandemic and SARS-CoV-2 (Scholia) Z87 Number of COVID-19 new cases in a given country in Z118 Topics of the scholarly publications about COVID-19 function of number of old cases (SPEED) pandemic and SARS-CoV-2 (Scholia) Z88 Global number of COVID-19 Cases in function of the Z119 Co-occurring topic graph of the scholarly publications global number of clinical tests (SPEED) about COVID-19 pandemic and SARS-CoV-2 (Scholia) Z89 Global number of COVID-19 Deaths in function of the Z120 Map of cities and countries evocated by the scholarly pub- global number of cases (SPEED) lications about COVID-19 pandemic and SARS-CoV-2 Z90 COVID-19 Global Mortality Rate in function of the global (Scholia) number of cases (SPEED) Z121 Research scientists mostly cited by the scholarly publica- Z91 Country-level number of COVID-19 Cases in function of tions about COVID-19 pandemic and SARS-CoV-2 (Scho- Country Population (SPEED) lia) Z92 Country-level number of COVID-19 Cases in function of Z122 Venues and series mostly publishing research works about Country Area (SPEED) the COVID-19 pandemic and SARS-CoV-2 (Scholia) Z93 Country-level number of COVID-19 Cases in function of Z123 Most cited research publications about COVID-19 pandem- Country Population Density Rate (SPEED) ic and SARS-CoV-2 (Scholia) Z94 Country-level number of COVID-19 Deaths in function of Z124 Map of institutions publishing research works about CO- VID-19 pandemic and SARS-CoV-2 (Scholia) Z125 Citation network of research countries working on COVID- 19 pandemic and SARS-CoV-2 (Scholia) Z126 Awards received by authors who published on COVID-19 pandemic and SARS-CoV-2 (Scholia) Z127 Scholarly publications about COVID-19 and SARS-CoV-2 with missing main subject [P921] values (SARS-CoV-2- Queries, WPCOVID) Other Z128 Images from Wikimedia Commons about COVID-19 pan- demic and SARS-CoV-2 (SPEED) Z129 COVID-19 Factbook (SPEED) Z130 Bankrupt businesses due to the COVID-19 pandemic (WPCOVID) Z131 Properties used to model COVID-19 knowledge in Wikida- ta (WPCOVID)

10.2. Supplementary figures

This section of the supplementary data includes additional array of visualizations that were not able to fit in the main text but that exemplify the diversity of additional valuable information that can be ex- tracted out of the Wikidata knowledge base.

Fig. S2. Epidemiological data for Tunisia. The SPEED website was set up as a COVID-19 data dashboard for Tunisia (Available at: https://w.wiki/cQC). A) Daily mortality rate from COVID-19 in Tunisia (live data: https://w.wiki/N2p). B) Tunisian governorate- level cases (live data: https://w.wiki/N9Y). C) Daily Evolution of Clinical tests by laboratory in Tunisia (live data: https://w.wiki/NEb).

Fig. S1. Snapshot of the extended graph of the three main COVID items and the statements for which they are the subject. Linked items demonstrate the variety of topics for which the three main COVID items (indicated in red) are the subject and present a small subset of the classes indicated in Fig. 2. (Available at: https://w.wiki/cPa, live data: https://tinyurl.com/y2ddm3n4, Access Date: August 19, 2020)

Figure S4. Partial citation network within Wikidata (Available at: https://w.wiki/cQV). The citation network around COVID-19 is currently rather incomplete but part of the larger, ongoing Wiki- Cite project to represent all citation data within Wikidata as a fully open citation network. A) publications cited from C3 papers (live data: https://w.wiki/b$h) B) authors most frequently cited by C3 papers (live data: https://w.wiki/b$i).

Figure S3. People listed in Wikidata deceased due to COVID-19 (Available at: https://w.wiki/cQK). A) As tabular output, ranked by date of death (live data: https://w.wiki/Mgv). B) Portrait images available under a CC BY-compatible license, ranked by how well- described the depicted individuals are in Wikidata (number of identifiers + statements + sitelinks) (live data: https://w.wiki/bzJ). C) as bubble diagram of professions (live data: https://w.wiki/bTz).

Figure S5. Most common publication venues for C3-themed pa- pers (published and preprint). Even with Wikidata‟s currently incomplete coverage of articles hosted on preprint servers, they are clearly a significant location for COVID-related publications (Available at: https://w.wiki/cQX, live data: https://w.wiki/bd$).

Fig. S6. Information regarding clinical trials on interventions to treat COVID-19 (Available at https://w.wiki/cQb, live data: https://w.wiki/bav)

Fig. S9. Online resource locations for information on COVID-19 regional outbreaks (Available at: https://w.wiki/cQo). A) Official websites (live data: https://w.wiki/bdt). B) Main hashtags (live data: https://w.wiki/bds)

Fig. S7. Computer applications and their types (Available at: https://w.wiki/cQg, live data: https://w.wiki/NVp)

Fig. S10. COVID-related images based on structured data (Availa- ble at: https://w.wiki/cQt). Images in wikimedia commons used to Fig. S8. Information on authors of articles on COVID-related be organised solely by a hierarchical category structure. Since topics (Available at: https://w.wiki/cQh). A) Awards most fre- 2019, structured data can be associated with images via Wikidata quently received by authors of C3 papers (live data: statements. A) Images from Wikimedia Commons about COVID- https://w.wiki/ban), B) Map of organizations associated with 19 pandemic and SARS-CoV-2 with a CC-BY-compatible license works about C3 with institutions that have published a single paper (live data: https://w.wiki/Zsn). B) Images of face masks used dur- on the topic in green, those that have published 1-10 in orange, and ing COVID-19 pandemic with a CC-BY-compatible license (live those having published >10 in blue (live data: https://w.wiki/cG4). data: https://w.wiki/bzG).