Wikidata As a Knowledge Graph for the Life Sciences
Total Page:16
File Type:pdf, Size:1020Kb
Washington University School of Medicine Digital Commons@Becker Open Access Publications 2020 Wikidata as a knowledge graph for the life sciences Andra Waagmeester Malachi Griffith Obi L. Griffith et al. Follow this and additional works at: https://digitalcommons.wustl.edu/open_access_pubs FEATURE ARTICLE SCIENCE FORUM Wikidata as a knowledge graph for the life sciences Abstract Wikidata is a community-maintained knowledge base that has been assembled from repositories in the fields of genomics, proteomics, genetic variants, pathways, chemical compounds, and diseases, and that adheres to the FAIR principles of findability, accessibility, interoperability and reusability. Here we describe the breadth and depth of the biomedical knowledge contained within Wikidata, and discuss the open-source tools we have built to add information to Wikidata and to synchronize it with source databases. We also demonstrate several use cases for Wikidata, including the crowdsourced curation of biomedical ontologies, phenotype-based diagnosis of disease, and drug repurposing. ANDRA WAAGMEESTER†, GREGORY STUPP†, SEBASTIAN BURGSTALLER- MUEHLBACHER, BENJAMIN M GOOD, MALACHI GRIFFITH, OBI L GRIFFITH, KRISTINA HANSPERS, HENNING HERMJAKOB, TOBY S HUDSON, KEVIN HYBISKE, SARAH M KEATING, MAGNUS MANSKE, MICHAEL MAYERS, DANIEL MIETCHEN, ELVIRA MITRAKA, ALEXANDER R PICO, TIMOTHY PUTMAN, ANDERS RIUTTA, NURIA QUERALT-ROSINACH, LYNN M SCHRIML, THOMAS SHAFEE, DENISE SLENTER, RALF STEPHAN, KATHERINE THORNTON, GINGER TSUENG, ROGER TU, SABAH UL-HASAN, EGON WILLIGHAGEN, CHUNLEI WU AND ANDREW I SU* Introduction advance efforts by the open-data community to Integrating data and knowledge is a formidable build a rich and heterogeneous network of scien- challenge in biomedical research. Although new tific knowledge. That knowledge network could, scientific findings are being discovered at a in turn, be the foundation for many computa- *For correspondence: asu@ rapid pace, a large proportion of that knowl- tional tools, applications and analyses. scripps.edu edge is either locked in data silos (where inte- Most data- and knowledge-integration initia- † These authors contributed gration is hindered by differing nomenclature, tives fall on either end of a spectrum. At one equally to this work data models, and licensing terms; end, centralized efforts seek to bring multiple Competing interests: The Wilkinson et al., 2016) or locked away in free- knowledge sources into a single database (see, authors declare that no text. The lack of an integrated and structured for example, Mungall et al., 2017): this competing interests exist. version of biomedical knowledge hinders effi- approach has the advantage of data alignment Funding: See page 12 cient querying or mining of that information, according to a common data model and of thus preventing the full utilization of our accu- enabling high performance queries. However, Reviewing editor: Peter mulated scientific knowledge. centralized resources are difficult and expensive Rodgers, eLife, United Kingdom Recently, there has been a growing emphasis to maintain and expand (Chandras et al., 2009; Copyright Waagmeester et al. within the scientific community to ensure all sci- Gabella et al., 2018), at least in part because of This article is distributed under entific data are FAIR – Findable, Accessible, bottlenecks that are inherent in a centralized the terms of the Creative Interoperable, and Reusable – and there is a design. Commons Attribution License, which permits unrestricted use growing consensus around a concrete set of At the other end of the spectrum, distributed and redistribution provided that principles to ensure FAIRness (Wilkinson et al., approaches to data integration result in a broad the original author and source are 2019; Wilkinson et al., 2016). Widespread landscape of individual resources, focusing on credited. implementation of these principles would greatly technical infrastructure to query and integrate Waagmeester et al. eLife 2020;9:e52614. DOI: https://doi.org/10.7554/eLife.52614 1 of 15 Feature Article Science Forum Wikidata as a knowledge graph for the life sciences across them for each query. These approaches In previous work, we seeded Wikidata with lower the barriers to adding new data by content from public and authoritative sources of enabling anyone to publish data by following structured knowledge on genes and proteins community standards. However, performance is (Burgstaller-Muehlbacher et al., 2016) and often an issue when each query must be sent to chemical compounds (Willighagen et al., 2018). many individual databases, and the performance Here, we describe progress on expanding and of the system as a whole is highly dependent on enriching the biomedical knowledge graph the stability and performance of each individual within Wikidata, both by our team and by others component. In addition, data integration in the community (Turki et al., 2019). We also requires harmonizing the differences in the data describe several representative biomedical use cases on how Wikidata can enable new analyses models and data formats between resources, a and improve the efficiency of research. Finally, process that can often require significant skill we discuss how researchers can contribute to and effort. Moreover, harmonizing differences in this effort to build a continuously-updated and data licensing can sometimes be impossible. community-maintained knowledge graph that Here we explore the use of Wikidata (www. epitomizes the FAIR principles. wikidata.org; Vrandecˇic´, 2012; Mora- Cantallops et al., 2019) as a platform for knowl- edge integration in the life sciences. Wikidata is The Wikidata Biomedical an openly-accessible knowledge base that is Knowledge Graph editable by anyone. Like its sister project Wiki- The original effort behind this work focused on pedia, the scope of Wikidata is nearly boundless, creating and annotating Wikidata items for with items on topics as diverse as books, actors, human and mouse genes and proteins (Burgstal- historical events, and galaxies. Unlike Wikipedia, ler-Muehlbacher et al., 2016), and was subse- Wikidata focuses on representing knowledge in quently expanded to include microbial reference a structured format instead of primarily free genomes from NCBI RefSeq (Putman et al., text. As of September 2019, Wikidata’s knowl- 2017). Since then, the Wikidata community edge graph included over 750 million state- (including our team) has significantly expanded ments on 61 million items (tools.wmflabs.org/ the depth and breadth of biological information wikidata-todo/stats.php). Wikidata was also the within Wikidata, resulting in a rich, heteroge- first project run by the Wikimedia Foundation neous knowledge graph (Figure 1). Some of the (which also runs Wikipedia) to have surpassed key new data types and resources are described one billion edits, achieved by a community of below. 12,000 active users, including 100 active compu- Genes and proteins: Wikidata contains items for over 1.1 million genes and 940 thousand pro- tational ‘bots’ (Figure 1—figure supplement 1). teins from 201 unique taxa. Annotation data on As a knowledge integration platform, Wiki- genes and proteins come from several key data- data combines several of the key strengths of bases including NCBI Gene (Agarwala et al., the centralized and distributed approaches. A 2018), Ensembl (Zerbino et al., 2018), UniProt large portion of the Wikidata knowledge graph (UniProt Consortium, 2019), InterPro is based on the automated imports of large (Mitchell et al., 2019), and the Protein Data structured databases via Wikidata bots, thereby Bank (Burley et al., 2019). These annotations breaking down the walls of existing data silos. include information on protein families, gene Since Wikidata is also based on a community- functions, protein domains, genomic location, editing model, it harnesses the distributed and orthologs, as well as links to related com- efforts of a worldwide community of contribu- pounds, diseases, and variants. tors, including both domain experts and bot Genetic variants: Annotations on genetic var- developers. Anyone is empowered to add new iants are primarily drawn from CIViC (http://www. statements, ranging from individual facts to civicdb.org), an open and community-curated large-scale data imports. Finally, all knowledge database of cancer variants (Griffith et al., 2017). in Wikidata is queryable through a SPARQL Variants are annotated with their relevance to dis- query interface (query.wikidata.org/), which also ease predisposition, diagnosis, prognosis, and enables distributed queries across other Linked drug efficacy. Wikidata currently contains 1502 Data resources. items corresponding to human genetic variants, Waagmeester et al. eLife 2020;9:e52614. DOI: https://doi.org/10.7554/eLife.52614 2 of 15 Feature Article Science Forum Wikidata as a knowledge graph for the life sciences found in taxon (1,247) taxon 2,600,217 has active ingredient / Global Biodiversity Information Facility ID: 2,058,609 found in taxon (581,407) therapeutic use active ingredient in (236) pharmaceutical product Encyclopedia of Life ID: 1,354,013 ortholog (3,711,264) 803 2,731 IRMNG ID: 1,214,539 iNaturalist taxon ID: 569,998 MeSH ID: 608 RxNorm CUI: 2,046 ITIS TSN: 533,003 found in taxon (795,773) gene ChEBI ID: 478 European Medicines Agency product number: 1,068 IPNI plant ID: 488,933 MeSH Code: 443 NCBI Taxonomy ID: 471,220 1,176,028 Freebase ID: 422 drugmedical used for condition treatment treated (1,022)