Alliance of Genome Resources Portal: Unified Model Organism
Total Page:16
File Type:pdf, Size:1020Kb
Nucleic Acids Research, 2019 1 doi: 10.1093/nar/gkz813 Alliance of Genome Resources Portal: unified model organism research platform The Alliance of Genome Resources Consortium*,† Downloaded from https://academic.oup.com/nar/advance-article-abstract/doi/10.1093/nar/gkz813/5573549 by guest on 30 September 2019 Received July 31, 2019; Revised September 03, 2019; Editorial Decision September 11, 2019; Accepted September 19, 2019 ABSTRACT six MODs: the Mouse Genome Database, the Rat Genome Database, the Zebrafish Information Network, FlyBase, The Alliance of Genome Resources (Alliance) is a WormBase and the Saccharomyces Genome Database. consortium of the major model organism databases These MOD resources extensively curate data for the bud- and the Gene Ontology that is guided by the vision of ding yeast (Saccharomyces cerevisiae), worm (Caenorhab- facilitating exploration of related genes in human and ditis elegans), fruit fly (Drosophila melanogaster), zebrafish well-studied model organisms by providing a highly (Danio rerio), mouse (Mus sp.), rat (Rattus norvegicus) as integrated and comprehensive platform that enables well as many closely related taxa. The GOC provides GO researchers to leverage the extensive body of ge- annotations for all taxa. To make the case for long-term netic and genomic studies in these organisms. Initi- sustainability, the Alliance is working to make MOD and ated in 2016, the Alliance is building a central portal GOC infrastructure more efficient by sharing computa- (www.alliancegenome.org) for access to data for the tional pipelines and software development. The Alliance of Genome Resources Portal, reported on primary model organisms along with gene ontology here, results from the development and maintenance of a data and human data. All data types represented in software platform and shared modular infrastructure de- the Alliance portal (e.g. genomic data and phenotype signed for the coordination of data harmonization activi- descriptions) have common data models and work- ties across the model organism databases. The MODs and flows for curation. All data are open and freely avail- GOC remain the focus of organism-specific expert curation able via a variety of mechanisms. Long-term plans for of data and are responsible for quality control and submis- the Alliance project include a focus on coverage of sion of data to the Alliance research platform. additional model organisms including those without dedicated curation communities, and the inclusion CREATING THE INTEGRATED MODEL ORGANISM of new data types with a particular focus on pro- PORTAL viding data and tools for the non-model-organism The Alliance web portal (www.alliancegenome.org)pro- researcher that support enhanced discovery about vides a single point of access to multiple types of genetic and human health and disease. Here we review current genomic data from diverse model organisms used to study progress and present immediate plans for this new the genetic and genomic basis of human biology, health bioinformatics resource. and disease. The portal is designed with a number of key principles in mind: (a) entity report pages as information ‘hubs’, connecting together a variety of linked data types INTRODUCTION AND HISTORY into single views; (b) a powerful top-level search function, Model Organism Databases (MODs) and the associated with results faceted by key data types and (c) standard- Gene Ontology Consortium (GOC), have been highly suc- ized user-interface components with a common look-and- cessful and are crucial to modern biomedical and biological feel across all organisms. The portal also provides access to research (1–7). Increasingly, researchers want information large datasets through Application Programming Interfaces about more than one model organism, especially in the con- (APIs) and data downloads. With each quarterly release of text of human biomedical research, and they would benefit the portal, data from the primary Alliance knowledge cen- greatly from being able to explore data across many model ters (as Alliance MODs and GOC, the curation centers, organisms in a consistent way. To facilitate such cross- are known) are refreshed, and new data types and displays organism studies, the Alliance is harmonizing representa- are added. Version 2.2, released 7/2/2019, includes new ex- tion of concepts and unifying user interface design for the tensive representation of gene expression data, inclusion presentation of data types in common across the GOC and of molecular interaction data, and improved links between *To whom correspondence should be addressed. Tel: +1 207 288 6248; Email: [email protected] †The Alliance of Genome Resources Consortium authors are listed in the appendix. C The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 2 Nucleic Acids Research, 2019 orthology, expression and disease data. Release notes can Function - Go annotations be reviewed here (https://www.alliancegenome.org/release- Using the controlled vocabulary of the GO, the functions of notes). genes have been carefully annotated over many years by in- ternational teams of curators, translating the findings from GENE REPORT PAGES more than 150 000 published articles into simple, structured Species-specific gene report pages provide detailed informa- and reusable pieces of biological knowledge. A given gene tion about a gene and its function (Figure 1). They begin can therefore have tens to hundreds of functions annotated Downloaded from https://academic.oup.com/nar/advance-article-abstract/doi/10.1093/nar/gkz813/5573549 by guest on 30 September 2019 with a summary of the gene and include subsections for with specific GO terms. The GOC provides a graphical tool, genomic information, functional (GO) annotations, orthol- the Ribbon, to summarize those functions and to enable ogy relationships, phenotypic and disease associations, ex- users to drill down to further details as well as to compare pression patterns, alleles and molecular interactions. Many the functions of multiple genes easily (Figure 3A). First in- details represented include hyperlinks to additional infor- troduced by the MGD database (10), the GO Ribbon has mation. Below are descriptions of how elements of this page three sections, reflecting the three branches of the GOon- are generated and refreshed. tology, namely Biological Process, Molecular Function and Cellular Component. The high-level terms selected to sum- Automated gene descriptions marize the functions of genes are carefully designed sub- sets, also called ‘GO slim’, and darker colors indicate more With so much information available for many genes, intelli- data (http://geneontology.org/docs/go-subset-guide/). This gent summation of the data is needed to facilitate discover- paradigm is further detailed in Disease section below. ability and interspecies comparison of gene function. One One of the challenges of the Alliance consortium was way of doing this is to provide a short standardized human to design a GO slim that would represent the specific ex- readable description for every gene, yet manually writing perimental findings in all the model organisms yet be gen- them from an ever-growing literature corpus is a time- and eral enough to enable useful comparisons between these labor-intensive task. To automate this process, we devel- species, and with human genes. The matrix-like represen- oped an algorithm that summarizes gene function based on tation of the GO ribbon is a useful way to represent curated data. Data include gene associations to Gene Ontol- other ontology-based of annotations and thus is also be- ogy (GO) terms, Disease Ontology (DO) terms, human or- ing used to power the Alliance disease and expression rib- tholog data and gene expression data. The problem of gen- bons (Figure 3B, C). The REACT component is freely erating a text summary from multiple ontology terms anno- available as an NPM package (https://www.npmjs.com/ tated to a gene was formulated as an optimization problem. package/@geneontology/ribbon) and a more generic rib- The solution to this problem resulted in a short, template- bon that can easily be integrated in any website (indepen- based paragraph of text containing an optimal combina- dent of the framework used) is in progress. Community tion of ontology terms based on their information content, feedback and requests are also welcome and can be sub- limited to a predefined maximum text length. The terms in mitted at the public GitHub repository (https://github.com/ the final description can be either the original annotated geneontology/ribbon). term/s or a parent term which groups several of the an- notated terms. An example is the automated description Orthology provided for mouse gene Bmp4 which has 291 GO anno- tations alone (https://www.alliancegenome.org/gene/MGI: The representation of orthology is essential to comparative 88180)(Figure2). This gene descriptions project has re- genomic studies and to interpretation and inferential con- sulted in over 100 000 modular and standardized gene de- clusions of model organism studies. One of the key initial scriptions for purposes of portal display, and can be down- goals of the Alliance is to generate, display and provide