An Overview of Synergistic Data Tools for Biological Scrutiny
Total Page:16
File Type:pdf, Size:1020Kb
Review DOI: 10.1002/ijch.201200094 An Overview of Synergistic Data Tools for Biological Scrutiny Tsviya Olender,*[a] Marilyn Safran,[a] Ron Edgar,[b] Gil Stelzer,[a] Noam Nativ,[a] Naomi Rosen,[a] Ronit Shtrichman,[b] Yaron Mazor,[b] Michael D. West,[b] Ifat Keydar,[a] Noa Rappaport,[a] Frida Belinky,[a] David Warshawsky,[b] and Doron Lancet[a] Abstract: A network of biological databases is reviewed, the realm of disease scrutiny, we portray MalaCards, a novel supplying a framework for studies of human genes and the comprehensive database of human diseases and their anno- association of their genomic variations with human pheno- tations. Also shown is GeneKid, a tool aimed at generating types. The network is composed of GeneCards, the human novel kidney disease biomarkers using systems biology, as gene compendium, which provides comprehensive informa- well as Xome, a database for whole-exome next-generation tion on all known and predicted human genes, along with DNA sequences for human diseases in the Israeli popula- its suite members GeneDecks and GeneLoc. Two databases tion. Finally, we show LifeMap Discovery, a database of em- are shown that address genes and variations focusing on ol- bryonic development, stem cell research and regenerative factory reception (HORDE) and transduction (GOSdb). In medicine, which links to both GeneCards and MalaCards. Keywords: bioinformatics · databases · gene sequencing · genetic disease · olfaction 1 Introduction 2 GeneCards A main focus of our research is the use of genomics and The individual scientist, seeking research knowledge bioinformatics tools to obtain novel biological insights. about a gene of interest, can be overwhelmed by the These approaches involve the application of computation- deluge of data from worldwide genome projects. The la- al strategies to biological information management and borious task of sifting through hundreds of thousands of analysis. We develop data compendia that assist in obtain- records can be reduced by the use of integrated flexible ing a better understanding of how genes and their varia- and user-friendly databases. For over 15 years, we have tions underlie different human phenotypes. We focus on developed and expanded GeneCards (www.genecards several parallel threads, including human olfaction, Men- .org), a comprehensive, authoritative compendium of an- delian and complex diseases, embryonic development and notative information about human genes, accessed yearly stem cell research. In the spirit of systems biology, we by more than two million unique users. Its gene-centric help generate genome-wide views to facilitate gene-based content is automatically mined and integrated from over inquiry. One of our long-time fortes is a gene-centric view 100 electronic sources, including HGNC, NCBI, Ensembl, of the human genome universe, as embodied in Gene- UniProt, UCSC, REACTOME, and BioGPS, resulting in Cards. More recently, we generated in parallel the human web-based cards with sections encompassing a variety of disease compendium MalaCards. Both are richly annotat- topics (e.g., genomic location, gene function, transcrip- ed from numerous web resources and provide focused in- tion, disorder, literature, and pathways) for each of more formation to external data structures (Figure 1). A special than 122,000 human gene entries.[1] GeneCards also fea- liaison is between the Weizmann data structure and the LifeMap Discovery database, which provides a unique [a] T. Olender, M. Safran, G. Stelzer, N. Nativ, N. Rosen, I. Keydar, angle on embryonic development and stem cell research. N. Rappaport, F. Belinky, D. Lancet Department of Molecular Genetics Such a network of databases helps our own teams as well Weizmann Institute of Science as numerous external users worldwide carry out genome Rehovot 76100 (Israel) and bioinformatics research, creating a live digital “eco- e-mail: [email protected] system” that stores and expands knowledge. This paper [b] R. Edgar, R. Shtrichman, Y. Mazor, M. D. West, D. Warshawsky presents an overview of each of these databases, and ex- LifeMap Sciences amples of interconnections and synergistic benefits. HaNehoshet 10, Tel Aviv 69710 (Israel) Isr. J. Chem. 2013, 53, 185 – 198 2013 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 185 Review T. Olender et al. tures comprehensive searches of all mined data, allowing fields and receive tabulated output. Recent research fo- one to generate data relations that would otherwise be cuses include revamped gene expression visualization, in- concealed. Also shown are links to gene-related research cluding RNA-seq data from the Illumina BodyMap proj- reagents such as antibodies, recombinant proteins, DNA ect, and an expanded 76-tissue repertoire from BioGPS[2] clones and inhibitory RNAs, was well as human embryon- and CGAP Serial Analysis of Gene Expression (SAGE)[3] ic stem and progenitor cell lines. The GeneCards suite data. member GeneALaCart provides batch query support, whereby users submit a gene list (e.g., from a microarray 2.1 ncRNA Genes experiment) along with desired GeneCards annotation Non-coding RNA (ncRNA) genes in the human genome have been increasingly studied in the past few years, and Dr. Tsviya Olender has a PhD in physi- new ncRNA classes have been discovered.[4] However, cal chemistry from the Weizmann Insti- tute. In 1999, she joined the group of until recently, no comprehensive non-redundant database Prof. Lancet as a postdoc and contin- for all ncRNA human genes was available. We have re- ued on as a research associate. Her re- cently completed a major unification effort of ncRNA search focuses on the genetics of gene entries within GeneCards, integrated from 15 differ- human olfaction and the genetics of ent data sources by mining four secondary sources[5] and rare diseases using computational several additional primary sources. An integration algo- tools. Together with Prof. Lancet, she rithm based on mapping to genomic coordinates em- has played a key role in discovering ol- ployed, among others, another GeneCards suite member, factory receptor genes from humans GeneLoc[6] (Section 2.3). This was used to cluster overlap- and other mammals, and developed ping entries at unified locations, thereby merging parallel the human OR data explorer (HORDE) versions of the same gene, as well as precursors with their database. She is also involved in developing tools for deciphering [1a] disease-causing mutations, including the Xome database described mature transcripts. A total of ~64,000 new human here. ncRNA genes were added to GeneCards, resulting in a total of almost 80,000 entries belonging to 14 RNA Marilyn Safran is the head of develop- classes, including 22,000 piwi-interacting RNAs (piRNAs) ment for the multi-disciplinary Gene- and 17,000 long non-coding RNAs (LncRNAs). Gene- Cards/MalaCards team in the Lancet Cards V3.09 now contains ~122,500 gene entries, with lab at the Weizmann Institute of Sci- a greater than fivefold enhancement of the ncRNA ence in Israel. Prior to joining in 1997, count, compared to the pre-unification version 3.07 (Fig- she did software development and ure 2A). Among the annotations provided is a quality management at Ubique Inc., the CS score, reflecting the degree of confidence that the entry is department at Weizmann, Bell Labora- indeed a bona fide gene, based on functional annotations tories, MIT’s Lincoln Laboratory, and as well as expression. Current work in progress includes the Memorial Sloan Kettering Cancer Center, in the fields of bioinformatics, a probabilistic categorization, indicating source-related web applications, compilers, databas- ambiguities in gene classification (Figure 2B). es, and medicine. Marilyn received a BA in Math with Honors from Queens College of CUNY in 1974, 2.2 GeneDecks and an MS in Computer Science from Boston University in 1978. Enrichment analysis of high-throughput data is a key Prof. Lancet, head of the Weizmann In- method used to glean new biological insights from experi- stitute’s Crown Genome Center, has mental results.[7] GeneDecks, a GeneCards suite member a PhD from Weizmann and postdoctor- (www.genecards.org/GeneDecks), is an annotation-based al training from Harvard and Yale. He enrichment analysis tool for human genes and gene sets. pioneered research on the biochem- It constitutes a systems biology facilitator, leveraging istry, genetics and evolution of olfac- GeneCards unique wealth of combinatorial annotations. tion, and investigates human rare dis- ease genes. Lancet and his team devel- In its Set Distiller mode, GeneDecks exposes commonali- oped GeneCards, a world-renowned ties within a list of genes, shedding functional light upon web compendium of human genes, the constituent individual genes (Figure 3A). Descriptors and more recently initiated a compan- that are enriched in the query gene set are sorted first by ion database MalaCards, a comprehen- their statistical significance and then by their shared gene sive web tool for human diseases. count. GeneDecks strength lies in the broad range of at- Lancet was awarded the Takasago and Wright Awards in the USA tributes available for use in its analyses, coming from and the Landau Prize in Israel. He is a member of EMBO and (until eight categories, such as disorders, compounds, pathways, 2012) of the HUGO Council. Gene Ontology (GO) terms and expression. It produces 186 www.ijc.wiley-vch.de 2013 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Isr. J. Chem. 2013, 53, 185 – 198 Synergistic Data Tools for Biological Scrutiny Figure 1. Data integration and information flow. One hundred varied external sources provide detailed information to the GeneCards inte- grated database of human genes, along with its suite members, GeneLoc exon-based genomic map and GeneDecks sets analysis. Gene- Cards has recently greatly enhanced its ncRNA gene coverage. GeneCards empowers specific databases as shown, covering tools for olfac- tory genes as well as disease-specific genes and biomarkers. Two novel medically focused databases, MalaCards and LifeMap Discovery, to- gether with GeneCards, form a powerful trio of mutually enriching bioinformatics tools.