<<

Review

DOI: 10.1002/ijch.201200094 An Overview of Synergistic Data Tools for Biological Scrutiny

Tsviya Olender,*[a] Marilyn Safran,[a] Ron Edgar,[b] Gil Stelzer,[a] Noam Nativ,[a] Naomi Rosen,[a] Ronit Shtrichman,[b] Yaron Mazor,[b] Michael D. West,[b] Ifat Keydar,[a] Noa Rappaport,[a] Frida Belinky,[a] David Warshawsky,[b] and Doron Lancet[a]

Abstract: A network of biological databases is reviewed, the realm of disease scrutiny, we portray MalaCards, a novel supplying a framework for studies of human and the comprehensive database of human diseases and their anno- association of their genomic variations with human pheno- tations. Also shown is GeneKid, a tool aimed at generating types. The network is composed of GeneCards, the human novel kidney disease biomarkers using systems biology, as compendium, which provides comprehensive informa- well as Xome, a database for whole-exome next-generation tion on all known and predicted human genes, along with DNA sequences for human diseases in the Israeli popula- its suite members GeneDecks and GeneLoc. Two databases tion. Finally, we show LifeMap Discovery, a database of em- are shown that address genes and variations focusing on ol- bryonic development, stem cell research and regenerative factory reception (HORDE) and transduction (GOSdb). In medicine, which links to both GeneCards and MalaCards. Keywords: bioinformatics · databases · gene sequencing · genetic disease · olfaction

1 Introduction 2 GeneCards

A main focus of our research is the use of genomics and The individual scientist, seeking research knowledge bioinformatics tools to obtain novel biological insights. about a gene of interest, can be overwhelmed by the These approaches involve the application of computation- deluge of data from worldwide genome projects. The la- al strategies to biological information management and borious task of sifting through hundreds of thousands of analysis. We develop data compendia that assist in obtain- records can be reduced by the use of integrated flexible ing a better understanding of how genes and their varia- and user-friendly databases. For over 15 years, we have tions underlie different human phenotypes. We focus on developed and expanded GeneCards (www. several parallel threads, including human olfaction, Men- .org), a comprehensive, authoritative compendium of an- delian and complex diseases, embryonic development and notative information about human genes, accessed yearly stem cell research. In the spirit of systems biology, we by more than two million unique users. Its gene-centric help generate genome-wide views to facilitate gene-based content is automatically mined and integrated from over inquiry. One of our long-time fortes is a gene-centric view 100 electronic sources, including HGNC, NCBI, Ensembl, of the universe, as embodied in Gene- UniProt, UCSC, REACTOME, and BioGPS, resulting in Cards. More recently, we generated in parallel the human web-based cards with sections encompassing a variety of disease compendium MalaCards. Both are richly annotat- topics (e.g., genomic location, gene function, transcrip- ed from numerous web resources and provide focused in- tion, disorder, literature, and pathways) for each of more formation to external data structures (Figure 1). A special than 122,000 human gene entries.[1] GeneCards also fea- liaison is between the Weizmann data structure and the LifeMap Discovery database, which provides a unique [a] T. Olender, M. Safran, G. Stelzer, N. Nativ, N. Rosen, I. Keydar, angle on embryonic development and stem cell research. N. Rappaport, F. Belinky, D. Lancet Department of Molecular Genetics Such a network of databases helps our own teams as well Weizmann Institute of Science as numerous external users worldwide carry out genome Rehovot 76100 (Israel) and bioinformatics research, creating a live digital “eco- e-mail: [email protected] system” that stores and expands knowledge. This paper [b] R. Edgar, R. Shtrichman, Y. Mazor, M. D. West, D. Warshawsky presents an overview of each of these databases, and ex- LifeMap Sciences amples of interconnections and synergistic benefits. HaNehoshet 10, Tel Aviv 69710 (Israel)

Isr. J. Chem. 2013, 53, 185 – 198 2013 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 185 Review T. Olender et al. tures comprehensive searches of all mined data, allowing fields and receive tabulated output. Recent research fo- one to generate data relations that would otherwise be cuses include revamped gene expression visualization, in- concealed. Also shown are links to gene-related research cluding RNA-seq data from the Illumina BodyMap proj- reagents such as antibodies, recombinant proteins, DNA ect, and an expanded 76-tissue repertoire from BioGPS[2] clones and inhibitory RNAs, was well as human embryon- and CGAP Serial Analysis of Gene Expression (SAGE)[3] ic stem and progenitor cell lines. The GeneCards suite data. member GeneALaCart provides batch query support, whereby users submit a gene list (e.g., from a microarray 2.1 ncRNA Genes experiment) along with desired GeneCards annotation Non-coding RNA (ncRNA) genes in the human genome have been increasingly studied in the past few years, and Dr. Tsviya Olender has a PhD in physi- new ncRNA classes have been discovered.[4] However, cal chemistry from the Weizmann Insti- tute. In 1999, she joined the group of until recently, no comprehensive non-redundant database Prof. Lancet as a postdoc and contin- for all ncRNA human genes was available. We have re- ued on as a research associate. Her re- cently completed a major unification effort of ncRNA search focuses on the genetics of gene entries within GeneCards, integrated from 15 differ- human olfaction and the genetics of ent data sources by mining four secondary sources[5] and rare diseases using computational several additional primary sources. An integration algo- tools. Together with Prof. Lancet, she rithm based on mapping to genomic coordinates em- has played a key role in discovering ol- ployed, among others, another GeneCards suite member, factory receptor genes from humans GeneLoc[6] (Section 2.3). This was used to cluster overlap- and other mammals, and developed ping entries at unified locations, thereby merging parallel the human OR data explorer (HORDE) versions of the same gene, as well as precursors with their database. She is also involved in developing tools for deciphering [1a] disease-causing mutations, including the Xome database described mature transcripts. A total of ~64,000 new human here. ncRNA genes were added to GeneCards, resulting in a total of almost 80,000 entries belonging to 14 RNA Marilyn Safran is the head of develop- classes, including 22,000 piwi-interacting RNAs (piRNAs) ment for the multi-disciplinary Gene- and 17,000 long non-coding RNAs (LncRNAs). Gene- Cards/MalaCards team in the Lancet Cards V3.09 now contains ~122,500 gene entries, with lab at the Weizmann Institute of Sci- a greater than fivefold enhancement of the ncRNA ence in Israel. Prior to joining in 1997, count, compared to the pre-unification version 3.07 (Fig- she did software development and ure 2A). Among the annotations provided is a quality management at Ubique Inc., the CS score, reflecting the degree of confidence that the entry is department at Weizmann, Bell Labora- indeed a bona fide gene, based on functional annotations tories, MIT’s Lincoln Laboratory, and as well as expression. Current work in progress includes the Memorial Sloan Kettering Cancer Center, in the fields of bioinformatics, a probabilistic categorization, indicating source-related web applications, compilers, databas- ambiguities in gene classification (Figure 2B). es, and medicine. Marilyn received a BA in Math with Honors from Queens College of CUNY in 1974, 2.2 GeneDecks and an MS in Computer Science from Boston University in 1978. Enrichment analysis of high-throughput data is a key Prof. Lancet, head of the Weizmann In- method used to glean new biological insights from experi- stitute’s Crown Genome Center, has mental results.[7] GeneDecks, a GeneCards suite member a PhD from Weizmann and postdoctor- (www.genecards.org/GeneDecks), is an annotation-based al training from Harvard and Yale. He enrichment analysis tool for human genes and gene sets. pioneered research on the biochem- It constitutes a systems biology facilitator, leveraging istry, genetics and evolution of olfac- GeneCards unique wealth of combinatorial annotations. tion, and investigates human rare dis- ease genes. Lancet and his team devel- In its Set Distiller mode, GeneDecks exposes commonali- oped GeneCards, a world-renowned ties within a list of genes, shedding functional light upon web compendium of human genes, the constituent individual genes (Figure 3A). Descriptors and more recently initiated a compan- that are enriched in the query gene set are sorted first by ion database MalaCards, a comprehen- their statistical significance and then by their shared gene sive web tool for human diseases. count. GeneDecks strength lies in the broad range of at- Lancet was awarded the Takasago and Wright Awards in the USA tributes available for use in its analyses, coming from and the Landau Prize in Israel. He is a member of EMBO and (until eight categories, such as disorders, compounds, pathways, 2012) of the HUGO Council. (GO) terms and expression. It produces

186 www.ijc.wiley-vch.de 2013 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Isr. J. Chem. 2013, 53, 185 – 198 Synergistic Data Tools for Biological Scrutiny

Figure 1. Data integration and information flow. One hundred varied external sources provide detailed information to the GeneCards inte- grated database of human genes, along with its suite members, GeneLoc exon-based genomic map and GeneDecks sets analysis. Gene- Cards has recently greatly enhanced its ncRNA gene coverage. GeneCards empowers specific databases as shown, covering tools for olfac- tory genes as well as disease-specific genes and biomarkers. Two novel medically focused databases, MalaCards and LifeMap Discovery, to- gether with GeneCards, form a powerful trio of mutually enriching bioinformatics tools. Double-lined arrows depict data sharing, single- lined arrows depict links, and the number of links between each pair of sources is also indicated. superior results compared to parallel analysis systems, In its Partner Hunter mode, one can “GeneDecks” such as DAVID.[8] The new disease-centric database Mal- a given gene with respect to a selected combinatorial an- aCards (Section 4.1) strongly relies on GeneDecks for notation in order to obtain a set of similar genes, i.e., po- many of its annotations, by harnessing Set Distiller to tential functional paralogs. GeneDecks portrays function- find shared descriptors for the genes associated with a dis- al partners in descending order of similarity (Figure 3B), ease.

Isr. J. Chem. 2013, 53, 185 – 198 2013 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.ijc.wiley-vch.de 187 Review T. Olender et al.

Figure 2. GeneCards ncRNA genes. A) A unification procedure that integrated numerous data sources within GeneCards resulted in an increase from ~15,000 RNA genes to almost 80,000, part of a gener- al gene count entry enhancement from ~68,000 entries to ~122,500. The uncategorized genes segment in GeneCards has been greatly reduced (from ~25% to ~3%), mainly feeding the ncRNA category. B) Scrutiny within the GeneCards compendium re- vealed prevalent annotation ambiguities among different sources, and among different transcripts for the same gene. This involves a trichotomy among the protein-coding, ncRNA and pseudogene categories. The diagram portrays a prototypic probabilistic catego- rization, quantifying where a gene lies within the spectrum of the Figure 3. GeneDecks analysis modes. A) Set Distiller: ten bron- three types, potentially allowing a more balanced view of a gene’s chiectasis-related genes (circles on the circumference) were ana- functional category. lyzed, resulting in several shared descriptors (inside central circle). Unexpected descriptors, which may give new insight into the dis- ease, appear in bold. B) Partner Hunter: the LEP gene was used as a query gene in an analysis aimed at finding potential functional often discovering relations that might be overlooked paralogous genes via shared attributes. Two of the best results [8b] based on sequence similarity alone. appear to the right, indicating why IL6 scored higher than LEPR.

2.3 GeneLoc The GeneLoc algorithm (genecards.weizmann.ac.il/gene coordinates of a given gene, taking into account every loc/) creates an integrated location map of genes and exon. Additionally, DNA segments classified by catego- markers within the entire human genome.[6] It eliminates ries (such as EST clusters) are presented, alongside the redundancies and assigns each gene a meaningful chro- genes, on a single megabase-scale map, with further infor- mosomal megabase location. This is used, in turn, to mation and links to relevant databases. GeneLocs impor- create a gene identifier, which serves as a GeneCards ID. tant advantage is in generating a simple tabulated list of GeneLoc currently uses gene sets from NCBI, Ensembl, genes and markers in a genomic interval of choice, or and newly added ncRNA sources. It compares these col- flanking a gene of interest, with ready links to GeneCards lections, deciding which entries should be consolidated and other databases such as Ensembl, Genethon and and which are to remain discrete. The resulting GeneLoc NCBI Gene (Figure 4). In this way, it complements “gene territory” reflects the range of the unified genomic graphic maps such as provided by the UCSC.

188 www.ijc.wiley-vch.de 2013 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Isr. J. Chem. 2013, 53, 185 – 198 Synergistic Data Tools for Biological Scrutiny

Figure 4. GeneLoc tabulation results for the TECPR2 genomic region. GeneLoc presents genes along the ordered by their co- ordinates, hyperlinked to their GeneCards source database(s), and with a description, when available. Gene identifiers from outside data- bases also have a link to a page that displays their exons, as reported by those databases.

3 Olfactory Databases of human OR genes, pseudogenes and segregating pseu- dogenes, as well as ORs from four other vertebrate spe- Olfaction, the sense of smell, is a molecularly complex cies (chimpanzee, dog, opossum and platypus). This data- sensory processing system, capable of producing accurate base is generated by an automated OR-specialized com- odor perception.[9] Olfactory receptors (ORs), which putational pipeline, which mines OR gene and pseudo- detect and recognize a multitude of odorants, constitute gene sequences out of complete genomes.[11] The pipeline the largest multigene family in the mammalian genome. has the capacity to annotate OR pseudogenes, which are They serve as an example of paralog and interindividual usually not identified by standard whole-genome annota- diversity, and reveal a very high count of pseudogenes, tion pipelines, making HORDE a unique resource for all with the unusual phenomenon that intact and inactive OR loci. Human location information produced by this genes may segregate in the population.[10] Downstream pipeline is supplied to HGNC and to GeneCards. Impor- from the ORs are auxiliary genes that mediate sensory tantly, HORDEs automatic pipeline generates gene sym- transduction. bols based on sequence-similarity classification for each gene.[12] For example, OR1A2 indicates family 1, subfami- ly A, member 2. This nomenclature system was subse- 3.1 HORDE quently applied to dog,[13] opossum,[14] and platypus.[15] Its HORDE, the human olfactory data explorer (genome potential as a tool for the study of mammalian ecological .weizmann.ac.il/horde/), presents a complete compendium adaptation was recently demonstrated.[16]

Isr. J. Chem. 2013, 53, 185 – 198 2013 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.ijc.wiley-vch.de 189 Review T. Olender et al.

Figure 5. An example of a HORDE card. A) Partial view of the information provided for the olfactory receptor gene OR7C2. B) The haplo- type section for OR7C2. Only segregating positions are shown, with the frequency of each haplotype in the population (%Freq) and its pre- dicted functionality score (CORP[43]).

In an attempt to provide insights into the evolution, the ORs into clusters, expression data (obtained from structure and function of the complete human OR reper- ESTs and microarray data),[17] gene model,[18] annotation toire, HORDE integrates diverse bioinformatic analyses of putative functional amino acid residues,[19] identifica- and additional resources into the database, and provides tion of synthetic clusters,[14] and more (Figure 5A). This links to/from HGNC, GeneCards, ORDB and others. information is pertinent to several open questions such as Thus, information is given on the genomic organization of the control of gene expression in the olfactory system, in-

190 www.ijc.wiley-vch.de 2013 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Isr. J. Chem. 2013, 53, 185 – 198 Synergistic Data Tools for Biological Scrutiny cluding epithelial zone–specific expression as well as ex- tome sequencing of human olfactory epithelium and pression with locus and allele exclusion,[20] ectopic expres- mouse olfactory epithelium and bulb, aiming to identify sion in non-olfactory tissues,[17,21] such as sperm,[22] and olfactory sensory-enriched transcripts. The information is biased expression of certain OR genes and pseudogenes presented on a web card for each gene, with the main (e.g., reference [23]). symbol (e.g., FGFR1) linked prominently to GeneCards. Genetic variations in OR genes underlie odorant sensi- Employing a global scoring system based on the attri- tivity differences, specific anosmia (diminished sensitivity) butes of the eleven data sources assembled, we identified and specific hyperosmia (enhanced sensitivity).[24] To ad- 1680 candidate auxiliary olfactory genes. To assess our dress the universe of human interindividual variation, we differential expression data sources potential to detect recently integrated a comprehensive catalog of genomic olfactory functional genes, we examined which genes re- variations in human ORs, collected from eleven sour- ceive high scores. Of the 20 top accumulated scoring ces,[10b] including The 1000 Genomes Project. This encom- genes, we found eleven genes that were previously anno- passes single-nucleotide polymorphisms (SNPs) and copy- tated by other data sources as core olfactory genes, and number variation (CNV), as well as deleterious mutation- the rest are candidates for new scrutiny (Figure 6). For al events. Using this catalog, HORDE now presents all a shortlist of the 136 top-scoring genes, we identified ge- inferred protein variants (haplotypes) for all intact OR nomic variants (probably damaging single-nucleotide genes, potentially related to differences in odorant bind- polymorphisms, indels and copy-number deletions) ing (Figure 5B). The catalog provides a realistic and up- gleaned from public repositories. Our GOSdb database of to-date view of the personal OR repertoire, where about genes and their variants should assist in rationalizing the two-thirds of the loci segregate between intact and inacti- great interindividual variation in human overall olfactory vated alleles, and every individual genome contains differ- sensitivity. The database and its variations assist in an on- ent OR combinations. going whole-exome sequencing study of 66 Jewish fami- Future plans include the expansion of HORDE to ad- lies with CGA. In addition, GOSdb may aid in scrutiniz- ditional OR repertoires of terrestrial vertebrates. The ing undeciphered genetic diseases accompanied by olfac- availability of such a collection with the unified standard tory disorders, as exemplified by Kallmann syndrome for nomenclature system of HORDE across species will which, despite substantial progress, most of the genetic allow researchers to make facile cross-species transitions basis remains uncharted.[26] FGFR1 is an example of in the complex OR gene family. a GOSdb gene, implicated in anosmia via in vivo and mouse knockout studies, linked to GeneCards, and associ- ated in Malacards with Kallmann Syndrome. Insights into 3.2 GOSdb this disease were shared with the MalaCards project, and In parallel to the study of odorant sensitivity differences, it was featured as the sample malady on the MalaCards we also study general olfactory sensitivity (GOS), scantly V1.0.1 homepage. Ultimately, GOSdb may illuminate charted interindividual differences in overall (average) other sensory systems using olfaction as a model, and smell sensitivities.[24a] An extreme case of the GOS phe- may contribute to a broader understanding of neuroge- notype is isolated congenital general anosmia (CGA), netics. a non-syndromic inborn complete incapacity to perceive odorants. Our working hypothesis is that genetic varia- tions in auxiliary olfactory genes, including those media- 4 Disease Databases ting transduction and sensory neuronal structure and de- velopment, constitute the genetic basis for GOS and One of the greatest challenges of biomedical research is CGA. To better study such chemosensory phenotypes, we deciphering the underlying mechanisms of human diseas- performed a systematic exploration and created GOSdb, es, which requires accurate classification and annotation. an online resource (genome.weizmann.ac.il/GOSdb) de- Most human diseases arise due to complex interactions rived from eleven data sources, which integrates auxiliary between multiple genetic variants and environmental risk olfactory genes and their variations.[25] As a primary factors.[27] The study of diseases could thus shed light on source, GeneCards was searched for words like “anos- basic biological mechanisms. In parallel, diagnosis and mia”. The resulting gene set was fed into GeneALaCart treatment are facilitated by the huge amount of informa- to extract annotations, including aliases, articles, and ge- tion coming from genomics and proteomics research, al- nomic locations. In parallel, the literature was surveyed, lowing molecular-level support for medical decisions. seeking relevant functional in vitro studies, mouse gene The integration of massive existing amounts of infor- knockouts, and human disorders with olfactory pheno- mation under a single disease nomenclature is an enor- types. Also tackled were published transcriptome and mous challenge. At present, disease compilation, occur- proteome data for genes expressed in olfactory tissues ring in more than 60 existing data sources, is incomplete, and genes identified in olfactory-related linkage peaks. Fi- heterogeneous and often lacking systematic inquiry mech- nally, we performed in-house next-generation transcrip- anisms. Each data source focuses on different aspects of

Isr. J. Chem. 2013, 53, 185 – 198 2013 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.ijc.wiley-vch.de 191 Review T. Olender et al.

Figure 6. Twenty top-scoring GOSdb genes. The genes are sorted by their accumulated differential expression data sources score. These include RNA-seq of human olfactory epithelium and mouse olfactory epithelium and bulb, as compared to eight control tissues. Of the 20 genes shown, eleven were previously annotated by other data sources as core olfactory genes. Another nine genes are novel in the olfac- tory context (red), interesting candidates for further scrutiny. disease annotation, and/or contains a partial specialized 4.1 MalaCards list. Promising attempts to settle the varied disease nomen- It is clear to us that the realm of disease databasing re- clature are presented via knowledge representation quires the same “one-stop shop solution” as provided by through standardized vocabularies, to ensure both effec- GeneCards for the gene universe. We have recently intro- tive information sharing and interoperability among infor- duced MalaCards (www.malacards.org),[44] an integrated mation systems.[28] There are several vocabularies, ranging database of human maladies and their annotations, mod- from class-specific ones such as the Infectious Disease eled on the GeneCards strategy, architecture and infor- Ontology (IDO, infectiousdiseaseontology.org/page/ mation affluence. MalaCards mines and merges 44 data Main_Page) to more broadly disposed ones such as the sources to generate a computerized web card for each of International Classification of Diseases (ICD),[29] the Uni- 16,919 human disease entries, with disease-specific priori- fied Medical Language System (UMLS),[30] the Systemat- tized annotations, as well as interdisease connections. It ized Nomenclature of Medicine – Clinical Terms leverages the GeneCards relational database, searches, (SNOMED-CT) (16770974), the Medical Subject Head- and GeneDecks set analyses (Figure 7A). ings (MeSH) (www.nlm.nih.gov/mesh/) and the Disease The MalaCards disease list is built from 15 ranked Ontology (DO).[31] Such data structures range from flat sources, using disease name unification heuristics. Four lists, such as Online Mendelian Inheritance in Man schemes populate MalaCards sections: (i) direct interrog- (OMIM),[32] to hierarchies, as exemplified by the Disease ation of disease resources, to establish integrated disease Ontology. However, significant inconsistencies prevail in names and synonyms, as well as additional annotations basic terms pertaining to diseases. Existing vocabularies such as summaries, drugs/therapeutics, clinical features, are only partially cross-connected to each other, and do genetic tests, and anatomical context; (ii) searches of not define disease concepts uniformly. Moreover, most GeneCards for related publications and for associated existing disease databases only partially associate with genes, with corresponding relevance scores; (iii) analyses any ontology, which greatly limits the effectiveness of for- of disease-associated gene sets using GeneDecks,[8b] to malization and definition unification. calculate statistically significant descriptors enriched in this set (e.g., in the “Anaplastic Ependymoma” Mala- Card, “tumorigenesis” is entered into the phenotypes sec- tion, while “tumor metastasis” is entered into the path-

192 www.ijc.wiley-vch.de 2013 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Isr. J. Chem. 2013, 53, 185 – 198 Synergistic Data Tools for Biological Scrutiny

Figure 7. MalaCards: a database of human maladies. A) MalaCards architectural pipeline, showing integration of heterogeneous disease databases and the leveraging of GeneCards, GeneDecks set distillation, and internal searches for creating MalaCards annotations. B) Mala- Cards disease network. Nodes are connected by co-appearance in MalaCards search results. Edges are colored by their cluster association and nodes are sized by degree. ways section. This process also assigns a relevance score ed diseases. The latter form the basis for the construction for every hit, and is employed to populate the related dis- of a disease network, based on shared MalaCards annota- eases, phenotypes, pathways, compounds and GO terms tions (Figure 7B). Such networks embody associations sections, also providing relevant GeneCards deep links.); based on etiology, clinical features and clinical conditions. (iv) searches in MalaCards itself, e.g., for additional relat- This broadly disposed network has a power-law degree

Isr. J. Chem. 2013, 53, 185 – 198 2013 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.ijc.wiley-vch.de 193 Review T. Olender et al. distribution, as previously indicated for smaller disease networks, implying that some inherent properties are rep- resented within such networks. Our current work in Mala- Cards includes a more profound classification of diseases, and disease set analyses, hoping to make MalaCards an improved tool for biological as well as medical studies.

4.2 GeneKid A powerful way to integrate heterogeneous omics data is by using genes as their common denominator, using the genes annotation network. The expected increase in pa- tient chronic kidney disease (CKD) afflictions,[33] resulting in end-stage renal failure, has inspired SysKid (www.sy skid.eu/), an EU consortium involving 25 research groups from 16 countries. The aim of the consortium is to associ- Figure 8. Overview of the GeneKid workflow. Individual omics ate genes with CKD, so as to assist in the development of studies are uploaded into the GeneKid database and connected to new diagnosis and treatment methods. The crucial step in genes by the symbolizer mechanism. Thereupon, scores (fold- establishing a unified omics network is the connection of change etc.) resulting from mRNA, protein or other experiments each datum, such as RNA expression and metabolites, to can be interlinked through a single entity, namely a gene, and pri- a unique approved human gene symbol, a step termed oritized by an overall score. This process can be used to create “symbolization”. The robustness of GeneCards gene an- a CKD-relevant gene set, which may undergo further scrutiny, im- plicating new pathways, compounds, etc. This could ultimately notation is the basis for overcoming the diversity of gene reveal non-obvious chronic kidney disease biomarker candidates. identifiers supplied by consortium experimentalists. This challenge is significant for microarray probe set and SNP identifiers, and is even more accentuated by the need to associate metabolites to genes; to this end, compound- years, more than 100 genes have been identified as partic- gene associations were fortified using both The Human ipating in rare Mendelian disorders,[38] including our own Metabolome Database (HMDB)[34] and DrugBank,[35] discovery of deleterious mutations in five diseases, such cheminformatics sources that link drugs and compounds as the TECPR2 mutation that causes an autosomal reces- to their gene target. sive form of hereditary spastic paraplegia.[39] However, GeneKid is a resource for data storage and manage- the tremendous and increasing volume of sequencing ment, such as those generated by SysKid. Its database data generated by these technologies provides a great consists of 18 tables storing omics data as the main entity, bioinformatics challenge in terms of data processing, stor- along with study and sample information. Insertion of age, management and interpretation. data into GeneKid occurs after symbolization. Quantita- While typical whole-exome sequencing results in tive measurements detected for each experimental fea- ~25,000 variants per individual genome, identification of ture are stored in GeneKid, and a combinatorial/com- the causal variant involves the application of various fil- pound score is produced for each of them, thus prioritiz- tering strategies.[40] Commonly adopted procedures in- ing their importance in overall SysKid results (Figure 8). clude filtering of common variants, type of allele variants, The GeneKid user interface enables basic lookup serv- predictions of pathogenicity, and selection of an inheri- ices, allowing collaborating groups to access intermediate tance mode. A key component in these analyses is a com- results. When testing potential biomarkers, it is highly parison to an appropriate set of control variants, incorpo- beneficial to use GeneCards posted research reagents. rating exome-sequencing data of individuals from the As an example, siRNAs for seven specific genes of inter- same population and arising from the same sequencing est were extracted from GeneCards; these are already platform. This reduces the number of false positive var- being tested on candidate CKD genes. This knowledge iants and filters out population-specific variants. In this could be the basis for the development of proprietary di- realm, we developed Xome, a database for whole-exome agnostic tools.[36] sequencing data management. The database is currently populated with data from 105 Israeli individuals. These give rise to 474,781 variants. The access to Xome is cur- 4.3 Xome rently password protected, because the database contains Current advances in next-generation sequencing, such as detailed patient information, which is difficult to disguise whole-genome and whole-exome sequencing,[37] have dra- for this small population. matically increased the identification probability of causal variants for genetic disorders. Indeed, during the last two

194 www.ijc.wiley-vch.de 2013 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Isr. J. Chem. 2013, 53, 185 – 198 Synergistic Data Tools for Biological Scrutiny

Figure 9. An example card for an in vivo cell in the LifeMap Discovery website. A) Quick view of the cell anatomical location (kidney ! metanephric mesenchyme (compartment) ! cell) and summary, notes and development time. B) List of available data on this cell, includ- ing gene expression, high-throughput data, signals, related diseases and more. C) An interactive graphical view of the cell’s development tree. Cells lower down are descendants, and cells can have one or more ancestors. D) Insert of the Gene Expression section, with lists of genes curated from literature, in situ and microarrays.

Isr. J. Chem. 2013, 53, 185 – 198 2013 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.ijc.wiley-vch.de 195 Review T. Olender et al.

5 LifeMap Discovery In addition to the in vivo developmental part, the data- base includes in vitro differentiation protocols, as well as Understanding how stem and progenitor cells differenti- characteristics and cell therapy applications of the various ate into mature functional cells during embryonic devel- types of human in vitro cultured cells. These include em- opment is of fundamental interest, and clearly one of the bryonic stem and progenitor cells, adult stem cells, in- most mystifying areas of multicellular life. Being able to duced pluripotent stem cells, and primary cells. The dif- generate functional cells, and grow tissues from these ferentiation protocols are mapped, where possible, to the stem and progenitor cells in a dish, is perhaps the greatest closest matching in vivo development cells, anatomical challenge for medicine in the coming decades.[41] The compartments and tissues. LifeMap Discovery database has been created to bridge The database is divided into the following main parts: these separate areas of research, and bring knowledge (i) in vivo development – the first complete assembly and from the in vivo into the in vitro realm, and from there to reconstruction of cell lineages developed in the mammali- clinical utility, facilitating scientific knowledge and future an body; (ii) stem cell differentiation – cultured cells, applications of regenerative medicine. The underlying their differentiation protocols and cell therapyrelated postulate of LifeMap Discovery is that understanding the applications; and (iii) regenerative medicine – develop- genes expressed in every developing cell, and the signal- ment of stem and progenitor cells into therapeutic prod- ing that drives its differentiation, will provide invaluable ucts. These different parts are connected and interlaced information for (i) identification and classification of dif- by computational and hand-curated methods. Most note- ferentiated stem and progenitor cells, and (ii) suggesting worthy, the in vivo entities are linked to their closest in mechanisms to derive protocols for differentiating these vitro entities whenever data is available. Matching is cells into the desired, more mature cell types.[42] mostly based on gene-expression analysis and also on The LifeMap Discovery database (discovery.lifemap other cell characteristics such as functional and morpho- sc.com) helps users trace the cellular differentiation that logical similarity. occurs during mammalian embryonic development. The The value provided by LifeMap Discovery and its pro- annotated cell development tree stems from the zygote, jected effect on stem cell research and therapeutics origi- branches to progenitor cells, and terminates at mature nate from the combined power of these data, which ena- cells (Figure 9). The LifeMap Discovery database is based bles or facilitates identifying, predicting and indicating on developmental paths to specific fates, such as blood, possible differentiation paths and future regenerative endothelium, motor neurons, bone or cartilage. Develop- medicine applications. LifeMap Discovery integrates with mental data are available at cellular and anatomical reso- GeneCards, where the rich information available on each lutions. In concert with cellular differentiation, complex gene is directly linked, and MalaCards, to which it pro- organs and tissues are formed; hence each developing vides the relationships of cells, tissues and organs to the cell is also a member of specific anatomical compart- disease potentially targeted by regenerative medicine and ments, and associated with tissues composing the develop- cell therapy applications. ing organs. LifeMap Discovery is based on systematic gathering and de novo assimilation of various scientific data de- scribing mostly mouse and human development. Due to the inherent difficulties in studying human development, mouse information is far more abundant, and therefore is included in the database. The mouse data serve both as a foundation for further studies and as a model for Acknowledgements human development. The resulting annotated cell devel- opment ontology tree is thus based on mouse develop- Research at the Weizmann Institute of Science is support- ment, but with cross-related human development infor- ed by grants from the Israel Science Foundation Heritage mation wherever available. The information is collected Fund, the SysKid EU FP7 project (No. 241544) and Life- and described for each cell in specific lineages. Examples Map Sciences Inc. California (USA). Research support are the endothelium, blood, muscle, bone, cartilage, heart comes also from the Crown Human Genome Center, the myocardium, kidney, neuronal cells and more. The anato- Weizmann-Sourasky collaborative grant and the Nella my content includes the organ development path supple- and Leon Benoziyo Center for Neurological Diseases at mented with relevant images, in situ hybridizations, and the Weizmann Institute of Science. We are grateful to high-throughput experimental gene-expression data. Cel- Michal Twik, Tsippi Iny Stein, Shahar Zimmerman, Iris lular information includes the cell development path, Bahir, Danit Oz-Levi, Anna Alkelai and Edna Ben- qualitative gene expression, signaling that affects the dif- Asher from the Weizmann Institute and Yaron Guan- ferentiation process, high-throughput gene-expression Golan from LifeMap Sciences. LifeMap Sciences Inc. is data, related diseases, and relevant references. a subsidiary of BioTime Inc., Alameda, CA (USA).

196 www.ijc.wiley-vch.de 2013 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Isr. J. Chem. 2013, 53, 185 – 198 Synergistic Data Tools for Biological Scrutiny

References [13] T. Olender, T. Fuchs, C. Linhart, R. Shamir, M. Adams, F. Kalush, M. Khen, D. Lancet, Genomics 2004, 83, 361–372. [1] a) F. Belinky, I. Bahir, G. Stelzer, S. Zimmerman, N. Rosen, [14] R. Aloni, T. Olender, D. Lancet, Genome Biol. 2006, 7, N. Nativ, I. Dalah, T. Iny Stein, N. Rappaport, T. Mituyama, R88. M. Safran, D. Lancet, Bioinformatics 2013, 29, 255–261; [15] W. C. Warren, L. W. Hillier, J. A. Marshall Graves, E. b) G. Stelzer, I. Dalah, T. I. Stein, Y. Satanower, N. Rosen, Birney, C. P. Ponting, F. Grtzner, K. Belov, W. Miller, L. N. Nativ, D. Oz-Levi, T. Olender, F. Belinky, I. Bahir, H. Clarke, A. T. Chinwalla, S. P. Yang, A. Heger, D. P. Locke, Krug, P. Perco, B. Mayer, E. Kolker, M. Safran, D. Lancet, P. Miethke, P. D. Waters, F. Veyrunes, L. Fulton, B. Fulton, Hum. Genomics 2011, 5, 709–717; c) M. Safran, I. Dalah, J. T. Graves, J. Wallis, X. S. Puente, C. Lpez-Otin, G. R. Or- Alexander, N. Rosen, T. Iny Stein, M. Shmoish, N. Nativ, I. dÇez, E. E. Eichler, L. Chen, Z. Cheng, J. E. Deakin, A. Bahir, T. Doniger, H. Krug, A. Sirota-Madi, T. Olender, Y. Alsop, K. Thompson, P. Kirby, A. T. Papenfuss, M. J. Wake- Golan, G. Stelzer, A. Harel, D. Lancet, Database 2010. field, T. Olender, D. Lancet, G. A. Huttley, A. F. Smit, A. DOI: 10.1093/database/baq020. Pask, P. Temple-Smith, M. A. Batzer, J. A. Walker, M. K. [2] C. Wu, C. Orozco, J. Boyer, M. Leglise, J. Goodale, S. Bata- Konkel, R. S. Harris, C. M. Whittington, E. S. Wong, N. J. lov, C. L. Hodge, J. Haase, J. Janes, J. W. Huss 3rd, A. I. Su, Gemmell, E. Buschiazzo, I. M. Vargas Jentzsch, A. Merkel, Genome Biol. 2009, 10, R130. J. Schmitz, A. Zemann, G. Churakov, J. O. Kriegs, J. Brosius, [3] V. E. Velculescu, L. Zhang, B. Vogelstein, K. W. Kinzler, E. P. Murchison, R. Sachidanandam, C. Smith, G. J. Science 1995, 270, 484–487. Hannon, E. Tsend-Ayush, D. McMillan, R. Attenborough, [4] J. S. Mattick, I. V. Makunin, Hum. Mol. Genet. 2006, 15 W. Rens, M. Ferguson-Smith, C. M. Lefvre, J. A. Sharp, (suppl 1), R17–29. K. R. Nicholas, D. A. Ray, M. Kube, R. Reinhardt, T. H. [5] a) P. Flicek, M. R. Amode, D. Barrell, K. Beal, S. Brent, D. Pringle, J. Taylor, R. C. Jones, B. Nixon, J.-L. Dacheux, H. Carvalho-Silva, P. Clapham, G. Coates, S. Fairley, S. Fitzger- Niwa, Y. Sekita, X. Huang, A. Stark, P. Kheradpour, M. ald, L. Gil, L. Gordon, M. Hendrix, T. Hourlier, N. Johnson, Kellis, P. Flicek, Y. Chen, C. Webber, R. Hardison, J. A. K. Kahari, D. Keefe, S. Keenan, R. Kinsella, M. Komor- Nelson, K. Hallsworth-Pepin, K. Delehaunty, C. Markovic, owska, G. Koscielny, E. Kulesha, P. Larsson, I. Longden, W. P. Minx, Y. Feng, C. Kremitzki, M. Mitreva, J. Glasscock, T. McLaren, M. Muffato, B. Overduin, M. Pignatelli, B. Pritch- Wylie, P. Wohldmann, P. Thiru, M. N. Nhan, C. S. Pohl, ard, H. S. Riat, G. R. Ritchie, M. Ruffier, M. Schuster, D. S. M. Smith, S. Hou, M. Nefedov, P. J. de Jong, M. B. Ren- Sobral, Y. A. Tang, K. Taylor, S. Trevanion, J. Vandrovcova, free, E. R. Mardis, R. K. Wilson, Nature 2008, 453, 175–183. S. White, M. Wilson, S. P. Wilder, B. L. Aken, E. Birney, F. [16] S. Hayden, M. Bekaert, T. A. Crider, S. Mariani, W. J. Cunningham, I. Dunham, R. Durbin, X. M. Fernandez- Murphy, E. C. Teeling, Genome Res. 2010, 20, 1–9. Suarez, J. Harrow, J. Herrero, T. J. Hubbard, A. Parker, G. [17] E. Feldmesser, T. Olender, M. Khen, I. Yanai, R. Ophir, D. Proctor, G. Spudich, J. Vogel, A. Yates, A. Zadissa, S. M. Lancet, BMC Genomics 2006, 7, 121. Searle, Nucleic Acids Res. 2012, 40, D84–D90; b) D. Ma- [18] D. Thierry-Mieg, J. Thierry-Mieg, Genome Biol. 2006, 7 glott, J. Ostell, K. D. Pruitt, T. Tatusova, Nucleic Acids Res. (suppl 1), S12. 2011, 39, D52–D57; c) T. Mituyama, K. Yamada, E. Hattori, [19] O. Man, Y. Gilad, D. Lancet, Protein Sci. 2004, 13, 240–254. H. Okida, Y. Ono, G. Terai, A. Yoshizawa, T. Komori, K. [20] a) M. Q. Nguyen, Z. Zhou, C. A. Marks, N. J. Ryba, L. Bel- Asai, Nucleic Acids Res. 2009, 37, D89–D92; d) R. L. Seal, luscio, Cell 2007, 131, 1009–1017; b) T. Imai, H. Sakano, Re- S. M. Gordon, M. J. Lush, M. W. Wright, E. A. Bruford, Nu- sults Probl. Cell Differ. 2009, 47, 57–75. cleic Acids Res. 2011, 39, D514–D519. [21] O. De la Cruz, R. Blekhman, X. Zhang, D. Nicolae, S. Fire- [6] N. Rosen, V. Chalifa-Caspi, O. Shmueli, A. Adato, M. Lapi- stein, Y. Gilad, Mol. Biol. Evol. 2009, 26, 491–494. dot, J. Stampnitzky, M. Safran, D. Lancet, Bioinformatics [22] N. Fukuda, K. Yomogida, M. Okabe, K. Touhara, J. Cell 2003, 19 (suppl 1), i222–i224. Sci. 2004, 117, 5835–5845. [7] D. Groth, H. Lehrach, S. Hennig, Nucleic Acids Res. 2004, [23] L. L. Xu, B. G. Stackhouse, K. Florence, W. Zhang, N. Shan- 32, W313–W317. mugam, I. A. Sesterhenn, Z. Zou, V. Srikantan, M. Augus- [8] a) G. Dennis Jr. , B. T. Sherman, D. A. Hosack, J. Yang, W. tus, V. Roschke, K. Carter, D. G. McLeod, J. W. Moul, D. Gao, H. C. Lane, R. A. Lempicki, Genome Biol. 2003, 4, Soppett, S. Srivastava, Cancer Res. 2000, 60, 6568–6572. P3; b) G. Stelzer, A. Inger, T. Olender, T. Iny-Stein, I. [24] a) I. Menashe, T. Abaffy, Y. Hasin, S. Goshen, V. Yahalom, Dalah, A. Harel, M. Safran, D. Lancet, OMICS 2009, 13, C. W. Luetje, D. Lancet, PLoS Biol. 2007, 5, e284; b) A. 477–487. Keller, H. Zhuang, Q. Chi, L. B. Vosshall, H. Matsunami, [9] A. Kato, K. Touhara, Cell. Mol. Life Sci. 2009, 66, 3743– Nature 2007, 449, 468–472; c) J. F. McRae, J. D. Mainland, 3753. S. R. Jaeger, K. A. Adipietro, H. Matsunami, R. D. New- [10] a) I. Menashe, O. Man, D. Lancet, Y. Gilad, Nat. Genet. comb, Chem. Senses 2012, 37, 585–593. 2003, 34, 143–144; b) T. Olender, S. M. Waszak, M. Viavant, [25] I. Keydar, E. Ben-Asher, E. Feldmesser, N. Nativ, A. Oshi- M. Khen, E. Ben-Asher, A. Reyes, N. Nativ, C. J. Wysocki, moto, D. Restrepo, H. Matsunami, M.-S. Chien, J. M. Pinto, D. Ge, D. Lancet, BMC Genomics 2012, 13, 414. Y. Gilad, T. Olender, D. Lancet, Hum. Mutat. 2013, 34, 32– [11] a) M. Safran, V. Chalifa-Caspi, O. Shmueli, T. Olender, M. 41. Lapidot, N. Rosen, M. Shmoish, Y. Peter, G. Glusman, E. [26] G. P. Sykiotis, N. Pitteloud, S. B. Seminara, U. B. Kaiser, Feldmesser, A. Adato, I. Peter, M. Khen, T. Atarot, Y. W. F. Crowley Jr. , Sci. Transl. Med. 2010, 2, 32rv2. Groner, D. Lancet, Nucleic Acids Res. 2003, 31, 142–146; [27] J. N. Hirschhorn, M. J. Daly, Nat. Rev. Genet. 2005, 6,95– b) T. Olender, E. Feldmesser, T. Atarot, M. Eisenstein, D. 108. Lancet, GMR, Genet. Mol. Res. 2004, 3, 545–553. [28] a) R. H. Scheuermann, W. Ceusters, B. Smith, Summit on [12] G. Glusman, A. Bahar, D. Sharon, Y. Pilpel, J. White, D. Translat Bioinforma 2009, 116–120; b) O. Bodenreider, A. Lancet, Mamm. Genome 2000, 11, 1016–1023. Burgun, in Proceedings of the First International Conference

Isr. J. Chem. 2013, 53, 185 – 198 2013 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.ijc.wiley-vch.de 197 Review T. Olender et al.

on Biomedical Ontology (ICBO 2009), July 24–26, 2009, H. J. Vogel, I. Forsythe, Nucleic Acids Res. 2009, 37, D603– Buffalo, NY, 39–42. D610. [29] a) A. A. o. H. D. Systems, A. H. Association, A. o. H. Re- [35] C. Knox, V. Law, T. Jewison, P. Liu, S. Ly, A. Frolkis, A. cords, C. o. C. Classifications, W. H. O. C. f. C. o. D. f. N. Pon, K. Banco, C. Mak, V. Neveu, Y. Djoumbou, R. Eisner, America, The International Classification of Diseases, 9th A. C. Guo, D. S. Wishart, Nucleic Acids Res. 2011, 39, Revision, Clinical Modification, 1991, Ann Arbor: Comm D1035–D1041. on Profess & Hosp Act, 1991; b) W. H. Organization, Inter- [36] R. Fechete, A. Heinzel, P. Perco, K. Monks, J. Sollner, G. national Statistical Classification of Diseases and Related Stelzer, S. Eder, D. Lancet, R. Oberbauer, G. Mayer, B. Health Problems, Tenth Revision: Introduction; list of three- Mayer, Proteomics: Clin. Appl. 2011, 5, 354–366. character categories; tabular list of inclusions and four-char- [37] J. M. Rizzo, M. J. Buck, Cancer Prev. Res. 2012, 5, 887–900. acter subcategories; morphology of neoplams; special tabula- [38] B. Rabbani, N. Mahdieh, K. Hosomichi, H. Nakaoka, I. tion lists for mortality and morbidity; definitions; regulations, Inoue, J. Hum. Genet. 2012, 57, 621–632. World Health Organization, 1992. See: http://apps.who.int/ [39] D. Oz-Levi, B. Ben-Zeev, E. K. Ruzzo, Y. Hitomi, A. classifications/icd10/browse/2010/en Gelman, K. Pelak, Y. Anikster, H. Reznik-Wolf, I. Bar- [30] D. A. Lindberg, B. L. Humphreys, A. T. McCray, Methods Joseph, T. Olender, A. Alkelai, M. Weiss, E. Ben-Asher, D. Inf. Med. 1993, 32, 281–291. Ge, K. V. Shianna, Z. Elazar, D. B. Goldstein, E. Pras, D. [31] L. M. Schriml, C. Arze, S. Nadendla, Y. W. Chang, M. Ma- Lancet, Am. J. Hum. Genet. 2012, 91, 1065–1072. zaitis, V. Felix, G. Feng, W. A. Kibbe, Nucleic Acids Res. [40] a) N. O. Stitziel, A. Kiezun, S. Sunyaev, Genome Biol. 2011, 2012, 40, D940–D946. 12, 227; b) S. Coutant, C. Cabot, A. Lefebvre, M. Lonard, [32] a) A. Hamosh, A. F. Scott, J. S. Amberger, C. A. Bocchini, E. Prieur-Gaston, D. Campion, T. Lecroq, H. Dauchel, V. A. McKusick, Nucleic Acids Res. 2005, 33, D514–D517; BMC Bioinformatics 2012, 13 (suppl 14), S9. b) A. D. Baxevanis, Curr. Protoc. Hum. Genet. 2012, 73, [41] F. M. Watt, R. R. Driskell, Philos. Trans. R. Soc., B 2010, Unit 9.13. 365, 155–163. [33] S. Wild, G. Roglic, A. Green, R. Sicree, H. King, Diabetes [42] M. D. West, C. Mason, Regener. Med. 2007, 2, 329–333. Care 2004, 27, 1047–1053. [43] I. Menashe, R. Aloni, D. Lancet, BMC Bioinformatics 2006, [34] D. S. Wishart, C. Knox, A. C. Guo, R. Eisner, N. Young, B. 7, 393. Gautam, D. D. Hau, N. Psychogios, E. Dong, S. Bouatra, R. [44] N. Rappaport, N. Nativ, G. Stelzer, M. Twik, Y. Guan- Mandal, I. Sinelnikov, J. Xia, L. Jia, J. A. Cruz, E. Lim, Golan, T. I. Stein, I. Bahir, F. Belinky, C. P. Morrey, M. C. A. Sobsey, S. Shrivastava, P. Huang, P. Liu, L. Fang, J. Safran, D. Lancet, Database 2013, in press; DOI: 10.1093/ Peng, R. Fradette, D. Cheng, D. Tzur, M. Clements, A. database/bat018. Lewis, A. De Souza, A. Zuniga, M. Dawe, Y. Xiong, D. Received: December 10, 2012 Clive, R. Greiner, A. Nazyrova, R. Shaykhutdinov, L. Li, Accepted: February 22, 2013

198 www.ijc.wiley-vch.de 2013 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Isr. J. Chem. 2013, 53, 185 – 198