Human Genotype–Phenotype Databases: Aims, Challenges and Opportunities

REVIEWS COMPUTATIONAL TOOLS Human genotype–phenotype databases: aims, challenges and opportunities Anthony J. Brookes1,2 and Peter N. Robinson3–5 Abstract | Genotype–phenotype databases provide information about genetic variation, its consequences and its mechanisms of action for research and health care purposes. Existing databases vary greatly in type, areas of focus and modes of operation. Despite ever larger and more intricate datasets — made possible by advances in DNA sequencing, omics methods and phenotyping technologies — steady progress is being made towards integrating these databases rather than using them as separate entities. The consequential shift in focus from single-gene variants towards large gene panels, exomes, whole genomes and myriad observable characteristics creates new challenges and opportunities in database design, interpretation of variant pathogenicity and modes of data representation and use. The term ‘genotype–phenotype database’ covers a wide gene panels2, whole-exome sequencing (WES)3, clinically range of online and institutional implementations focused WES4 and even whole-genome sequencing5. of systems used for recording and making available However, the power of genomics technologies can be a datasets that include genetic data (for example, DNA double-edged sword: although high-throughput meth- sequences, variants and genotypes), phenotype data ods help find novel disease genes and pathogenic vari- 1Department of Genetics, (observable characteristics of an individual) and corre- ants, they also generate many misleading pathogenicity University of Leicester, lations between the two. In biomedicine, such databases assignments. For instance, a typical WES in a rare dis- Leicester LE1 7RH, UK. 2 are principally focused on human genetic data and the ease medical context will uncover 30,000–100,000 vari- Data to Knowledge for 6 Practice Facility, resulting normal or disease phenotypes. Genotype– ants relative to today’s reference genome , of which only Cardiovascular Research phenotype databases have one fundamental goal: to one or a few may be causative. Approximately 10,000 Centre, Glenfield Hospital, provide access to sufficient data and knowledge to enable of these variants will have putative molecular conse- Leicester LE1 9HN, UK. the functional and pathogenic significance of genetic quences by inserting or deleting genic sequences, or 3 Institute for Medical 1 Genetics and Human variants to be reliably established and documented . by causing missense or nonsense amino acid changes 7,8 Genetics, and the Berlin It is critical to distinguish disease-causing alleles or alterations of conserved splice site residues . After Brandenburg Center for from the abundance of neutral variants that co‑occur eliminating all those that are common in the overall, Regenerative Therapies, in normal and disease-affected individuals. Incorrectly healthy population or otherwise predicted to be non- Charité Universitätsmedizin assigning pathogenicity to variants can lead to inaccurate pathogenic, several hundred possible disease-causing Berlin, 13353 Berlin, Germany. genetic diagnoses or disease risk assessments in other candidates remain. Indeed, all ‘normal’ human genomes 4Max Planck Institute for individuals harbouring these variants. Achieving this are estimated to contain ~100 loss‑of‑function alleles, Molecular Genetics, 14195 distinction is challenging given that relatively few alleles with ~20 genes completely inactivated in each individ- Berlin, Germany. have sufficiently large effect sizes to unambiguously stand ual9. Thus, identifying true pathogenic variants among 5Institute for Bioinformatics, Department of Mathematics out from background noise (that is, false positives due to the many alleles that are putatively pathogenic remains and Computer Science, Free measurement errors, hidden biases or multiple testing). a major challenge, and one that can best be tackled by University of Berlin, 14195 With the advent of next-generation sequencing maximizing the reliability, usefulness and availability Berlin, Germany. (NGS), traditional single-gene analyses using Sanger of data and their annotations, which are recorded in Correspondence to A.J.B. sequencing are increasingly being superseded in genotype–phenotype databases. e‑mail: [email protected] doi:10.1038/nrg3932 both research and diagnostic settings as relationships In this Review, we consider the main objectives, Published online between an individual’s genetic make‑up (genotype) state of progress, challenges and initiatives that pres- 10 November 2015 and disease (phenotype) can now be explored using ently characterize genotype–phenotype databases. We 702 | DECEMBER 2015 | VOLUME 16 www.nature.com/reviews/genetics © 2015 Macmillan Publishers Limited. All rights reserved REVIEWS also identify important missing elements that need to number of related genes are grouped together, such as be developed to maximize the field’s potential. Given the resources listed in TABLE 1 for Fanconi anaemia (16 the vastness of the topic, we have limited the focus to genes), osteogenesis imperfecta (16 genes), amyotrophic practical and technical issues and opportunities rather lateral sclerosis (116 genes) or immunodeficiency (131 than ethical, legal and social matters, which cut across a genes). Other databases concentrate on genetic diseases broader swathe of biomedical practice. of particular importance in a given country or ethnic group11; for example, the collection of ETHNOS data- Overview of the current database landscape bases. The Leiden Open Variation Database12 and the Given the immense size, complexity and variability of Universal Mutation Database13 software platforms are human genomes, the large range of possible normal and widely used for LSDB creation, as they were designed Genotype disease phenotypes, and the myriad settings in which from the outset to be generic solutions for this domain. In biology, genotype refers to one might wish to capture, organize and utilize human The technical proficiency and data quality of small- the genetic makeup of an genotype–phenotype data, there is a need for many dif- scale, public databases varies. Beyond the inbuilt proper- organism with reference to either a single nucleotide, a ferent types of genotype–phenotype databases. These ties of the adopted database software they are created from, larger genetic locus or the databases are sometimes created using simple spread- these databases tend to be tailored to local preferences entire genome. In the current sheets or text file approaches, but the rapidly increas- and may use few standards in their design and operation, context, genotype refers to a ing size and complexity of relevant datasets is generally with little or no integration with other systems or options genetic sequence variant being assessed for potential causality being matched by progress in computing and database for bulk download. As such, despite contributing to of a disease, as well as its technologies. the global digital collection of data relating genotypes status as heterozygous, Categorization of genotype–phenotype databases can to phenotypes, such databases tend to each be visited by homozygous or hemizygous. be broadly based on many factors, such as the following: only a limited number of users and may not be secured distinctions between health care and research settings; or integrated long-term due to sustainability challenges. Phenotype In biology, phenotype refers to open versus controlled access or limited access systems; This piecemeal arrangement of mixed quality datasets the observable characteristics archival systems as opposed to those that handle real- obviously limits the scope for good data exploitation; of an organism, but in time content (including Big data projects); primary data however, even greater amounts of poorly exploited medicine, the word is usually through to aggregated forms of data; and the spectrum data reside in non-public genotype–phenotype data- used to describe clinically relevant abnormalities, between centralized (single-site data depositories) and bases that support research and diagnostic laboratories including signs, symptoms federated (multiple interconnected data sites) ways of in non-profit or commercial environments. and abnormal findings of bringing datasets together. Furthermore, databases can laboratory analyses, imaging be classified on the basis of the type of stored data and Large-scale databases. Numerous larger genotype– studies, physiological the source of this information (for example, disease- phenotype databases — primarily ‘central’ databases examinations, as well as behavioural anomalies. related mutations collected from the literature, primary — have been created; a selection of the most notable of through to fully processed and interpreted data from these is summarized in TABLE 1. Some larger genotype– Variants NGS-based investigations, and manually curated data phenotype databases seek to provide comprehensive Genetic variants describe any combined from various sources). Evidently, there is coverage of germline variations in single genes across the deviations from a normal or reference sequence. For substantial overlap and interplay between these differ- majority of known Mendelian diseases. Principal exam- example, a substitution of one ent forms of genotype–phenotype databases, and when ples include the commercial Human Gene Mutation nucleotide for another at a and where each option might best be deployed. In this Database, which

Human Genotype–Phenotype Databases: Aims, Challenges and Opportunities

A Decision Support System for Reporting High Throughput Sequencing of Cancers in Clinical Diagnostic Laboratories Kenneth D

Locus Reference Genomic Sequences: an Improved Basis for Describing Human DNA Variants

14 CLINICAL GENOMICS CODE SYSTEMS Information on Code Systems with Name, HL70396 Code, OID, Source Information Links, and Description

Haplosaurus Computes Protein Haplotypes for Use in Precision Drug Design

Diagnosis of Neurogenetic Disorders Contribution of Next Generation Sequencing and Deep Phenotyping

Establishing an Automated Graphical Genome Analysis Platform

Genes and Variants Underlying Human Congenital Lactic Acidosis—From Genetics to Personalized Treatment

Algorithms for the Description of Molecular Sequences Issue Date: 2016-12-21

The Online Abstract Book

Accurate Mutation Annotation and Functional Prediction Enhance the Applicability of -Omics Data in Precision Medicine Tenghui Chen

Pathos: a Decision Support System for Reporting High Throughput Sequencing of Cancers in Clinical Diagnostic Laboratories Kenneth D

Distribute! Or! Use! Other! Than! For! Personal! Purposes! Without! The! Permission! Of! Suzanne!Leal!Or!Michael!Nothnagel.!