Computational Evaluation of Exome Sequence Data Using Human and Model Organism Phenotypes Improves Diagnostic Efficiency

ORIGINAL RESEARCH ARTICLE Official journal of the American College of Medical Genetics and Genomics Open Computational evaluation of exome sequence data using human and model organism phenotypes improves diagnostic efficiency William P. Bone1, Nicole L. Washington, PhD2, Orion J. Buske3,4, David R. Adams, MD, PhD1,5, Joie Davis1, David Draper1, Elise D. Flynn1, Marta Girdea3,4, Rena Godfrey1, Gretchen Golas1, Catherine Groden1, Julius Jacobsen, PhD6, Sebastian Köhler, PhD7, Elizabeth M. J. Lee1, Amanda E. Links1, Thomas C. Markello, MD, PhD1, Christopher J. Mungall, PhD2, Michele Nehrebecky1, Peter N. Robinson, MD7, Murat Sincan, MD1, Ariane G. Soldatos, MD1, Cynthia J. Tifft, MD, PhD1,5, Camilo Toro, MD1, Heather Trang3,4, Elise Valkanas1, Nicole Vasilevsky, PhD8, Colleen Wahl1, Lynne A. Wolfe1, Cornelius F. Boerkoel, MD, PhD1, Michael Brudno, PhD3,4, Melissa A. Haendel, PhD8, William A. Gahl, MD, PhD1,5, and Damian Smedley, PhD6 Purpose: Medical diagnosis and molecular or biochemical confir- Results: Benchmarking showed Exomiser ranked the causal variant mation typically rely on the knowledge of the clinician. Although as the top hit in 97% of known disease–gene associations and ranked this is very difficult in extremely rare diseases, we hypothesized that the correct seeded variant in up to 87% when detectable disease–gene the recording of patient phenotypes in Human Phenotype Ontology associations were unavailable. Using UDP data, Exomiser ranked (HPO) terms and computationally ranking putative disease-associ- the causative variant(s) within the top 10 variants for 11 previously ated sequence variants improves diagnosis, particularly for patients diagnosed variants and achieved a diagnosis for 4 of 23 cases undiag- with atypical clinical profiles. nosed by clinical evaluation. Methods: Using simulated exomes and the National Institutes of Conclusion: Structured phenotyping of patients and computational Health Undiagnosed Diseases Program (UDP) patient cohort and analysis are effective adjuncts for diagnosing patients with genetic associated exome sequence, we tested our hypothesis using Exomiser. disorders. Exomiser ranks candidate variants based on patient phenotype simi- Genet Med advance online publication 12 November 2015 larity to (i) known disease–gene phenotypes, (ii) model organism phenotypes of candidate orthologs, and (iii) phenotypes of protein– Key Words: exome sequencing; model organisms; phenotype; protein association neighbors. semantic comparison; undiagnosed diseases INTRODUCTION Exome sequencing identifies thousands of exomic varia- The National Institutes of Health (NIH) established the tions relative to the human reference sequence, and this often Undiagnosed Diseases Program (UDP) to provide explanations precludes generation of a tractable number of diagnostic to individuals with elusive disorders and to improve understand- hypotheses. This problem remains after filtering for predicted ing of rare and common disorders. Of the individuals or families deleteriousness, conservation, and Mendelian modes of inheri- admitted, ~15% are diagnosed with a known condition based tance.4 Many variants may even be true loss-of-function muta- on clinical evaluation.1,2 For those remaining without a diagno- tions tolerated in the average human genome.5,6 Also, because sis, the NIH UDP screens for potential genetic etiologies. The the UDP often has DNA from only the proband or a few family diagnoses have included known disorders that were previously members, or sometimes a few unrelated patients for each dis- unrecognized, atypical presentations of known disorders, com- ease being studied, traditional linkage and cohort analyses are binations of several disorders, and completely novel diseases.2,3 not possible. 1Undiagnosed Diseases Program, Common Fund, Office of the Director, National Institutes of Health, Bethesda, Maryland, USA; 2Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA; 3Centre for Computational Medicine Hospital for Sick Children, Toronto, Ontario, Canada; 4Department of Computer Science, University of Toronto, Toronto, Ontario, Canada; 5Medical Genetics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA; 6Skarnes Faculty group, Wellcome Trust Sanger Institute, Hinxton, UK; 7Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, Berlin, Germany; 8Library; and Department of Medical Informatics and Epidemiology, Oregon Health & Science University, Portland, Oregon, USA. Correspondence: Damian Smedley ([email protected]) Submitted 1 March 2015; accepted 27 August 2015; advance online publication 12 November 2015. doi:10.1038/gim.2015.137 608 Volume 18 | Number 6 | June 2016 | GEnEticS in mEdicinE Evaluation of exome data using human and model organism phenotypes | BONE et al ORIGINAL RESEARCH ARTICLE Encoding patient phenotypes using an ontology such as Filtration and evaluation of exome variants Human Phenotype Ontology (HPO) assists computational After filtering the variants for rarity, segregation, and delete- ranking and identification of causal variants.7–10 However, riousness as described in the Supplementary Methods online because only a small portion of human coding genes (~35% (Supplementary Figure S1 online), the variants were submit- by our estimation) have been associated with disease, can- ted to Exomiser for ranking. Note the variants are filtered sepa- didate variations in genes with unknown function or pheno- rately under compound heterozygous, homozygous recessive, typic consequences are difficult to evaluate. To address this, the de novo dominant, and X-linked inheritance models, and the first version of Exomiser prioritized causative variants using output from these four models is combined before ranking by semantic comparisons of encoded mutant mouse phenotypes Exomiser. Following ranking of the variants by Exomiser, we against the HPO-encoded patient phenotypes as well as vari- reviewed the quality of alignment and genotype of the highest- ant frequency, pathogenicity, and inheritance models.10 Despite ranked variants as described in the Supplementary Methods improving variant identification, Exomiser was still biased to online. identify causal variants in genes for which their orthologs had a phenotype annotated. We therefore hypothesized that querying Ranking of variants using Exomiser human, mouse, and zebrafish mutant phenotypes together and Using the data sources described in the Supplementary the phenotypes associated with mutations of interacting pro- Methods online, Exomiser assesses variant candidacy by allele teins can improve Exomiser function. frequency, predicted pathogenicity, and inheritance model, as To test this hypothesis, we recorded structured phenotype well as the likelihood of a mutation in the gene causing the dis- descriptions utilizing HPO and performed Exomiser analysis ease based on the similarity of a patient’s phenotype to known of sequence variants from 32 UDP families (9 diagnosed and gene-associated phenotypes in human, mouse, or zebrafish. 23 undiagnosed). This showed that ontology-based phenotype Additionally, for candidate genes that have no terms that match analysis, cross-species integration, and protein–protein asso- the patient, the proximity of a candidate gene to another phe- ciation walking aid the prioritization of candidate variants and notypically similar gene via protein–protein associations is lead to more efficient diagnoses as well as to the generation of leveraged. plausible novel disease–gene hypotheses. Genes were ranked for candidacy by Exomiser’s variant and phenotypic relevance score. The optimal method of combining MATERIALS AND METHODS these scores was generated by logistic regression on a training Human subjects set of 10,000 Human Gene Mutation Database (HGMD) disease Patients or their guardians gave written informed consent to variants and 10,000 benign variants from the 1000 Genomes protocol 76-HG-0238, approved by the Institutional Review Project. The Weka data mining suite was used to run the logis- Board of the National Human Genome Research Institute. We tic classifier on files containing phenotype and variant scores analyzed 9 families diagnosed by the UDP clinicians and 23 for equal amounts of disease and benign variants. Ten-fold families whose affected members remained undiagnosed when cross-validation was used to train and test the model (91.5% we began this investigation. of variants were correctly classified in the cross-validation, κ = 0.8296) and average parameters from the cross-validation runs Patient phenotype documentation and curation were used to generate the final model17: As culled from medical records, the patient phenotype was 11 recorded using the PhenoTips (http://phenotips.org) tool using 1 HPO, curated by members of the Monarch Initiative according ExomiserScore = 1+ e−−( 13..96+×11 61 phenotypeScore+×11.61 varianntScore) to qualitative guidelines and quantitative graph-based metrics (http://monarchinitiative.org/page/services), and approved by the attending physician.12 Positive HPO terms were exported Variant score from PhenoTips and used for Exomiser analysis (Supplementary Exomiser’s variant score is a measure of how rare and patho- Table S1 online); negative and onset of disease terms are not cur- genic the variant is. A variant frequency score between 0 and 1 rently utilized by the algorithm. For families with phenotype was calculated as previously optimized and described10: data from multiple affected

Computational Evaluation of Exome Sequence Data Using Human and Model Organism Phenotypes Improves Diagnostic Efficiency

Supplementary Table 1: Adhesion Genes Data Set

Detailed Characterization of Human Induced Pluripotent Stem Cells Manufactured for Therapeutic Applications

Brain Region-Specific Gene Signatures Revealed by Distinct Astrocyte Subpopulations Unveil Links to Glioma and Neurodegenerative

Identification of Shared and Unique Gene Families Associated with Oral

Regulation of Muscle Stem Cell Function by the Transcription Factor Pax7

A Genomic Approach to Delineating the Occurrence of Scoliosis in Arthrogryposis Multiplex Congenita

SARS-Cov-2 Infected Pediatric Cerebral Cortical Neurons: Transcriptomic Analysis and Potential Role of Toll-Like Receptors in Pathogenesis

PARKINSON DISEASE LOCI in the MID-WESTERN AMISH by Mary

Fine Mapping of a Linkage Peak with Integration of Lipid Traits Identifies

Differentially Expressed Genes in Aneurysm Tissue Compared With

The Role of MEGF10 in Skeletal Muscle Myopathy

Genome-Wide DNA Methylation Analysis Identifies MEGF10 As a Novel Epigenetically Repressed Candidate Tumour Suppressor Gene in Neuroblastoma