Methods for Determining the Genetic Causes of Rare Diseases
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Apollo Methods for Determining the Genetic Causes of Rare Diseases Daniel Greene MRC Biostatistics Unit University of Cambridge This dissertation is submitted for the degree of Doctor of Philosophy Clare College January 2018 Methods for Determining the Genetic Causes of Rare Diseases Daniel Greene Thanks to the affordability of DNA sequencing, hundreds of thousands of individuals with rare disorders are undergoing whole-genome sequencing in an effort to reveal novel disease aetiologies, increase our understanding of biological processes and improve patient care. However, the power to discover the genetic causes of many unexplained rare diseases is hindered by a paucity of cases with a shared molecular aetiology. This thesis presents research into statistical and computational methods for determining the genetic causes of rare diseases. Methods described herein treat important aspects of the nature of rare diseases, including genetic and phenotypic heterogeneity, phenotypes involving multiple organ systems, Mendelian modes of inheritance and the incorporation of complex prior information such as model organism phenotypes and evolutionary conservation. The complex nature of rare disease phenotypes and the need to aggregate patient data across many centres has led to the adoption of the Human Phenotype Ontology (HPO) as a means of coding patient phenotypes. The HPO provides a standardised vocabulary and captures relationships between disease features. The use of such ontologically encoded data is widespread in bioinformatics, with ontologies defining relationships between concepts in hundreds of subfields. However, there has been a dearth of tools for manipulating and analysing ontological data. I developed a suite of software packages dubbed ‘ontologyX’ in order to meet these needs, simplify visualisation of such data, and enable them to be incor- porated into complex analysis methods. An important aspect of the analysis of ontological data is quantifying the semantic similarity between ontologically annotated entities, which is implemented in the ontologyX software. We employed this functionality in a phenotypic similarity regression framework, ‘SimReg’, which models the relationship between ontologi- cally encoded patient phenotypes of individuals and rare variation in a given genomic locus. It does so by evaluating support for a model under which the probability that a person carries rare alleles in a locus depends on the similarity between the person’s ontologically encoded phenotype and a latent characteristic phenotype which can be inferred from data. A proba- bility of association is computed by comparison of the two models, allowing prioritisation of candidate loci for involvement in disease with respect to a heterogeneous collection of disease phenotypes. SimReg includes a sophisticated treatment of HPO-coded phenotypic data but dichotomises the genetic data at a locus. Therefore, we developed an additional method, ‘BeviMed’, standing for Bayesian Evaluation of Variant Involvement in Mendelian Disease, which evaluates the evidence of association between allele configurations across rare variants within a genomic locus and a case/control label. It is capable of inferring the probability of association, and conditional on association, the probability of each mode of inheritance and probability of involvement of each variant. Inference is performed through a Bayesian comparison of multiple models: under a baseline model disease risk is independent of allele configuration at the given rare variant sites and under an alternate model disease risk depends on the configuration of alleles, a latent partition of variants into pathogenic and non-pathogenic groups and a mode of inheritance. The method can be used to analyse a dataset comprising thousands of individuals genotyped at hundreds of rare variant sites in a fraction of a second, making it much faster than competing methods and facilitating genome-wide application. The thesis concludes by describing an analysis pipeline and web application called ‘Gene-docs’ that utilises ontologyX, SimReg and BeviMed to perform inferences and makes the results and the underlying data available to collaborating biologists and clinicians. I would like to dedicate this thesis to my wife Danielle, whose love and support has been my motivation thoughout. Declaration I hereby declare that except where specific reference is made to the work of others, the contents of this dissertation are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other university. This dissertation is my own work and contains nothing which is the outcome of work done in collaboration with others, except as specified in the text and Acknowledgements. This dissertation contains fewer than 60,000 words including appendices, bibliography, footnotes, tables and equations. Daniel Greene January 2018 Acknowledgements I would like to thank my supervisor, Dr. Ernest Turro, who has been consistently supportive, generous with his time and ideas, and an inspiration. Being his student has been a privilege and a joy, and without him none of this would have been possible. I would also like to thank my co-supervisors, Professor Sylvia Richardson and Professor Willem Ouwehand, for all of their support, direction and many fruitful discussions over the years. I am also grateful to my colleague William Astle, who has always made time for technical discussions and had many good suggestions. I would like to thank my wife, Danielle, and our children, Sean and Lyra, for their love and patience, for keeping my spirits up, and being there for me every day. Lastly, I would like to thank my parents, Brendan and Michele, for all their encouragement and support during the course of my PhD, and for opening all the doors which led me here. Table of contents List of figures xiii List of tables xv 1 Background1 1.1 A portrait of rare Mendelian diseases . .3 1.2 The impact of high-throughput sequencing on rare disease research . .7 1.3 The National Institute for Health Research BioResource–Rare Diseases project dataset . .8 1.4 The ThromboGenomics platform . 10 1.5 Overview of thesis . 11 1.6 Current methods and software for analysing rare disease data . 14 2 Methods for ontological data and applications 23 2.1 ontologyX . 24 2.2 Statistical significance of within-group ontological similarity . 29 2.3 Rare variant prioritisation for genetic diagnostics . 34 2.4 Similarity to model organism phenotypes . 36 2.5 Unsupervised clustering of ontological phenotypes . 38 3 Phenotype similarity regression 45 3.1 Introduction . 45 3.2 Model specification . 47 3.3 Inference . 54 3.4 Simulation study . 58 3.5 Application to real data . 62 3.6 Discussion . 69 xii Table of contents 4 Bayesian evaluation of variant involvement in Mendelian disease 71 4.1 Introduction . 72 4.2 Model specification . 73 4.3 Inference . 75 4.4 Simulation study . 80 4.5 Application to real data . 84 4.6 Discussion . 89 5 Gene-docs: a web application for browsing phenotypic and genetic data 93 6 Conclusions 99 References 105 Appendix A ontologyX comparisons and examples 117 Appendix B SimReg manual 123 Appendix C BeviMed manual 129 List of figures 1.1 Number of rare alleles observed per gene . .2 1.2 Pairs of organ systems affected by rare diseases . .6 1.3 HPO encoded phenotypes in the NBR–RD project . .9 1.4 Example HPO phenotype . 15 2.1 Visualisation of ontological set operations in ontologyIndex . 25 2.2 GO annotation of QPCTL and CRNN .................... 28 2.3 Distribution of comorbidities in BPD study participants . 32 2.4 Heatmap of phenotypic similarity between BPD cases grouped by pedigree and clinical diagnosis . 33 2.5 Prioritisation of variants by similarity to gene HPO profile . 35 2.6 Performance of HPO-based variant prioritisation . 36 2.7 Prioritisation of variants for family with bleeding, thrombocytopenia and bone pathologies . 39 2.8 Schematic representation of enrichment test strategy for phenotype clustering 41 3.1 Cartoon of SimReg model . 46 3.2 Priors for similarity transformations f and g ................. 51 3.3 Inferred f for various parameterisations of the similarity function . 52 3.4 SimReg simulation study . 60 3.5 Relationship between genetic heterogeneity and power . 61 3.6 SimReg specificity . 62 3.7 SimReg results for ACTN1 .......................... 65 3.8 SimReg results for DIAPH1 and RASGRP2 ................. 67 3.9 SimReg results for all genes . 68 4.1 BeviMed simulation study . 83 4.2 BeviMed inference applied to ANKRD26 ................... 87 4.3 BeviMed inference applied to RNU4ATAC .................. 88 xiv List of figures 5.1 Gene-docs main page . 95 5.2 BeviMed annotation of variants on gene page for GP1BB .......... 96 5.3 Gene-docs pedigree page . 97 B.1 Plotting marginal probabilities of term inclusion in f with SimReg package 125 List of tables 1.1 NBR–RD subprojects . 10 2.1 Execution time for retrieving descendants and ancestors of terms using different ontology software . 26 2.2 Performance comparison of semantic similarity packages in R . 29 2.3 Abbreviations used for HPO terms in Figure 2.7............... 40 3.1 SimReg performance evaluation . 63 3.2 Known disease-gene associations identified using SimReg . 69 4.1 Loci with BeviMed posterior probabilities of association with thrombocy- topenia at least 0.9 . 89 4.2 Performance comparison of rare variant association