Expanding the Utility of Whole Genome Sequencing in The
Total Page:16
File Type:pdf, Size:1020Kb
EXPANDING THE UTILITY OF WHOLE GENOME SEQUENCING IN THE DIAGNOSIS OF RARE GENETIC DISORDERS by Phillip Andrew Richmond B.A., The University of Colorado—Boulder, 2012 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Bioinformatics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) October 2020 © Phillip Andrew Richmond, 2020 The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled: Expanding the utility of whole genome sequencing in the diagnosis of rare genetic disorders submitted by Phillip Andrew Richmond in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Bioinformatics Examining Committee: Wyeth W. Wasserman, Professor, Medical Genetics, UBC Supervisor Dr. Inanc Birol, Professor, Medical Genetics, UBC Supervisory Committee Member Dr. Anna Lehman, Assistant Professor, Medical Genetics, UBC Supervisory Committee Member Dr. William Gibson, Professor, Medical Genetics, UBC University Examiner Dr. Paul Pavlidis, Professor, Psychiatry, UBC University Examiner Additional Supervisory Committee Members: Sara Mostafavi, Assistant Professor, Medical Genetics, UBC Supervisory Committee Member ii Abstract The emergence of whole genome sequencing (WGS) has revolutionized the diagnosis of rare genetic disorders, advancing the capacity to identify the “causal” gene responsible for disease phenotypes. In a single assay, many classes of genomic variants can be detected from small single nucleotide changes to large insertions, deletions and duplications. While WGS has enabled a significant increase in the diagnostic rate compared to previous assays, at least 50% of cases remain unsolved. The lack of a diagnosis is the result of both limitations in variant calling, and in variant interpretation. As the field of genomic medicine continues to advance, the emergence of novel bioinformatic approaches to variant calling and interpretation herald promise for the future of undiagnosed cases. In the applied setting, innovation is driven by anecdotes of complex diagnoses, which in turn lead to the development of novel tools and approaches. This is a key theme within this thesis work, where in-depth analysis of a single undiagnosed case leads to an appreciation for a challenging class of variants–short tandem repeats–which in turn leads to the development of novel software for detecting these variants in WGS data. Following the anecdote and novel tool development came an appreciation for the role of simulation, both in enabling the development and in the uptake of bioinformatic innovation for diagnostic analysis pipelines. This appreciation led to the development of a rare disease scenario simulator, which can simulate complex variants in multiple inheritance patterns to emulate challenging cases. Lastly, appreciating the limitations of the linear reference genome, I develop a framework for detecting the presence of user-specified sequences within unmapped read sets. This flexible framework can reproduce microarray-like coverage profiles, and genotype SNPs to identify ancestry and sex which can inform the choice of personalized reference genomes in emergent analysis pipelines. Together, the novel short tandem repeat discovery, bioinformatic innovation, and increased iii capacity to simulate rare disease cases, expand the utility of whole genome sequencing in the diagnosis of rare genetic diseases. iv Lay Summary Mendelian rare genetic disorders occur when a gene is broken, often by mutations, in the genome of an individual. New DNA sequencing technology has brought on the genomic revolution, enabling clinicians and researchers to identify with precision the mutations which cause disease. As this technology is adopted into practice, our understanding of the genome expands. In turn, this understanding brings novel insights into disease mechanisms, including the identification of many mutations which were previously unknown. The bioinformatics community (scientists who create software to analyze biological data) has been developing, adapting, and improving software and procedures which better utilize DNA sequencing data. Within the scope of this thesis, I focus on the development of software for the simulation and detection of rare disease mutations, I apply these approaches to undiagnosed cases and define a new rare genetic disease, and lastly I explore an alternative analysis approach for utilizing DNA sequencing data. v Preface The work presented within this thesis was performed at the BC Children’s Hospital Research Institute, in the Centre for Molecular Medicine and Therapeutics, as part of a PhD program in Bioinformatics within the Faculty of Science at the University of British Columbia. Much of the work in this thesis was done in collaboration with interdisciplinary research teams of scientists and clinicians. Each of the research chapters contains contents from a co-first authored publication either published, accepted, or currently under review. The introduction and discussion sections of the thesis are not published elsewhere and written solely by me. Details of my involvement in the research program for each chapter are provided below. The work presented in Chapter 2 represents a co-first author work, which is under the first round of revision at Human Mutation titled: GeneBreaker - Variant simulation to improve the diagnosis of rare Mendelian genetic diseases. I devised the work presented, wrote the manuscript, developed downstream benchmarking and test cases, and processed the data to show tool efficacy. The core codebase for variant simulation was implemented by the other co-first author Tamar Av-Shalom, an undergraduate research assistant in the lab. Tamar implemented the variant creation methods, MySQL tables, and online web interface under my supervision, and with my assistance throughout the design and debugging process. I was responsible for the creation of all the figures and tables within this thesis section. A version of this work can be found online at https://www.biorxiv.org/content/10.1101/2020.05.29.124495v1, and a submitted version of this work is currently under revision. vi The work presented in Chapter 3 has been published in the New England Journal of Medicine as a co-first author work: “Glutaminase Deficiency Caused by Short Tandem Repeat Expansion in GLS” (van Kuilenburg, Tarailo-Graovac et al. 2019). My role in this collaborative work was to process WES and WGS data for a set of undiagnosed patients with similar and specific biochemical phenotypes. In doing so, I discovered missense variants in the GLS gene, as well as manually identified a repeat expansion in the 5’UTR of the GLS gene. The identification of the repeat expansion was a result of collaborative interactions with Dr. Britt Drogemoller, who was attempting to validate a single nucleotide variant in the five prime untranslated region of GLS using Sanger sequencing. Complications with amplification through PCR led to a thorough manual investigation of the surrounding region in the genome sequence data. Using discordant read signals and extracting full read sequences I was able identify a possible short tandem repeat expansion. I then facilitated genotyping of this repeat with the new (at the time) computational method, ExpansionHunter, on a population of PCR-free WGS samples. The samples come from multiple sequencing consortia, and use of ExpansionHunter was guided by the developers of the tool Dr. Michael Eberle and Dr. Egor Dolzhenko. Processing of population samples was performed by analysts within the respective consortia. During the mechanistic investigation of the impact of the repeat expansion I played a central role in interpreting molecular assays and defining the role of the repeat in a non-methylation mediated form of gene repression. The wet lab work to confirm the repeat expansion and two missense variants was performed primarily by Dr. Britt Drogemoller, and members of Dr. Karen Usdin’s group. Validation of the impact of missense variants was performed by members of Andre van Kuilenburg’s group. The wet lab work for identifying the mechanism of action for the repeat expansion was performed by groups led by Dr. Andre van Kuilenburg, Dr. Mahmoud Pouladi, and Dr. Karen Usdin. As I did not vii perform any wet lab work presented in this chapter, the details of the molecular assays performed can be found in the original publication, and work presented in this chapter focuses on the bioinformatic analysis of WES/WGS data. I wrote the genome analysis portions of the manuscript, and co-wrote the manuscript with Dr. Antoine van Kuilenburg, Dr. Maja Tarailo- Graovac, Dr. Karen Usdin, and Dr. Clara van Karnebeek. I was also responsible for creation of all the figures presented in this thesis section, although their final formatting comes from the NEJM editorial staff. An online version of this work can be found at https://www.nejm.org/doi/10.1056/NEJMoa1806627. Text and figures are copied from “Glutaminase Deficiency Caused by Short Tandem Repeat Expansion in GLS.” André B.P. van Kuilenburg, Maja Tarailo-Graovac, Phillip A. Richmond, Britt I. Drögemöller, et al. 380:1433-1441. Copyright © (2020) Massachusetts Medical Society. Reprinted with permission. Patients enrolled in this study were consented under the Treatable Intellectual Disability Endeavour (TIDE) protocol, with REB approval number H12-00067, and sub-study