Whole-Exome Sequencing in Rare Diseases and Complex Traits
Total Page:16
File Type:pdf, Size:1020Kb
Whole-exome Sequencing in Rare Diseases and Complex Traits: Analysis and Interpretation by Xiaolin Zhu University Program in Genetics and Genomics Duke University Date:_______________________ Approved: ___________________________ David B. Goldstein, Supervisor ___________________________ Sandeep S. Dave, Chair ___________________________ Andrew S. Allen ___________________________ Yong-Hui Jiang ___________________________ Douglas A. Marchuk ___________________________ Vandana Shashi Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the University Program in Genetics and Genomics in the Graduate School of Duke University 2017 ABSTRACT Whole-exome Sequencing in Rare Diseases and Complex Traits: Analysis and Interpretation by Xiaolin Zhu University Program in Genetics and Genomics Duke University Date:_______________________ Approved: ___________________________ David B. Goldstein, Supervisor ___________________________ Sandeep S. Dave, Chair ___________________________ Andrew S. Allen ___________________________ Yong-Hui Jiang ___________________________ Douglas A. Marchuk ___________________________ Vandana Shashi An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the University Program in Genetics and Genomics in the Graduate School of Duke University 2017 Copyright by Xiaolin Zhu 2017 Abstract Next-generation sequencing (NGS), including whole-exome sequencing (WES) and whole-genome sequencing (WGS), has dramatically empowered the human genetic analysis of disease. This is clearly demonstrated by the exponential increase in the number of newly identified disease-causing genes since the early applications of WES to study human diseases1. In particular, WES has been extremely successful in determining the causal mutations of sporadic rare disorders that are intractable to genetic mapping due to lack of informative pedigrees, a large proportion of which are later shown to be caused by de novo mutations (DNMs). In the meantime, WES and WGS studies are being performed increasingly in population scale to understand the genetic basis of complex diseases and traits. The discovery power of WES and WGS is expected to boost further as more and more patients and healthy individuals are sequenced with the development of more powerful and less expensive sequencing technologies. Meanwhile, clinical WES and WGS are playing an increasingly important role in guiding physicians to achieve the accurate diagnosis and treatment informed by genetic findings. WES and WGS generate a massive amount of sequence data that include both signals and noises, posing challenges for scientists and clinicians to overcome to fully realize the discovery and diagnostic power of NGS. Indeed, despite the many new disease-associated genes identified to date, methodologies used to analyze sequence data vary across research groups and there seems to be no standardized analysis method like those used in genome-wide association studies. Indeed, the sequence data generated allows different ways to perform genetic analysis, and the approach taken can rely on iv the hypothesis about the underlying genetic architecture of the phenotype, which is often difficult to know precisely. Herein we explored and evaluated two primary approaches to analyzing and interpreting WES data in different contexts. The first is a trio interpretation framework based on variant/genotype filtering and bioinformatic signatures of genes and variants, which can be used to identify the highly penetrant causal mutations responsible for rare genetic conditions given known genotype- phenotype correlations and, at the same time, indicate the presence of novel disease- causing genes. The second is a formal statistical genetics framework comparing the burden of rare, high impact variants gene by gene in case and control subjects, which can be used to identify novel disease-causing genes for both Mendelian and complex traits. We introduced each of these two approaches and provided examples of their implementations. In Chapter 2, we presented our trio interpretation framework and sought to determine the genetic diagnoses of undiagnosed genetic conditions using WES. By further developing a trio interpretation framework reported in our pilot study2 and applying it to interpreting 119 trios (affected children and their unaffected biological parents), we achieved a diagnostic rate of 24% based on known genotype-phenotype correlations when the study was performed. Furthermore, we sought to extract bioinformatic signatures of causal variants and genes and showed that these bioinformatic signatures clearly pointed toward the pathogenic variants in novel disease-causing genes, some of which have been confirmed later by larger, independent studies. Our study highlights the importance of regularly reanalyzing clinical exomes to v achieve the best diagnostic rate of WES by leveraging the latest databases of population- scale sequence (e.g., EVS and ExAC) and genotype-phenotype correlation (e.g., OMIM and ClinVar) as well as advancements in analytical methods. In our documentation of individual cases resolved by trio WES, we also discussed technical details relevant to how the genetic diagnosis was achieved in individual cases. In Chapter 3, we presented our case-control gene-based collapsing framework focusing on ultra-rare variants predicted to have a high functional impact and sought to identify disease-causing genes responsible for two neurodevelopmental disorders: epileptic encephalopathy (EE) and periventricular nodular heterotopia (PVNH), both rare disorders presumably caused by highly penetrant mutations and characterized by genetic heterogeneity. Remarkably, both conditions had been subjected to DNM analyses by us or other groups, which was very successful and implicated novel causal genes for EE. We showed that our genetic association analysis framework successfully identified EE genes originally implicated by DNM analyses, and a novel PVNH gene that would not be identified by a DNM analysis, empirically highlighting the power of case-control gene-based collapsing analysis in identifying disease genes in rare neurodevelopmental disorders, in particular when we have an educated guess of the genetic architecture of the phenotype. In the third project, we extended our gene-based collapsing analysis to quantitative traits and sought to understand the genetic basis of a newly defined quantitative bone microstructure phenotype that was shown to be strongly associated with ethnicity in White and Asian women. We have special interest in this presumably complex trait for mainly two reasons: it can be a potential vi “endophenotype” for important bone diseases including osteoporosis, and its strong association with ethnicity indicates genetic control. In this pilot study, we generated WES data on a small number of White and Asian women and applied our gene-based collapsing analysis framework in order to identify the genes associated with this quantitative trait. A larger sample size would likely yield more definitive genetic discoveries, but our preliminary findings suggest that using WES to study the genetics of such “endophenotypes” and other novel clinically relevant phenotypes can identify causal variants that have a high functional impact and deliver novel biological insights. In conclusion, we introduced a trio WES interpretation framework to identify the genetic diagnoses of unresolved rare genetic diseases and a gene-based collapsing analysis framework focusing on rare, functionally impactful variants to identify genes associated with dichotomous and quantitative traits, and evaluated these methods in multiple human genetic studies. The power of these methods will be boosted by its application to novel phenotypes and the accruing WES and WGS data generated from both patients and healthy individuals. vii Contents Abstract ........................................................................................................................................ .iv List of Tables ............................................................................................................................... xiii List of Figures ............................................................................................................................ xvii Acknowledgements .................................................................................................................... xx 1. Introduction ............................................................................................................................... 1 1.1 Next-generation Sequencing ........................................................................................... 1 1.1.1 WES and WGS ......................................................................................................... 1 1.1.2 Bioinformatic processing of WES and WGS data ............................................... 2 1.2 Human Genome Variation .............................................................................................. 3 1.2.1 The human genome and its variation .................................................................. 3 1.2.2 HapMap.................................................................................................................... 4 1.2.3 Population-scale sequence data ............................................................................ 5 1.2.3.1 1000 Genomes .................................................................................................