Qt8b18n6c6 Nosplash Caf9e24
Total Page:16
File Type:pdf, Size:1020Kb
Copyright 2014 by Christopher K. Fuller ii For Kai and Lena iii Acknowledgements This work would not have been possible without the support of my research advisor Hao Li. His broad base of interests gave me the freedom to search for something that so well matches my inclinations. I appreciate his insight, enthusiasm, skepticism, and overall encouragement of this effort. I wish to thank my committee, Saunak Sen and Kathleen Giacomini, for their suggestions and support. I wish to thank postdoctoral scholar Xin He for catalyzing the research that became the focus of my work. In addition, I wish to thank postdoctoral scholar Jiashun Zheng for his numerous suggestions and support. I wish to thank the thousands of anonymous individuals who make integrative genomics research possible by providing samples and the researchers dedicated to making these widely available. In addition, I wish to thank the MuTHER and DIAGRAM consortia for providing access to their full summary results. I wish to thank my wife, Sharoni, for her support and for permitting me to encroach on her domain of expertise. You are still the real biologist in the family. Finally, I wish to thank my Mom and Dad ... who took me to the library. iv Abstract Genome-wide association studies (GWAS) have linked various complex diseases to many dozens, sometimes hundreds, of individual genomic loci. Since these are generally of small effect and may lack both functional annotations and an obvious relation to other disease-associated regions, they are difficult to place in a functional context that advances our understanding of the disease. When considered only in isolation, disparate collections of trait associations provide little mechanistic insight. Thus, there is a pressing need for computational methods that use genome-wide molecular data to aggregate GWAS associations and extract functional insight from seemingly unrelated SNPs. To this end, we recently introduced Sherlock, a Bayesian method that detects gene-disease associations through pattern matching between eQTL results and the the full set of GWAS loci (He et al., 2013). Here we review the Bayesian formulation and present Empirical Sherlock, a robust, parameter-free approach to detecting associations between various molecular functions and GWAS traits. It uses an empirically-derived null distribution to associate subsets of GWAS loci, grouped by their relevance to a particular molecular function, with the GWAS trait. The method is easily generalizable to most any genome-wide functional characterization (e.g. DNA methylation, transcription factor binding, etc.), in addition to eQTL. By avoiding null distributions that assume a particular theoretical form for the input GWAS, Empirical Sherlock yields results are resistant to false discovery due to inflation of the output p-values. The core method requires no tunable parameters or prior probabilities and, when used with eQTL, permits an adjustment for pleiotropic SNPs that control the expression of many genes. In addition, Empirical Sherlock calculates significance directly, avoiding the inherent limitations of permutation- based tests. As we demonstrate using moderately-powered GWAS for Crohn's disease, type 2 diabetes, and Schizophrenia, it detects gene associations that are either validated through better-powered GWAS or supported by independent literature. v Contents Contents vi 1 Introduction 1 2 Bayesian Sherlock Method 7 2.1 Model Derivation ...........................................7 2.2 Model Behavior and Parameter Estimation ............................ 10 2.3 Statistical Significance of Gene Bayes Factors ........................... 11 2.4 Summary of Results: Crohn's Disease ............................... 11 2.5 Summary of Results: Type 2 Diabetes ............................... 12 3 Empirical Sherlock Method 14 3.1 Method Overview ........................................... 14 3.2 Identifying Tag SNPs for Independent Blocks ........................... 15 3.3 Core Statistical Method ....................................... 16 3.4 Correcting for Pleiotropic and Sampling Effects .......................... 18 3.5 Avoiding Arbitrary Thresholding of Functional Data ....................... 19 3.6 Performance Verification ...................................... 19 4 Empirical Sherlock Analyses Using Multi-tissue eQTL 23 4.1 Crohn's Disease ............................................ 23 4.2 Type 2 Diabetes ............................................ 26 4.3 Schizophrenia ............................................. 28 5 Discussion 32 A Null Distribution Under Uniform Assumption 36 B Supporting SNPs for Top Crohn's Results 38 C Supporting SNPs for Top T2D Results 73 D Supporting SNPs for Top Schizophrenia Results 83 Bibliography 120 vi Chapter 1 Introduction The genetic underpinnings of complex disease have proven resistant to easy mapping using only the com- mon polymorphisms directly implicated through genome-wide association studies (GWAS). Typically, such studies reveal large numbers of small effect loci that, in aggregate, explain only a small fraction of complex disease heritability. This discrepancy has generated considerable debate concerning the specific mechanisms of polygenic trait inheritance and the overall utility of the GWAS paradigm.1 2 3 By assembling ever-larger disease cohorts, the power to resolve small effects through association studies improves, but this does not necessarily translate into superior mechanistic insight. The functional consequence of a given locus is of- ten non-obvious, especially for the many associations located in intergenic regions. Even when accurate annotations are available, the challenge of assembling many dozens or hundreds of associated loci into a coherent picture of disease etiology is daunting. The ability to reliably map large numbers of associated loci onto a smaller number of functional elements constitutes a key component of a pipeline to convert disparate associations into meaningful disease insight. For these reasons, there is much interest in mining GWAS results using a diverse array of genome- wide indicators of functional relevance. These include assays for chromatin structure, DNA methylation, transcription factor binding activity, and gene expression regulation, as made publicly available through the NHGRI's Encyclopedia of of DNA Elements (ENCODE) project and the NIH-funded Genotype-Tissue Expression (GTEx) program, among others.4 5 Taken together, they form a dictionary of genomic loci in terms of distinct but related molecular phenotypes that can be used to understand the key drivers of complex diseases and other traits. In typical use, however, these molecular definitions provide clues only one-by-one for the top GWAS SNPs (those near or above genome-wide significance). Yet, to be especially useful, improved approaches for GWAS data mining must reflect both the need to gain functional insight and the need to aggregate many SNPs to explain a larger fraction of the inherited 1 phenotypic variance. Although the degree to which missing heritability is captured by rare genetic variants, higher-order genetic interactions (epistasis), or small effects across many common SNPs is the subject of active debate, approaches that mine small effects from SNPs below genome-wide significance are highly attractive, since data is readily available from existing studies.2 6 7 Several lines of evidence suggest that much additional information can be gleaned from common SNPs that fall below genome-wide statistical significance: First, the number of statistically-significant small-effect loci associated with a given trait is limited primarily by the power of the studies involved, not necessarily by some intrinsic property of biological network architecture. An analysis of the current NHGRI GWAS Catalog reveals an exponential distribution of reported effect sizes, with a few large effect loci interspersed with large numbers of small effects.8 Across all studies, the median odds ratio (OR) is well-below 1.3 (Fig. 1.1). With current and planned studies routinely targeting many tens of thousands of individuals in disease cohorts, the number of small effect loci that reach genome-wide significance will only increase. Figure 1.1: Histogram of of odds ratios from all entries in the GWAS catalog through 2012. Large effects are rare, while the number of small effects discovered relates largely to the statistical power of the GWAS studies undertaken. Second, work in model systems indicates that an accumulation of small effects can explain most heritabil- ity for some traits. For example, association studies performed in a panel of yeast segregants may achieve much higher statistical power than typical human GWAS. In a study of 46 traits, the inclusion of small effect loci enabled between 72% and 100% of the narrow-sense (additive) heritability to be explained.9 Although 2 epistasis appeared to contribute to phenotypic variation for many traits, additive effects constituted the dominant component of broad-sense heritability in most cases. For highly additive traits, it was possible to predict trait values using genotypes, but only when very small effect loci were included. Lastly, recent efforts have demonstrated that the phenotypic variance explained by all SNPs in human GWAS, when considered in aggregate, can be significantly higher than that of just the top SNPs and may approach the heritability estimates of twin and pedigree studies.3 For example, the heritability of human height is approximately 0.8, while the phenotypic variance accounted for by 180 top