Recalibrating the Genome-Wide Association Null Hypothesis
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/597484; this version posted October 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 1 Recalibrating the Genome-wide Association Null Hypothesis 2 Enables the Detection of Core Genes Underlying Complex 3 Traits 4 1,2 1,2 2-4 5 Wei Cheng , Sohini Ramachandran y, and Lorin Crawford y 6 1 Department of Ecology and Evolutionary Biology, Brown University, Providence, RI, 7 USA 8 2 Center for Computational Molecular Biology, Brown University, Providence, RI, USA 9 3 Department of Biostatistics, Brown University, Providence, RI, USA 10 4 Center for Statistical Sciences, Brown University, Providence, RI, USA 11 E-mail: [email protected]; lorin [email protected] y 12 Abstract 13 Traditional univariate genome-wide association studies generate false positives and negatives due to 14 difficulties distinguishing causal variants from \interactive" variants (i.e., variants correlated with causal 15 variants without directly influencing the trait). Recent efforts have been directed at identifying gene or 16 pathway associations, but these can be computationally costly and hampered by strict model assumptions. 17 Here, we present gene-", a new approach for identifying statistical associations between sets of variants 18 and quantitative traits. Our key innovation is a recalibration of the genome-wide null hypothesis to treat 19 small-yet-nonzero associations emitted by interactive variants as non-causal. gene-" efficiently identifies 20 core genes under a variety of simulated genetic architectures, achieving greater than a 90% true positive 21 rate at 1% false positive rate for polygenic traits. Lastly, we apply gene-" to summary statistics derived 22 from six quantitative traits using European-ancestry individuals in the UK Biobank, and identify genes 23 that are enriched in biological relevant pathways. 24 Introduction 25 Over the last decade, there has been considerable debate surrounding whether genome-wide single- 26 nucleotide polymorphism (SNP) genotype data can offer insight into the genetic architecture of com- 27 plex traits [1{5]. In the traditional genome-wide association (GWA) framework, individual SNPs are 28 tested independently for association with a trait of interest. While this approach has drawbacks [2,3,6], 29 more recent approaches that combine SNPs within a region have gained power to detect biologically 30 relevant genes and pathways enriched for correlations with complex traits [7{14]. Reconciling these two 31 observations is crucial for biomedical genomics. 32 In the traditional GWA model, each SNP is assumed to either (i) directly influence (or perfectly 33 tag a variant that directly influences) the trait of interest; or (ii) have no affect on the trait at all (see 34 Fig. 1a). Throughout this manuscript, for simplicity, we refer to SNPs under the former as \causal" and 35 those under latter as \non-causal". These classifications are based on ordinary least squares (OLS) effect 36 size estimates βj for each j-th SNP in a regression framework, where the null hypothesis assumes no 37 association for the true effects of non-causal SNPs (H0 : βj = 0). The traditional GWA model is agnostic 38 to trait architecture,b and is underpowered with a high false-positive rate for \polygenic" traits or traits 39 which are generated by many mutations of small effect [5, 15{17]. 40 Suppose that in truth each SNP in a GWA dataset instead belongs to one of three categories: (i) 41 causal or truly associated; (ii) statistically associated with the trait but not directly influencing it, which 42 we refer to as \interactive"; and (iii) non-causal (Fig. 1b) [18]. Equivalently, these three categories can be bioRxiv preprint doi: https://doi.org/10.1101/597484; this version posted October 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 2 43 defined using the relationship between the true SNP effects β and the observed effect size estimates β:(i) 44 for causal SNPs, the true effects β = 0 and the observed effect size estimates β = 0; (ii) for \interactive" 6 6 45 SNPs, β = 0 or β 0, and β = 0; and (iii) for non-causal SNPs, β = 0 and β = 0. More specifically,b ≈ 6 46 causal SNPs lie in core genes that directly influence the trait of interest. Interactiveb SNPs then follow 47 one of two definitions. Theyb are either non-causal SNPs that are statisticallyb associated with the trait 48 due to some varying degree of linkage disequilibrium (LD) with causal SNPs (β = 0); or they are SNPs 49 that are not directly causal, but are located within a gene that covaries with a core gene (β 0) [19]. ≈ 50 In either setting, interactive SNPs will emit intermediate statistical noise (in some cases, even appearing 51 indistinguishable from causal SNPs), thereby confounding traditional GWA tests (Fig. 1b). Hereafter, 52 we refer to this noise as \epsilon-genic effects” (denoted in shorthand as \"-genic effects”). There is a 53 need for a computational framework that has the ability to identify mutations associated with a wide 54 range of traits, regardless of whether heritability is sparsely or uniformly distributed across the genome. 55 Here, we develop a new and scalable quantitative approach for testing aggregated sets of SNP-level 56 GWA summary statistics. In practice, our approach can be applied to any user-specified set of genomic 57 regions, such as regulatory elements, intergenic regions, or gene sets. In this study, for simplicity, we refer 58 to our method as a gene-level test (i.e., an annotated collection of SNPs within the boundary of a gene). 59 The key contribution for our approaches is that gene-level association tests should treat interactive SNPs 60 with "-genic effects as non-causal. Conceptually, this requires assessing whether a given SNP explains 2 61 more than an \epsilon" proportion of narrow-sense heritability, h . In this generalized model, we assume 62 that true SNP-level effects for complex traits are jointly distributed as 2 63 β (0; Ω); Ω = σ ; Ω = σ ρ(x ; x )σ ; (1) ∼ N jj j jl j j l l 1=2 1=2 2 2 64 where Ω = H ΣH is a J J variance-covariance matrix, and H = diag(σ1; : : : ; σJ ) is a J J diagonal 2 × 2 × 65 matrix with elements σj denoting the proportion of narrow-sense heritability h contributed by the j-th 2 2 66 SNP (with the constraint σj = h ). Under our framework, we recalibrate the GWA null hypothesis 67 to assume approximately no association for interactive and non-causal SNPs, H0 : βj 0. This can be P 2 2 ≈ 2 68 formally restated using the variance-component-based null hypothesis, H : σ σ , where σ represents 0 j ≤ " " 69 the maximum proportion-of-variance-explained (PVE) by an interactive or non-causal SNP (Fig. 1b). 2 2 70 Non-core genes are then defined as genes that only contain SNPs with "-genic effects (i.e. 0 σ σ ≤ j ≤ " 71 for every j-th SNP within that region). Core genes, on the other hand, are genes that contain at least 2 2 72 one causal SNP (i.e. σj > σ" for at least one SNP j within that region). By modeling "-genic effects 2 73 (i.e., different values of σ" for interactive SNPs), our approach flexibly constructs an appropriate null 74 hypothesis for a wide range of traits with genetic architectures that land anywhere on the polygenic 75 spectrum (see Methods). 76 We refer to our gene-level association framework as \gene-"" (pronounced \genie"). gene-" lowers 77 false positive rates and increases power for identifying gene-level associations from GWA studies via 78 two key conceptual insights. First, gene-" regularizes observed (and inflated) GWA summary statistics 79 so that SNP-level effect size estimates are positively correlated with the assumed generative model of 80 complex traits. Second, it examines the distribution of regularized effect sizes to determine the "-genic 2 81 threshold σ" of interactive and non-causal SNPs. This makes for an improved hypothesis testing strategy 82 for identifying core genes associated with complex traits. With detailed simulations, we assess the power 83 of gene-" to identify core genes under a variety of genetic architectures, and compare its performance 84 against multiple competing approaches [7,10,12,14,20]. We also apply gene-" to the SNP-level summary 85 statistics of six quantitative traits assayed in individuals of European ancestry from the UK Biobank [21]. bioRxiv preprint doi: https://doi.org/10.1101/597484; this version posted October 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 3 86 Results 87 Overview of gene-" 88 The gene-" framework requires two inputs: GWA SNP-level effect size estimates, and an empirical linkage 89 disequilibrium (LD, or variance-covariance) matrix. The LD matrix can be estimated directly from 90 genotype data, or from an ancestry-matched set of samples if genotype data are not available to the 2 91 user. We use these inputs to both estimate gene-level contributions to narrow-sense heritability h , and 92 perform gene-level association tests.