Linkage Disequilibrium and Heterozygosity Modulate the Genetic Architecture of Human Complex Phenotypes

bioRxiv preprint doi: https://doi.org/10.1101/705285; this version posted August 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Linkage Disequilibrium and Heterozygosity Modulate the Genetic Architecture of Human Complex Phenotypes Dominic Hollanda,b,∗, Oleksandr Freid, Rahul Desikanc, Chun-Chieh Fana,e,f, Alexey A. Shadrind, Olav B. Smelanda,d,g, Ole A. Andreassend,g, Anders M. Dalea,b,f,h aCenter for Multimodal Imaging and Genetics, University of California at San Diego, La Jolla, CA 92037, USA, bDepartment of Neurosciences, University of California, San Diego, La Jolla, CA 92093, USA, cDepartment of Radiology, University of California, San Francisco, San Francisco, CA 94158, USA, dNORMENT, KG Jebsen Centre for Psychosis Research, Institute of Clinical Medicine, University of Oslo 0424 Oslo, Norway, eDepartment of Cognitive Sciences, University of California at San Diego, La Jolla, CA 92093, USA, fDepartment of Radiology, University of California, San Diego, La Jolla, CA 92093, USA, gDivision of Mental Health and Addiction, Oslo University Hospital, 0407 Oslo, Norway, hDepartment of Psychiatry, University of California, San Diego, La Jolla, CA 92093, USA, Abstract We propose an extended Gaussian mixture model for the distribution of causal effects of common single nucleotide polymorphisms (SNPs) for human complex phenotypes, taking into account linkage disequilibrium (LD) and heterozygosity (H), while also allowing for independent components for small and large effects. Using a precise methodology showing how genome-wide association studies (GWAS) summary statistics (z-scores) arise through LD with underlying causal SNPs, we applied the model to multiple GWAS. Our findings indicated that causal effects are distributed with dependence on a SNP’s total LD and H, whereby SNPs with lower total LD are more likely to be causal, and causal SNPs with lower H tend to have larger effects, consistent with models of the influence of negative pressure from natural selection. The degree of dependence, however, varies markedly across phenotypes. Keywords: GWAS, Polygenicity, Discoverability, Heritability, Causal SNPs, Effect size, Linkage disequilibrium, Minor allele frequency, Natural selection INTRODUCTION rare events [11]. Natural selection partly determines how the prevalence of a variant develops over time in a pop- There is currently great interest in the distribution of causal ulation. Selection pressure can be assessed by measuring effects among trait-associated single nucleotide polymor- the extent of long-range linkage disequilibrium (LD), given phisms (SNPs), and recent analyses of genome-wide as- heterozygosity (H≡MAF×(1-MAF)) and local recombina- sociation studies (GWAS) have begun uncovering deeper tion rates, but genetic forces such as genetic drift compli- layers of complexity in the genetic architecture of complex cate this. The signature of positive selection is the rapid traits [24, 16, 17, 65, 66]. This research is facilitated by rise in the frequency of an allele; in general, this trans- using new analytic approaches to interrogate structural lates in the short run to alleles becoming more common features in the genome and their relationship to pheno- while their LD with neighboring SNPs have not signifi- typic expression. Some of these analyses take into account cantly decayed through recombination; in the long run, the fact that different classes of SNPs have different char- much of the signature will become lost over many genera- acteristics and play a multitude of roles [47, 48, 15]. Along tions as long-range LD truncates [49, 42], but there is also a with different causal roles for SNPs, which in itself would countervailing effect whereby LD with other selected vari- suggest differences in distributions of effect-sizes for differ- ants builds up over time [37, 2]. SNPs associated with a ent categories of causal SNPs, the effects of minor allele trait that is not fitness-related can increase in frequency if frequency (MAF) of the causal SNPs and their total cor- also pleiotropically associated with a fitness-related trait. relation with neighboring SNPs are providing new insights SNPs can also become common through genetic hitchhik- into the action of selection on the genetic architecture of ing, i.e., through LD with a selected SNP. complex traits [17, 62, 66]. Negative selection acts predominantly to keep variants There are more ways of being wrong than being right: deleterious to fitness at low frequency (or ultimately re- any given mutation is likely to be neutral or deleterious to move them), a process that is facilitated by higher recom- fitness [14]. Indeed, mutations advantageous to fitness are bination rates; however, interference of mutations and genetic drift, for example, can lead to weakly deleterious ∗Corresponding author: variants becoming common or even fixed [21, 32]. Since email: [email protected] weakly deleterious variants will take a while to be removed, Phone: 858-822-1776 many extant recent variants are likely to be deleterious. In Fax: 858-534-1078 August 8, 2020 bioRxiv preprint doi: https://doi.org/10.1101/705285; this version posted August 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. addition, the larger the effect of a deleterious variant the that indicate the strength of association between causal more efficient negative selection will be, suggesting that variants and a phenotype). Due to extensive and complex the lower the MAF the larger the effect size – and this is patterns of LD among SNPs, many non-causal SNPs will expected under evolutionary models [40, 13] and consistent exhibit a strong association with phenotypes, resulting in with empirical findings [38]. So when negative selection is a far more complicated distribution for the summary z- operating, we should expect to find more and more effects scores. The basic model for the distribution of the causal with larger and larger effect size at lower LD score (or total β’s is a mixture of non-null and null normal distributions, LD) and lower heterozygosity. the latter (denoted N (0, 0)) being just a delta function (or Several recent reports focused on individual aspects of point mass at zero): genetic architecture and their potential roles in natural se- β ∼ π N (0, σ2 ) + (1 − π )N (0, 0) (1) lection, or excluded important quantities like polygenicity 1 β 1 (the fraction of all SNPs – usually given in terms of a ref- where π1 is the polygenicity, i.e., the proportion of SNPs erence panel – that are causal for a particular trait), but that are causal (equivalently, the prior probability that 2 it is important to treat disparate (and overlapping) archi- any particular SNP is causal), and σβ is the discoverabil- tectural features simultaneously to avoid over-interpreting ity, i.e., the variance of the non-null normal distribution the role of any individual feature and more precisely pre- (the expectation of the square of the effect size), which dict quantities of interest like heritability. was taken to be a constant across all causal SNPs. The Here we present a unifying approach, using Bayesian distribution of z-scores arising from this shows strong het- analysis on GWAS summary statistics and maximum like- erozygosity and TLD dependence (Eqs. 21 and 29 in [25]). lihood estimation of model parameters, to explore the role Correctly implementing this relatively simple model (Eq. of total LD (TLD) on the distribution of causal SNPs, and 1) and applying it to GWAS summary statistics involved the impact of MAF on causal effect size, while simultane- a good deal of complexity, but resulted in remarkably ac- ously incorporating multiple effect-size distributions (TLD curate estimation of the overall distribution of z-scores, as used here and previously [25, 16] is similar to LD score [4] demonstrated by the fit between the actual distribution but takes into account more neighboring SNPs; it is de- of z-scores for multiple phenotypes and the model predic- fined in the Methods section). Treating each of these fac- tions, visualized on quantile-quantile (QQ) plots. As noted tors independently is likely to provide a misleading assess- above, the more accurate the fit the more correct are the ment. Thus, we posit a model for the prior distribution of values of the physically meaningful parameters and de- effect sizes, and building on previous work [24], fit it to the rived quantities like heritability; additionally, these values GWAS summary statistics for a wide range of phenotypes. help distinguish the genetic architectures of different phe- From the estimated set of model parameters for a pheno- notypes, and enable prediction, for example of sample size type, we generate simulated genotypes and calculate the required to detect the preponderance of causal SNPs. corresponding z-scores. With a wide range of model pa- Given that GWAS, due at least to lack of power, con- rameter values across real phenotypes, the specificity of the sistently failed to discover the SNPs that explain the bulk individual parameters for a given phenotype make them of heritability, the objective of the model was to character- narrowly defining of the distribution of summary statistics ize the likely large number of causal SNPs of weak signal for that phenotype; since each parameter is physically mo- that evaded detection – and likely would explain missing tivated, they allow for direct interpretation, and the more or hidden heritability. Thus, the model was not intended accurate the model fit the more accurate are estimates of to accurately predict the tails of the distribution of z- physical quantities based on the parameter values.

Linkage Disequilibrium and Heterozygosity Modulate the Genetic Architecture of Human Complex Phenotypes

Linkage Disequilibrium in Human Ribosomal Genes: Implications for Multigene Family Evolution

The Effects of Natural Selection on Linkage Disequilibrium and Relative Fitness in Experimental Populations of Drosophila Melanogaster

Fisher's Fundamental Theorem of Natural Selection Steven A

The Role of Linkage Disequilibrium in the Evolution of Premating Isolation

Linkage Disequilibrium and Its Expectation in Human Populations

Prediction of Identity by Descent Probabilities from Marker-Haplotypes Theo Meuwissen, Mike Goddard

Evolution at Multiple Loci

Design and Analysis of Genetic Association Studies

Linkage Disequilibrium Fine Mapping of Quantitative Trait Loci

Unbiased Estimation of Linkage Disequilibrium from Unphased Data

Genetic Diversity and Fitness in Small Populations of Partially Asexual, Self

Using the Variability of Linkage Disequilibrium Between Subpopulations to Infer Sweeps and Epistatic Selection in a Diverse Panel of Chickens