medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 1
Title: The missing link between genetic association and regulatory function
Noah Connally1,2,3, Sumaiya Nazeen1,2,4, Daniel Lee1,2,3, Huwenbo Shi3,5, John
Stamatoyannopoulos6,7, Sung Chun8, Chris Cotsapas3,9,10*, Christopher Cassa1,3*,
Shamil Sunyaev1,2,3*
1 Brigham and Women’s Hospital, Division of Genetics, Harvard Medical School,
Boston, MA, USA
2 Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
3 Program in Medical and Population Genetics, Broad Institute of MIT and Harvard,
Cambridge, MA, USA
4 Brigham and Women’s Hospital, Department of Neurology, Harvard Medical School,
Boston, MA, USA
5 Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA,
USA
6 Altius Institute for Biomedical Sciences, Seattle, WA, USA
7 Department of Medicine, University of Washington School of Medicine, Seattle, WA,
USA
8 Division of Pulmonary Medicine, Boston Children’s Hospital, Boston, MA, USA
9 Department of Neurology, Yale Medical School, New Haven, CT, USA
10 Department of Genetics, Yale Medical School, New Haven, CT, USA
* Co-corresponding authors (C. Cotsapas: [email protected]; C. Cassa:
[email protected]; S. Sunyaev: [email protected])
NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice. medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 2
The genetic basis of most complex traits is highly polygenic and dominated by
non-coding alleles, and it is widely assumed that such alleles exert small
regulatory effects on the expression of cis-linked genes. However, despite
availability of expansive gene expression and epigenomic data sets, few
variant-to-gene links have emerged. We identified 139 genes in which
protein-coding variants cause severe or familial forms of nine human traits. We
then computed the association between common complex forms of the same
traits and non-coding variation, revealing that most such traits are also
associated with non-coding variation in the vicinity of the same genes. However,
we found colocalization evidence—the same variant influencing both the
physiological trait and gene expression—for only 7% of genes, and
transcriptome-wide association evidence with correct direction of effect for only
4% of genes, despite an abundance of eQTLs in most loci. Fine mapping variants
to regulatory elements and assigning these to genes by linear distance similarly
failed to implicate most genes in complex traits. These results contradict the
hypothesis that most complex trait-associated variants coincide with currently
ascertained expression quantitative trait loci. The field must confront this deficit,
and pursue the “missing regulation.”
Modern complex trait genetics has uncovered surprises at every turn, including the
paucity of associations between traits and coding variants of large effect, and the
“mystery of missing heritability,” where no combination of common and rare variants can
explain a large fraction of trait heritability1. Further work has revealed unexpectedly high medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 3
polygenicity for most human traits and very small effect sizes for individual variants.
Bulk enrichment analyses have demonstrated that a large fraction of heritability resides
in regions with gene regulatory potential, predominantly tissue-specific accessible
chromatin and enhancer elements, suggesting that trait-associated variants influence
gene regulation2–4. Furthermore, genes in trait-associated loci are more likely to have
genetic effects on their expression levels (expression QTLs, or eQTLs), and the variants
with the strongest trait associations are more likely also to be associated with transcript
abundance of at least one proximal gene5. Combined, these observations have led to
the inference that most trait-associated variants are eQTLs, exerting their effect on
phenotype by altering transcript abundance, rather than protein sequence. The
mechanism may involve a knock-on effect on gene regulation, with the variant altering
transcript abundances for genes elsewhere in the genome (a trans-eQTL), but the
consensus view is that this must be mediated by the variant influencing a gene in the
region (a cis-eQTL)6. As most eQTL studies profile cell populations or tissues from
healthy donors at homeostatic equilibrium, the further assumption has been tacitly made
that these trait-associated variants affect genes in cis under resting conditions.
Equivalent QTL analyses of exon usage data have revealed a more modest overlap
with trait-associated alleles, suggesting that a fraction of trait-associated variants
influence splicing, and hence the relative abundance of different transcript isoforms,
rather than overall expression levels. Thus, a model has emerged where most
trait-associated variants influence proximal gene regulation. medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 4
Several observations have challenged this basic model. One challenge comes from the
difference between spatial distributions of eQTLs, which are dramatically enriched in
close proximity of genes, and GWAS peaks, which are usually distal7. Another comes
from colocalization analyses, attempting to map shared genetic associations between
human traits and gene expression. If the model is correct, most trait associations should
also be eQTLs; trait and expression phenotype should thus share an association in that
locus (rather than two association peaks overlapping). However, only 5-40% of trait
associations co-localize with eQTLs in relevant tissues or cell types6,8–10, and only 15%
of genes colocalize with any of 74 different complex traits11. Finally, expression levels
mediated a minority of complex trait heritability12. This has led to the suggestion that
most trait-associated alleles influence gene regulation in a context-specific
manner13—either altering expression during development or in response to specific
physiological stimuli—or that they act indirectly in trans to affect the regulation of a small
number of genes involved in trait biology (the omnigenic model14,15). Without a set of
true positive cases, in which the gene driving trait variation is known, it remains difficult
to assess either the basic model or the proposed variations.
One source of true positives is to identify genes that are both in loci associated with a
complex trait and are also known to harbor coding mutations causing severe or early
onset forms of related traits (e.g. related Mendelian disorders). The strong expectation
is that a variant of small effect influences the gene identified in the severe form of the
trait. This expectation is supported by several lines of evidence. Comorbidity between
Mendelian and complex traits has been used to identify common variants associated medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 5
with the complex traits16. A handful of genes have been conclusively identified in both
Mendelian and complex forms of the same trait, including APOE, which is involved in
cholesterol metabolism17,18, and SNCA, which contributes to Parkinson’s disease risk.
Early genome-wide association studies (GWAS) found associations near genes
identified through familial studies of severe disease19,20, and more recent analyses have
found that GWAS associations are enriched in regions near causative genes for
cognate Mendelian traits in both blood traits8 and a diverse collection of 62 traits21.
To test the model that trait-associated variants influence baseline gene expression,
therefore, we assembled a list of such “putatively causative” genes. We selected nine
polygenic common traits with available large-scale GWAS data, each of which also has
an extreme form in which coding mutations of large effect size affect one or more genes
with well-characterized biology (Table 1). Our selection included four common diseases:
type II diabetes22, where early onset familial forms are caused by rare coding mutations
(insulin-independent MODY; neonatal diabetes; maternally inherited diabetes and
deafness; familial partial lipodystrophy); ulcerative colitis and Crohn disease23,24, which
have Mendelian pediatric forms characterized by severity of presentation; and breast
cancer25, where coding mutations in the germline (e.g. BRCA1) or somatic tissue (e.g.
PIK3CA) are sufficient for disease. We also chose five quantitative traits: low and high
density lipoprotein levels (LDL and HDL); systolic and diastolic blood pressure; and
height. We selected 139 genes harboring large-effect-size coding variants for one of the
nine phenotypes (Table 1). These genes were identified in familial studies, and, for
breast cancer, using the MutPanning method26. medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 6
We first examined whether these genes are more likely than chance to be in close
proximity harboring variants associated with the polygenic form of each trait. In
agreement with existing literature21, we observe a highly significant enrichment.
However, in well-powered GWAS, even relatively rare large-effect coding alleles
(mutations in BRCA1 which cause breast cancer, for instance) may be detectable as an
association to common variants. To account for this possibility, we computed association
statistics in each GWAS locus conditional on coding variants. We applied a direct
conditional test to datasets with available individual-level genotype data; for those
studies without available genotype data, we computed conditional associations from
summary statistics using COJO27. After controlling for coding variation, we still detected
a highly significant enrichment of our genes under GWAS peaks. Of our 139 genes, 89
(64%) fell within 1 Mb of a GWAS locus for the cognate complex trait. After
fine-mapping the GWAS associations in each locus using the SuSiE algorithm28, we
found that 23/139 (17%) putative causal genes are closer to the GWAS fine-mapped
SNPs (posterior inclusion probability > 0.7) than any other gene in the locus, as
measured from the transcription start site. Given their known causal roles in the severe
forms of each phenotype, we thus suggest that the 89 genes near GWAS signals are
likely to be the targets of trait-associated non-coding variants. For example, we see a
significant GWAS association between breast cancer risk and variants in the estrogen
receptor (ESR1) locus even after controlling for coding variation; the baseline
expression model would thus predict that non-coding risk alleles alter ESR1 expression
to drive breast cancer risk. medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 7
We next looked for evidence that the trait-associated variants were also altering the
expression of our 89 genes in relevant tissues. If these variants act through changes in
gene expression, phenotypic associations should be driven by the same variants as
eQTLs in relevant tissue types. We therefore looked for co-localization between our
GWAS signals and eQTLs in relevant tissues (Supplementary table 1) drawn from the
GTEx Project, using three well-documented methods: coloc10, JLIM9, and eCAVIAR29.
We found support for the colocalization of trait and eQTL association for only four
(coloc), six (JLIM), and three (eCAVIAR) of our 89 putatively causative genes, even
before correcting for multiple-hypothesis testing, which is not obviously better than
random chance. We note that our estimates of the number of putatively causative genes
with colocalization of eQTL and GWAS signal is conceptually distinct from and not
directly comparable to the existing estimates of the fraction of GWAS associations
colocalizing with eQTLs. This distinction matters because it illuminates the role of
eQTLs in known trait biology rather than examining the locus for the presence of a
colocalizing eQTL which may or may not be relevant to the complex trait.
A different way to identify potential causative genes under GWAS peaks using gene
expression is the transcriptome-wide association study design (TWAS)30–32. This
approach measures local genetic correlation between a complex trait and gene
expression. Though not designed to avoid correlation signals caused by LD33, the
approach has higher power than colocalization methods in cases of allelic heterogeneity
or poorly typed causative variants30. We used the FUSION implementation of TWAS, medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 8
which accounts for the possibility of multiple cis-eQTLs linked to the trait-associated
variant by jointly calling sets of genes predicted to include the causative gene, to
interrogate our 89 loci32.
FUSION included our putatively causative genes in the set of genes identified as likely
relevant to the GWAS peak in 42/89 (47%) loci. Genes were often identified as hits in
multiple tissues, but with an inconsistent direction of effect—that is, increased gene
expression correlated with an increase in the quantitative trait or disease risk in some
tissues, but a decrease in others. This may indicate that different tissues have relevant
genes that are different, but still called within the same joint set. Because of this
possibility, and the known biological role of many of our genes, we restricted our results
to tissues with established relevance to our traits. Only 9/89 (10%) genes were
identified by FUSION when we restricted the analysis to relevant tissues, and of these,
only five had a direction of effect on the complex trait consistent with what is known
from hypomorphic and amorphic Mendelian mutations. This fact, combined with the
inconsistent direction of effect across tissues, may indicate that even when putatively
causative genes fall within a set of genes jointly called by TWAS, their baseline
expression may not be mediating the association.
Our results so far are consistent with trait-associated variants altering the regulation of
causative genes in ways that are not well-represented by steady-state gene expression
measurements. We thus tried to find fine-mapped GWAS variants that appear in
regulatory sites within +/- 1 Mb windows around the transcription start sites (TSS) of our medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 9
putatively causative genes. We found that 73 fine-mapped variants with a high posterior
probability of association (PIP > 0.7) to a trait fall within a narrow peak of H3K27ac,
H3K4me1, or H3K4me3 chromatin modification features. Despite our 1 Mb window, all
identified features are located within a 100 kb window around the transcription starts
sites of 27/89 (30%) putatively causative genes (two of these genes, ATG16L1 and
CARD9, are putatively causative for both CD and UC). Extending our search to include
not only fine-mapped variants within chromatin modification features, but also those
within 500 bp of features, identifies only two additional putatively causative genes.
Restricting our analysis to chromatin features in relevant tissues, 46 fine-mapped
variants fall within chromatin features, corresponding to 24 putatively causative genes.
Combining activity and proximity signals, we evaluated an “activity-by-distance”
measure, a simplified version of the “activity-by-contact” method34. Activity-by-distance
uses linear distance along the genome instead of the chromatin contact frequency
between feature and TSS. Among the fine-mapped variants that fall inside chromatin
modification features, 17 variants appear in the feature with the highest
activity-by-distance score in the locus, corresponding to 11 genes.
Next, we relaxed the requirement of proximity to a specific feature and selected all
enhancer regions annotated by the ChromHMM35 method in any measured cell or tissue
type. Overall, within +/- 1 Mb windows of our putatively causative genes 120/335
fine-mapped variants fall in an enhancer region (i.e. enhancer, bivalent enhancer,
genetic enhancer) highlighted by ChromHMM’s core 15-state model. These enhancers medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 10
correspond to 43 putatively causative genes. Restricting our analysis to relevant
tissues, 51/335 fine-mapped variants fall in enhancers, corresponding to 26 putatively
causative genes.
In sum, we observe that fine-mapped variants appear near sites of regulatory
activity—suggested by the presence of activating chromatin marks—for a sizable
minority of our loci. However, 54/89 (61%) putatively causative genes, no fine-mapped
variants are associated with regulatory regions according to either chromatin marks or
ChromHMM. Furthermore, because we connect regulatory features to genes based
solely on proximity, it is possible that our finding of 35 genes represents an
over-estimate.
Overall, our results do not support the assertion that most common non-coding variants
associated with human traits alter baseline gene expression in trait-relevant tissues.
Several explanations may account for this: incorrect assumptions, lack of statistical
power, biological context, and alternative regulatory mechanisms. We discuss each
below.
Incorrect assumptions: it is possible that our putatively causative genes may simply not
be causative in complex trait forms. This would invalidate our underlying premise that
they should be targets of trait-associated variants in the common, complex forms of
phenotypes. This implies that in the vast majority of cases, a common variant
associated with the polygenic form of a trait near a gene known to cause a severe form medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 11
actually targets a different gene. For instance, the risk alleles driving the breast cancer
GWAS signal near BRCA2, do not alter BRCA2 expression in breast tissue, but instead
influence another gene. This would also explain why 42 putatively causal genes do not
fall near a GWAS peak. The implication is that the underlying biological causes of an
extreme phenotypic presentation are different from the causes of the polygenic form
across all nine of the traits we have studied. This, to our minds, stretches credulity given
the highly significant enrichment of our genes near significant GWAS loci for cognate
phenotypes. We suggest it is more likely that our putatively causative genes are
relevant but influenced in some other way by polygenic risk alleles. More parsimonious
explanations for the 42 genes are that currently available GWAS are incompletely
powered, and thus have not detected association with alleles in those loci; or that strong
purifying selection acting on noncoding regions of these genes is preventing noncoding
variants from reaching population frequencies detectable by GWAS.
Lack of statistical power: it is possible that complex trait GWAS are insufficiently
powered to allow accurate fine-mapping and hence accurate colocalization; that eQTL
studies do not detect all eQTLs; that epigenetic studies do not identify all elements; or
that colocalization and regulatory element mapping methods lack power to detect
overlaps. However, we have ascertained GWAS associations at genome-wide
significance, and fine-map the majority of these signals using a Bayesian approach; and
the GTEx Consortium eQTL studies have reached saturation for eGene discovery6. medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 12
The upper bound on the power of colocalization methods, under near-ideal
circumstances, is 66% at P < 0.01 (Barbeira et al. 2020). Under more typical conditions,
the portion of GWAS peaks which colocalize with an eQTL is 25% or higher9,10,29. As not
all GWAS peaks will share a causative SNP with a cis-eQTL, these estimates represent
a lower bound on power, with empirical power likely to be much higher. Given our
assumption that putatively causative genes are mediating association signals, we would
expect that 25% of these associations would colocalize, and that in each case, the gene
they colocalize with is our putatively causative gene. We would thus expect at least
22/89 (25%) of putatively causative genes near a polygenic trait association signal to
have a colocalizing eQTL in relevant tissue. Here, we report all associations without
correcting for multiple testing, so we would expect substantially more colocalizations.
We thus cannot attribute the absence of such events to lack of power. This conclusion is
supported directly by our analyses: coloc explicitly tests the hypothesis that GWAS and
eQTL signals are distinct, and finds strong statistical support for this hypothesis in three
times as many loci as it finds evidence for colocalization. This suggests that, in many
cases, genetically induced changes to baseline expression of putatively causative
genes do not translate into downstream phenotypic effects. At the same time, most
GWAS peaks over these genes are not eQTLs in available tissues.
The power of TWAS is comparable to colocalization methods in cases of a single typed
causative SNP. Its relative power increases in cases of poorly-typed SNPs, allelic
heterogeneity, or apparent heterogeneity (when multiple SNPs tag a single untyped medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 13
causative SNP)30. Thus, the paucity of TWAS signals in the correct tissue and with the
correct direction of effect cannot be explained by low power.
Biological context: causative eQTLs may only manifest in certain developmental
windows, under specific conditions, or in a crucial cell subpopulation. We used data
from the GTEx project, which profiled bulk post-mortem adult tissue samples. If
causative eQTLs are only present in early development, or under specific exposures or
conditions not applicable to the GTEx donors, they would not be captured in these
contexts, even though cis-eQTLs have been detected for essentially every gene in the
genome in the GTEx data6.
Single-cell RNA sequencing (scRNA-seq) studies have identified some eQTLs present
in only a subset of the cell types captured in bulk-tissue analysis, but these appear to be
limited—van der Wjist et al. found that 60% of cell type-specific eQTLs replicate in
bulk-tissue analysis, and their use of scRNA-seq found only 13% more eQTLs than
bulk-tissue analysis36. It has also been posited that cell type-specific eQTLs may be
enriched in disease association37. Additionally, genes causal for disease tend to have
more enhancers, which may lead to more complex spatiotemporal expression38.
Nonetheless, using this tendency to explain the many putatively causative genes whose
expression was not linked to GWAS requires us to believe most genes both have
cis-eQTLs that do not show up in bulk-tissue analysis, and lack those cis-eQTLs which
do show up in bulk-tissue analysis. Additionally, nearly all genes identified through medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 14
proximity to a fine-mapped variant chromatin mark peak were identified in relevant
tissues, suggesting that our selection of tissue is correct.
A new cell-type TWAS method, which leverages large sample sizes for human bulk
tissues and high-resolution mouse scRNA-seq data to infer cell-type-specific gene
expression for each GTEx sample with respect to each Tabula Muris cell type under an
empirical Bayes framework and produce gene expression prediction models at cell-type
resolution, found no additional disease-associated gene in type II diabetes, and only
one, targeting FGFR2, in breast cancer (albeit not in breast mammary tissue; Huwenbo
Shi and Alkes Price, unpublished correspondence). This argues against context-specific
eQTLs being the most prevalent effect of trait-associated variants.
It is possible for eQTLs to change or disappear over the course of development39.
Because colocalization and TWAS methods rely on eQTL-mapping, such dynamic
eQTLs present a potential blind spot. Chromatin marks provide an orthogonal source of
information generally. Furthermore, because chromatin marks within a
tissue—especially H3K4me3—can remain stable across developmental time40, they
provide specific value in addressing this blind spot.
Alternative regulatory mechanisms: finally, it is conceivable that most non-coding
trait-associated variants act not on expression levels, but on other aspects of gene
regulation. For example, splicing QTLs (sQTLs) are enriched in GWAS peaks to the
same extent as eQTLs41,42. However, only 29% of our trait-associated variants that are medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 15
highly likely to be causal (fine-mapping posterior probability > 0.7) fall in introns, despite
introns composing 45% of the genome43. Thus sQTLs do not immediately appear as a
viable hypothesis to explain the majority of trait-associated variation.
We thus have to explain the observation that putatively causative genes are often near
GWAS signals driven by non-coding variants, and that these genes are influenced by
baseline eQTLs in relevant tissues, but that trait-associated variants are not driving
those eQTLs. This result questions the basic assumption that trait variants act by
perturbing baseline gene expression, so that eQTLs in GWAS peaks are necessarily
relevant to the mapped trait. That these genes are more likely than chance to be near
such non-coding trait-associated variants suggests that both the structure and
regulation of these genes is relevant to complex traits. However, our results
demonstrate that the mechanism by which our genes influence complex traits is
generally not their baseline expression.
Regardless of the root cause, our results have consequences for efforts to uncover the
biology underlying human traits by linking variants to molecular function through
baseline expression measurements. These variant-to-function methods are currently the
most common computational strategies for identifying the biological significance and
therapeutic potential of non-coding genetic associations. Though they have successfully
identified many genes of biological consequence and clinical promise, most causative
genes likely go undiscovered. Given the difficulties many tissues present in obtaining
expression data across diverse developmental and environmental contexts, the medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 16
limitations of examining baseline expression may present a difficult obstacle to
overcome.
There are limited mechanistic models to explain the function of non-coding variants
besides their action as cis-eQTLs. Besides sQTLs, another possibility is trans-eQTLs
that are not mediated by a cis effect on a gene, such as variants affecting CTCF binding
sites37, but this fails to explain the enrichment in GWAS signal near putatively causative
genes. Though it is likely that power and context play a role in the lack of overlap we
observe, for the reasons above it seems improbable that they explain it entirely.
Cumulatively, our analysis shows that whilst gold standard genes are often the closest
to a genetic association, more sophisticated analyses incorporating functional genomic
data fail to identify them as relevant to the trait in meaningful numbers. There are
currently no prominent models to fill this gap, but we must remember that complex trait
genetics has overturned our assumptions time and time again.
Figures medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 17
Figure 1. Putatively causative genes identified by each method.
The leftmost column displays the entire set of putatively causative genes, along with the
subset near a linkage peak, and its subset of genes closest to the peak. For JLIM,
Coloc, and eCAVIAR, the portion of genes that were the only gene to colocalize in their
locus is noted. The numbers for these methods represent nominal significance
thresholds. For TWAS results, the subsets of genes which are in an appropriate tissue
and in an appropriate tissue in the right direction are indicated. medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 18
A) medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 19
B) medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 20
C)
Figure 2. Genes identified as associated with a complex trait by each method.
A) Positive results for each of the three colocalization methods. B) Positive results for
each of the two chromatin methods. C) Positive results for all methods, collapsing A) to
“colocalization” and B) to “chromatin.”
Phenotype Genes
LDL APOB
APOC2
APOE medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 21
LDLR
LPL
PCSK9
HDL ABCA1
APOA1
CETP
LIPC
LIPG
PLTP
SCARB1
Height ANTXR1
ATR
BLM
CDC6
CDT1
CENPJ
COL1A1
COL1A2
COMP
CREBBP
DNA2
DTDST medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 22
EP300
EVC
EVC2
FAM157B
FBN1
FGFR3
FKBP10
GHR
KRAS
NBN
NIPBL
ORC1
ORC4
ORC6L
PCNT
PLOD2
PTPN11
RAD21
RAF1
RECQL4
RIT1
RNU4ATAC
ROR2 medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 23
SLC26A2
SMAD4
SMC3
SOS1
SRCAP
WRN
Blood pressure (systolic and diastolic) KCNJ1
SLC12A1
SLC12A3
WNK1
WNK4
Crohn disease ATG16L1
CARD9
IL10
IL10RA
IL10RB
IL23R
IRGM
NOD2
PRDM1
PTPN22
RNF186 medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 24
Ulcerative colitis ATG16L1
CARD9
IL23R
IRGM
PRDM1
PTPN22
RNF186
Type II diabetes ABCC8
BLK
CEL
EIF2AK3
GATA4
GATA6
GCK
GLIS3
HNF1A
HNF1B
HNF4A
IER3IP1
INS
KCNJ11
KLF11 medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 25
LMNA
NEUROD1
NEUROG3
PAX4
PDX1
PPARG
PTFA1
RFX6
SLC19A2
SLC2A2
WFS1
ZFP57
Breast cancer AKT1
ARID1A
ATM
BRCA1
BRCA2
CBFB
CDH1
CDKN1B
CHEK2
CTCF medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 26
ERBB2
ESR1
FGFR2
FOXA1
GATA3
GPS2
HS6ST1
KMT2C
KRAS
LRRC37A3
MAP2K4
MAP3K1
NCOR1
NF1
NUP93
PALB2
PIK3CA
PTEN
RB1
RUNX1
SF3B1
STK11
TBX3 medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 27
TP53
ZFP36L1
Table 1. Putatively causative genes
Supplementary methods
Identifying coding variants
Because many variants can fall within coding sequences in rare splice variants, coding
SNPs were selected based on the pext (proportion of expression across transcripts)
data44. Two filters were used. First, genes were considered only if their expression in a
trait relevant tissue was at least 50% of their maximum expression across tissues.
Second, variants were considered only if they fell within the coding sequence of at least
25% of splice isoforms in that tissue.
GWAS
For height, LDL cholesterol, and HDL cholesterol, GWAS were performed using
unrelated individuals of European ancestry from UKBB. The GWAS was run in Plink
2.045, using age, sex, BMI (for LDL and HDL only), 10 principal components, and coding
SNPs as covariates.
Conditional analysis
Analysis of breast cancer, Crohn disease, ulcerative colitis, and type II diabetes used
publically available summary statistics. The summary statistics were corrected for medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 28
coding SNPs using an LD reference panel of TOPMed subjects of European ancestry46.
These subjects were identified with FastPCA47,48 and extracted using bcftools49.
Colocalization
JLIM9 was running using GWAS summary statistics and GTEx v7 genotypes and
phenotypes. Coloc10 was run using GWAS and GTEx v7 summary statistics. eCAVIAR29
was run using GWAS and GTEx v7 summary statistics, and a reference dataset of LD
from UKBB50 (Weissbrod et al. 2021).
1. Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461,
747–753 (2009).
2. Maurano, M. T. et al. Systematic Localization of Common Disease-Associated Variation in
Regulatory DNA. Science 337, 1190–1195 (2012).
3. Trynka, G. et al. Chromatin marks identify critical cell types for fine mapping complex trait
variants. Nat. Genet. 45, 124–130 (2013).
4. Gusev, A. et al. Partitioning Heritability of Regulatory and Cell-Type-Specific Variants across
11 Common Diseases. Am. J. Hum. Genet. 95, 535–552 (2014).
5. Nicolae, D. L. et al. Trait-Associated SNPs Are More Likely to Be eQTLs: Annotation to
Enhance Discovery from GWAS. PLOS Genet. 6, e1000888 (2010).
6. Consortium, T. Gte. The GTEx Consortium atlas of genetic regulatory effects across human
tissues. Science 369, 1318–1330 (2020).
7. Stranger, B. E. et al. Population genomics of human gene expression. Nat. Genet. 39,
1217–1224 (2007). medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 29
8. Vuckovic, D. et al. The Polygenic and Monogenic Basis of Blood Traits and Diseases. Cell
182, 1214-1231.e11 (2020).
9. Chun, S. et al. Limited statistical evidence for shared genetic effects of eQTLs and
autoimmune-disease-associated loci in three major immune-cell types. Nat. Genet. 49,
600–605 (2017).
10. Giambartolomei, C. et al. Bayesian Test for Colocalisation between Pairs of Genetic
Association Studies Using Summary Statistics. PLoS Genet. 10, e1004383 (2014).
11. Barbeira, A. N. et al. Exploiting the GTEx resources to decipher the mechanisms at
GWAS loci. bioRxiv 814350 (2020) doi:10.1101/814350.
12. Yao, D. W., O’Connor, L. J., Price, A. L. & Gusev, A. Quantifying genetic effects on
disease mediated by assayed gene expression levels. Nat. Genet. 52, 626–633 (2020).
13. Alasoo, K. et al. Shared genetic effects on chromatin and gene expression indicate a
role for enhancer priming in immune response. Nat. Genet. 50, 424–431 (2018).
14. Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from
polygenic to omnigenic. Cell 169, 1177–1186 (2017).
15. Liu, X., Li, Y. I. & Pritchard, J. K. Trans Effects on Gene Expression Can Drive
Omnigenic Inheritance. Cell 177, 1022-1034.e6 (2019).
16. Blair, D. R. et al. A Nondegenerate Code of Deleterious Variants in Mendelian Loci
Contributes to Complex Disease Risk. Cell 155, 70–80 (2013).
17. Schneider, W. J. et al. Familial dysbetalipoproteinemia. Abnormal binding of mutant
apoprotein E to low density lipoprotein receptors of human fibroblasts and membranes from
liver and adrenal of rats, rabbits, and cows. J. Clin. Invest. 68, 1075–1085 (1981).
18. Boerwinkle, E. & Utermann, G. Simultaneous effects of the apolipoprotein E
polymorphism on apolipoprotein E, apolipoprotein B, and cholesterol metabolism. Am. J.
Hum. Genet. 42, 104–112 (1988).
19. Voight, B. F. et al. Twelve type 2 diabetes susceptibility loci identified through large-scale medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 30
association analysis. Nat. Genet. 42, 579–589 (2010).
20. Chan, Y. et al. Genome-wide Analysis of Body Proportion Classifies Height-Associated
Variants by Mechanism of Action and Implicates Genes Important for Skeletal Development.
Am. J. Hum. Genet. 96, 695–708 (2015).
21. Freund, M. K. et al. Phenotype-Specific Enrichment of Mendelian Disorder Genes near
GWAS Regions across 62 Complex Traits. Am. J. Hum. Genet. 103, 535–552 (2018).
22. Mahajan, A. et al. Fine-mapping type 2 diabetes loci to single-variant resolution using
high-density imputation and islet-specific epigenome maps. Nat. Genet. 50, 1505–1513
(2018).
23. Liu, J. Z. et al. Association analyses identify 38 susceptibility loci for inflammatory bowel
disease and highlight shared genetic risk across populations. Nat. Genet. 47, 979–986
(2015).
24. Goyette, P. et al. High-density mapping of the MHC identifies a shared role for
HLA-DRB1*01:03 in inflammatory bowel diseases and heterozygous advantage in ulcerative
colitis. Nat. Genet. 47, 172–179 (2015).
25. Zhang, H. et al. Genome-wide association study identifies 32 novel breast cancer
susceptibility loci from overall and subtype-specific analyses. Nat. Genet. 52, 572–581
(2020).
26. Dietlein, F. et al. Identification of cancer driver genes based on nucleotide context. Nat.
Genet. 52, 208–218 (2020).
27. Genetic Investigation of ANthropometric Traits (GIANT) Consortium et al. Conditional
and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants
influencing complex traits. Nat. Genet. 44, 369–375 (2012).
28. Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable
selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Ser. B Stat.
Methodol. 82, 1273–1300 (2020). medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 31
29. Hormozdiari, F. et al. Colocalization of GWAS and eQTL Signals Detects Target Genes.
Am. J. Hum. Genet. 99, 1245–1260 (2016).
30. Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association
studies. Nat. Genet. 48, 245–252 (2016).
31. Gamazon, E. R. et al. A gene-based association method for mapping traits using
reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).
32. Mancuso, N. et al. Integrating Gene Expression with Summary Association Statistics to
Identify Genes Associated with 30 Complex Traits. Am. J. Hum. Genet. 100, 473–487 (2017).
33. Wainberg, M. et al. Opportunities and challenges for transcriptome-wide association
studies. Nat. Genet. 51, 592–599 (2019).
34. Fulco, C. P. et al. Activity-by-contact model of enhancer–promoter regulation from
thousands of CRISPR perturbations. Nat. Genet. 51, 1664–1669 (2019).
35. Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and
characterization. Nat. Methods 9, 215–216 (2012).
36. van der Wijst, M. G. P. et al. Single-cell RNA sequencing identifies celltype-specific
cis-eQTLs and co-expression QTLs. Nat. Genet. 50, 493–497 (2018).
37. Umans, B. D., Battle, A. & Gilad, Y. Where Are the Disease-Associated eQTLs? Trends
Genet. (2020) doi:10.1016/j.tig.2020.08.009.
38. Wang, X. & Goldstein, D. B. Enhancer Domains Predict Gene Pathogenicity and Inform
Gene Discovery in Complex Disease. Am. J. Hum. Genet. 106, 215–233 (2020).
39. Strober, B. J. et al. Dynamic genetic regulation of gene expression during cellular
differentiation. Science 364, 1287–1290 (2019).
40. Gorkin, D. U. et al. An atlas of dynamic chromatin landscapes in mouse fetal
development. Nature 583, 744–751 (2020).
41. Walker, R. L. et al. Genetic Control of Expression and Splicing in Developing Human
Brain Informs Disease Mechanisms. Cell 179, 750-771.e22 (2019). medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 32
42. Li, Y. I. et al. RNA splicing is a primary link between genetic variation and disease.
Science 352, 600–604 (2016).
43. Francis, W. R. & Wörheide, G. Similar Ratios of Introns to Intergenic Sequence across
Animal Genomes. Genome Biol. Evol. 9, 1582–1598 (2017).
44. Cummings, B. B. et al. Transcript expression-aware annotation improves rare variant
interpretation. Nature 581, 452–458 (2020).
45. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer
datasets. GigaScience 4, (2015).
46. Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed
Program. Nature 590, 290–299 (2021).
47. Galinsky, K. J. et al. Fast Principal-Component Analysis Reveals Convergent Evolution
of ADH1B in Europe and East Asia. Am. J. Hum. Genet. 98, 456–472 (2016).
48. Galinsky, K. J., Loh, P.-R., Mallick, S., Patterson, N. J. & Price, A. L. Population Structure
of UK Biobank and Ancient Eurasians Reveals Adaptation at Genes Influencing Blood
Pressure. Am. J. Hum. Genet. 99, 1130–1139 (2016).
49. Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, (2021).
50. Weissbrod, O. et al. Functionally informed fine-mapping and polygenic localization of
complex trait heritability. Nat. Genet. 52, 1355–1363 (2020).
Acknowledgements
We thank Alkes Price, Alex Bloemendal, Benjamin Neale, Bogdan Pasanuic, Sasha
(Alexander Gusev), and Matt Warman for their helpful discussions. This research was
supported by NIH grants HG010372, R35GM127131, R01HG010372, and
R01MH101244. N.J.C was supported by NIH training grant T32GM74897. UK Biobank medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 33
was accessed under projects 14048 and 10438. TOPMed data were used under dbGaP
project 28674.