The Missing Link Between Genetic Association and Regulatory Function

medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 1 Title: The missing link between genetic association and regulatory function Noah Connally1,2,3, Sumaiya Nazeen1,2,4, Daniel Lee1,2,3, Huwenbo Shi3,5, John Stamatoyannopoulos6,7, Sung Chun8, Chris Cotsapas3,9,10*, Christopher Cassa1,3*, Shamil Sunyaev1,2,3* 1 Brigham and Women’s Hospital, Division of Genetics, Harvard Medical School, Boston, MA, USA 2 Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA 3 Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA 4 Brigham and Women’s Hospital, Department of Neurology, Harvard Medical School, Boston, MA, USA 5 Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA 6 Altius Institute for Biomedical Sciences, Seattle, WA, USA 7 Department of Medicine, University of Washington School of Medicine, Seattle, WA, USA 8 Division of Pulmonary Medicine, Boston Children’s Hospital, Boston, MA, USA 9 Department of Neurology, Yale Medical School, New Haven, CT, USA 10 Department of Genetics, Yale Medical School, New Haven, CT, USA * Co-corresponding authors (C. Cotsapas: [email protected]; C. Cassa: [email protected]; S. Sunyaev: [email protected]) NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice. medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 2 The genetic basis of most complex traits is highly polygenic and dominated by non-coding alleles, and it is widely assumed that such alleles exert small regulatory effects on the expression of cis-linked genes. However, despite availability of expansive gene expression and epigenomic data sets, few variant-to-gene links have emerged. We identified 139 genes in which protein-coding variants cause severe or familial forms of nine human traits. We then computed the association between common complex forms of the same traits and non-coding variation, revealing that most such traits are also associated with non-coding variation in the vicinity of the same genes. However, we found colocalization evidence—the same variant influencing both the physiological trait and gene expression—for only 7% of genes, and transcriptome-wide association evidence with correct direction of effect for only 4% of genes, despite an abundance of eQTLs in most loci. Fine mapping variants to regulatory elements and assigning these to genes by linear distance similarly failed to implicate most genes in complex traits. These results contradict the hypothesis that most complex trait-associated variants coincide with currently ascertained expression quantitative trait loci. The field must confront this deficit, and pursue the “missing regulation.” Modern complex trait genetics has uncovered surprises at every turn, including the paucity of associations between traits and coding variants of large effect, and the “mystery of missing heritability,” where no combination of common and rare variants can explain a large fraction of trait heritability1. Further work has revealed unexpectedly high medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 3 polygenicity for most human traits and very small effect sizes for individual variants. Bulk enrichment analyses have demonstrated that a large fraction of heritability resides in regions with gene regulatory potential, predominantly tissue-specific accessible chromatin and enhancer elements, suggesting that trait-associated variants influence gene regulation2–4. Furthermore, genes in trait-associated loci are more likely to have genetic effects on their expression levels (expression QTLs, or eQTLs), and the variants with the strongest trait associations are more likely also to be associated with transcript abundance of at least one proximal gene5. Combined, these observations have led to the inference that most trait-associated variants are eQTLs, exerting their effect on phenotype by altering transcript abundance, rather than protein sequence. The mechanism may involve a knock-on effect on gene regulation, with the variant altering transcript abundances for genes elsewhere in the genome (a trans-eQTL), but the consensus view is that this must be mediated by the variant influencing a gene in the region (a cis-eQTL)6. As most eQTL studies profile cell populations or tissues from healthy donors at homeostatic equilibrium, the further assumption has been tacitly made that these trait-associated variants affect genes in cis under resting conditions. Equivalent QTL analyses of exon usage data have revealed a more modest overlap with trait-associated alleles, suggesting that a fraction of trait-associated variants influence splicing, and hence the relative abundance of different transcript isoforms, rather than overall expression levels. Thus, a model has emerged where most trait-associated variants influence proximal gene regulation. medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 4 Several observations have challenged this basic model. One challenge comes from the difference between spatial distributions of eQTLs, which are dramatically enriched in close proximity of genes, and GWAS peaks, which are usually distal7. Another comes from colocalization analyses, attempting to map shared genetic associations between human traits and gene expression. If the model is correct, most trait associations should also be eQTLs; trait and expression phenotype should thus share an association in that locus (rather than two association peaks overlapping). However, only 5-40% of trait associations co-localize with eQTLs in relevant tissues or cell types6,8–10, and only 15% of genes colocalize with any of 74 different complex traits11. Finally, expression levels mediated a minority of complex trait heritability12. This has led to the suggestion that most trait-associated alleles influence gene regulation in a context-specific manner13—either altering expression during development or in response to specific physiological stimuli—or that they act indirectly in trans to affect the regulation of a small number of genes involved in trait biology (the omnigenic model14,15). Without a set of true positive cases, in which the gene driving trait variation is known, it remains difficult to assess either the basic model or the proposed variations. One source of true positives is to identify genes that are both in loci associated with a complex trait and are also known to harbor coding mutations causing severe or early onset forms of related traits (e.g. related Mendelian disorders). The strong expectation is that a variant of small effect influences the gene identified in the severe form of the trait. This expectation is supported by several lines of evidence. Comorbidity between Mendelian and complex traits has been used to identify common variants associated medRxiv preprint doi: https://doi.org/10.1101/2021.06.08.21258515; this version posted June 11, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license . 5 with the complex traits16. A handful of genes have been conclusively identified in both Mendelian and complex forms of the same trait, including APOE, which is involved in cholesterol metabolism17,18, and SNCA, which contributes to Parkinson’s disease risk. Early genome-wide association studies (GWAS) found associations near genes identified through familial studies of severe disease19,20, and more recent analyses have found that GWAS associations are enriched in regions near causative genes for cognate Mendelian traits in both blood traits8 and a diverse collection of 62 traits21. To test the model that trait-associated variants influence baseline gene expression, therefore, we assembled a list of such “putatively causative” genes. We selected nine polygenic common traits with available large-scale GWAS data, each of which also has an extreme form in which coding mutations of large effect size affect one or more genes with well-characterized biology (Table 1). Our selection included four common diseases: type II diabetes22, where early onset familial forms are caused by rare coding mutations (insulin-independent MODY; neonatal diabetes; maternally inherited diabetes and deafness; familial partial lipodystrophy); ulcerative colitis and Crohn disease23,24, which have Mendelian pediatric forms characterized by severity of presentation; and breast cancer25, where coding mutations in the germline (e.g. BRCA1) or somatic tissue (e.g. PIK3CA) are sufficient for disease. We also chose five quantitative traits: low and high density lipoprotein levels (LDL and HDL); systolic and diastolic blood pressure; and height. We selected 139 genes harboring large-effect-size coding variants for one of the nine phenotypes (Table 1). These genes were identified in familial studies, and, for breast cancer, using the MutPanning method26.

The Missing Link Between Genetic Association and Regulatory Function

The Title of the Dissertation

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

Insulin/Glucose-Responsive Cells Derived from Induced Pluripotent Stem Cells: Disease Modeling and Treatment of Diabetes

Functional Characterization of Rfx6 and Sel1l in Pancreatic Development

Table S1. Identified Proteins with Exclusive Expression in Cerebellum of Rats of Control, 10Mg F/L and 50Mg F/L Groups

Chromatin and Epigenetics Cross-Journal Focus Chromatin and Epigenetics

Screen to Identify the Novel Pancreatic Gene Synaptotagmin 13 (Syt13)

Supplementary Material Computational Prediction of SARS

Table SII. Significantly Differentially Expressed Mrnas of GSE23558 Data Series with the Criteria of Adjusted P<0.05 And

Identification of Novel, Clinically Correlated Autoantigens in the Monogenic Autoimmune Syndrome

RDH10 Is Necessary for Initial Vagal Neural Crest Cell Contribution to Embryonic Gastrointestinal Tract Development

Heparanase Overexpression Induces Glucagon Resistance and Protects