<<

ARTICLE

DOI: 10.1038/s41467-018-03371-0 OPEN Integrative analysis of omics summary data reveals putative mechanisms underlying complex traits

Yang Wu1, Jian Zeng 1, Futao Zhang1, Zhihong Zhu1, Ting Qi1, Zhili Zheng1,2, Luke R. Lloyd-Jones1, Riccardo E. Marioni3,4, Nicholas G. Martin5, Grant W. Montgomery 1, Ian J. Deary4, Naomi R. Wray 1,6, Peter M. Visscher 1,6, Allan F. McRae1 & Jian Yang 1,6

The identification of and regulatory elements underlying the associations discovered

1234567890():,; by GWAS is essential to understanding the aetiology of complex traits (including diseases). Here, we demonstrate an analytical paradigm of prioritizing genes and regulatory elements at GWAS loci for follow-up functional studies. We perform an integrative analysis that uses summary-level SNP data from multi-omics studies to detect DNA methylation (DNAm) sites associated with expression and through shared genetic effects (i.e., pleio- tropy). We identify pleiotropic associations between 7858 DNAm sites and 2733 genes. These DNAm sites are enriched in enhancers and promoters, and >40% of them are mapped to distal genes. Further pleiotropic association analyses, which link both the methylome and transcriptome to 12 complex traits, identify 149 DNAm sites and 66 genes, indicating a plausible mechanism whereby the effect of a genetic variant on phenotype is mediated by genetic regulation of through DNAm.

1 Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD 4072, Australia. 2 The Eye Hospital, School of Ophthalmology & Optometry, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China. 3 Medical Section, Centre for Genomics and Experimental Medicine, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh EH4 2XU, UK. 4 Centre for Cognitive Ageing and Cognitive Epidemiology, Department of Psychology, University of Edinburgh, 7 George Square, Edinburgh EH8 9JZ, UK. 5 Genetics and Computational Biology, QIMR Berghofer Medical Research Institute, Brisbane, QLD 4029, Australia. 6 Queensland Brain Institute, The University of Queensland, Brisbane, QLD 4072, Australia. Yang Wu, Jian Zeng and Futao Zhang contributed equally to this work. Correspondence and requests for materials should be addressed to J.Y. (email: [email protected])

NATURE COMMUNICATIONS | (2018) 9:918 | DOI: 10.1038/s41467-018-03371-0 | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-03371-0

enome-wide association studies (GWAS) have identified further annotate the DNAm sites that are associated with gene thousands of genetic variants associated with complex expression. G 1,2 traits (including diseases) . Given the polygenic nature In this study, we report an integrative analysis of summary of most complex traits3, more variants are expected to be dis- statistics from GWAS, eQTL and mQTL studies of the largest covered in the foreseeable near future because of the rapid sample sizes to date. We mapped 7858 DNAm sites to 2733 increase in sample size (due to large cohorts such as UK Biobank4 putative target genes in cis-regions and then linked them to 14 and consortia efforts). However, the mechanisms underlying the complex traits. associations remain largely unaddressed5. This is because the mapping resolution of GWAS is limited by the complicated Results linkage disequilibrium (LD) structure of the genome (i.e., the top Method overview. Our integrative analysis combines summary- associated variant at a locus is often not the causal variant)6,7 and level multi-omics data to prioritize gene targets and their reg- by the sampling variation in statistical tests due to finite sample ulatory elements. It consists of three steps, each of which relies on sizes. Furthermore, genetic variants can affect phenotype through the SMR & HEIDI method to test for pleiotropic association9 distal regulation of (i.e., the nearest gene to the – (Fig. 1 and Supplementary Note 1). First, we map the methylome GWAS top signal is often not the causal gene)7 9. High- to the transcriptome in cis-regions by testing the associations of throughput methods such as massively parallel reporter assay10, DNAm with their neighbouring genes (within 2 Mb of each self-transcribing active regulatory region sequencing (STARR- DNAm probe) using the top associated mQTL as the instru- seq)11, and CRISPR-Cas9 epigenome screens12 have been devel- mental variable (Methods). Next, we prioritize the trait-associated oped to identify the causal variants and functional elements genes by testing the associations of transcripts with the phenotype regulating gene expression. However, it remains challenging to using the top associated eQTL. Last, we prioritize the trait- pinpoint the gene(s) responsible for the association signal at a associated DNAm sites by testing associations of DNAm sites GWAS locus8,13,14. Thus, analytical approaches that somehow with the phenotype using the top associated mQTL. If the asso- mimic a functional study in silico are needed to prioritize plau- ciation signals are significant in all three steps, then we predict sible functional genes and/or regulatory elements at GWAS loci with strong confidence that the identified DNAm sites and target for further studies. genes are functionally relevant to the trait through the genetic Given that the complex trait-associated variants are pre- regulation of gene expression at the DNAm sites. The SMR & dominantly found in noncoding regions7,15, it is reasonable to HEIDI method assumes consistent LD between two samples. This assume that these variants affect phenotype through genetic assumption is generally satisfied by using data from samples of regulation of transcriptional output. To test whether the effect of the same ancestry (Supplementary Fig. 1). a genetic variant on a phenotype is mediated by transcription, we have developed powerful and flexible approaches, SMR (sum- mary-data-based Mendelian randomization) and HEIDI (het- Analytical mapping of methylome to transcriptome. To identify erogeneity in dependent instruments) tests9, and have target genes for the DNAm sites, we applied the SMR & HEIDI implemented them in an efficient software tool for genome-wide approach to test for pleiotropic associations between DNAm and analysis (URLs). The SMR & HEIDI approach uses summary- gene expression (denoted as M2T analysis) in large data sets from level data from GWAS and expression peripheral blood (Methods). The summary statistics of SNPs on (eQTL) studies to test if a transcript and phenotype are associated gene expression were from an eQTL analysis of 38624 probes in because of a shared causal variant (i.e., pleiotropy). Compared the CAGE (Consortium for the Architecture of Gene Expression) with most other methods for an integrative analysis of GWAS 28 data (n = 2765). The summary statistics of SNPs on DNAm – and eQTL data16 18, the SMR & HEIDI approach features the were from a meta-analysis of mQTL data26 from two independent ability to distinguish a pleiotropic model (i.e., gene expression data sets: the Brisbane Systems Genetics Study (BSGS, n = 614)29 and phenotype are associated owing to a single shared genetic and the Lothian Birth Cohorts (LBC, n = 1366)30. After quality variant) from a linkage model (i.e., there are two or more distant controls (Methods), we retained 9538 gene expression probes genetic variants in LD affecting gene expression and phenotype from the CAGE eQTL analysis and 73973 DNAm probes from independently)9. Moreover, like other summary-data-based the meta-mQTL analysis. In any of the specific analyses below, we methods16,18, the SMR & HEIDI approach allows the use of only included SNPs available and with consistent alleles in the GWAS and eQTL data from two independent studies; thus, the data sets used. statistical power can be boosted by using data from studies with By testing each DNAm probe for associations with genes very large sample sizes. within a 2 Mb distance in either direction, we detected 21938 The analytical framework that integrates GWAS and eQTL DNAm probes that showed significant SMR associations with data can be applied to incorporate other source of omics infor- 4804 gene expression probes (corresponding to 3991 unique 19 −8 mation, e.g., DNA methylation (DNAm) at CpG sites , which is genes) at PSMR < 2.26 × 10 correcting for ~2.21 million tests. an important epigenetic mechanism for gene regulation20. We further used the HEIDI approach to test against the null Methylome-wide association studies (MWAS) have identified a hypothesis that the association detected by the SMR test is due to 21–23 number of DNAm sites associated with complex traits . pleiotropy, rejecting those with PHEIDI < 0.01 (Supplementary However, it is not clear whether these associations are causal or Note 2). Finally, 10,588 associations between 7858 DNAm and driven by confounding factors. More importantly, it is not 3239 gene expression probes (corresponding to 2733 genes) were obvious which genes are regulated by DNAm because the reg- not rejected by the HEIDI test. On average, 3.0 DNAm sites were ulatory elements can be distant from the target genes24. Data associated with each gene, with a median value of 2.0 and a from recent eQTL and methylation quantitative trait locus standard deviation of 3.1. For the DNAm sites associated with the (mQTL) studies25,26 provide an opportunity to incorporate same gene, they tended to be located proximally (average mQTL data into the SMR analysis to map DNAm to distance = 35.4 kb) and enriched in enhancers (see the enrich- transcripts through a shared genetic factor and to further detect ment analysis below), and their effects on the expression level of pleiotropic associations of DNAm and transcripts with pheno- the gene tended to be in the same direction (87.2% pairs in the type. The chromatin activity data provided by the Roadmap same direction). More than a half of the DNAm located in Epigenomics Mapping Consortium (REMC)27 allows us to promoters (61.0%) or enhancers (62.2%) were negatively

2 NATURE COMMUNICATIONS | (2018) 9:918 | DOI: 10.1038/s41467-018-03371-0 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-03371-0 ARTICLE

a b 12 GWAS GWAS

Trait ) 9

GWAS 6 P ( 10

log 3 −

0 eQTL Transcription 62 eQTL

) 41 eQTL P ( 10 21 −log mQTL Methylation 0 311 mQTL

) 207 mQTL P (

10 104 −log SNP 0

30.83 31.23 31.63 32.03 32.43 32.83 (Mb)

Fig. 1 Schematic of our integrative analysis of multi-omics data. a A hypothetical model of a mediation mechanism tested in our analysis: an SNP exerts an effect on the trait by altering the DNAm level, which regulates the expression levels of a functional gene. b An example by simulation under our hypothetical model (Supplementary Note 1), in which the observed SNP-association signals are consistent at the shared GWAS locus across methylation, transcript and trait phenotype (see Fig. 7b for a real data example) associated with the expression levels of the target genes, powerful than T2M. However, there were 248 associations consistent with the hypothesis that most transcription factors identified by T2M, for which the instrument SNP explained are activators and the binding affinity of transcription factors on significantly larger proportion of variance in gene expression than promoters and enhancers are affected by DNAm. DNAm (Supplementary Data 1 and Supplementary Note 3), The SMR associations not rejected by the HEIDI test are implying a potential mechanism that the genetic effect on DNAm consistent with a pleiotropic model whereby both DNAm and is mediated through gene expression. The map between transcript are affected by a shared causal variant. We therefore transcriptome and methylome identified in this study have been hypothesized that the DNAm sites are located in regulatory integrated in an online database (URLs), which is useful for elements that are functionally relevant to the associated genes. To interpreting the discoveries from MWAS31 and understanding test this hypothesis, we conducted an enrichment analysis the mechanism of genetic regulation of gene expression. The (Methods) of the 7858 transcript-associated DNAm probes in pleiotropic associations between DNAm and gene expression are 14 main functional annotation categories in blood samples from driven by shared causal variants, and are therefore robust to REMC (Methods)27. We found a significant enrichment of these environmental exposures or disease, unless the environmental DNAm probes in the (fold-change = 1.39, P = 3.21 × exposures or diseases are dominated by a specific , which − − 10 88) and enhancer (fold-change = 2.66, P = 2.20 × 10 71) is very unlikely for complex traits (Supplementary Fig. 3). regions and a significant underrepresentation in heterochromatin Genes in the closest physical proximity with the DNAm sites − (fold-change = 0.35, P = 7.53 × 10 6), repressed (fold-change = are often used as the target genes in MWAS. Given the − 0.68, P = 2.79 × 10 15) and quiescent regions (fold-change = DNAm–gene associations identified from the analysis above, we − 0.48, P = 1.61 × 10 157), and transcription starting sites (fold- can estimate the proportion of DNAm sites mapped to the − change = 0.59, P = 5.86 × 10 17) in comparison with the probes nearest genes. We found that only 36.3% of the DNAm sites were π sampled at random with the variance of DNAm levels at each mapped to the nearest genes ( nearest) and 70.1% were mapped to π probe matched (Fig. 2). These results demonstrate the use of distal (i.e., not the nearest) genes ( distal), with 6.4% mapped to eQTL and mQTL summary data to genetically link a gene to its both. There were a number of DNAm sites mapped to genes regulatory elements. beyond 500 kb distance (Supplementary Fig. 4). For each of the We also conducted a SMR & HEIDI analysis considering gene distal DNAm–gene associations, we calculated the correlation in expression as the exposure and DNAm as the outcome (denoted expression level between the nearest and distal genes. The mean as T2M analysis). The associations identified by M2T and T2M correlation was only slightly above zero (mean r = 0.024, s.e.m. = showed a substantial overlap (4182 associations in common) with 0.0020) and not significantly different from that of the same 6406 and 3598 unique associations identified by M2T and T2M number of random gene pairs sampled from cis-regions with respectively (Supplementary Fig. 2), suggesting that M2T is more distances matched (mean r = 0.023, s.e.m. = 0.0033), not

NATURE COMMUNICATIONS | (2018) 9:918 | DOI: 10.1038/s41467-018-03371-0 | www.nature.com/naturecommunications 3 ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-03371-0

a 0.4 b 3.0 Sig. DNAm All DNAm

2.5

0.3

2.0

0.2 1.5 Fold enrichment Proportion of probes 1.0

0.1

0.5

0.0 0.0

Tx Het Tx Het TssA Prom TxWk TxEn EnhA TssA Prom TxWk TxEn EnhA EnhWDNase PromP Quies EnhWDNase PromP Quies PromBivReprPC PromBivReprPC ZNF/Rpts ZNF/Rpts

Fig. 2 Enrichment analysis of transcript-associated DNAm probes identified by the SMR & HEIDI test for 14 main functional annotation categories. a Distribution of the transcript-associated DNAm probes ('Sig. DNAm', blue) across the 14 functional categories in comparison to that of all DNAm probes in the data ('All DNAm', green). b Fold enrichment: a comparison of the associated probes with the same probes sampled repeatedly at random with the variance of each probe matched. Error bar represents the standard error of an estimate obtained from 500 random samples. The 14 functional categories are: TssA active transcription start site, Prom upstream/downstream TSS promoter, Tx actively transcribed state, TxWk weak transcription, TxEn transcribed and regulatory Prom/Enh, EnhA active enhancer, EnhW weak enhancer, DNase primary DNase, ZNF/Rpts state associated with zinc-finger protein genes, Het constitutive heterochromatin, PromP Poised promoter, PromBiv bivalent regulatory states, ReprPC repressed Polycomb states, Quies a quiescent state

−6 supporting the hypothesis that the distal associations are probes) at a genome-wide significance level (PSMR < 6.97 × 10 , mediated through the nearest genes. Since not all the gene i.e., 0.05/7177 tagged genes) for 13 traits (Supplementary Data 3). expression probes were included in the SMR analysis (we only The HEIDI test rejected 152 of the associations detected by the −8 π π included probes with cis-eQTL P <5×10 ), nearest ( distal) SMR test (222 remaining) at a threshold 0.01 (Supplementary could possibly be underestimated (overestimated) because some Data 3). Approximately, 65.6% of the 222 significant genes were of the cis-eQTLs were not detected due to the lack of power. We not the nearest genes to the GWAS top associated SNPs, con- then excluded 3368 DNAm probes for which the nearest genes sistent with the result from a previous study9. We further π were not included in the SMR analysis. The estimate of nearest developed a multi-SNP-based SMR method (SMR-multi) π fi fi increased to 63.6% and distal decreased to 47.6% (11.2% mapped (Methods) and identi ed 564 genes at a genome-wide signi cance to both). However, this stringent criterion would probably lead to level, of which 235 were not rejected by the HEIDI test (Sup- π π an overestimation of nearest (underestimation of distal) because plementary Data 4). The multi-SNP-based SMR test appeared to some of the nearest genes do not have cis-eQTL even if the be more powerful than the single-SNP-based test (564 vs. 374). sample size is infinite. Nevertheless, our results indicate that at However, the number of SMR associations not rejected by the least 47.6% of the DNAm are involved in relatively distal HEIDI test for the former was only mildly larger than that for the regulation of the target genes, providing an important caveat that latter (235 vs. 222), suggesting SMR-multi picked up a large DNAm, such as those discovered in MWAS, do not always target proportion of associations due to linkage. We therefore focused the nearest genes. This result is supported by the observation on the results from SMR and relegated those from SMR-multi in from recent functional studies8,32 that the causal variants are Supplementary Data 4. In the analysis with DNAm data, we used largely located in regulatory sequences (e.g., enhancer regions) cis-mQTL summary data of 73973 DNAm probes from the meta- that are distal from the causal genes that they act on. analysis described above and detected 1903 DNAm probes (Supplementary Data 5) associated with 14 complex traits at a fi −7 Pinpointing functionally relevant DNAm and target genes.We genome-wide signi cance level (PSMR < 6.81 × 10 , i.e., 0.05/ applied the SMR & HEIDI method to perform pleiotropic asso- 73448 where 73448 was the number of DNAm probes used in the ciations of both the transcriptome and the methylome with 14 analysis after matching the SNPs and alleles among the mQTL, complex traits (including diseases). The summary statistics of GWAS and LD reference data), of which 893 DNAm probes were SNP associations for the were from the latest GWAS not rejected by the HEIDI test at PHEIDI > 0.01. – meta-analyses for complex traits32,34,35 and diseases36 42 Combining the results from pleiotropic associations between (Methods and Supplementary Data 2). Using the cis-eQTL DNAm, transcripts and complex traits can potentially pinpoint summary data of 9538 gene expression probes from the CAGE functionally relevant genes and regulatory elements at GWAS eQTL analysis described above, we performed the SMR test for loci, as demonstrated by the consistent association signals at the associations between gene expression probes and each of the 14 shared genomic regions across multiple omics layers (Fig. 3). traits, and identified 374 genes (tagged by 446 gene expression When combining the results, we used a stringent criterion that

4 NATURE COMMUNICATIONS | (2018) 9:918 | DOI: 10.1038/s41467-018-03371-0 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-03371-0 ARTICLE

22

21 50

25 1 0 25

0 25 20 50 50 0 75 100 25 125

150 50 0 19 175 25 200

225 75 0 0 18 50 25 25 50 0 75 75 2 17 50 100 25 125 150 0 75 175 200 16 50

225

25 0

1000 25 75 50 15 50 75 3 25 100 0 125 100 150 75 175

14 50 0 25 25 0 50 100 75 75 13

100 4 50 125 25 150 0 125 175

100 0 25 12 75 50 50 75 25 0 100 5 125 125 150 100

75 175 11 0 50 25 25 50 0 75 125 100 100 125 75 6 150

50 10 0 25 0 25 50

75 125 100 100 125 75 150

50 25

50 0 75 0 25 7

9 125 100 8

Fig. 3 Pinpointing functional regions for a complex trait with consistent SMR association signals across multi-omics layers. Shown are −log10 (P-values) from the SMR tests for schizophrenia against the physical positions of DNAm or gene expression probes. The blue lines (outer ring) represent −log10 (P- values) from the SMR tests for associations between DNA methylation and trait, the green lines (inner ring) represent those for the associations between transcripts and trait and the yellow lines (middle ring) represent the significant associations between methylations and transcripts. The orange circles represent the significance thresholds of the SMR tests. The black lines are the significant SMR associations consistent across all three layers, and the red lines highlight the significant and consistent SMR associations that are not rejected by the HEIDI test. Due to the large scale of the plot, DNAm and genes that are distally associated (e.g. >500 kb and <2 Mb) appear to be closely located in the figure

43 both the DNAm site and transcript of each pair were associated correlation (rg) between SCZ and EY . It should be noted that with the trait at a genome-wide significance level with none of the the gene–trait association analysis was not dependent on the associations rejected by the HEIDI test (see Supplementary DNAm–trait association analysis. There were 222 genes that Data 6 for details of the results from the SMR & HEIDI test). Of showed pleiotropic associations with the traits, only 66 of which the 10588 DNAm–gene associations identified from the analysis showed pleiotropic associations with DNAm. above, 225 pairs (consisting of 149 DNAm and 66 genes) were Similar to the enrichment analysis of all transcript-associated significantly associated with 12 traits. Taking schizophrenia DNAm sites shown above (Fig. 2), a subset of the 149 transcript- (SCZ) as an example, we found 10 genomic regions (tagged by and trait-associated DNAm sites were also significantly enriched 19 genes and 52 DNAm sites) with consistent significant in the promoter and enhancer regions (Supplementary Fig. 6). Of association signals from DNAm sites and transcripts (Fig. 3). the 66 identified genes, 22 genes are associated with diseases These results are in line with a model that the effects of genetic (Supplementary Data 7), i.e., Alzheimer’s disease (AD), rheuma- variants in these regions on SCZ susceptibility are mediated by toid arthritis (RA), SCZ, Crohn’s disease (CD), ulcerative colitis the regulation of gene expression through DNAm (Fig. 1a). (UC) and coronary artery disease (CAD). To evaluate the Among the 10 identified regions for SCZ, one on chromosome 12 potential value of the 22 putative disease susceptibility genes in (12q24.31) shows a pleiotropic effect on educational years (EY; drug discovery, we obtained all the drug target genes from a Supplementary Fig. 5), consistent with a positive genetic recent study42 based on two major drug databases, Drugbank44

NATURE COMMUNICATIONS | (2018) 9:918 | DOI: 10.1038/s41467-018-03371-0 | www.nature.com/naturecommunications 5 ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-03371-0 and Therapeutic Targets Database45, which include drugs used as the candidates for drug repositioning, whereas those with approved in clinical trials or experimental drugs. We found that estimated effects increasing the disease risk might need to be five genes identified in our study (MS4A2, FADS1, FADS2, LIPA, monitored for side effects in clinical trials or practice. SULT1A1) overlapped with the drug targets (Supplementary Data 8). For the drugs listed in Supplementary Data 8, those with estimated effects decreasing the disease risk might potentially be Plausible mediation mechanism for susceptibility genes. The discovery of putative functional genes and their regulatory

a ) ) FADS1 FADS2

ILMN_1670134 ( cg21709803 cg06781209 cg14911132 cg01400685 cg25324164 ILMN_2075065 ( 9 rs968567 ) pMSMR = 6.81e−07 7

4

GWAS or SMR pESMR = 6.97e−06 P (

10 2 –log 0

) 50 ILMN_2075065 (FADS2) eQTL 33 P (

10 17

−log 0

) 224 cg06781209 (TMEM258,FADS1,FADS2)

mQTL 149 P (

10 75 0 −log

ESC TssA iPSC Prom ES-deriv Tx

Blood & T-cell TxWk

HSC & B-cell TxEn

Mesenchymal EnhA Epithelial EnhW

Brain Quies

Muscle DNase Heart ZNF/Rpts

Digestive Het

Other PromP PromBiv ENCODE ReprPC

MYRF FADS2 TMEM258 BEST1 MYRF FADS3 FADS1 BEST1 DKFZP434K028 MIR6746 MIR1908 FTH1 FEN1 RAB3IL1 FADS2 MIR611 RAB3IL1 FADS2

61.48 61.55 61.62 61.69 61.76 61.83 Chromosome 11 Mb b Unmethylated Transcription

Transcription factors RNA polymerase

Promoter

Transcription factors Methylated Transcription suppressed

RNA polymerase

Promoter

6 NATURE COMMUNICATIONS | (2018) 9:918 | DOI: 10.1038/s41467-018-03371-0 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-03371-0 ARTICLE elements in our analysis provides opportunities to infer the (Supplementary Data 6), suggesting that the transcription factors mechanism of genetic regulation at a GWAS locus. A notable that bind to these regions are repressors (Fig. 5b). example is the DNAm that reside in the promoter region within One further example is the regulatory mechanism of a gene FADS2 for RA (Fig. 4a), where the SNP-association signals are (i.e., LIPA) associated with CAD. A previous study has shown significant and consistent across data from GWAS, eQTL and that an exonic variant rs1051338 in LIPA decreases lysosomal mQTL studies, implying a plausible biological pathway. We acid lipase levels and activity in lysosomes49. Interestingly, we detected two gene expression probes (ILMN_2075065 and found that the top associated DNAm is located in an enhancer ILMN_1670134) tagging FADS1 and FADS2 that are significantly region and that the top mQTL is only 110 bp away from associated with RA. The SNP-association signals for gene rs1051338 (Supplementary Fig. 7). expression coincide with those for the five nearby DNAm probes We have shown, as a proof-of-principle, the examples above (cg06781209, cg21709803, cg01400685, cg25324164 and where the functional genes and the likely mechanisms are known cg14911132) that are mapped to the two genes, suggesting that or strongly supported by previous studies. However, these the two genes are co-expressed (estimated correlation of 0.56 in examples are rare. We have identified 66 genes and 149 DNAm the GTEx blood samples) and potentially regulated by these five that show pleiotropic associations with 12 traits. These results DNAm sites. The co-regulation hypothesis is supported by the demonstrate the power of re-analysing and integrating different evidence that both FADS1 and FADS2 are involved in the levels of data from published studies of large sample size to make metabolism of omega-6 and omega-3 fatty acids46 (Supplemen- novel discoveries and to shed light on putative causal genes and tary Data 7). Indeed, the five DNAm sites that are significantly regulatory elements that can be prioritized in functional studies. associated with FADS1 and FADS2 are in the promoter and nearby enhancer regions of the two genes according to the Pleiotropic effects of DNAm, transcripts and traits. Previous chromatin state annotations from the REMC28 reference samples. studies have used GWAS data to characterize genetic overlap Furthermore, rs968567 is the top associated SNP in both the between human complex traits and common diseases43,50. In our eQTL analysis of the two genes and the mQTL analysis of the five integrative analysis, we detected several DNAm sites and/or DNAm (Supplementary Data 6). This variant is only 567 bp away transcripts that show pleiotropic effects on multiple traits (Sup- from one of the five DNAm sites and has been identified in a plementary Data 9), especially traits that are known to be recent study47 as the binding site of SREBF2. genetically correlated. Shown in Supplementary Fig. 8 is an The negative estimates of methylation effects on gene expression example in which the ARL6IP4 locus shows pleiotropic effects on from SMR (b = −0.812, Supplementary Data 6) indicate a SMR EY and SCZ (estimated r from genome-wide SNPs51 of 0.10, P repressing role of DNAm on target gene expression. With all the − g = 8.47 × 10 3), where the SNP-association signals from the evidence above, we hypothesize a mechanism in which the genetic mQTL and eQTL studies are highly consistent with those from variant (rs968567) at the promoter of FADS2 gene alters the GWAS for EY and SCZ. We have examples where the effect sizes DNAm, which disrupts the binding of transcription factor of a transcript on two traits are in the same direction. For (SREBF2), down-regulating the expression of the FADS2 gene and example, height shares genes (SLC22A4 and SLC22A5) with CD, therefore decreasing the risk of RA47 (Fig. 4b). and the effects of transcripts on the two traits are in the same The ATG16L1 gene for CD is another interesting example of direction, which is accordant with the positive genetic correlation inferring a plausible mechanism of genetic mediation from the (r = 0.06, P = 0.12) between two traits although the estimate of SMR analysis of omics data. As shown in Fig. 5a, there are five g r is very low and not significant. We also observed pleiotropic DNAm sites that are significantly associated with ATG16L1 and g effects in the opposite direction for two negatively correlated CD, among which two are in the promoter region and three are in traits, e.g., height and low-density lipoprotein (LDL) (r = −0.09, an enhancer region within the open reading frame (ORF) of − g P = 8.02 × 10 3), at three genes (PLEC, GRINA and PARP10) ATG16L1. This enhancer is highly tissue-specific and is only (Supplementary Fig. 9). Among all 14 traits, height shares the present in the blood, thymus, digestive system and a few obscure largest number of genes with other traits, which is likely due to samples in REMC; most of these tissues are relevant to CD. An the relatively larger GWAS sample size and the polygenic archi- intronic variant rs2241880 in ATG16L1, which has been tecture. Not surprisingly, the pairs of traits that share more than confirmed to be the causal variant for CD and has been validated three pleiotropic genes are those that have been shown to be at the protein level48, is only 56 bp away from one of the genetically correlated43, e.g., height and LDL (r = −0.09, P = associated DNAm sites in the tissue-specific enhancer region − g − 8.02 × 10 3), CD and UC (r = 0.54, P = 1.69 × 10 13), and EY (Supplementary Data 6). All the results point strongly towards a g − and height (r = 0.13, P = 3.82 × 10 6). Nevertheless, this con- mechanism of genetic regulation of gene expression initiated by g cordance depends on the sample sizes of the GWAS data. DNAm in the enhancer and mediated through enhancer–promoter interactions at the ATG16L1 locus for CD (Fig. 5b). Furthermore, the identified DNAm in the enhancer and Robustness and tissue specificity. Having identified many promoter regions all have positive effects on gene expression pleiotropic associations between DNAm, gene expression, and complex traits, we then sought to investigate how robust the

Fig. 4 Prioritizing genes and regulatory elements at the FADS1/FADS2 locus for rheumatoid arthritis (RA) with a plausible regulation mechanism. a Results 42 of SNP and SMR associations across mQTL, eQTL and GWAS. The top plot shows −log10(P-values) of SNPs from the GWAS meta-analysis for RA . The red diamonds and blue circles represent −log10(P-values) from SMR tests for associations of gene expression and DNAm probes with RA, respectively. The solid diamonds and circles are the probes not rejected by the HEIDI test. The yellow star indicates the previously reported causal variant rs968567. The second plot shows −log10(P-values) of the SNP association for gene expression probe ILMN_2075065 (tagging FADS2) from the CAGE eQTL study. The third plot shows −log10(P-values) of the SNP associations for DNAm probe cg06781209 from the mQTL study. The bottom plot shows 14 chromatin state annotations (indicated by colours) of 127 samples from REMC for different primary cells and tissue types (rows). b A hypothetical regulation mechanism. When the DNAm site in the promoter is unmethylated, the transcription factor SREBF2 (activator) binds to the promoter and enhances the transcription of the FADS2 gene. When the DNAm site is methylated (by the effect of the genetic variant rs968567 at the promoter), the binding of SREBF2 is disrupted and therefore the transcription of FADS2 is suppressed

NATURE COMMUNICATIONS | (2018) 9:918 | DOI: 10.1038/s41467-018-03371-0 | www.nature.com/naturecommunications 7 ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-03371-0 results are across tissues and data sets. We used the same data- Genotype-Tissue Expression (GTEx)53 and the Brain eQTL filtering criteria (Methods) for a replication analysis using eQTL Almanac (Braineac)54. We performed the SMR analysis using summary data from whole-blood samples in Westra et al.26 and eQTL summary data of 5967 gene expression probes after quality for a tissue-specific analysis using eQTL summary data from controls from the Westra et al. study26, which has a larger sample brain samples in the Common Mind Consortium (CMC)52, size (n = 5311) than CAGE. We used CAGE for discovery

a ) ) ) ATG16L1 ATG16L1 SAG

cg19193136 cg12834730 cg25479499rs2241880cg07618928 cg23050873 ILMN_2365881ILMN_1725707 ( ILMN_1758619 ( ( 39 ) 29

20 GWAS or SMR P ( 10 10 pMSMR = 6.81e−07 −log 0 pESMR = 6.97e−06

) 20

eQTL ILMN_1725707 (ATG16L1) 13 P (

10 7 0

) −log 59 cg07618928 (ATG16L1) mQTL 39 P (

10 20 0 −log

ESC TssA iPSC Prom ES-deriv Tx

Blood & T-cell TxWk

HSC & B-cell TxEn Mesenchymal EnhA Epithelial EnhW

Brain Quies

Muscle DNase Heart ZNF/Rpts

Digestive Het

Other PromP PromBiv ENCODE ReprPC

ATG16L1 DGKD SCARNA5 DGKD SCARNA6

SAG

b 234.05 234.13 234.21 234.30 234.38 234.46 Mb

Unmethylated Enhancer

Repressor Transcription suppressed Transcription factors RNA polymerase

Promoter

Methylated Enhancer

Transcription Repressor

Transcription factors RNA polymerase

Promoter

8 NATURE COMMUNICATIONS | (2018) 9:918 | DOI: 10.1038/s41467-018-03371-0 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-03371-0 ARTICLE because of its denser SNP coverage (SNPs in Westra were across DNAm, transcript and complex traits in our analysis. It is imputed to HapMap2 with a maximum of ~2.5 M, and only SNPs also consistent with our observation that the variance explained −5 fi with PeQTL <1×10 are available). For the signi cant by the top associated mQTL decreases dramatically from that for DNAm–transcript associations identified using the CAGE data, DNAm to gene expression and to phenotype (Fig. 7). the SMR estimates of the effect sizes of DNAm on transcripts In this study, we identified 7858 DNAm probes associated with were highly consistent with those estimated in the Westra data 2733 genes, of which 149 DNAm loci and the corresponding 66 (Supplementary Fig. 10) with an estimated correlation of 0.98. A putative target genes are associated with 12 complex traits and total of 24360 (63%) significant DNAm–transcript associations diseases, through pleiotropy. As expected, the DNAm that show detected using the CAGE data were significant at P < 2.90 × 10 pleiotropic associations with genes are enriched in the enhancer − SMR 7 in the analysis with the Westra eQTL data. Although there is a and promoter regions. Notably, 48–70% of the DNAm sites do sample overlap between CAGE and Westra, this analysis not show pleiotropic association with their closest genes, indi- demonstrates the robustness of the results across data sets gen- cating that a large proportion of DNAm sites are in distal func- erated from different studies. tional regions (e.g., enhancer) and regulate the target genes In addition, we used data from the blood samples as the probably through interactions of looping chromatins56. Further- discovery set because of the large sample sizes, which maximizes more, DNAm in enhancer and promoter regions is presumed to the power of discovery. To test whether eQTL and mQTL data be involved in the transcription factor binding process57, which from the blood sample are good representatives of those from the implicates the likely role of the causal variants as binding sites. most relevant tissue, we performed the SMR analysis for SCZ and Given the direction of DNAm effect on gene expression, we can EY using three eQTL data sets from the brain (i.e., CMC, GTEx further infer the role of the transcription factor as an activator or and Braineac). In total, there are nine genes (56%) replicated for repressor. In this study, we found that a substantial number of SCZ and seven genes (77%) replicated for EY in at least one of the DNAm sites (46%) have positive effects on their target genes, three brain eQTL data sets (Supplementary Data 10). These implying that the transcription factors that bind to the unme- replication rates are surprisingly high given the relatively small thylated DNA at these sites are likely to be repressors. We sample sizes of the brain eQTL studies (n = up to 467). Genes replicated several putative causal genes for complex traits repor- SNX19, NT5C2 and MAPK3 for SCZ and the genes ERCC8 and ted in previous studies (e.g., ATG16L1 for CD and LIPA for CAD) C18orf8 for EY (Supplementary Fig. 11) were replicated in at least 48,49. However, the causal genes for body mass index (BMI) (i.e., two data sets. MAPK3 was highlighted in a recent study55 as a IRX3 and IRX5) were not detected in our study, likely because the functional gene whose expression is up-regulated by a variant by effects of the FTO SNP on IRX3 and IRX5 are specific in primary disrupting chromatin activity, which is associated with a decrease preadipocytes (a minority group of adipose cells)8 whereas the in SCZ risk. Consistent with our previous result9, we again eQTL data used in our study were from peripheral blood. In detected the association of SNX19 with SCZ in the GTEx data. summary, based on the methylome-transcriptome map, the Figure 6 shows that the eQTL association signals for SNX19 from incorporation of DNAm–trait and transcript–trait association the three data sets of two different tissues are consistent with the information contributes to an understanding of the mediation GWAS association signals for SCZ, despite that the gene mechanism of genetic variants on complex traits and diseases. expression levels were measured by different technologies (gene Under the analytical paradigm introduced in this study, it is expression microarray for Braineac and RNA sequencing for straightforward to incorporate other types of molecular traits GTEx). We further identified five DNAm probes that were from specific cells. For instance, RNA splicing can be an associated with both SNX19 and SCZ at promoter and enhancer important source in the differentiation of gene expression. One regions close to SNX19 (Fig. 6). Together with that SNX19 is the recent study showed that splicing QTLs (sQTLs) contribute only annotated gene underlying this GWAS locus, the results remarkably to complex trait variation, roughly as much as that suggest a likely mechanism that a genetic variant (of course, may from eQTLs58. An integrative analysis that combines sQTL data not be the GWAS top SNP) alters the methylation level of a DNA is expected to shed light on the underlying mechanism of dif- element that regulates the expression level of SNX19, resulting in ferential expression. Other types of epigenetic data, such as a small difference in susceptibility to SCZ. chromatin phenotypes59, can be further used to detect epigenetic interactions and decipher gene regulations in three dimensions. Although a tissue-specific mechanism has been proposed to Discussion explain the differences in gene expression levels across different We introduced an integrative analysis based on the SMR & primary cells and tissues, we showed that a large proportion of HEIDI method to map DNAm sites to putative target genes and the detected genes for SCZ and EY using peripheral blood sam- further map both to a complex trait. We tested a model at each ples is replicated in brain tissues. Using the data in a single tissue GWAS locus to evaluate whether an SNP exerts an effect on the with a large sample size allows us to gain power in the discovery trait by altering the DNAm level, which regulates the expression of novel trait-associated genes and regulatory elements. On the level of a functional gene (Fig. 1a). This hypothetical model is other hand, the power of the SMR test largely depends on the supported by many examples of consistent SMR associations

Fig. 5 Prioritizing genes and regulatory elements at the ATG16L1 locus for Crohn’s disease (CD) with a plausible regulation mechanism. a Results of SNP 40 and SMR associations across mQTL, eQTL and GWAS. The top plot shows −log10(P-values) of SNP from the GWAS meta-analysis for CD . The red diamonds and blue circles represent –log10(P-values) from the SMR tests for associations of gene expression and DNAm probes with CD, respectively. The solid diamonds and circles represent the probes not rejected by the HEIDI test. The yellow star indicates the previously reported causal variant rs2241880.

The second plot shows −log10(P-values) of the SNP associations for gene expression probe ILMN_1725707 (tagging ATG16L1). The third plot shows −log10(P-values) of the SNP associations for DNAm probe cg07618928. The bottom plot shows 14 chromatin state annotations (indicated by colours) of 127 samples from REMC for different primary cells and tissue types (rows). b A hypothetical regulation mechanism. When the DNAm site in the enhancer is unmethylated, repressors can bind to the enhancer, decrease the activity of the promoter, and thus suppress the transcription of the ATG16L1 gene. When the DNAm site is methylated (by the effect of the genetic variant rs2241880 in the enhancer), the binding of repressors is disrupted, which prevents the transcription of ATG16L1 from suppression

NATURE COMMUNICATIONS | (2018) 9:918 | DOI: 10.1038/s41467-018-03371-0 | www.nature.com/naturecommunications 9 ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-03371-0

)

(SNX19 ) ) SNX19 ( (SNX19

) 13 cg08069931 CAGE−ILMN_1788211BRAINEAC−3398482GTEx_Cortex−ENSG00000120451.6cg08912652 cg12179176 cg22079354 cg14779329

10 pMSMR = 6.81e−07 6 GWAS or SMR P ( pESMR = 6.97e−06

10 3

−log 0 75 CAGE−ILMN_1788211 (SNX19) 50 25 ) 0 eQTL

P 12 ( BRAINEAC−3398482 (SNX19) 10 8 4 −log 0 20 GTEx_Cortex−ENSG00000120451.6 (SNX19) 13 7 ) 0

mQTL 30 P

( cg08069931 (SNX19)

10 20 10 −log 0

ESC TssA iPSC Prom ES-deriv Tx

Blood & T-cell TxWk

HSC & B-cell TxEn Mesenchymal EnhA Epithelial EnhW

Brain Quies

Muscle DNase Heart ZNF/Rpts

Digestive Het

Other PromP PromBiv ENCODE ReprPC

LOC100507431

LOC103611081

SNX19

130.55 130.64 130.73 130.82 130.92 131.01 Chromosome 11 Mb

Fig. 6 Replication analysis in the SNX19 locus for schizophrenia (SCZ). The top plot shows −log10(P-values) of the SNPs from the GWAS meta-analysis for 36 SCZ . The red diamonds and blue circles represent −log10(P-values) from the SMR tests for associations of gene expression and DNAm probes with SCZ, respectively. The solid diamonds and circles represent the probes not rejected by the HEIDI test. Plot #2, #3 and #4 (red crosses) show −log10(P-values) of the eQTL for gene SNX19 from three independent data sets. The plot #5 (blue dots) shows −log10(P-values) of the mQTL for DNAm probe cg08069931. The bottom plot shows 14 chromatin state annotations (indicated by colours) of 127 samples from REMC for different primary cells and tissue types (rows)

10 NATURE COMMUNICATIONS | (2018) 9:918 | DOI: 10.1038/s41467-018-03371-0 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-03371-0 ARTICLE

a b 0.8 0.04 GWAS (Height) 0.03 0.02 0.01 0.6 0

8.5 eQTL (ILMN_1735979) 5.7 0.4 2.8 0 Variance explained

0.2 Variance explained (%) 23 mQTL (cg14132834) 16 8 0.0 0 41.73 41.81 41.89 41.97 42.05 42.13 Trait Chromosome 19 (Mb) Methylation Expression

Fig. 7 The attenuation of effect sizes of genetic variants on methylation and gene expression towards complex traits. a Distribution of the variance explained in methylation, gene expression and trait phenotypes by the top mQTLs for the 149 DNAm that are significantly associated with both 66 transcripts and 12 traits. Of 225 DNAm–transcript pairs, there are 160 pairs for which the mQTL effect is higher than the eQTL effect (both in SD units). b An example of a genomic locus for height, where the SNP-association signals are consistent across mQTL, eQTL and GWAS, indicating a single shared underlying causal variant, but the variance explained decreases dramatically across these studies sample size of GWAS because of the small effect sizes of SNPs on Methods complex traits (Fig. 7). For example, no gene expression probe SMR and HEIDI test for pleiotropic association. The SMR test9 was developed to and only eight methylation probes were significantly associated test the association of an exposure (e.g., transcript) with an outcome (e.g., trait) using a genetic variant as the instrumental variable to remove non-genetic con- with T2D in the SMR test, likely a consequence of the limited founding. Let x be an exposure variable, y be an outcome variable, and z be an sample size of GWAS for T2D. In addition, only a subset of instrumental variable. What these variables refer to varies in different steps of our probes were used in the SMR analysis after the filtering criterion analysis. In the step of testing for a DNAm–gene association, x, y and z refer to that at least one SNP is significantly associated with the transcript DNAm, transcript and the top-associated mQTL, respectively. In the steps of −8 testing for a DNAm/transcript–trait association, x, y and z refer to DNAm/tran- (PeQTL <5×10 ). A proportion of transcripts may be further script, trait and the top-associated mQTL/eQTL, respectively. The Mendelian ^ missed due to the elimination of expression probes with incon- Randomization (MR) estimate of the effect of exposure on outcome (b ) is the ^ xy sistent gene mapping when combining eQTL data across multiple ratio of the estimated effect of instrument on exposure (bzx) and that on outcome ( ^ 9,60,61 studies (e.g., CAGE), resulting in a potential loss of power. Given bzy) : that future eQTL and mQTL studies will have larger sample sizes, ^ ¼ ^ =^ ; we expect to detect more genes and DNAm sites that show bxy bzy bzx pleiotropic associations with complex traits and diseases through ^ ^ our integrative analysis. where bzx and bzy are available from mQTL, eQTL or GWAS summary data. One There is a limitation of this study. That is, we performed the of the basic assumptions for MR is that the instrument should be strongly asso- – – ciated with exposure. We therefore only select the top associated eQTL/mQTL at P SMR & HEIDI analysis to detect DNAm gene, DNAm trait and −8 ^ – <5×10 as an instrument for an SMR analysis. The standard error (SE) of bxy gene trait associations separately, and focused on the association ^ ^ 9,61 can be computed from the SEs of bzx and bzy using the Delta method^ 2 . The signals that were consistent across the three analyses at a locus. fi ^ bxy  χ2 signi cance of bxy can therefore be assessed by the Wald test, i.e., SE 1. This strategy, however, is not optimal and potentially loses power A significant association detected by the SMR test above can result from either a because of thresholding the results by P-values in multiple steps. pleiotropic model (i.e., the exposure and the outcome are associated owing to a Further development of the method that integrates GWAS data single shared genetic variant) or a linkage model (i.e., there are two or more genetic variants in LD affecting the exposure and outcome independently). To distinguish and all the omics data in a single test is a priority in the future pleiotropy from linkage, the HEIDI (heterogeneity in dependent instruments) test (Supplementary Note 4). Despite this limitation, our study pro- was developed to test against the null hypothesis that there is a single causal variant vides a statistically elegant analytical paradigm that integrates underlying the association (pleiotropy)9. In brief, we use multiple SNPs (e.g., the genomic, transcriptomic and epigenomic information to under- top 20 associated mQTLs/eQTLs after pruning SNPs for either too strong or too weak LD9)inacis region to detect whether the association patterns across the stand the regulatory mechanism of polygenic effects for complex region are homogeneous or not (a homogenous pattern indicates a single shared ^ traits. The methylome-transcriptome pleiotropic map constructed causal variant). That is, we assess the difference between b estimated at the top ^ ^ xy associated instrument b ð Þ and b estimated at a less significant instrument in this study will be helpful for researchers to query the putative ^ xy 0 xy target gene(s) of a given DNAm site, and the database (URLs) can bxyðÞ i : be expanded in the future with more data. The analyses we ^ ¼ ^ À ^ : perform here can be applied to any other source of omics data di bxyðÞ i bxyðÞ0 and any other phenotype or disease, even in a different species, using the tools that we have made available (URLs). The putative no b  ðÞ; b ¼ ^ ; ÁÁÁ; ^ functional genes and regulatory elements identified in this study For multiple SNPs, d MVN d V , where d d1 dm and V is the provide important leads for designing studies in the future to covariance matrix because most SNPs are likely in LD9. In this study, we estimated LD from the Health and Retirement Study (HRS)62 with SNP data imputed to the understand the mechanism of genetic regulation of genes affect- 1000 Genomes Project (1KGP)63. Under the null hypothesis (pleiotropic model), d ing common complex diseases. = 0. We use an approximate multivariate approach to test whether d is significantly ^ deviated from 0 (equivalent to a test for evidence of heterogeneity in bxy estimated at multiple instruments that are likely in LD)9,64. We reject the SMR associations that show significant heterogeneity as detected by the HEIDI test.

NATURE COMMUNICATIONS | (2018) 9:918 | DOI: 10.1038/s41467-018-03371-0 | www.nature.com/naturecommunications 11 ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-03371-0

Multi-SNP-based SMR test. We can extend the SMR method by including under the null of no enrichment, we repeated the analysis with the same number of multiple SNPs at a cis-mQTL/eQTL locus in the SMR test. We call this method as DNAm probes randomly sampled from all probes, matching their variance in SMR-multi and have implemented in the SMR software (URLs). First, we select all DNAm levels with each of the transcript-associated DNAm probes. This procedure − the SNPs with P<5×10 8 in the cis region (e.g. within 500 kb of the probe) and was performed over 500 times, and the fold enrichment was calculated by a remove SNPs in very high LD with the top associated SNP (e.g. LD r2 > 0.9). We comparison of the observed value with the mean from 500 null replicates. To assess fi then estimate bxy(i) at each of the SNPs and combine the bxy estimates of all the the signi cance of the enrichment test, we estimated the standard error of the fold SNPs in a single test using an approximate set-based test developed previously64 enrichment from 100 null replicates. = accounting for LD among SNPs. In brief, let z {zi} be a vector of z-statistics for all the instruments at a locus, where z  b ðÞ=SE. Under the null hypothesis, z i xy i Tissue specificity analysis from brain samples. To detect the tissue specificity follows a multivariate normal distribution, i.e. z ~ MVN(0, R), where R is the LD for brain-related traits or disease, we performed the SMR analysis using eQTL correlation matrixP for the SNPs at a locus. For the significance test, we use the test- ¼ 2 summary data from the Common Mind Consortium (CMC) and Genotype-Tissue statistic T i zi , which does not have an explicit cumulative density function Expression (GTEx). Gene expression levels of these two data sets were quantified but can be approximated by the Saddlepoint method64,65. by RNA sequencing, and the annotation was from GENCODE Version 19 (ref.72). The GTEx summary data are available at dbGaP (URLs), and the sample size of the Data used for the integrative analysis and quality controls. The eQTL brain tissue is up to 125. It consists of up to 24762 transcripts in 10 brain regions summary-level statistics were from the CAGE29 data, which consists of a total of and up to ~6.5 million 1KGP-imputed variants. We carried out the SMR analysis seven distinct cohorts with gene expression levels measured in peripheral blood. for each brain region and counted the result as a successful replication if a sig- fi This unified dataset comprises 2765 individuals (predominantly Europeans), 38624 ni cant SMR result from the analysis with the CAGE data was detected in any of normalized gene expression probes and ~8 million SNPs. The eQTL effects were in the 10 brain regions after correcting for multiple tests. We further validated the standard deviation (SD) units of transcription levels. For replication, we used eQTL results using the CMC data, which were generated from 467 brain samples. Gene summary-level data from the Westra et al.26 study, which is a meta-analysis of 5311 expression levels were measured across multiple regions of the whole brain. The blood samples based on SNP data imputed to HapMap2 (ref.66). Gene expression eQTL summary data were from the analysis of 16423 transcripts and ~2 million in these two data sets was mainly measured based on Illumina HumanHT-12 v3.0 1KPG-imputed variants. We also performed the SMR analysis with Braineac eQTL chip, and the annotations were from Illumina based on hg18 (see URLs). We used summary data, which are available in the public domain (see URLs) and comprise the function illuminaHumanv4fullReannotation in Bioconductor67 to update the 26493 gene expression probes in ten brain regions and ~6 million genetic variants annotation of the gene expression probes based on hg19. Next, we selected the on up to 134 individuals. probes with good tagging of gene expression (i.e., a probe annotation quality score of at least 'good'68) and at least one cis-eQTL passing the significant threshold URLs. For M2Tdb, see http://cnsgenomics.com/shiny/M2Tdb/. For SMR, see −8 (PeQTL <5×10 ). The eQTL effect sizes were not available in Westra data, but http://cnsgenomics.com/software/smr. For GTEx, see http://www.gtexportal.org/ they were estimated from z-statistics using the method described in Zhu et al.9. home/. For CMC, see https://www.synapse.org/CMC. For Braineac, see http:// Probes at the major histocompatibility complex (MHC) region were excluded www.braineac.org/. For the Annotation file for the Illumina HumanHT-12 v3.0 because of the complexity of this region. We also removed probes with SNPs in the Gene Expression BeadChip, see https://support.illumina.com/downloads/ hybridization sequences. Finally, we retained 9538 and 5967 gene expression humanht-12_ v3_product_files.html. For the probes from CAGE and Westra, respectively, for analysis. Annotation file for the Illumina HumanMethylation450 BeadChip, see. https:// The mQTL data were from the Brisbane Systems Genetics Study30 (n = 614) www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL16304. and Lothian Birth Cohorts of 1921 and 193631 (n = 1366). All the individuals are of European descent. The methylation states of all the samples were measured based Data availability. The summary-level eQTL data from the CAGE and mQTL data on Illumina HumanMethylation450 chips, consisting of 485512 DNAm probes. from the meta-analyses of LBC and BSGS are available at http://cnsgenomics.com/ For these probes, an enhanced annotation from Price et al.69 (see URLs) was used software/smr/#Download. All the other data sets used in this study are from the to annotate the closest genes of DNA methylation probes. The summary-level statistics of the two independent cohorts were obtained from a recent mQTL public domain. The software tools are available at the URLs above. study27, where the mQTL effects were in SD units of DNAm levels. There were ~8 million genetic variants after quality control and 55000 mQTL were cross- Received: 17 August 2017 Accepted: 7 February 2018 replicated in the two data sets at a very stringent significance level. We performed a meta-analysis of the two cohorts and identified 94338 methylation probes with at −8 least a cis-mQTL at PmQTL <5×10 . We excluded methylation probes in MHC regions and probes with SNPs in the hybridization sequences, resulting in 73973 probes retained for analysis. We included in the analysis 14 complex traits (including disease). They are height, BMI, waist–hip ratio adjusted by BMI (WHRadjBMI), high-density References lipoprotein, LDL, thyroglobulin, EY, RA, SCZ, CAD, type 2 diabetes (T2D), CD, 1. Visscher, P. M., Brown, M. A., McCarthy, M. I. & Yang, J. Five years of GWAS UC and AD. The GWAS summary data were from the latest GWAS meta-analyses discovery. Am. J. Hum. Genet. 90,7–24 (2012). (predominantly in Europeans) at the time when the analyses were performed, 2. Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait where the sample sizes are up to 339224 (Supplementary Data 2). The number of associations. Nucleic Acids Res. 42, D1001–D1006 (2014). SNPs varies from 2.5 to 9.4 million across traits. The SNP effects on quantitative 3. Yang, J. et al. Ubiquitous polygenicity of human complex traits: genome-wide traits were in SD units, whereas those on disease traits (e.g., case-control design) analysis of 49 traits in Koreans. PLOS Genet. 9, e1003355 (2013). were expressed as log odds-ratios. For those traits (i.e., RA and SCZ) for which the 4. Sudlow, C. et al. UK Biobank: an open access resource for identifying the SNP allele frequencies are not available, we estimated the allele frequencies using causes of a wide range of complex diseases of middle and old Age. PLOS Med. the 1KGP-imputed HRS data. We further excluded variants with a minor allele 12, e1001779 (2015). frequency <0.01. 5. Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017). 6. Wu, Y., Zheng, Z., Visscher, P. M. & Yang, J. Quantifying the mapping – Chromatin state annotation. The epigenomic annotations as shown in Figs 4 6 precision of genome-wide association studies using whole-genome sequencing are from the Roadmap Epigenomics Mapping Consortium (REMC), which is data. Genome Biol. 18, 86 (2017). publicly available for download at http://compbio.mit.edu/roadmap/. We use these 7. Farh, K. K. et al. Genetic and epigenetic fine mapping of causal autoimmune annotations to indicate the regulatory elements and cell and tissue types where the disease variants. Nature 518, 337–343 (2015). functional DNAm sites and causal variants might act. The chromatin state data of 8. Claussnitzer, M. et al. FTO obesity variant circuitry and adipocyte browning in 127 epigenomes were profiled by the Roadmap Epigenomics Project28 and humans. N. Engl. J. Med. 373, 895–907 (2015). ENCODE Project70 in a number of primary cells and tissue types (URLs). The 25 9. Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies chromatin states were predicted by ChromHMM71 based on the imputed data of 48 – 12 histone-modification marks. The 14 main functional categories such as those predicts complex trait gene targets. Nat. Genet. , 481 487 (2016). 10. Tewhey, R. et al. Direct identification of hundreds of expression-modulating shown in Fig. 4 and the enrichment analysis below were derived from the 25 – chromatin states by combining functionally relevant annotations to a single variants using a multiplexed reporter assay. Cell 165, 1519 1529 (2016). functional category (Supplementary Table 1). 11. Arnold, C. D. et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339, 1074–1077 (2013). 12. Korkmaz, G. et al. Functional genetic screens for enhancer elements in the Enrichment test of functional categories. We used the chromatin state anno- using CRISPR-Cas9. Nat. Biotech. 34, 192–198 (2016). tation of 23 blood cell types from 127 epigenomes described above to test whether 13. Yang, J. et al. FTO genotype is associated with phenotypic variability of body the transcript-associated DNAm are enriched in any functional region(s). We mass index. Nature 490, 267–272 (2012). quantified the proportion of overlap between the transcript-associated DNAm probes and 14 main functional categories (see Fig. 2). To calibrate the distribution

12 NATURE COMMUNICATIONS | (2018) 9:918 | DOI: 10.1038/s41467-018-03371-0 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-03371-0 ARTICLE

14. Smemo, S. et al. Obesity-associated variants within FTO form long-range 46. Simopoulos, A. P. Genetic variants in the metabolism of omega-6 and omega- functional connections with IRX3. Nature 507, 371–375 (2014). 3 fatty acids: their role in the determination of nutritional requirements and 15. Edwards, SL., Beesley, J., French, JD. & Dunning, AM. Beyond GWASs: chronic disease risk. Exp. Biol. Med. 235, 785–795 (2010). illuminating the dark road from association to function. Am. J. Hum. Genet. 47. Zhernakova, D. V. et al. Identification of context-dependent expression 93, 779–797 (2013). quantitative trait loci in whole blood. Nat. Genet. 49, 139–145 (2017). 16. Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of 48. Murthy, A. et al. A Crohn’s disease variant in Atg16l1 enhances its genetic association studies using summary statistics. PLOS Genet. 10, degradation by caspase 3. Nature 506, 456–462 (2014). e1004383 (2014). 49. Morris, G. E. et al. Coronary artery disease–associated LIPA coding variant 17. Gamazon, E. R. et al. A gene-based association method for mapping traits rs1051338 reduces lysosomal acid lipase levels and activity in lysosomes. using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015). Arterioscler. Thromb. Vasc. Biol. 37, 1050 (2017). 18. Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide 50. Sivakumaran, S. et al. Abundant pleiotropy in human complex diseases and association studies. Nat. Genet. 48, 245–252 (2016). traits. Am. J. Hum. Genet. 89, 607–618 (2011). 19. Hannon, E., Weedon, M., Bray, N., O’Donovan, M. & Mill, J. Pleiotropic 51. Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases effects of trait-associated on DNA methylation: utility for and traits. Nat. Genet. 47, 1236–1241 (2015). refining GWAS loci. Am. J. Hum. Genet. 100, 954–959 (2017). 52. Fromer, M. et al. Gene expression elucidates functional impact of polygenic 20. Schubeler, D. Function and information content of DNA methylation. Nature risk for schizophrenia. Nat. Neurosci. 19, 1442–1453 (2016). 517, 321–326 (2015). 53. The GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot 21. Wahl, S. et al. Epigenome-wide association study of body mass index, and the analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015). adverse outcomes of adiposity. Nature 541,81–86 (2017). 54. Ramasamy, A. et al. Genetic variability in the regulation of gene expression in 22. Liu, Y. et al. Epigenome-wide association data implicate DNA methylation as ten regions of the human brain. Nat. Neurosci. 17, 1418–1428 (2014). an intermediary of genetic risk in rheumatoid arthritis. Nat. Biotech. 31, 55. Gusev, A. et al. Transcriptome-wide association study of schizophrenia and 142–147 (2013). chromatin activity yields mechanistic disease insights. Preprint at https://doi. 23. Chambers, J. C. et al. Epigenome-wide association of DNA methylation org/10.1101/067355 (2016). markers in peripheral blood from Indian Asians and Europeans with incident 56. Whalen, S., Truty, R. M. & Pollard, K. S. Enhancer-promoter interactions are type 2 diabetes: a nested case-control study. Lancet Diabetes Endocrinol. 3, encoded by complex genomic signatures on looping chromatin. Nat. Genet. 526–534 (2015). 48, 488–496 (2016). 24. Shlyueva, D., Stampfel, G. & Stark, A. Transcriptional enhancers: from 57. Zhu, H., Wang, G. & Qian, J. Transcription factors as readers and effectors of properties to genome-wide predictions. Nat. Rev. Genet. 15, 272–286 (2014). DNA methylation. Nat. Rev. Genet. 17, 551–565 (2016). 25. Westra, H.-J. et al. Systematic identification of trans eQTLs as putative drivers 58. Li, Y. I. et al. RNA splicing is a primary link between genetic variation and of known disease associations. Nat. Genet. 45, 1238–1243 (2013). disease. Science 352, 600–604 (2016). 26. McRae, A. F. et al. Identification of 55,000 replicated DNA methylation 59. Grubert, F. et al. Genetic control of chromatin states in humans involves local QTL and their role in disease. Preprint at https://doi.org/10.1101/ and distal chromosomal interactions. Cell 162, 1051–1065 (2015). 166710 (2017). 60. Lawlor, D. A., Harbord, R. M., Sterne, J. A., Timpson, N. & Davey Smith, G. 27. Roadmap Epigenomics Consortium. Integrative analysis of 111 reference Mendelian randomization: using genes as instruments for making causal human epigenomes. Nature 518, 317–330 (2015). inferences in epidemiology. Stat. Med. 27, 1133–1163 (2008). 28. Lloyd-Jones, L. R. et al. The of gene expression in 61. Burgess, S., Small, D. S. & Thompson, S. G. A review of instrumental variable peripheral blood. Am. J. Hum. Genet. 100, 371 (2017). estimators for Mendelian randomization. Stat. Methods Med. Res. doi: 29. Powell, J. E. et al. The Brisbane Systems Genetics Study: genetical genomics 10.1177/0962280215597579 (2015). meets complex trait genetics. PLoS ONE 7, e35430 (2012). 62. Sonnega, A. et al. Cohort profile: the Health and Retirement Study (HRS). Int. 30. Chen, B. H. et al. DNA methylation-based measures of biological age: meta- J. Epidemiol. 43, 576–585 (2014). analysis predicting time to death. Aging (Albany NY) 8, 1844–1865 (2016). 63. Abecasis, G. R., Altshuler, D., Auton, A., Brooks, L. D. & Durbin, R. M. A map 31. Rakyan, V. K., Down, T. A., Balding, D. J. & Beck, S. Epigenome-wide of human genome variation from population-scale sequencing. Nature 467 association studies for common human diseases. Nat. Rev. Genet. 12, 529–541 1061–1073 (2010). (2011). 64. Bakshi, A. et al. Fast set-based association analysis using summary data from 32. Musunuru, K. et al. From noncoding variant to phenotype via SORT1 at the GWAS identifies novel gene loci for human complex traits. Sci. Rep. 6, 32894 1p13 cholesterol locus. Nature 466, 714–719 (2010). (2016). 33. Shungin, D. et al. New genetic loci link adipose and insulin biology to body fat 65. Kuonen, D. Miscellanea. Saddlepoint approximations for distributions of distribution. Nature 518, 187–196 (2015). quadratic forms in normal variables. Biometrika 86, 929–935 (1999). 34. Wood, A. R. et al. Defining the role of common variation in the genomic and 66. Gibbs, R. A., Belmont, J. W., Boudreau, A., Leal, S. M. & Hardenbol, P. A biological architecture of adult human height. Nat. Genet. 46, 1173–1186 haplotype map of the human genome. Nature 437, 1299–1320 (2005). (2014). 67. Gentleman, R. C. et al. Bioconductor: open software development for 35. Okbay, A. et al. Genome-wide association study identifies 74 loci associated computational biology and bioinformatics. Genome Biol. 5, R80 (2004). with educational attainment. Nature 533, 539–542 (2016). 68. Barbosa-Morais, N. L. et al. A re-annotation pipeline for Illumina BeadArrays: 36. Schizophrenia Working Group of the Psychiatric Genomics Consortium. improving the interpretation of gene expression data. Nucleic Acids Res. 38, Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, e17 (2010). 421–427 (2014). 69. Price, M. E. et al. Additional annotation enhances potential for biologically- 37. Lambert, J. C. et al. Meta-analysis of 74,046 individuals identifies 11 new relevant analysis of the Illumina Infinium HumanMethylation450 BeadChip susceptibility loci for Alzheimer’s disease. Nat. Genet. 45, 1452–1458 (2013). array. Epigenet. Chromatin 6, 4 (2013). 38. Global Lipids Genetics Consortium et al. Discovery and refinement of loci 70. Encode Project Consortium. An integrated encyclopedia of DNA elements in associated with lipid levels. Nat. Genet. 45, 1274–1283 (2013). the human genome. Nature 489,57–74 (2012). 39. Nikpay, M. et al. A comprehensive 1,000 Genomes-based genome-wide 71. Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and association meta-analysis of coronary artery disease. Nat. Genet. 47, characterization. Nat. Methods 9, 215–216 (2012). 1121–1130 (2015). 72. Harrow, J. et al. GENCODE: the reference human genome annotation for The 40. Liu, J. Z. et al. Association analyses identify 38 susceptibility loci for ENCODE Project. Genome Res. 22, 1760–1774 (2012). inflammatory bowel disease and highlight shared genetic risk across populations. Nat. Genet. 47, 979–986 (2015). 41. Morris, A. P. et al. Large-scale association analysis provides insights into the Acknowledgements genetic architecture and pathophysiology of type 2 diabetes. Nat. Genet. 44, This research was supported by the Australian Research Council (DP160101343, 981 (2012). DP160102400 and DP160101056), the Australian National Health and Medical Research 42. Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and Council (1107258, 1078037, 1113400, 1078901 and 1083656), the US National Institutes drug discovery. Nature 506, 376–381 (2014). of Health (R01 MH100141 and R21 ES025052), the Sylvia & Charles Viertel Charitable 43. Pickrell, J. K. et al. Detection and interpretation of shared genetic influences Foundation, and the UK’s Medical Research Council (MR/K026992/1). This study makes on 42 human traits. Nat. Genet. 48, 709–717 (2016). use of data from dbGaP (accessions: phs000428.v1.p1 and phs000424.v1.p1). A full list of 44. Knox, C. et al. DrugBank 3.0: a comprehensive resource for ‘Omics’ research acknowledgements to these data sets can be found in the Supplementary Note 5. on drugs. Nucleic Acids Res. 39, D1035–D1041 (2011). 45. Zhu, F. et al. Therapeutic target database update 2012: a resource for facilitating target-oriented drug discovery. Nucleic Acids Res. 40, Author contributions D1128–D1136 (2012). J.Y. conceived and designed the study. Y.W. performed statistical analyses under the assistance and guidance from J.Z., F.Z., Z.Z, T.Q., Z.L.Z, L.R.L., R.E.M., N.G.M., G.W.M.,

NATURE COMMUNICATIONS | (2018) 9:918 | DOI: 10.1038/s41467-018-03371-0 | www.nature.com/naturecommunications 13 ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-03371-0

I.J.D., N.R.W., P.M.V., A.F.M. and J.Y.. J.Y., J.Z. and Y.W. wrote the manuscript with the Open Access This article is licensed under a Creative Commons participation of all authors. Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give Additional information appropriate credit to the original author(s) and the source, provide a link to the Creative Supplementary Information accompanies this paper at https://doi.org/10.1038/s41467- Commons license, and indicate if changes were made. The images or other third party ’ 018-03371-0. material in this article are included in the article s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the Competing interests: The authors declare no competing interests. article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from Reprints and permission information is available online at http://npg.nature.com/ the copyright holder. To view a copy of this license, visit http://creativecommons.org/ reprintsandpermissions/ licenses/by/4.0/.

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. © The Author(s) 2018

14 NATURE COMMUNICATIONS | (2018) 9:918 | DOI: 10.1038/s41467-018-03371-0 | www.nature.com/naturecommunications