UQ722401 OA.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
ARTICLE DOI: 10.1038/s41467-018-03371-0 OPEN Integrative analysis of omics summary data reveals putative mechanisms underlying complex traits Yang Wu1, Jian Zeng 1, Futao Zhang1, Zhihong Zhu1, Ting Qi1, Zhili Zheng1,2, Luke R. Lloyd-Jones1, Riccardo E. Marioni3,4, Nicholas G. Martin5, Grant W. Montgomery 1, Ian J. Deary4, Naomi R. Wray 1,6, Peter M. Visscher 1,6, Allan F. McRae1 & Jian Yang 1,6 The identification of genes and regulatory elements underlying the associations discovered 1234567890():,; by GWAS is essential to understanding the aetiology of complex traits (including diseases). Here, we demonstrate an analytical paradigm of prioritizing genes and regulatory elements at GWAS loci for follow-up functional studies. We perform an integrative analysis that uses summary-level SNP data from multi-omics studies to detect DNA methylation (DNAm) sites associated with gene expression and phenotype through shared genetic effects (i.e., pleio- tropy). We identify pleiotropic associations between 7858 DNAm sites and 2733 genes. These DNAm sites are enriched in enhancers and promoters, and >40% of them are mapped to distal genes. Further pleiotropic association analyses, which link both the methylome and transcriptome to 12 complex traits, identify 149 DNAm sites and 66 genes, indicating a plausible mechanism whereby the effect of a genetic variant on phenotype is mediated by genetic regulation of transcription through DNAm. 1 Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD 4072, Australia. 2 The Eye Hospital, School of Ophthalmology & Optometry, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China. 3 Medical Genetics Section, Centre for Genomics and Experimental Medicine, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh EH4 2XU, UK. 4 Centre for Cognitive Ageing and Cognitive Epidemiology, Department of Psychology, University of Edinburgh, 7 George Square, Edinburgh EH8 9JZ, UK. 5 Genetics and Computational Biology, QIMR Berghofer Medical Research Institute, Brisbane, QLD 4029, Australia. 6 Queensland Brain Institute, The University of Queensland, Brisbane, QLD 4072, Australia. Yang Wu, Jian Zeng and Futao Zhang contributed equally to this work. Correspondence and requests for materials should be addressed to J.Y. (email: [email protected]) NATURE COMMUNICATIONS | (2018) 9:918 | DOI: 10.1038/s41467-018-03371-0 | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-03371-0 enome-wide association studies (GWAS) have identified further annotate the DNAm sites that are associated with gene thousands of genetic variants associated with complex expression. G 1,2 traits (including diseases) . Given the polygenic nature In this study, we report an integrative analysis of summary of most complex traits3, more variants are expected to be dis- statistics from GWAS, eQTL and mQTL studies of the largest covered in the foreseeable near future because of the rapid sample sizes to date. We mapped 7858 DNAm sites to 2733 increase in sample size (due to large cohorts such as UK Biobank4 putative target genes in cis-regions and then linked them to 14 and consortia efforts). However, the mechanisms underlying the complex traits. associations remain largely unaddressed5. This is because the mapping resolution of GWAS is limited by the complicated Results linkage disequilibrium (LD) structure of the genome (i.e., the top Method overview. Our integrative analysis combines summary- associated variant at a locus is often not the causal variant)6,7 and level multi-omics data to prioritize gene targets and their reg- by the sampling variation in statistical tests due to finite sample ulatory elements. It consists of three steps, each of which relies on sizes. Furthermore, genetic variants can affect phenotype through the SMR & HEIDI method to test for pleiotropic association9 distal regulation of gene expression (i.e., the nearest gene to the – (Fig. 1 and Supplementary Note 1). First, we map the methylome GWAS top signal is often not the causal gene)7 9. High- to the transcriptome in cis-regions by testing the associations of throughput methods such as massively parallel reporter assay10, DNAm with their neighbouring genes (within 2 Mb of each self-transcribing active regulatory region sequencing (STARR- DNAm probe) using the top associated mQTL as the instru- seq)11, and CRISPR-Cas9 epigenome screens12 have been devel- mental variable (Methods). Next, we prioritize the trait-associated oped to identify the causal variants and functional elements genes by testing the associations of transcripts with the phenotype regulating gene expression. However, it remains challenging to using the top associated eQTL. Last, we prioritize the trait- pinpoint the gene(s) responsible for the association signal at a associated DNAm sites by testing associations of DNAm sites GWAS locus8,13,14. Thus, analytical approaches that somehow with the phenotype using the top associated mQTL. If the asso- mimic a functional study in silico are needed to prioritize plau- ciation signals are significant in all three steps, then we predict sible functional genes and/or regulatory elements at GWAS loci with strong confidence that the identified DNAm sites and target for further studies. genes are functionally relevant to the trait through the genetic Given that the complex trait-associated variants are pre- regulation of gene expression at the DNAm sites. The SMR & dominantly found in noncoding regions7,15, it is reasonable to HEIDI method assumes consistent LD between two samples. This assume that these variants affect phenotype through genetic assumption is generally satisfied by using data from samples of regulation of transcriptional output. To test whether the effect of the same ancestry (Supplementary Fig. 1). a genetic variant on a phenotype is mediated by transcription, we have developed powerful and flexible approaches, SMR (sum- mary-data-based Mendelian randomization) and HEIDI (het- Analytical mapping of methylome to transcriptome. To identify erogeneity in dependent instruments) tests9, and have target genes for the DNAm sites, we applied the SMR & HEIDI implemented them in an efficient software tool for genome-wide approach to test for pleiotropic associations between DNAm and analysis (URLs). The SMR & HEIDI approach uses summary- gene expression (denoted as M2T analysis) in large data sets from level data from GWAS and expression quantitative trait locus peripheral blood (Methods). The summary statistics of SNPs on (eQTL) studies to test if a transcript and phenotype are associated gene expression were from an eQTL analysis of 38624 probes in because of a shared causal variant (i.e., pleiotropy). Compared the CAGE (Consortium for the Architecture of Gene Expression) with most other methods for an integrative analysis of GWAS 28 data (n = 2765). The summary statistics of SNPs on DNAm – and eQTL data16 18, the SMR & HEIDI approach features the were from a meta-analysis of mQTL data26 from two independent ability to distinguish a pleiotropic model (i.e., gene expression data sets: the Brisbane Systems Genetics Study (BSGS, n = 614)29 and phenotype are associated owing to a single shared genetic and the Lothian Birth Cohorts (LBC, n = 1366)30. After quality variant) from a linkage model (i.e., there are two or more distant controls (Methods), we retained 9538 gene expression probes genetic variants in LD affecting gene expression and phenotype from the CAGE eQTL analysis and 73973 DNAm probes from independently)9. Moreover, like other summary-data-based the meta-mQTL analysis. In any of the specific analyses below, we methods16,18, the SMR & HEIDI approach allows the use of only included SNPs available and with consistent alleles in the GWAS and eQTL data from two independent studies; thus, the data sets used. statistical power can be boosted by using data from studies with By testing each DNAm probe for associations with genes very large sample sizes. within a 2 Mb distance in either direction, we detected 21938 The analytical framework that integrates GWAS and eQTL DNAm probes that showed significant SMR associations with data can be applied to incorporate other source of omics infor- 4804 gene expression probes (corresponding to 3991 unique 19 −8 mation, e.g., DNA methylation (DNAm) at CpG sites , which is genes) at PSMR < 2.26 × 10 correcting for ~2.21 million tests. an important epigenetic mechanism for gene regulation20. We further used the HEIDI approach to test against the null Methylome-wide association studies (MWAS) have identified a hypothesis that the association detected by the SMR test is due to 21–23 number of DNAm sites associated with complex traits . pleiotropy, rejecting those with PHEIDI < 0.01 (Supplementary However, it is not clear whether these associations are causal or Note 2). Finally, 10,588 associations between 7858 DNAm and driven by confounding factors. More importantly, it is not 3239 gene expression probes (corresponding to 2733 genes) were obvious which genes are regulated by DNAm because the reg- not rejected by the HEIDI test. On average, 3.0 DNAm sites were ulatory elements can be distant from the target genes24. Data associated with each gene, with a median value of 2.0 and a from recent eQTL and methylation quantitative trait locus standard deviation of 3.1. For the DNAm sites associated with the (mQTL) studies25,26 provide an opportunity to incorporate same gene, they tended to be located proximally (average mQTL data into