A Statistical Framework for Cross-Tissue Transcriptome-Wide Association Analysis
Total Page:16
File Type:pdf, Size:1020Kb
UCLA UCLA Previously Published Works Title A statistical framework for cross-tissue transcriptome-wide association analysis. Permalink https://escholarship.org/uc/item/45f4g6k0 Journal Nature genetics, 51(3) ISSN 1061-4036 Authors Hu, Yiming Li, Mo Lu, Qiongshi et al. Publication Date 2019-03-01 DOI 10.1038/s41588-019-0345-7 Peer reviewed eScholarship.org Powered by the California Digital Library University of California TECHNICAL REPOrt https://doi.org/10.1038/s41588-019-0345-7 A statistical framework for cross-tissue transcriptome-wide association analysis Yiming Hu 1,17, Mo Li1,17, Qiongshi Lu 2,17, Haoyi Weng3, Jiawei Wang4, Seyedeh M. Zekavat 5,6,7, Zhaolong Yu4, Boyang Li 1, Jianlei Gu8, Sydney Muchnik9, Yu Shi1, Brian W. Kunkle 10, Shubhabrata Mukherjee 11, Pradeep Natarajan 6,7,12,13, Adam Naj14,15, Amanda Kuzma15, Yi Zhao15, Paul K. Crane11, Alzheimer’s Disease Genetics Consortium16, Hui Lu8 and Hongyu Zhao 1,4,8,9* Transcriptome-wide association analysis is a powerful approach to studying the genetic architecture of complex traits. A key component of this approach is to build a model to impute gene expression levels from genotypes by using samples with matched genotypes and gene expression data in a given tissue. However, it is challenging to develop robust and accurate imputation models with a limited sample size for any single tissue. Here, we first introduce a multi-task learning method to jointly impute gene expression in 44 human tissues. Compared with single-tissue methods, our approach achieved an average of 39% improve- ment in imputation accuracy and generated effective imputation models for an average of 120% more genes. We describe a summary-statistic-based testing framework that combines multiple single-tissue associations into a powerful metric to quantify the overall gene–trait association. We applied our method, called UTMOST (unified test for molecular signatures), to multiple genome-wide-association results and demonstrate its advantages over single-tissue strategies. enome-wide association studies (GWAS) have successfully between imputed gene expression and the trait of interest14,15. These identified numerous single-nucleotide polymorphisms (SNPs) methods have gained popularity in the past two years, owing to Gassociated with complex human traits and diseases. Despite their capability to effectively use signals from multiple eQTL with these successes, important problems remain in statistical power and moderate effects and to reduce the effect of reverse causality in biological interpretation of GWAS results2. In particular, the com- expression–trait association analysis. The applications of these plex architecture of linkage disequilibrium and context-dependent methods have led to novel insights into the genetic basis of many regulatory machinery in the genome hinder the ability to accurately diseases and traits16–18. identify disease genes from GWAS, thereby raising challenges in Despite these successes, existing methods have several limita- downstream functional validation and therapeutics develop- tions. First, owing to the tissue-dependent nature of transcription ment. Recently, large-scale consortia, such as the Genotype-Tissue regulation, existing methods train separate imputation models for Expression (GTEx) project3,4, have generated matched genotype different tissues. This practice ignores the similarity in transcrip- and expression data for various human tissues. These rich datasets tion regulation across tissues, thereby limiting the effective sample have provided great insights into the mechanisms of cross-tissue sizes for tissues that are difficult to acquire. Second, a hypothesis- transcriptional regulation and accelerated discoveries for expres- free search across genes and tissues increases the burden of mul- sion quantitative trait loci (eQTL)4–7. In addition, integrating eQTL tiple testing and thus reduces statistical power. Pinpointing a subset information in genetic association analysis has become an effective of tissues according to prior knowledge may resolve this issue way to bridge SNPs, genes, and complex traits. Many methods have to some extent. However, for many complex traits, biologically been developed to co-localize eQTL with loci identified in GWAS relevant tissues are unknown. Further, reports have shown that to identify candidate risk genes for complex traits8–13. Two recent eQTL with large effects tend to regulate gene expression in mul- studies have addressed this issue through an innovative approach tiple tissues4. Genetic correlation analysis has also suggested sub- that is sometimes referred to as transcriptome-wide association stantial sharing of local expression regulation across tissues19. analysis. First, on the basis of an externally trained imputation This would inevitably result in statistically significant associations model, gene expression is imputed by using genotype informa- in tissues irrelevant to the trait of interest, a phenomenon that has tion in GWAS samples. Next, gene-level association is assessed been extensively discussed recently20. Jointly analyzing data from 1Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA. 2Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA. 3Division of Biostatistics, The Jockey Club School of Public Health and Primary Care, The Chinese University of Hong Kong, Shatin, NT, Hong Kong. 4Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA. 5Yale School of Medicine, New Haven, CT, USA. 6Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA. 7Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA. 8SJTU–Yale Joint Center for Biostatistics, Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiaotong University, Shanghai, China. 9Department of Genetics, Yale School of Medicine, New Haven, CT, USA. 10John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA. 11Department of Medicine, University of Washington, Seattle, WA, USA. 12Department of Medicine, Harvard Medical School, Boston, MA, USA. 13Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA. 14Center for Clinical Epidemiology and Biostatistic, and the Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA. 15Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA. 16A list of members and affiliations appears in the Supplementary Note. 17These authors contributed equally: Yiming Hu, Mo Li, Qiongshi Lu. *e-mail: [email protected] 568 NATURE GENETICS | VOL 51 | MARCH 2019 | 568–576 | www.nature.com/naturegenetics NATURE GENETICS TECHNICAL REPOrt information and matched expression data from 44 tissues in GTEx. Next, we test associations between the trait of interest and imputed expression in each tissue. Finally, a cross-tissue test is performed for each gene to summarize single-tissue association statistics into a powerful metric that quantifies the overall gene–trait association. Genotype matrix Here, we briefly introduce the UTMOST framework. All the statisti- (GTEx) ferent tissues cal details are discussed in the Methods. Expression in 450 individuals 450 individuals dif Reference panel We formulate cross-tissue expression imputation as a penalized multivariate regression problem: 44 tissues 15 K genes 11 million SNPs YXNP××=+NMBMP××εNP where N, M, and P denote the sample size in the training data, the cis-SNPs number of SNPs in the imputation model, and the total number of tissues, respectively. Because only a subset of tissues was collected across tissues for each gene fect sizes of SNPs from each individual, expression data in matrix Y were incomplete, Ef ain imputation model and sample sizes for different tissues were unbalanced. We estimate Tr Expression matrix (with unmeasured entries) B by minimizing the squared loss function with a lasso penalty on the columns (within-tissue effects) and a group-lasso penalty on the rows (cross-tissue effects) (Methods). GWAS summary statistics Linkage disequilibrium s estimated from 1,000 Genomes samples P P 1 2 1 B̂ =∥argmin YXii−∥B i +∥λ1 B i∥ fect sizes ∑∑⋅ 2 ⋅ 1 B i=1 2Ni i=1 Ni M Single-tissue across tissues +∥λ2 ∑ Bj⋅∥ association analysi 2 Estimated ef j=1 15 K genes where Yi, Xi, and Ni denote the observed expressions, genotypes, s and sample size of the ith tissue, respectively. Parameters λ1 and λ2 are tuned through cross-validation. Our cross-tissue imputation Single-tissue model does not assume eQTL to have the same effect direction associations 29 Cross-tissue across tissues. Instead, UTMOST uses a group-lasso penalty term covariance to encourage the presence of cross-tissue eQTL and improve the Cross-tissue estimation of their effects. association analysi 44 tissues 15 K genes In the second stage, we test the associations between the trait of interest and imputed gene expression in each tissue. We denote ̂ Gene-trait associations imputed gene expression in the ith tissue as EXii= B⋅i and test asso- ciations via a univariate regression model: Fig. 1 | UTMOST workflow. Gray and brown boxes denote input data and computed outcomes, respectively. TE=+αγiiii+ δ where αi is the intercept, γi is the effect size of gene expression, and multiple genetically