Predicting Tissue-Specific Gene Expression from Whole Blood Transcriptome

Predicting Tissue-Specific Gene Expression from Whole Blood Transcriptome

bioRxiv preprint doi: https://doi.org/10.1101/2020.05.10.086942; this version posted May 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 Predicting tissue-specific gene expression from whole blood transcriptome 2 Mahashweta Basu1#, Kun Wang2#, Eytan Ruppin2, Sridhar Hannenhalli2* 3 1 Institute for Genome Sciences, University of Maryland, Baltimore, MD 4 2 Cancer Data Science Lab, National Cancer Institute, NIH, Bethesda, MD 5 # the authors contributed equally 6 *Corresponding author 7 Sridhar Hannenhalli 8 Cancer Data Science Lab, National Cancer Institute, NIH, Bethesda 9 [email protected] 10 11 12 Key words: GTEx, Whole - Blood transcriptome, Gene expression, imputation, SNP, systems 13 biology 14 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.10.086942; this version posted May 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 15 Abstract 16 Complex diseases are systemic, largely mediated via transcriptional dysregulation in multiple 17 tissues. Thus, knowledge of tissue-specific transcriptome in an individual can provide important 18 information about an individual’s health. Unfortunately, with a few exceptions such as blood, 19 skin, and muscle, an individual’s tissue-specific transcriptome is not accessible through non- 20 invasive means. However, due to shared genetics and regulatory programs between tissues, the 21 transcriptome in blood may be predictive of those in other tissues, at least to some extent. Here, 22 based on GTEx data, we address this question in a rigorous, systematic manner, for the first time. 23 We find that an individual’s whole blood gene expression and splicing profile can predict tissue- 24 specific expression levels in a significant manner (beyond demographic variables) for many genes. 25 On average, across 32 tissues, the expression of about 60% of the genes is predictable from blood 26 expression in a significant manner, with a maximum of 81% of the genes for the musculoskeletal 27 tissue. Remarkably, the tissue-specific expression inferred from the blood transcriptome is 28 almost as good as the actual measured tissue expression in predicting disease state for six 29 different complex disorders, including Hypertension and Type 2 diabetes, substantially 30 surpassing predictors built directly from the blood transcriptome. The code for our pipeline for 31 tissue-specific gene expression prediction – TEEBoT, is provided, enabling others to study its 32 potential translational value in other indications. 33 34 Introduction 35 Most common complex (non-Mendelian) diseases involve dysfunction in multiple tissues organs. 36 For instance, Hypertension, which is characterized by elevated arterial pressure, involves 37 metabolic changes in heart, blood vessels, brain, kidney, etc.(Dai et al., 2018). Furthermore, the 38 tissue and organ dysfunction is primarily driven by and reflected in transcriptional changes in 39 various tissues and organs (Cookson, Liang, Abecasis, Moffatt, & Lathrop, 2009). Indeed, 40 transcriptional variance mediates, in large part, causal links between genotype and complex traits 41 (Nica et al., 2010). Thus the knowledge of tissue-specific gene expression profile can lead to a bioRxiv preprint doi: https://doi.org/10.1101/2020.05.10.086942; this version posted May 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 42 better understanding of diseases etiology, enabling patient subtyping and assessing drug efficacy 43 (Basu et al., 2017; Grant, Pickard, Briskin, & Gutierrez-Ramos, 2002). However, except for easily 44 accessible tissues such as blood, muscle, skin, etc. and in some cases, biopsies, organ and tissue- 45 specific gene expression profiles cannot be readily obtained, presenting a challenge for 46 transcription-based investigation of complex diseases. This limitation naturally gives rise to two 47 important research questions: (1) to what extent can we predict an individual’s tissue-specific 48 gene expression based solely on his/her whole blood gene expression, and (2) can the predicted 49 tissue-specific expression reflect disease states better than what can be gleaned on those directly 50 from the whole blood gene expression? 51 Cross-individual variance in tissue-specific gene expression can be explained, to some extent, by 52 the genotypic variance, forming the basis of expression Quantitative Trait Loci (eQTL). A widely 53 used tool – PrediXcan predicts tissue-specific gene expression of a gene based on the gene’s eQTL 54 SNP (or eSNPs) (Gamazon et al., 2015). However, a priori, such a tool has limited scope, as it can 55 only predict expression of a minority of genes (about 10% genes on average across tissue) that 56 have significant tissue-specific eSNPs (Gamazon et al., 2015). Furthermore, the previous 57 prediction models based on blood expression (e.g., (J. Wang et al., 2016) and (Halloran et al., 58 2015)) do not use the expression of other genes to predict the expression of a given gene, missing 59 on the possibility of exploiting potentially shared regulatory programs between tissues. In 60 difference, here we use an individual’s whole blood gene expression, including the whole blood 61 splicing profile, to predict the expression of each gene in a particular tissue of the individual. 62 We proceed by building machine learning predictors based on GTEx data (Aguet et al., 2017) for 63 32 primary tissues having at least 65 samples each. We find that an individual’s whole blood gene 64 expression and splicing profile significantly inform tissue-specific expression levels (above and 65 beyond various demographic variable) for substantive fractions of genes, with a mean of 59% of 66 genes across 32 tissues, up to 81% for Muscle-Skeletal tissue, based on likelihood ratio test with 67 FDR threshold of 5%. The splicing profile contributes further beyond the gene expression profile, 68 and for the subset of genes having eSNPs, genotype makes further significant contributions. We 69 find that the genes with highly predictable expression are not biased toward housekeeping bioRxiv preprint doi: https://doi.org/10.1101/2020.05.10.086942; this version posted May 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 70 genes, proportionally representing tissue-specific genes. Moreover, such genes tend to have a 71 greater number of protein interaction partners, which may suggest a contribution by shared gene 72 networks toward expression predictability. Finally, in many cases the predicted tissue-specific 73 expression can be used as a surrogate for actual expression in predicting disease state for several 74 complex diseases far better than using the whole blood expression. Overall, our work establishes 75 the utility and limits of whole blood transcriptome in estimating gene expression in other tissues, 76 leading a basis for future translationally motivated applications. We provide our software 77 pipeline for predicting tissue-specific gene expression, named TEEBoT (Tissue Expression 78 Estimation using Blood Transcriptome), in a user friendly and publicly available form. 79 80 Results 81 Overview of TEEBoT – a pipeline for tissue-specific gene expression prediction 82 Fig. 1A illustrates the overall motivation and TEEBoT pipeline. Our goal is to assess the 83 predictability of tissue-specific gene expression (TSGE) using all available and easily accessible 84 information about the individual, which includes Whole Blood transcriptome (WBT), his/her 85 genotype, as well as basic demographic information on age, gender and race. Normalized 86 expression data was obtained from GTEx v6 (“The Genotype-Tissue Expression (GTEx) project.,” 87 2013) for 32 primary tissues (60% of all tissues) for which both the gene expression as well as 88 WBT was available for at least 65 individuals (Supplementary Fig. S1). 89 For each tissue, and for each gene, we have fit three nested regression models to estimate TSGE: 90 the prime model (M2; Methods), whose results are described in the main text, is based on Whole 91 blood gene expression (WBGE), Whole Blood splicing (WBSp) information and three demographic 92 ‘confounding’ factors (CF) – Age, Race, and Sex. To reduce the dimensionality of the modeling 93 task, instead of using the expression (respectively, splicing) levels of all genes in blood, we 94 estimated the principle components (PC) using WBGE (respectively, WBSp) across all individuals 95 and use the sample-specific scores of top 10 PCs (top 20 PCs for WBSp) explaining 99% of variance 96 as features. The whole-genome tissue-specific splicing profiles were obtained from (K. Wang et 97 al., 2018). To assess the value of using splicing information we additionally built and tested our bioRxiv preprint doi: https://doi.org/10.1101/2020.05.10.086942; this version posted May 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 98 base model, ‘WBGE+CF’ model (M1), which uses only the WBGE PCs and CF variables. Finally, to 99 estimate the contribution of SNPs we fit a third ’WBGE+WBSp+SNP+CF’ model (M3); although 100 this model is most inclusive it covers only a small fraction of about 10% of the genes, those having 101 at least one eSNPs, as reported previously by the GTEx consortium (“The Genotype-Tissue 102 Expression (GTEx) project.,” 2013). For those genes, we used the top five PCs of the genotype 103 profile of eSNPs detected in cross-validation (CV) manner to avoid overfitting. Below we present 104 the results obtained using our prime model M2. While results based on models M1 and M3 are 105 mentioned briefly in context as appropriate, their details are provided as supplementary results 106 for brevity and focus.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    18 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us