bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Decoding differential expression

2 Shinya Tasaki1*, Chris Gaiteri1, Sara Mostafavi2, Yanling Wang1

3 1 Rush Alzheimer’s Disease Center, Rush University Medical Center, Chicago IL, USA

4 2 University of British Columbia, Vancouver, British Columbia, Canada

5 * corresponding author

6 Abstract

7 Identifying the molecular mechanisms that control differential (DE) is a major goal of basic and disease

8 biology. Combining the strengths of systems biology and deep learning in a model called DEcode, we are able to predict

9 DE more accurately than traditional sequence-based methods, which do not utilize systems biology data. To determine

10 the biological origins of this accuracy, we identify the most predictive regulators and types of regulatory interactions in

11 DEcode, contrasting their roles across many human tissues. Diverse systems biology, ontological and disease-related

12 assessments all point to the predominant influence of post-translational RNA-binding factors on DE. Through the

13 combinatorial gene regulation that is captured in DEcode, it is even possible to predict relatively subtle person-to-person

14 variation in gene expression. We demonstrate the broad applicability of these clinically-relevant predictions by predicting

15 drivers of aging throughout the human lifespan, gene coexpression relationships on a genome-wide scale, and frequent

16 DE in diverse conditions. Researchers can freely access DEcode to utilize genomic big data in identifying influential

17 molecular mechanisms for any human expression data - www.differentialexpression.org.

18 Introduction

19 While all human cells share DNA sequences, gene regulation differs among cell types and developmental stages, and in

20 response to environmental cues and stimuli. Accordingly, when gene expression is not properly regulated, cellular

21 homeostasis can be perturbed, often affecting cell function and leading to disease1. These distinctions between cell

22 states are observed as differential expression (DE) of gene transcripts. DE have been cataloged for tens of thousands of

23 gene expression datasets, in the context of distinctions between species, organs, and conditions. Despite the important

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

24 and pervasive nature of DE, it has been challenging to shift from these observations towards a coherent understanding

25 of the underlying generative processes that would essentially decode DE– a transition which is essential for progress in

26 basic and disease biology. We address this gap by exploiting novel computational and systems biology approaches to

27 develop a predictive model of DE based on genome-wide regulatory interaction data. Utilizing diverse genomic

28 datasets, we identify a complex, yet strikingly consistent set of principles that control DE. This model of differential

29 expression, called DEcode, can be applied to the majority of current and future gene expression data, to accelerate basic

30 and disease biology, by identifying the origins of DE in each experiment.

31 Diverse molecular interactions have been shown to generate DE, and jointly regulate gene expression at the

32 transcriptional and post-transcriptional levels. Major classes of gene regulatory interactions have been cataloged at the

33 genomic scale, including factor (TF)-promoter interactions2, -RNA interactions3, RNA-RNA

34 interactions4, chromatin interactions5, and epigenetic modifications on DNAs6, histones, and RNAs7. Statistical models

35 of gene expression can help fulfill the purpose of these resources in describing the origins of gene regulation and DE1.

36 However, such raw data resources have outpaced model development, likely due to the challenge of uniting diverse

37 molecular data into a single accurate model.

38 Predicting DE on the basis of gene regulatory interactions is one initial approach to understanding its origins. Among

39 many possible statistical approaches to predicting DE, deep learning (DL) blends diverse data sources in a way that

40 approximates the convergence of regulatory interactions. Indeed, DL has been applied to genomic research8, 9 including

41 RNA splicing10, genomic variant functions11, and RNA/DNA binding12. However, accurate prediction is only one

42 component of understanding DE; additional genomic and systems biology analysis are helpful in understanding how

43 predictions are fueled by existing molecular concepts, mechanisms, and classes.

44 To decode the basis of DE in terms of molecular regulatory interactions, we first learn to predict it with a high degree of

45 accuracy, using a DL model we call “DEcode”. This model combines several types of gene regulatory interactions and

46 allows us to prioritize the main systems and molecules that influence DE on a tissue-specific basis. We further establish

47 likely molecular mechanisms for this gene regulation and validate the influence of the predicted strongest regulators. In

48 parallel, we predict the origin of person-to-person DE, which is the major component of experimental and clinical studies.

49 These particularly challenging predictions are validated on a genome-wide scale, as we identify key drivers of

50 coexpression, and also drivers for phenotype-associated differential expression. These tests and applications indicate

51 DEcode can combine multiple recent data sources, to extract regulators for arbitrary human DE signatures.

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

52 Results

53 Promoter and RNA features predict differential expression across human tissues

54 The overarching goal of this study is to accurately predict gene expression as a function of molecular interactions. These

55 results should be tissue-specific, but also highlight major regulatory principles across tissues, and ideally have sufficient

56 accuracy to predict the relatively small expression changes observed between individual humans. To accomplish this, we

57 utilized deep convolutional neural networks in a system called DEcode that can predict inter-tissue variations and inter-

58 person variations in gene expression levels from promoter and mRNA features (Figure 1). The promoter features

59 included: the genomic locations of binding sites of 762 TFs and the mRNA features encompassed the locations of binding

60 sites of 171 RNA-binding (RNABPs) and 213 miRNAs in each mRNA (Table S1). DEcode takes the promoter

61 features and the mRNA features for each gene as inputs and outputs its expression levels under various conditions. We

62 note that the prediction is based on only the presence or absence of known binding sites, and other information such as

63 gene expression levels of TFs, RNABPs, and miRNAs is not utilized. First, we applied the DEcode framework to tissue-

64 specific human transcriptomes of 27,428 and 79,647 transcripts measured in the GTEX consortium13 to predict

65 log-fold changes across 53 tissues against the median log-TPM (transcripts per million) of all tissues, as well as the

66 median log-TPM of all tissues with a multi-task learning architecture. To ensure rigorous model testing, we excluded all

67 genes or transcripts coded in 1 from the training data and used them as the testing data for evaluating the

68 performances of DEcode models. This procedure prevents information leaking from intra-chromosomal interactions and

69 potential overlaps of regulatory regions (details of model construction in Figure S1). The predicted median TPM levels

70 showed high consistency with the actual observations for both gene-level (Spearman's rho = 0.81) and transcript-level

71 (Spearman's rho = 0.62) (Figure 2A). Moreover, the model predicted the differential transcript usage within the same

72 gene (Spearman's rho = 0.44) (Figure 2A). The DEcode models also predicted the differential expression profiles across

73 53 tissues for both gene (mean Spearman's rho = 0.34), transcript (mean Spearman's rho = 0.32), and transcript-usage

74 levels (mean Spearman's rho = 0.16) (Figure 2B). The predicted gene expression for the testing genes was indeed tissue-

75 specific, as they showed less correspondence with the expression profiles from alternate tissues (Figure 2C).

76 To provide context for the statistical performance of DEcode, we contrast it to a high-performing method called

77 ExPecto11, as it was designed to predict GTEX gene expression from epigenetic states, estimated from promoter

78 sequences via DL. We built 10 models for each method using the same genes for training, validation, and testing to 3 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

79 predict gene expression in the 53 tissues. In this comparison, DEcode showed an average of 7.2% improvement in root

80 mean square error over ExPecto (Figure 2D) which translates into an average correlation coefficient with actual gene

81 expression of 0.42 - a 50% increase over 0.28 from ExPecto (Figure S2).

82 Beyond the predictive performance of DEcode, we utilize the model to help define the biological processes regulating

83 DE. Many studies have demonstrated that TFs-promoter interactions are critical determinants of transcriptional activity

84 of promoter and thereby define gene expression levels2. However, it is unclear to what extent RNA features, which we

85 define as each RNA’s binding sites of proteins and miRNA’s, contribute to gene expression levels compared to TFs-

86 promoter interactions. To answer this question, we re-trained the deep learning model, randomizing either RNA features,

87 promoter features, or both. We found that RNA features alone explained the actual TPM values better than the model

88 trained with promoter features (Figure 2E). An example of how RNA features may distinguish between transcripts to a

89 greater extent than promotor features can be seen in the structure of the gene ACADM (Figure 2F), which showed

90 substantial differences between the promoter-based model and the RNA-based model. For instance, the promoter-based

91 model could not distinguish 8 out of 11 transcripts coding for the ACADM gene that shared the same promoter region

92 (p1 in Figure 2F). However, the actual expression levels for the 8 transcripts varied depending on the mRNA structures

93 and therefore were more accurately captured by the RNA-based model (Figure 2F). However, the importance of RNA

94 features was tissue-dependent (Figure 2G), as gene expression in the aorta and coronary arteries were mainly defined by

95 TF-promoter interactions, whereas RNA-binding features were the major predictors for thyroid-specific or skeletal

96 muscle-specific expression.

97

98 Regulatory factors for differential expression across human tissues

99 To quantify the importance of the biological interactions weighted in the DEcode models, we calculated DeepLIFT

100 scores, which are a measure of the additive contribution of its binding site to each prediction14, 15 and then averaged the

101 DeepLIFT scores for each interactor across genes (Table S2). Because DeepLIFT scores for the gene-based model and

102 the transcript-based model were well correlated (Spearman's rho = 0.52, P < 2.2e-16) (Figure S3), we focused on

103 DeepLIFT scores for the gene-based model in the following analyses. For the prediction of median TPM levels, the

104 enrichment of the binding sites of RNABPs peaked among the top 12% of influential predictors, which was significantly

105 greater than the influence of TFs and miRNAs (P < 0.00001) (Figure 3A). Indeed, out of the top 30 key predictors, 19

106 were RNABP’s binding sites and 11 were TF binding sites. The direction of DeepLIFT scores indicates either a positive

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

107 or a negative effect of having the binding site on the abundance of RNA (Figure 3B). For instance, the binding sites of

108 ATXN2, DDX3X, and FUS had high positive DeepLIFT scores to the prediction of RNA abundance, indicating the

109 that bear binding sites for these RNABPs tended to be more highly expressed (Figure 3B). We also calculated

110 DeepLIFT scores for tissue-specific expression to examine critical predictors for each of 53 tissues (Figure 3C). The

111 DeepLIFT scores across tissues recapitulated the contribution of binding sites of known master regulators in each tissue

112 such as REST for brain tissues16, SPI1 and RUNX1 for immune-related tissues17, TP63 and KLF4 for skin18, HNF4A

113 for liver19, and PPARG for adipose-related tissues20, which suggested the differences in predictive contributions of

114 binding sites of a given regulator reflect the differential activities of regulators across tissues. We hypothesized that the

115 differential activities of a regulator could be in part explained by the relative abundance of a regulator across tissues.

116 Based on this hypothesis, we contrasted DeepLIFT scores for the binding sites of each regulatory factor and its expression

117 levels across tissues. We indeed found that 99 RNABPs and 410 TFs showed significant correlations between DeepLIFT

118 scores of their binding sites and their expression levels (FDR < 5%) (Figure S4). These relationships were not based on

119 differences in expression profiles between brain and non-brain tissues, as the relationships remained the same without

120 brain tissues (Figure S5). The sign of the correlation possibly reflects whether the binding of a regulator to RNA

121 increased or decreased the abundance of the RNA. For instance, the model suggested that PPARG and PTBP1 are positive

122 regulators of gene expression as DeepLIFT scores of PPARG or PTBP1 binding sites were higher in the tissues expressing

123 PPARG or PTBP1 at higher levels (Figure 3D). Indeed, PPARG is a transcriptional activator20 and PTBP1 is a stabilizer

124 of RNAs21. Conversely, the expression levels of REST, a transcriptional repressor16, or METTL14, an RNA

125 methyltransferase destabilizing RNAs22, showed inverse correlations with their DeepLIFT scores as expected (Figure

126 3D). These results indicated that DEcode reflects biological mechanisms for controlling RNA abundance.

127

128 Critical predictors of transcriptome are enriched for disease genes.

129 Next, we characterized the roles of the critical regulators of human transcriptome, as suggested by the DEcode models

130 (Figure 3A). We hypothesized that if these are truly impactful transcriptome regulators, then defects in such regulators

131 would have significant impacts on cellular phenotypes and thereby lead to disease. To examine this hypothesis, first, we

132 obtained genes whose loss-of-function (LoF) mutations are depleted through the process of natural selection, from the

133 Exome Aggregation Consortium (ExAC)23. Since these genes are intolerant to LoF mutations they are considered to play

134 important roles in individual fitness. Out of all TFs and RNABPs used in DEcode, 853 genes were examined in the ExAC

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

135 study and 601 genes were reported as being intolerant of homozygous or heterozygous LoF mutations, with probability

136 greater than 99%. We found that these LoF-mutation-intolerant regulators had greater DeepLIFT score magnitudes for

137 the prediction of the absolute gene expression (Figure 3E and Table S3). In particular, these associations are based on

138 genes that are intolerant to both heterozygous and homozygous LoF mutations (Figure S6). This suggested that having

139 LoF mutations only in a single allele of the predicted critical regulators would cause a deleterious consequence on survival

140 or reproduction in humans. Next, to examine whether the predicted critical regulators of transcriptome indeed cause

141 diseases, we obtained disease-causing genes registered in the Online Mendelian Inheritance in Man (OMIM)24. We

142 confirmed that mutations in the regulators with high DeepLIFT scores tended to cause genetic disorders (Figure 3E).

143 Interestingly, their roles on fitness are likely preserved across species, as dysfunctions of the predicted critical regulators

144 also led pre-weaning lethality in mice (Figure 3E). Lastly, we asked whether the loss-of-function of the predicted critical

145 regulators of the transcriptome could also impair cellular viability, by overlapping them with loss-of-function screens for

146 a range of cellular models, from the Cancer Dependency Map project (DepMap)31. We found that the key genes for

147 cellular viability tended to have higher DeepLIFT scores in the DEcode model (Figure 3E). These results were robust,

148 as they were also supported by the DeepLIFT scores for the transcript-level model (Figure S7). Together, the results

149 indicated that the critical predictors of transcriptome indeed play critical roles in maintaining vital cellular and body

150 functions. Thus the DEcode model can identify disease-causing genes, and this capability points toward the broader

151 validity of predicted key regulators.

152

153 DEcode predicts differential expression across individuals

154 Next, we asked whether the same input of promoter and RNA features could also predict relative expression differences

155 across individuals within the same tissue. We hypothesized that each individual has different activation levels of

156 regulatory factors, and thus those differences lead to person-specific differential expression of their targets. To verify our

157 hypothesis, we extended the DEcode framework to model differential expression across individuals for 14 representative

158 tissues with a sample size greater than 100 in GTEX. This was challenging as the average variance in gene expression

159 within tissues was less than 25% of that between tissues (Figure S8).

160 To generate person-specific predictions, we utilized transfer learning, wherein the parameters in convolutional layers in

161 the across-tissue DEcode model were fixed and then only the parameters in the fully-connected layers were tuned (Figure

162 S9). The person-specific models successfully predicted fold changes across individuals with a mean Spearman’s

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

163 correlation of ~0.28 (Figure 4A). The performance was further increased to 0.34 when we filtered out the models that

164 worked poorly for the validation data (Figure S10). Note the model selection was performed based on validation data

165 alone, and all the follow-up performance evaluations and analyses were conducted by using testing data to prevent

166 information leaks that could inflate model performance (Figure S9). The models were indeed person-specific as they did

167 not predict gene expression profiles of unrelated individuals (Figure 4A). To examine if the model captured the person-

168 specific expression shared across tissues25, we compared expression between tissues within the same individuals and

169 between different individuals. The predicted expression showed better concordance between tissues from the same

170 individuals, as is the case with actual expression data, which indicated the model captured the person-specific regulatory

171 mechanisms, even though we did not use any direct information that could identify individuals (Figure 4B).

172 Next, to gauge the contribution of RNA and promoter features to the person-specific expression profiles, we re-trained

173 models with randomized RNA features, promoter features, or both. The RNA-feature-based model performed on average

174 85% as well as the model trained with all features. This corresponded to an average 173% performance gain, compared

175 to the promoter-feature-based model, which suggested that the post-transcriptional controls are the major determinants

176 of the differential expression across individuals (Figure 4C). The model also allowed us to investigate the person-specific

177 activities of regulators by calculating DeepLIFT scores (Figure 4D). At least 100 of regulators out of 933 regulators in

178 each tissue showed a good correlation between their DeepLIFT scores and expression levels across individuals (Figure

179 S11). The signs of these correlations were consistent between tissues, and consistent with those of the cross-tissue model

180 (Figure S12). This suggested that differential expression between individuals and between tissues can be modeled by the

181 universal relationships between regulators and their targets.

182 To examine whether specific genes contributed to the per-person accuracy of the predicted gene expression, we also

183 assessed its accuracy on a per-gene basis. The predicted expression of a majority of the testing genes (78% on average)

184 showed significant positive correlations with the actual gene expression (FDR<5%). In order to assess whether this

185 predictive performance outperformed a state of the art method, we compared DEcode with PrediXcan26, which predicts

186 person-specific gene expression from genetic variations in cis-regulatory regions of genes. We built PrediXcan models

187 for each of the testing genes based on the same GTEX gene expression data used for the DEcode models and whole-

188 genome sequence data of corresponding individuals (see Methods). The PrediXcan model predicted gene expression

189 levels of only about 11% of the testing genes at FDR less than 5%, which was far less than that of DEcode (Figure 4E).

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

190 This suggested that the differential activity of transcriptional and post-transcriptional regulators has a larger effect on

191 gene expression than genetic variations in cis-regulatory regions.

192 The genes that DEcode could predict well were similar across tissues (Figure S13). This suggested that the predictability

193 of gene expression is defined by gene characteristics rather than a target tissue. We, therefore, explored gene

194 characteristics that were associated with the per-gene accuracy of the predicted expression. We found that the models

195 showed higher performance for the genes that are registered in multiple gene annotation databases than those found only

196 in the GENCODE database (Figure S14). The GENCODE-specific genes are novel or putative and thus their annotations

197 are not well established. Since both actual gene expression and binding features in RNA and promoter regions are likely

198 to be less accurate for such a novel or putative gene, it is reasonable that the performance of the model for those genes

199 was lower than other well-established genes. Beyond the annotation reliability, we found that the number of known

200 binding features for each gene had a larger effect on the predictability (Figure S15). This suggested that the more

201 information on RNA and promoter interactions is available, the more the prediction becomes accurate. Interestingly, the

202 number of binding features in RNAs was a stronger determinant of the predictive accuracy than that in promoter regions

203 (Figure 4F and Figure S15). RNA-protein interactions are largely missing as global RNA-binding profiles are available

204 for only about 10% of known RNABPs30. Thus, the incompleteness of RNA features is likely to be an origin of lower

205 accuracy for a portion of genes.

206

207 DEcode predicts trait-related transcriptomic changes

208 Next, we asked whether the person-specific expression profiles predicted by the DEcode models also retained trait-

209 associated differential expression changes. For this, we conducted differential expression analysis against the donor’s

210 age and sex using the predicted gene expression data. Notably, test statistics of the predicted data showed significant

211 positive correlations with those of the actual data in all tissues for both traits (Figure 5A). Especially, age- and sex-

212 specific expression changes were well preserved in the predicted data in lung (Spearman's rho = 0.59, P < 2.2e-16) and

213 hippocampus (Spearman's rho = 0.47, P < 2.2e-16), respectively. The predicted associations were the closest to those of

214 corresponding tissues in 9 and 11 out of 14 tissues for age and sex, respectively (Figure 5B). This indicated that the

215 predicted gene expression changes against age and sex are tissue-specific in most cases, rather than the effects shared

216 across tissues. We also explored the regulators for the age- and sex-related gene expression changes by associating

217 regulator's DeepLIFT scores with age and sex. We found that many regulators, for instance, 717 in the tibial artery and

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

218 904 in the breast mammary tissue, showed age- and sex-dependent changes at FDR 5%, respectively (Figure 5C and

219 Table S4), which showed the capability of DEcode to associate transcriptional regulators with phenotypes. Although

220 there were more TFs associated with phenotypes than RNABPs and miRNAs, overall collective impacts of RNA features

221 on the generative process of DEs for age and sex were greater than those of promoter features in most tissues (Figure

222 S16).

223

224 DEcode predicts gene co-expression relationships

225 Co-expression analysis is a frequent component of transcriptome studies as gene-to-gene co-expression relationships are

226 regarded as functional units of the transcriptional system27. Therefore, we examined if the DEcode models could detect

227 known gene co-expression relationships. These tests were both a potential validation of the person-specific DEcode

228 predictions, and a means to explore the biological basis of co-expression. We found that the gene co-expression

229 relationships in the predicted gene expression profiles separated gene pairs with positive and negative correlation in the

230 actual gene expression data in each tissue (Figure 6A). Furthermore, the predicted gene expression profiles also detected

231 inter-tissue co-expression relationships (Figure 6B). The accuracy of these results motivated us to investigate key factors

232 driving co-expression, via the DEcode predictions. RNA features alone could explain co-expression relationships better

233 than promoter features in most tissues (Figure 6C), which again suggested the significant contribution of RNA features

234 to person-specific transcriptomes.

235 To further assess the capability of DEcode to decipher the mechanisms leading a specific co-expression relationship, we

236 focused on the co-expression of LAPTM5 and CD53, which were robustly co-expressed both in the simulated expression

237 data and the actual data in all tissues except whole-blood. Using the trained model, we simulated the consequences of

238 disruptions of promoter and mRNA features. The co-expression relationship was weakened when the features near

239 transcriptional start site (TSS) and 1,000 bp downstream of TSS in LAPTM5 or near TSS and 500 bp upstream of TSS in

240 CD53 were removed (Figure 6D). These observed effects were reasonable because many TFs bind to these regions

241 (Figure 6D). We further examined the specific regulators for the co-expression relationships by simulating knockout

242 (KO) effects of regulators. The in-silico KO experiments revealed that immune-related TFs such as SPI1 and TBX21

243 potentiated the co-expression relationships consistently across multiple tissues (Figure 6E). To validate if these

244 regulators indeed induced the co-expression relationships, we conducted a mediation analysis that is an orthogonal

245 computational method to infer the effect of regulators on downstream targets. A mediation analysis evaluated the

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

246 hypothesis where if LAPTM5 and CD53 are co-expressed due to the predicted regulators, normalizing expression levels

247 of the two genes by the expression levels of the regulators would decrease the co-expression relationships. Specifically,

248 it quantified the covariance between LAPTM5 and CD53 explained by the expression levels of the predicted regulators

249 using the actual expression data. The set of the 10 regulators together mediated up to 94% of covariance, which was

250 significantly greater than the same number of randomly picked regulators (Figure 6F). This example showed the utility

251 of DEcode framework to identify the drivers of the co-expression.

252

253 DEcode reveals molecular regulations for frequently DE genes in meta-transcriptomes

254 A recent meta-analysis of over 600 human transcriptome data revealed that some genes are more likely to be detected as

255 DE genes than others in diverse case-control studies28. From this observation, Megan et al. formulated the “DE prior”, a

256 global ranking of gene’s generic likelihood of being DE. The genes with high DE prior rank were significantly more

257 enriched with DE genes from a variety of conditions, as compared to other functional gene sets, such as those contained

258 in or canonical pathways28. However, the regulatory-origin behind the ranking of these highly responsive

259 genes has yet to be uncovered. Therefore, we used DEcode to examine whether the DE prior rank could be generated by

260 gene regulatory interactions, and to identify critical regulatory relationships for frequently DE genes. The ability of

261 DEcode to predict global DE prior ranks was highly significant (P < 2.2e-16) and practically relevant (Spearman's rho =

262 0.53) (Figure 7A). Furthermore, DEcode was able to identify genes with high (90th percentile and greater) DE prior

263 probability (AUCROC = 0.81, 95% confidence interval = 0.78 - 0.84) (Figure 7B). Re-training the model with

264 randomized inputs indicated that TF-promoter interactions were the major factors explaining the DE prior rank (Figure

265 7B). To further characterize TFs that contributed to the prediction, we defined TFs with DeepLIFT score greater than

266 90th percentile as critical TFs (Table S5) and performed pathway analysis on them. We found that critical TFs were

267 enriched for cancer or inflammatory-related KEGG pathways (FDR<5%) such as pathways in cancer (Fold = 3.1, P =

268 4.2e-5), JAK-STAT signaling pathway (Fold = 6.8, P = 4.8e-5), chemokine signaling pathway (Fold = 7.3, P = 1.4e-4),

269 and acute myeloid leukemia (Fold = 4.5, P = 3.6e-4) (Table S6). This result is consistent with the disease-related data

270 context for DE prior, which is 62% cancer-related and 23% inflammatory-related. Supported by the ability to predict DE

271 prior ranks, and by the consistency of these results, this application of DEcode illustrates how it goes beyond DE gene

272 lists, to uncover major key drivers for generating DE.

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

273 In summary, DEcode defines major principles in gene regulation in arbitrary gene expression data. It is applicable to

274 tracing the origins of complex gene expression patterns such as co-regulation, and also to arbitrary gene expression

275 signatures. This capacity is strongly supported on a comparative basis to alternative methods, and on an absolute basis

276 across diverse applications, which include, through predictions of transcript-usage, person-specific gene expression,

277 frequently DE genes of multiple external disease-related gene sets.

278 Discussion

279 We introduced the DEcode framework, which integrates a wealth of genomic data into a unified computational model of

280 transcriptome regulations to predict multiple transcriptional effects, including the absolute expression differences across

281 genes and transcripts, tissue- and person-specific transcriptomes. Systems biology analysis of these results provided

282 biological insights regarding the regulatory mechanisms of transcriptome. For instance, it suggested that absolute

283 expression levels are mainly under post-transcriptional control, whereas tissue-specific expression is shaped by both

284 transcriptional and post-transcriptional control. This implied that TFs act as a switch that initiates tissue-specific

285 transcriptional programs, but once a gene is transcribed at a certain level, its abundance in the cells will be primarily

286 regulated by RNABPs. The post-transcriptional regulators were also critical for explaining individual differences in

287 transcriptomes and thus may fine-tune the transcriptome in response to environmental and genetic factors.

288 Transcriptome analysis often identifies differentially expressed genes and then assesses the enrichment of functional

289 genes such as TF-targets one by one. The person-specific DEcode model offers several comparative advantages. First,

290 DEcode can take into account the effects of multiple regulators simultaneously as opposed to one at a time. Second,

291 DEcode can estimate the person-specific regulator’s activities that can be used to identify regulators associated with a

292 phenotype of interest. Third, DEcode can simulate the consequence of KO perturbations for each gene. This step can

293 reduce the number of candidate key drivers of gene expression changes by an order of magnitude or more, and facilitates

294 the design of follow-up experiments. Therefore, DEcode can extract more actionable information from transcriptome

295 data, which will benefit a variety of transcriptome studies.

296 Looking toward even more expansive applications, the DEcode framework has the flexibility to incorporate other types

297 of genomic information such as DNA methylation, histone marks, and RNA modifications, and also can be extended to

298 other organisms. Thus, DEcode framework provides a direct bridge between accumulating genomic big data and

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

299 individual transcriptome studies, allowing researchers to predict molecules that control DE associated with any condition

300 or disease.

301 Materials and Methods

302 Transcriptome data processing

303 To prepare gene expression data used for the model training, we downloaded the median gene TPM from 53 human

304 tissues from the v7 release of GTEX portal (https://gtexportal.org). We kept 27,428 genes expressed greater than two

305 TPM in at least one tissue and log2-transformed TPM with the addition of 0.25 to avoid a negative infinity. Then, we

306 calculated the median log2-TPM across 53 tissues and log2-fold-changes relative to the median of all tissues. The

307 processed gene-level expression data comprised 27,428 genes with 54 columns including relative fold-changes for 53

308 tissues and the median log2-TPM across 53 tissues. To compile transcript-level data, we downloaded the individual-

309 level transcript TPM from the GTEX portal and computed the median transcript TPM by tissue. We processed the

310 transcript data in the same way we did for the gene-level data. The resulted transcript-level data included 79,647

311 transcripts that corresponded to 23,813 genes. For building person-specific DEcode models, we obtained the gene-level

312 TPM for each individual in 14 tissues from the GTEX portal. We filtered out lowly-expressed genes in each tissue and

313 kept genes expressed greater than one TPM in at least 50% of samples. Then, we log2-transformed TPM with the

314 addition of 0.25 and then quantile normalized the log2-TPM. Finally, we removed the effects of technical covariates

315 including rRNA rate, intronic rate, and RIN number via linear regression for each gene followed by quantile

316 normalization.

317 Promoter and RNA binding features.

318 To generate RNA and DNA feature matrices, we downloaded genomic locations of binding sites of 171 RNABPs from

319 POSTAR229 as of Oct 2018, 218 miRNAs from TargetScan Release 7.230, and 826 TFs from GTRD31 as of Oct 2018.

320 Then, we mapped the binding sites of RNABPs, miRNAs, and TFs to promoters and RNA-coding regions defined in

321 the GTF file provided by the GTEX portal. A promoter region of each gene was defined as the region from 2,000 bp

322 upstream of the transcriptional start site (TSS) to 1,000bp downstream of the TSS. We only used interactors that bind to

12 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

323 promoters or RNA-coding regions of at least 30 genes, or transcripts as the predictors in each model. To reduce the size

324 of the input, an RNA-coding region and a promoter region of each gene was binned with 100 bp intervals and the

325 number of bases bound to each RNABP, miRNA, or TF was counted in each interval. This step generated RNA and

326 DNA feature matrices for each gene described in Figure 1.

327 Training tissue-specific models

328 For training the gene-level model of tissue-specific expression, we reserved all 2,705 genes coded on as

329 the testing data and the rest of the genes was randomly split into training data (22,251 genes) and validation data (2,472

330 genes). In the case of the transcript model, we used all 7,631 transcripts coded on chromosome 1 as the testing data and

331 the rest of the transcripts was randomly split into training data (64,978 transcripts) and validation data (7,038

332 transcripts). The binding matrices were normalized by the maximum values for each binding protein and miRNA. The

333 relative fold-changes for 53 tissues were scaled together to set the standard deviation as one and the median log2-TPM

334 was separately scaled to set the standard deviation as one. These steps were conducted for the training data first and

335 then the same scaling factors were used for the validation and the testing data to avoid information leaking from those

336 data. We constructed and trained DL models using Keras (version 2.1.3)32 with a TensorFlow (version 1.4.1)33

337 backend. Hyper-parameters were optimized using hyperopt (version 0.2)34 based on the mean squared error against the

338 validation data. The detailed structure of the model was described in Figure S17. The training was done using mini-

339 batches of 128 training examples with a learning rate of 0.001 for Adam optimizer35. The number of maximum training

340 epochs was set to 100 with early-stopping of 10 based on validation loss. This training cycle was repeated 10 times and

341 the best model for the validation data was selected as the final model (Figure S1). All models were trained using

342 TITAN X Pascal graphics processing units (Nvidia).

343 Comparison of DEcode with ExPecto

344 To perform a fair comparison between DEcode and ExPecto11, we used 18,550 genes that were commonly included in

345 both studies and trained models with the same set of genes for training and evaluation. Since ExPecto model was

346 originally built using genes on chromosome 8 as the testing data, we followed the same procedure as we reserved all

347 714 genes coded on chromosome 8 as the testing data and the rest of the genes was randomly split into training data

348 (16,052 genes) and validation data (1784 genes). The epigenetic states estimated by ExPecto were downloaded from

13 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

349 the ExPecto repository (https://github.com/FunctionLab/ExPecto) as of Nov 2019. Given the epigenetic states, we built

350 a prediction model for tissue-specific gene expression for each tissue via XGBoost based on the training script

351 downloaded from the ExPecto repository. We modified the original script so that the early stopping of the model

352 optimization was decided based on the performance on the validation data instead of the testing data. This modification

353 prevented the overfitting of the model to the testing data. We used the same hyper-parameters for XGBoost as in the

354 script. Both hyper-parameters and model parameters of DEcode model were also trained with the same set of training,

355 validation, and testing genes.

356 DeepLIFT score calculation

357 To evaluate the importance of input features to the prediction, we calculated DeepLIFT (Deep SHAP) scores14 using

358 DeepExplainer implementation (version 0.27.0)15. The DeepLIFT method estimates the contribution of each input

359 compared to a reference input in a trained DL model. To compute the contribution of the presence of a binding site, we

360 used a reference that does not have any binding sites in both promoters and RNAs with the median length of all genes

361 in the testing data. DeepLIFT scores follow a summation-to-delta property where the summation of input contributions

362 (DeepLIFT scores) is equal to the difference in the predicted value compared to the prediction from the reference input.

363 We calculated DeepLIFT scores for each gene in testing data for each of 54 outputs, then summed up the scores over

364 promoter or RNA regions for each feature, and finally averaged them over genes.

365 Disease genes

366 The probability that a gene is intolerant for a loss-of-function mutation was downloaded from the release 1.0 of the

367 ExAC portal (http://exac.broadinstitute.org). Disease genes were obtained from the OMIM portal as of June 2019

368 (https://www.omim.org/). We excluded provisional gene-to-phenotype associations and genes associated with non-

369 disease phenotypes, multifactorial disorders, or infection. We obtained mouse-lethal genes from Gene Discovery

370 Informatics Toolkit (v1.0.0)36 that provided pre-processed gene lists from the murine knock-out experiments registered

371 in Mouse genome informatics (MGI)37 and the International Mouse Phenotyping Consortium (IMPC)38. The results of

372 CRISPR screening for the genes essential for proliferation or viability conducted in the DepMap project39 were

373 downloaded from Enrichr portal40, 41 as of June 2019 (https://amp.pharm.mssm.edu/Enrichr). Enrichr portal provided

374 two CRISPR screening results conducted independently at Broad Institute and the Sanger Institute. To reduce the false

14 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

375 positives in the CRISPR screening, we used essential genes that were identified in both of the two independent

376 screenings.

377 Training person-specific models

378 To train person-specific models, we utilized the same model structure as the tissue-model, except that the number of

379 model outputs was modified to match the sample size of the tissue. We re-used the parameters of convolutional layers

380 in the tissue-model and only parameters in the fully-connected layers were tuned (Figure S9). We used the same gene

381 splits and the same procedure of normalization and scaling as the tissue-model for training and evaluating models. We

382 evaluated the model prediction for each individual separately based on validation data and filtered out the individual

383 models that performed less than 50% percentile of all individual models for some analyses (Figure S9).

384 Training PrediXcan models

385 To build a prediction model for gene expression from genotype data, we trained PrediXcan26 models with GTEX gene

386 expression and genotype data. A QCed vcf file of GTEx genotype data called by whole-genome sequence was

387 downloaded from dbGaP for 635 individuals. We filtered out variants with a missing rate greater than 1% and minor

388 allele frequency less than 1% and kept 9,219,660 variants for PrediXcan. We followed the model building procedure

389 employed in PredictDB (http://predictdb.org/), a repository of PrediXcan models, as of Nov 2019. Briefly, we

390 randomly split the samples into 5 folds. Then for each fold, we removed the fold from the data and used the remaining

391 data to train an elastic-net model using 10-fold cross-validation to tune the lambda parameter. With the trained model,

392 we predicted gene expression values for the hold out samples. We applied the PrediXcan method to predict the same

393 gene expression data used for the person-specific DEcode models. We built PrediXcan model for each gene using

394 variants located within 1 Mbp upstream and downstream of its TSS. A missing value of the genotype data was replaced

395 with an average dosage of non-missing samples.

396 Differential expression analysis for age and sex

397 Limma42 was used to identify genes associated with age using gender as a covariate. The log2-TPM values of genes in

398 the testing data were used. We also tested the associations between DeepLIFT scores for predictors and age via limma

15 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

399 to identify regulators for DE against ages and sex. The Benjamini–Hochberg procedure was used to control the false

400 discovery rate at 5%.

401 in silico binding-site disruption experiment

402 To simulate the consequence of the removal of binding sites on the expression of LAPTM5 and CD53, we generated

403 10,000 synthetic inputs for each of LAPTM5 and CD53 where all binding sites in each interval of its promoter and

404 RNA were randomly removed. From each of these synthetic inputs, we computed predicted expression values and

405 correlated them with ones of another gene without any disruptions in its binding sites. Then, we used multiple linear

406 regression to associate the location of disrupted regions with the correlation values between LAPTM5 and CD53 to

407 estimate the effects of the disruption in each region on the co-expression relationship.

408 in silico knockout experiment

409 To simulate the effect of regulator knockout (KO) on the expression of LAPTM5 and CD53, we generated 10,000

410 synthetic inputs for each LAPTM5 and CD53 where each protein or miRNA bound to its promoter or RNA was

411 randomly removed from it feature matrices. From each of these synthetic inputs, we computed predicted expression

412 values and correlated them with ones of another gene without any removals in its feature matrices. Then, we used

413 multiple linear regression to associate KOs of regulators with the correlation values between LAPTM5 and CD53 to

414 estimate the effects of the KO of each regulator on the co-expression relationship. We applied the Bonferroni correction

415 to control multiple testing and the regulators with the corrected p-value less than 0.05 in all tissues were chosen as the

416 key drivers of the co-expression.

417 Conditional independence test

418 To validate the effect of the predicted drivers on co-expression, we conducted a conditional independence test. We

419 regressed the actual log2-TPM values of LAPTM5 and CD53 with the actual log2-TPM values of the predicted drivers

420 and computed R2 (variance explained) between the residuals of two genes. The R2 based on the actual gene expression

421 and one from the residuals were compared to quantify the covariance explained by the predicted drivers. To evaluate

422 the significance of this effect, we repeated this process 1,000 times with an equal number of randomly picked genes

423 that have a binding site in LAPTM5 or CD53 as regressors.

16 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

424 DEcode model for DE prior rank

425 DE prior rank was downloaded from https://github.com/maggiecrow/DEprior. In the DE prior rank, each gene has a

426 probability-like value where zero is the minimum and one is the maximum. To convert this value to a non-bounded

427 scale, we applied the logit transformation to the DE prior value. We assigned a value of 10 to a gene that had an infinite

428 value after the logit transformation. We used the same gene splits as the GTEX-tissue-model, which resulted in 13,433

429 genes for training, 1,504 genes for validation, and 1,674 genes for testing. We trained the DEcode model for DE prior

430 rank using the same procedure as with the GTEX-person-specific models. To evaluate the contribution of promoter and

431 RNA features to the prediction, the model was also trained with randomized input features. Receiver operating

432 characteristic (ROC) curve analysis was performed using pROC R package43 with a default setting. We performed

433 pathway analysis of the TFs with a DeepLIFT score greater than 90th percentile using KEGG pathways44. KEGG

434 pathway gene sets were downloaded from MSigDB v6.145. The enrichment significance was based on results of the

435 hypergeometric test, with 757 unique TF genes as a background, against KEGG pathways comprised of at least 5

436 background genes. FDR was controlled at 5%. We manually curated the 159 disease-related data sets used in the

437 construction of the DE prior ranking, to determine the number of data sets related to cancer or inflammatory disease.

438 Code and model availability

439 DEcode software and pre-trained models for tissue- and person-specific transcriptomes are available at

440 www.differentialexpression.org.

441 Acknowledgments

442 We thank Dr. Lei Yu for managing access to GTEX data. The study was supported by NIH grants P30AG010161,

443 R01AG061798, and R01AG057911. The Genotype-Tissue Expression (GTEx) Project was supported by the Common

444 Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and

445 NINDS. The data used for the analyses described in this manuscript were obtained from the GTEx Portal on 10/01/2018.

17 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

446 Author Contributions

447 ST contributed to the conception and design of the study. ST performed the computational analysis. ST, CG, SM, and

448 YW interpreted the result. ST wrote the first draft of the manuscript. All authors contributed to manuscript revision, read

449 and approved the submitted version.

450 References

451 1. Lee, T. I. & Young, R. a. Transcriptional regulation and its misregulation in disease. Cell 152, 1237-51 (2013).

452 2. Lambert, S. A. et al. The Human Transcription Factors. Cell 172, 650-665 (2018).

453 3. Glisovic, T., Bachorik, J. L., Yong, J. & Dreyfuss, G. RNA-binding proteins and post-transcriptional gene

454 regulation. FEBS letters 582, 1977-86 (2008).

455 4. Bartel, D. P. : target recognition and regulatory functions. Cell 136, 215-33 (2009).

456 5. Schoenfelder, S. & Fraser, P. Long-range enhancer-promoter contacts in gene expression control. Nature reviews.

457 Genetics 20, 437-455 (2019).

458 6. Smith, Z. D. & Meissner, A. DNA methylation: roles in mammalian development. Nature reviews. Genetics 14, 204-

459 20 (2013).

460 7. Roundtree, I. A., Evans, M. E., Pan, T. & He, C. Dynamic RNA Modifications in Gene Expression Regulation. Cell

461 169, 1187-1200 (2017).

462 8. Avsec, Ž et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics.

463 Nature biotechnology 37, 592-600 (2019).

464 9. Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nature reviews. Genetics

465 16, 321-32 (2015).

466 10. Jaganathan, K. et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell 176, 535-548.e24

467 (2019).

468 11. Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk.

469 Nature genetics 50, 1171-1179 (2018).

470 12. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA- 18 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

471 binding proteins by deep learning. Nature biotechnology 33, 831-8 (2015).

472 13. Melé, M. et al. Human genomics. The human transcriptome across tissues and individuals. Science (New York,

473 N.Y.) 348, 660-5 (2015).

474 14. Shrikumar, A., Greenside, P. & Kundaje, A. Learning Important Features Through Propagating Activation

475 Differences. (2017).

476 15. Lundberg, S. M. & Lee, S. A Unified Approach to Interpreting Model Predictions. , 4765-4774 (2017).

477 16. Chong, J. A. et al. REST: a mammalian silencer protein that restricts sodium channel gene expression to neurons.

478 Cell 80, 949-57 (1995).

479 17. Imperato, M. R., Cauchy, P., Obier, N. & Bonifer, C. The RUNX1-PU.1 axis in the control of hematopoiesis.

480 International journal of hematology 101, 319-29 (2015).

481 18. Soares, E. & Zhou, H. Master regulatory role of p63 in epidermal development and disease. Cellular and molecular

482 life sciences : CMLS 75, 1179-1190 (2018).

483 19. Watt, A. J., Garrison, W. D. & Duncan, S. A. HNF4: a central regulator of hepatocyte differentiation and function.

484 Hepatology (Baltimore, Md.) 37, 1249-53 (2003).

485 20. Lefterova, M. I., Haakonsson, A. K., Lazar, M. A. & Mandrup, S. PPARγ and the global map of adipogenesis and

486 beyond. Trends in endocrinology and metabolism: TEM 25, 293-302 (2014).

487 21. Ge, Z., Quek, B. L., Beemon, K. L. & Hogg, J. R. Polypyrimidine tract binding protein 1 protects mRNAs from

488 recognition by the nonsense-mediated mRNA decay pathway. eLife 5 (2016).

489 22. Wang, Y. et al. N6-methyladenosine modification destabilizes developmental regulators in embryonic stem cells.

490 Nature Cell Biology 16, 191-198 (2014).

491 23. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285-91 (2016).

492 24. Online Mendelian Inheritance in Man, OMIM®. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins

493 University (Baltimore, MD), {June 11th, 2019}. World Wide Web URL: https://omim.org/.

494 25. Ardlie, K. G. et al. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans.

495 Science 348, 648-660 (2015).

496 26. Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data.

497 Nature genetics 47, 1091-8 (2015).

498 27. Gaiteri, C., Ding, Y., French, B., Tseng, G. C. & Sibille, E. Beyond modules and hubs: the potential of gene

19 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

499 coexpression networks for investigating molecular mechanisms of complex brain disorders. Genes, brain, and behavior

500 13, 13-24 (2014).

501 28. Crow, M., Lim, N., Ballouz, S., Pavlidis, P. & Gillis, J. Predictability of human differential gene expression. Proc.

502 Natl. Acad. Sci. U. S. A. 116, 6491-6500 (2019).

503 29. Zhu, Y. et al. POSTAR2: deciphering the post-transcriptional regulatory logics. Nucleic acids research 47, D203-

504 D211 (2019).

505 30. Agarwal, V., Bell, G. W., Nam, J. & Bartel, D. P. Predicting effective microRNA target sites in mammalian

506 mRNAs. eLife 4 (2015).

507 31. Yevshin, I., Sharipov, R., Valeev, T., Kel, A. & Kolpakov, F. GTRD: a database of transcription factor binding

508 sites identified by ChIP-seq experiments. Nucleic acids research 45, D61-D67 (2017).

509 32. Chollet, F. keras. GitHub repository https://github.com (2015).

510 33. Abadi, M. et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. (2016).

511 34. Bergstra, J., Komer, B., Eliasmith, C., Yamins, D. & Cox, D. D. Hyperopt: a Python library for model selection and

512 hyperparameter optimization. Computational Science & Discovery 8, 014008 (2015).

513 35. Kingma, D. P. &Ba, L. J. Adam: A Method for Stochastic Optimization, arXiv.org, 2015).

514 36. Dawes, R., Lek, M. & Cooper, S. T. Gene discovery informatics toolkit defines candidate genes for unexplained

515 infertility and prenatal or infantile mortality. NPJ genomic medicine 4, 8-11 (2019).

516 37. Smith, C. L., Blake, J. A., Kadin, J. A., Richardson, J. E. & Bult, C. J. Mouse Genome Database (MGD)-2018:

517 knowledgebase for the . Nucleic Acids Res. 46, D836-D842 (2018).

518 38. Koscielny, G. et al. The International Mouse Phenotyping Consortium Web Portal, a unified point of access for

519 knockout mice and related phenotyping data. Nucleic Acids Res. 42, 802 (2014).

520 39. Tsherniak, A. et al. Defining a Cancer Dependency Map. Cell 170, 564-576.e16 (2017).

521 40. Kuleshov, M. V. et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic

522 acids research 44, 90 (2016).

523 41. Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC

524 Bioinformatics 14, 128 (2013).

525 42. Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies.

526 Nucleic acids research 43, e47 (2015).

20 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

527 43. Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC

528 Bioinformatics 12, 77 (2011).

529 44. Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27-30 (2000).

530 45. Liberzon, A. et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell systems 1, 417-

531 425 (2015).

532 Figure legends

533

534 Figure 1. Overview of building and evaluating the DEcode transcriptome prediction model.

21 Pancreas Liver Kidney − Cortex Stomach Heart − Left Ventricle Heart − Atrial Appendage Muscle − Skeletal Whole Blood Spleen Tissue Tissue Cells − EBV−transformed lymphocytesbioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version postedBrain January − Amygdala 11, 2020. The copyrightBrain − Amygdala holder for this preprint Lung Brain − Anterior cingulate cortex (BA24) Brain − Anterior cingulate cortex (BA24) Breast − Mammary(which Tissue was not certified by peer review) is the author/funder, who has granted bioRxivBrain −a Caudate license (basal ganglia)to display theBrain preprint − Caudate in(basal perpetuity. ganglia) It is made Vagina available under aCC-BY 4.0 InternationalBrain − Cerebellar license Hemisphere. Brain − Cerebellar Hemisphere Bladder Brain − Cerebellum Brain − Cerebellum Prostate Brain − Cortex Brain − Cortex Fallopian Tube Brain − Frontal Cortex (BA9) Brain − Frontal Cortex (BA9) Thyroid Brain − Hippocampus Brain − Hippocampus Minor Salivary Gland Brain − Hypothalamus Brain − Hypothalamus Colon − Transverse Brain − Nucleus accumbens (basal ganglia) Brain − Nucleus accumbens (basal ganglia) Small Intestine − Terminal Ileum Brain − Putamen (basal ganglia) Brain − Putamen (basal ganglia) Skin − Sun Exposed (Lower leg) Brain − Spinal cord (cervical c−1) Brain − Spinal cord (cervical c−1) Skin − Not Sun Exposed (Suprapubic) Brain − Substantia nigra Brain − Substantia nigra Esophagus − Mucosa Pituitary Pituitary Cervix − Endocervix Testis Testis Cervix − EctocervixFigure 2 Uterus Bladder Bladder Artery − Coronary BreastPancreas − Mammary Tissue Breast − Mammary Tissue Gene Transcript Transcript usage Colon − TransverseLiver Colon − Transverse Artery − Aorta Kidney − Cortex A C EsophagusStomach − Mucosa Esophagus − Mucosa Artery − Tibial R = 0.82 , p < 2.2e−16 R = 0.62 , p < 2.2e−16 R = 0.46 , p < 2.2e−16 10 Heart Lung− Left Ventricle Lung Cells − Transformed fibroblasts 10 5 Heart − Atrial Appendage Adipose − Visceral (Omentum) MuscleMinor − SkeletalSalivary Gland Minor Salivary Gland

Actual expression ProstateWhole Blood Prostate Adipose − Subcutaneous 400 Spleen Tissue Tissue Adrenal Gland Cells − EBV−transformedSkin lymphocytes − Not Sun Exposed (Suprapubic) Skin − Not Sun Exposed (Suprapubic) Brain − Amygdala Brain − Amygdala 5 5 300 Skin − SunLung Exposed (Lower leg) Skin − Sun Exposed (Lower leg) Brain − Anterior cingulate cortex (BA24) Brain − Anterior cingulate cortex (BA24) Ovary 0 Breast − Mammary Tissue Brain − Caudate (basal ganglia) Brain − Caudate (basal ganglia) Esophagus − Muscularis 200 Small IntestineVagina − Terminal Ileum Small Intestine − Terminal Ileum Brain − Cerebellar Hemisphere Brain − Cerebellar Hemisphere StomachBladder Stomach Brain − Cerebellum Brain − Cerebellum Esophagus − Gastroesophageal Junction 100 Prostate Brain − Cortex Brain − Cortex Thyroid Thyroid Brain − Frontal Cortex (BA9) Brain − Frontal Cortex (BA9) Colon − Sigmoid Actual expression 0 0 Fallopian Tube Nerve − Tibial −5 VaginaThyroid Vagina Brain − Hippocampus Brain − Hippocampus Minor Salivary Gland Brain − Hypothalamus Brain − Hypothalamus Pituitary ColonAdipose − Transverse − Subcutaneous Adipose − Subcutaneous Brain − Nucleus accumbens (basal ganglia) Brain − Nucleus accumbens (basal ganglia) Brain − Spinal cord (cervical c−1) 0 5 10 0 5 −4 −2 0 2 4 Small Intestine − AdiposeTerminal Ileum − Visceral (Omentum) Adipose − Visceral (Omentum) Brain − Putamen (basal ganglia) Brain − Putamen (basal ganglia) Skin − Sun Exposed (Lower leg) Brain − Spinal cord (cervical c−1) Brain − Spinal cord (cervical c−1) Testis Skin − Not Sun ExposedArtery (Suprapubic) − Aorta Artery − Aorta Brain − Substantia nigra Brain − Substantia nigra Brain − Cerebellum Predicted expression EsophagusArtery − Mucosa − Coronary Artery − Coronary Pituitary Pituitary Cervix − Endocervix Testis Testis Brain − Cerebellar Hemisphere CervixArtery − Ectocervix − Tibial Artery − Tibial Brain − Hippocampus Uterus Bladder Bladder ArteryCervix − Coronary − Ectocervix Cervix − Ectocervix Breast − Mammary Tissue Breast − Mammary Tissue Brain − Amygdala CervixArtery − − Aorta Endocervix Cervix − Endocervix Colon − Transverse Colon − Transverse Esophagus − Mucosa Esophagus − Mucosa Brain − Caudate (basal ganglia) Gene Artery − Tibial B Cells − TransformedColon fibroblasts − Sigmoid Colon − Sigmoid Lung Lung Brain − Putamen (basal ganglia) Adipose − VisceralEsophagus (Omentum) − Gastroesophageal Junction Esophagus − Gastroesophageal Junction Minor Salivary Gland Minor Salivary Gland

0.4 Actual expression Prostate Prostate Brain − Substantia nigra Adipose −Esophagus Subcutaneous − Muscularis Esophagus − Muscularis Brain − Cortex Adrenal Gland Skin − Not Sun Exposed (Suprapubic) Skin − Not Sun Exposed (Suprapubic) 0.2 FallopianOvary Tube Fallopian Tube Skin − Sun Exposed (Lower leg) Skin − Sun Exposed (Lower leg) Brain − Anterior cingulate cortex (BA24) Small Intestine − Terminal Ileum Small Intestine − Terminal Ileum EsophagusNerve − Muscularis − Tibial Nerve − Tibial Brain − Nucleus accumbens (basal ganglia)0.0 Esophagus − Gastroesophageal Junction Stomach Stomach ColonOvary − Sigmoid Ovary Thyroid Thyroid Brain − Hypothalamus Transcript UterusNerve − Tibial Uterus Vagina Vagina Brain − Frontal Cortex (BA9) Pituitary Adipose − Subcutaneous Adipose − Subcutaneous 0.4 Brain − Spinal cordAdrenal (cervical Glandc−1) Adrenal Gland Adipose − Visceral (Omentum) Adipose − Visceral (Omentum) Cells − EBVTestis−transformed lymphocytes Cells − EBV−transformed lymphocytes Artery − Aorta Artery − Aorta Brain − Cerebellum Artery − Coronary Artery − Coronary 0.2 Brain − Cerebellar Hemisphere

1) Cells − Transformed fibroblasts Cells − Transformed fibroblasts Artery − Tibial Artery − Tibial Brain − Hippocampus − Cervix − Ectocervix Cervix − Ectocervix 0.0 BrainHeart − Amygdala − Atrial Appendage Heart − Atrial Appendage Liver

Lung Cervix − Endocervix Cervix − Endocervix Testis Tibial Tibial Ovary Brain − Caudate (basal ganglia) Aorta Heart − Left Ventricle Heart − Left Ventricle Colon − Sigmoid Colon − Sigmoid

Transcript usage Uterus Vagina Spleen Cortex Cortex − − − Thyroid Brain − Putamen (basal ganglia)

Bladder Esophagus − Gastroesophageal Junction Esophagus − Gastroesophageal Junction Pituitary

Prostate Kidney − Cortex Kidney − Cortex Mucosa Skeletal − − Sigmoid Spearman's correlation

Stomach Brain − Substantia nigra

Pancreas Esophagus − Muscularis Esophagus − Muscularis − − Coronary 0.4 − Brain − Cortex Amygdala Liver Liver Fallopian Tube Fallopian Tube Ectocervix − Muscularis Transverse Brain − Anterior cingulate cortex (BA24) Predicted vs actual expression Endocervix − Cerebellum Nerve − Tibial Nerve − Tibial −

− Muscle − Skeletal Muscle − Skeletal Whole Blood − Brain − Nucleus accumbens (basal ganglia) −

0.2 − Left Ventricle Ovary Ovary Nerve Artery Artery Hippocampus Adrenal Gland

Brain Brain Brain − Hypothalamus Fallopian Tube Fallopian Pancreas Pancreas − Hypothalamus Subcutaneous

Terminal Ileum Terminal Uterus Uterus

− Brain − Frontal Cortex (BA9) Kidney Kidney − − Colon 0.0 − Adrenal Gland Adrenal Gland Substantia nigra Spleen Spleen Muscle Brain Brain Atrial Appendage Mammary Tissue Artery

− Cells − EBV−transformed lymphocytes Cells − EBV−transformed lymphocytes 1) Whole Blood Whole Blood − − Brain Brain − Colon Cervix

1) Cells − Transformed fibroblasts Cells − Transformed fibroblasts Cervix Liver − Lung Visceral (Omentum) Visceral Minor Salivary Gland Testis

Heart Heart − Atrial Appendage Heart − Atrial Appendage Frontal Cortex (BA9) Tibial Tibial Ovary Aorta Brain Brain Liver Esophagus Lung

Uterus Correlation − Vagina Spleen Brain Brain Testis Cortex Cortex − − − Thyroid Tibial Tibial Ovary Aorta −

Bladder Heart − Left Ventricle Heart − Left Ventricle Uterus Pituitary Vagina Spleen Prostate Mucosa Skeletal − − Cortex Cortex − − − Sigmoid Thyroid Stomach

0.4 Bladder Cerebellar Hemisphere Pancreas Pituitary Transformed fibroblasts Transformed Prostate Brain Brain Kidney − Cortex Kidney − Cortex Mucosa − Skeletal − − − Coronary Sigmoid − Stomach Caudate (basal ganglia) Esophagus Amygdala Pancreas Putamen (basal ganglia) − − − Adipose Ectocervix transformed lymphocytes transformed Coronary − − − Muscularis Transverse Heart

Amygdala Liver Liver Endocervix − − Sun Exposed (Lower leg) Sun Exposed (Lower Cerebellum Ectocervix Spinal cord (cervical c − −

0.2 Muscularis Transverse − − Whole Blood Breast Endocervix − − − Cerebellum − − − Left Ventricle

− Muscle − Skeletal Muscle − Skeletal Whole Blood − − Nerve − − Artery Artery − Hippocampus Adrenal Gland Brain Brain Left Ventricle Fallopian Tube Fallopian Nerve − Gastroesophageal Junction Artery Hypothalamus Subcutaneous Artery Hippocampus Adrenal Gland Brain Brain Terminal Ileum Terminal Fallopian Tube Fallopian

Brain Brain Pancreas Pancreas − − Hypothalamus 0 Subcutaneous Kidney Kidney Terminal Ileum Terminal − − − Colon − − Kidney Kidney Substantia nigra − − Colon −

Substantia nigra Spleen Spleen Muscle Brain Brain Atrial Appendage Mammary Tissue Artery − Muscle Brain Brain Atrial Appendage Mammary Tissue Adipose Artery − Cells Brain Brain − −

Brain Brain −0.2 Colon Cervix Whole Blood Whole Blood − − EBV Brain Brain Colon Not Sun Exposed (Suprapubic) Cervix Brain Brain Cervix Small Intestine Skin Brain Brain Visceral (Omentum) Visceral Cervix Minor Salivary Gland Visceral (Omentum) Visceral Anterior cingulate cortex (BA24) Heart Minor Salivary Gland Frontal Cortex (BA9) Heart − Brain Brain Frontal Cortex (BA9) − Esophagus Brain Brain − Brain Brain Esophagus

Brain Brain Correlation − Brain Brain

− −0.4 − −

Cerebellar Hemisphere 0.4 Cerebellar Hemisphere Transformed fibroblasts Transformed Brain Brain Transformed fibroblasts Transformed Brain Brain Caudate (basal ganglia) Caudate (basal ganglia) Esophagus Esophagus Putamen (basal ganglia) Putamen (basal ganglia) − Adipose transformed lymphocytes transformed − Adipose transformed lymphocytes transformed − − Heart Heart − − Sun Exposed (Lower leg) Sun Exposed (Lower Sun Exposed (Lower leg) Sun Exposed (Lower Spinal cord (cervical c Spinal cord (cervical c

Nucleus accumbens (basal ganglia) 0.2 − − Breast − Breast − − − − − Gastroesophageal Junction − Gastroesophageal Junction Brain Brain Skin

Brain Brain 0 − Cells − Adipose Cells Brain Brain Adipose

Brain Brain −0.2 Cells Brain Brain EBV Not Sun Exposed (Suprapubic) Brain Brain Small Intestine EBV Not Sun Exposed (Suprapubic) Brain Brain Skin Brain Brain Small Intestine Anterior cingulate cortex (BA24) Skin Brain Brain − − Anterior cingulate cortex (BA24) Brain Brain Esophagus − − −0.4 Brain Brain − − Nucleus accumbens (basal ganglia) Brain Brain Nucleus accumbens (basal ganglia) − Skin Cells − Skin Cells Brain Brain Brain Brain Esophagus

Predicted expressionEsophagus Brain Brain Brain Brain D Predicted expression

20

E 0.8 15

0.6 10

5 0.4 Performance gain relative to ExPecto (%) to ExPecto gain relative Performance 0 Spearman's correlation 0.2 Predicted vs actual expression 1) − Liver Lung Testis Tibial Tibial Ovary Aorta Uterus Vagina Spleen Cortex Cortex − − − Thyroid Bladder Pituitary Prostate Mucosa Skeletal − − Sigmoid Stomach Pancreas − − Coronary ENST00000370841 − 0.0 Amygdala Ectocervix − Muscularis Transverse Endocervix − Cerebellum ENST00000370834 − − Whole Blood − − − Left Ventricle Nerve Artery Artery Hippocampus Adrenal Gland Brain Brain Fallopian Tube Fallopian −

ENST00000541113 Hypothalamus Subcutaneous

Terminal Ileum Terminal Gene Transcript Transcript usage − Kidney Kidney − − Colon − ENST00000526196 Substantia nigra Muscle Brain Brain Atrial Appendage Mammary Tissue Artery − Median expression − − Brain Brain Colon ENST00000420607 Cervix Cervix Visceral (Omentum) Visceral Minor Salivary Gland Heart

Frontal Cortex (BA9) Promoter & RNA features RNA features Promoter features Random features Brain Brain ENST00000526129 Esophagus − Brain Brain − Cerebellar Hemisphere Transformed fibroblasts Transformed ENST00000532207 Brain Caudate (basal ganglia) Esophagus Putamen (basal ganglia) − Adipose transformed lymphocytes transformed − Heart − Sun Exposed (Lower leg) Sun Exposed (Lower Spinal cord (cervical c − Breast ENST00000529059 − − − Gastroesophageal Junction

ENST00000473018 Brain − Adipose Cells ENST00000526930 Brain Transcripts of ACADM gene of ACADM Transcripts EBV Not Sun Exposed (Suprapubic) Brain Brain Small Intestine Skin Brain Brain Anterior cingulate cortex (BA24) − −

ENST00000525881 Brain − Nucleus accumbens (basal ganglia) − Skin Cells Brain Brain Esophagus Brain Brain

Actual expression DEcode ExPecto

Prediction from full features Prediction from RNA features

Prediction from random features Prediction from promoter features Performance relative to full model F Log2 fold change 76.2 mb 76.22 mb G −2−1 0 1 2 76.21 mb 1.00 ● Muscle − Skeletal ● ● ENST00000370841 ● ● ENST00000370834 ● ENST00000541113 ● ● 0.75 ● ● ENST00000526196 ● ● ● Thyroid ● ● ●● ● ●● ● ● 2 ● ● ● ENST00000420607 ● ● ● 1 ● ● ENST00000526129 ● ● 0 ● ● ● ● −1 ● ● ENST00000532207 p4 ● ● −2 0.50 ● ENST00000529059 p2 ● ●

ENST00000473018 ● Transcripts of ACADM gene of ACADM Transcripts ENST00000526930 p3 ● Model based on RNA features ENST00000525881 0.25 ●

p1 ● Adipose − Subcutaneous ● Esophagus − Muscularis●● Artery − Coronary Esophagus − Gastroesophageal Junction ● ● Artery − Aorta 0.00 Skin − Not Sun Exposed (Suprapubic)

Actual expression 0.00 0.25 0.50 0.75 1.00

Model based on promoter features Prediction from full features Prediction from RNA features

Prediction from random features 535 Prediction from promoter features

536 Figure 2. Performance of the tissue-level models. (A) Prediction performances on the median absolute expression levels

537 across tissues. The predicted the log2-TPM values for 2,705 genes or 7,631 transcripts coded on chromosome 1 were 22 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

538 compared with the actual median log2-TPMs across 53 tissues using Spearman's rank correlation. The transcript usage

539 within each gene was computed by subtracting the mean log2-TPM from log2-TPM of transcripts in each gene for 1,485

540 genes that had multiple transcripts. (B) Prediction performances on the tissue-specific expression profiles. The predicted

541 fold changes relative to the median of all tissues for 2,705 genes or 7,631 transcripts coded on chromosome 1 were

542 compared with the actual fold changes in each tissue using Spearman's rank correlation. The differences in the transcript

543 usage within each across tissues were computed for 1,485 genes that had multiple transcripts. The color of the bar

544 indicated the tissue groups based on the similarity of gene expression profiles. (C) The heatmap showing pairwise

545 correlations between the predicted and the actual tissue-specific expression profiles of 53 tissues for the testing genes.

546 (D) Performance comparison of DEcode with ExPecto. The root-mean-square errors (RMSE) of DEcode models for

547 expression-levels of 714 genes coded on chromosome 8 was compared with those of ExPecto. Each method was executed

548 10 times. The median RMSE of the 10 runs was displayed as a bar plot and the error bar represents median absolute

549 deviation. (E) The predictive performances of the models trained with a different set of features. (F) The comparison of

550 the expression levels for ACADM transcripts predicted by the models trained with different feature sets. (G) The

551 predictive performances on the tissue-specific gene expression profiles of the testing data relative to the model trained

552 with a full set of features.

23 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Figure 3 A C 0.75 ESR1 TF E2F6 DeepLIFT score Tissue RNABP H2AFZ 0.1 0.50 GATA4 Adipose − Subcutaneous miRNA RUNX1 0.05 Adipose − Visceral (Omentum) EBF1 0 Adrenal Gland TCF3 −0.05 Artery − Aorta 0.25 PAX5 REST −0.1 Artery − Coronary ERG Type Artery − Tibial Bladder FOXM1 TF Brain − Amygdala 0.00 LIN28B RNABP Enrichment Score TARDBP Brain − Anterior cingulate cortex (BA24) CTCF Brain − Caudate (basal ganglia) SPI1 Brain − Cerebellar Hemisphere −0.25 KMT2A Brain − Cerebellum TCF12 0 300 600 900 KLF5 Brain − Cortex FOXA1 Brain − Frontal Cortex (BA9) TP63 Brain − Hippocampus Regulator ranked by the absolute DeepLIFT score KLF4 Brain − Hypothalamus KDM5B Brain − Nucleus accumbens (basal ganglia) TF ELAVL1 Brain − Putamen (basal ganglia) HNF4A Brain − Spinal cord (cervical c−1) RXRA Brain − Substantia nigra RELA RNABP CPSF7 Breast − Mammary Tissue STAT5B Cells − EBV−transformed lymphocytes DDX3X Cells − Transformed fibroblasts miRNA FUS Cervix − Ectocervix ATXN2 Cervix − Endocervix JUN Colon − Sigmoid 0 300 600 900 SIN3A Colon − Transverse NFATC1 Esophagus − Gastroesophageal Junction TIAL1 Esophagus − Mucosa TP53 YTHDF2 Esophagus − Muscularis ETS1 Fallopian Tube AR Heart − Atrial Appendage NUDT21 Heart − Left Ventricle HNRNPC Kidney − Cortex Variable importance PPARG Liver B Lung Minor Salivary Gland Type ATXN2 1) Muscle − Skeletal − Liver DDX3X Lung Nerve − Tibial Testis Tibial Tibial Ovary Aorta Uterus Vagina Spleen Cortex Cortex

− − Ovary − Thyroid

FUS Bladder Pituitary Prostate Mucosa Skeletal − − Sigmoid Stomach

Pancreas Pancreas − − Coronary PPARG − Amygdala Ectocervix − Muscularis

Transverse Pituitary Endocervix − Cerebellum − − Whole Blood CSTF2T − − −

Left Ventricle Prostate Nerve Artery Artery Hippocampus Adrenal Gland Brain Brain Fallopian Tube Fallopian − Hypothalamus Subcutaneous

YTHDF2 Ileum Terminal

− Skin − Not Sun Exposed (Suprapubic) Kidney Kidney − − Colon − Substantia nigra Muscle Brain Brain Atrial Appendage PRPF8 Mammary Tissue Skin − Sun Exposed (Lower leg) Artery − − − Brain Brain Colon Cervix Small Intestine − Terminal Ileum TARDBP Cervix Visceral (Omentum) Visceral Minor Salivary Gland Heart Frontal Cortex (BA9) Brain Brain Esophagus −

Brain Brain Spleen

ZC3H7B − Cerebellar Hemisphere Transformed fibroblasts Transformed Brain Brain Stomach Caudate (basal ganglia) Esophagus Putamen (basal ganglia) − Adipose transformed lymphocytes transformed AR − Heart − Sun Exposed (Lower leg) Sun Exposed (Lower Spinal cord (cervical c − Breast − Testis − NFATC1 − Gastroesophageal Junction

Brain Brain Thyroid IGF2BP2 −

Adipose Uterus Cells Brain Brain EBV Not Sun Exposed (Suprapubic) Brain Brain Small Intestine Skin Brain Brain

Anterior cingulate cortex (BA24) Vagina ESR1 − − Brain Brain DGCR8 − Whole Blood Nucleus accumbens (basal ganglia)

MOV10 − TF Skin Cells E2F6 RNABP Brain STAT5B Esophagus CSTF2 Brain Predictive feature Predictive HNRNPC SRSF1 TF RNABP MAF D ETS1 PPARG REST PTBP1 METTL14 TP53 4 ● ● ● ● ● ● CPSF7 ● 4 ● ● 8 ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● PPIG ● ●● ●●●●●● ● ●● ● ●●● ● ● ● ● 5 ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ●●●● NUDT21 ●● ● ● 3 ● ●● ● ● ● ● ●● ● ● ● ●● ● ●●●● ● ● UPF1 ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● FMR1 ● ● ● ●● ● 2 ● ● 6 ● ● ● ● ● ● ●● ● ● ● ●●● FOXM1 log TPM ●● ● ● log TPM ● ● ● ● ●●● 2 ● ● ●● ● ● ● ● ● ● ● KLF1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ●● ● ● ● ● ● 4 ●● ●●● 0 ● ● 0.00 0.04 0.08 0.12 1 −0.1 0.0 0.00 0.05 −0.01 0.00 −5e−04 0e+00 5e−04 DeepLIFT score DeepLIFT score DeepLIFT score E Regulator Transcriptome

Human phenotype Mouse phenotype Cell Essentiality

1e+00 Wilcoxon, p = 8e−11 Wilcoxon, p = 3.4e−05 1e+00 Wilcoxon, p = 2.8e−05 Wilcoxon, p = 0.00036

1e−01 1e−01

1e−02 1e−02

1e−03 1e−03

1e−04 1e−04 |DeepLIFT| |DeepLIFT| |DeepLIFT|

|DeepLIFT score| 1e−05 1e−05

1e−06 1e−06

553 LoF tolerant LoF intolerant Non−OMIM gene OMIM gene Non−lethel Lethel Non−essential Essential

554 Figure 3. Identification and characterization of key predictors in the tissue-level models. (A) The enrichment of a

555 regulator class in the key predictors for the median absolute expression levels. We ranked the regulators by the DeepLIFT

556 scores and evaluated the enrichment of each regulator class. We used the pre-ranked gene set enrichment analysis (GSEA)

557 algorithm with 10,000 permutations to compute enrichment scores and statistical significance. (B) Top 30 key predictors

24 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

558 for the median absolute expression levels. (C) Key predictive regulators for the tissue-specific transcriptomes. We

559 selected the top 5 key predictors for each tissue and their DeepLIFT scores were displayed as a heatmap. The ward

560 linkage method with the Euclidean distance was used to cluster tissues and predictors. (D) Example relationships between

561 the predictive importance for a regulator and its expression levels across tissues. (E) The overlap between the key

562 regulators for the median absolute expression levels and external functional gene sets.

25 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Figure 4 A B 0.50

All models Models with better validation performance 0.25

0.4 0.00

−0.25 0.2 Spearman's correlation

0.0 Average correlation Average Lung Tibial Tibial Aorta Cortex − − − Thyroid Mucosa Skeletal − Sigmoid − − − Between tissues, Between tissues, Between tissues, Between tissues,

Whole Blood between individuals within individuals between individuals within individuals Nerve Artery Artery Hippocampus Brain Brain Subcutaneous − − Colon Muscle Mammary Tissue Comparison − Brain Brain Esophagus Observed Predicted Adipose Sun Exposed (Lower leg) Sun Exposed (Lower Breast − Skin

Tissue

Different person Same person

C D Per-person key predictive regulators for the hippocampus transcriptome 0.6

PPARG DDX3X RELA 0.4 RXRA SPI1 KMT2A TIAL1 MAF 0.2 GATA1 ZC3H7B CPSF7 ELAVL1 Spearman's correlation 0.0 CSTF2 AR DeepLIFT score 0.04 Predicted vs actual expression YTHDF2 0.02 CSTF2T 0 −0.02 SP1 −0.04 Type NR3C1 TF

Lung JUN RNABP Tibial Tibial Aorta Cortex − − − Thyroid HNRNPC Mucosa Skeletal − Sigmoid −

− NUDT21 − ESR1 Whole Blood Nerve Artery

Artery NEUROD1 Hippocampus Brain Brain Subcutaneous −

− FIP1L1 Colon Muscle Mammary Tissue FMR1 − MBNL2 Brain Brain Esophagus TARDBP LIN28B Adipose Sun Exposed (Lower leg) Sun Exposed (Lower Breast ERG − REST ATXN2 Skin FUS Individuals Promoter & RNA features RNA features Promoter features Random Type

Adipose − Subcutaneous Artery − Aorta Artery − Tibial Brain − Cortex Brain − Hippocampus 1.00 500 500 500 500 500 E F 0 0 0 1.000 0 Adipose − Subcutaneous R = 0.33, p < 2.2e−16 R = 0.29, p < 2.2e−16 0.75R = 0.3, p < 2.2e−16 R = 0.26, p < 2.2e−16 R = 0.24, p < 2.2e−16 Artery − Aorta 200 200 200 0.75200 200

Artery − Tibial 0.50 100 R = 0.42, p < 2.2e−16 100 R = 0.39, p < 2.2e−16 100 R = 0.44, p < 2.2e−16 100 R = 0.43, p < 2.2e−16 100 R = 0.36, p < 2.2e−16 Brain − Cortex 0.50 0 Brain − Hippocampus −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 −0.5 0.00.25 0.5 1.0 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0

Breast − Mammary Tissue Breast − Mammary Tissue Colon − Sigmoid Esophagus − Mucosa 0.25 Lung Muscle − Skeletal 500 500 500 500 Colon − Sigmoid 500 to full model relative Performance 0.00 0 0 0 0 0

Performance relative to full model relative Performance 0.00 Esophagus − Mucosa R = 0.3, p < 2.2e−16 R = 0.31, p < 2.2e−16 R = 0.26, p < 2.2e−16 R = 0.32, p < 2.2e−16 R = 0.22, p < 2.2e−16 Lung Tibial Tibial Aorta Cortex − − − Lung 200 200 200 200 200 Thyroid Mucosa Skeletal − Sigmoid Lung − − Tibial Tibial Aorta − Cortex − − − Thyroid

Muscle − Skeletal Mucosa Skeletal − Sigmoid Whole Blood

100 R = 0.25, p < 2.2e−16 − R = 0.37, p < 2.2e−16 100 R = 0.41, p < 2.2e−16 − Nerve −

100 R = 0.36, p < 2.2e−16 100 Artery R = 0.4, p < 2.2e−16 Artery Hippocampus Brain Brain Subcutaneous − − Colon Whole Blood within individuals Nerve − Tibial tissues, Between Muscle Nerve Mammary Tissue Artery Artery Hippocampus Brain Brain Subcutaneous

0 − 0 − − Colon within individuals −1.0 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 tissues, Between Brain Brain Muscle

Skin − Sun Exposed (Lower leg) Esophagus Mammary Tissue − Adipose Brain Brain Esophagus Thyroid Nerve − Tibial Skin − Sun Exposed (Lower leg) Thyroid Whole Blood leg) Sun Exposed (Lower Breast −

400 Adipose

500 500 500 leg) Sun Exposed (Lower Whole Blood 200 Breast −

0 0 0 0 Skin Numbergenes of (upper histogram) Numberbinding of (lowersites line plot)

0 R = 0.34, p < 2.2e−16 R = 0.33, p < 2.2e−16 R = 0.32, p < 2.2e−16 R = 0.21, p < 2.2e−16 Skin

500 RNA features Promoter features Random 1000 1500 200 200 200 200 RNA features Promoter features Random

Number of significant genes R = 0.45, p < 2.2e−16 R = 0.45, p < 2.2e−16 100 R = 0.45, p < 2.2e−16 100 R = 0.24, p < 2.2e−16

DEcode PrediXcan 0 0 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 563 Per-gene prediction accuracy (Spearman’s correlation)

564 Figure 4. Performance of the person-specific models. (A) The predictive performances of the person-specific models for

565 the actual data from the same individuals and unrelated random individuals. (B) The person-specific models predicted

566 person-specific expression shared across tissues. (C) The performances of the models trained with a distinct feature set. 26 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

567 (D) Per-person key predictive regulators for the hippocampus transcriptome. We selected the top 5 key predictors of the

568 hippocampus transcriptome for each individual and their DeepLIFT scores were displayed as a heatmap. The ward

569 linkage method with the Euclidean distance was used to cluster tissues and predictors. (E) Comparison of per-gene

570 predictive accuracy between DEcode and PrediXcan. The number of genes that showed a positive Pearson’s correlation

571 between predicted and actual gene expression levels at FDR 5% was calculated for each method. Only the testing genes

572 on chromosome 1 were used for this comparison. (F) Per-gene prediction accuracy is associated with the number of

573 features present in RNAs and promoters. The histogram represents Pearson’s correlations between the predicted and the

574 actual expression for each gene. The line plot shows the average number of RNA and promoter features of genes in each

575 bin of the histogram. Spearman’s correlation between the number of features and per-gene correlations is displayed in

576 the line plot. The error bars indicate standard errors.

27 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Figure 5

AGE Esophagus − Mucosa 1 AGE AGEAGE Adipose − Subcutaneous Artery − Aorta Artery − Tibial Brain − Cortex Brain − Hippocampus Breast − Mammary Tissue Colon − Sigmoid A Artery − Tibial B1 Esophagus − Mucosa 1 C Whole Blood R = 0.36 , p < 2.2e−16 R = 0.37 , p < 2.2e−16 R = 0.57 , p < 2.2e−16 5 R = 0.32 , p < 2.2e−16 R = 0.36 , p < 2.2e−16 R = 0.46 , p < 2.2e−16 R = 0.43 , p < 2.2e−16 10 5 Muscle − Skeletal 5 2 Artery − Tibial 1 10 5 5 Adipose − Subcutaneous 2 Muscle − Skeletal 2 Thyroid 5 Adipose − Subcutaneous 2 Thyroid 2 Skin − Sun Exposed (Lower leg) Adipose − Subcutaneous Artery − Aorta Artery − Tibial Brain − Cortex Brain − Hippocampus Breast − Mammary Tissue Colon − Sigmoid Tissue Tissue 0 0 0 0 0 0 Artery − Aorta 0 2 Thyroid 2 R = 0.36 , p < 2.2e−16 R = 0.37 , p < 2.2e−16 R = 0.57 , p < 2.2e−16 5 R = 0.32 , p < 2.2e−16 R = 0.36 , p < 2.2e−16 R = 0.46 , p < 2.2e−16 R = 0.43 , p < 2.2e−16 Adipose − SubcutaneousArtery − Aorta Adipose − Subcutaneous 2 Tissue Tissue 10 5 5 Nerve − Tibial 5 4 Artery − Aorta Artery − Aorta Nerve − Tibial −5 10 Adipose − Subcutaneous Adipose − Subcutaneous −5 Colon − Sigmoid 1 Artery − Tibial Nerve − Tibial Artery − Tibial 4 Artery − Aorta Artery − Aorta 5 −10 Brain − Cortex Brain − Cortex Artery − Tibial ArteryMuscle − Tibial − Skeletal −5 −5 Colon − Sigmoid 1 −10 −5 −5 Whole Blood 1 Brain − Hippocampus Brain − Hippocampus Brain − Cortex Brain − Cortex 0 0 0 0 0 0 0 Breast − Mammary TissueWhole Blood Breast − Mammary Tissue1 Brain − Hippocampus Brain − Hippocampus Skin − Sun Exposed (Lower leg) 1 Breast − Mammary Tissue Breast − MammaryLung Tissue −5 0 5 −10 −5 0 5 10 −10 0 10 −5 0 5 −5 0Actual expression 5 −5 0 5 −5 0 5 SkinColon − Sun− Sigmoid Exposed (Lower leg) Colon − Sigmoid 1 Lung 1 Actual expression Esophagus − Mucosa Esophagus − Mucosa Colon − Sigmoid Colon − Sigmoid −5 Lung Esophagus − Mucosa Esophagus − Mucosa −5 Lung Lung 1 Esophagus − Mucosa Breast − Mammary Tissue 1 Muscle − Skeletal Muscle − Skeletal Lung Lung Esophagus − Mucosa Lung −10 Muscle − Skeletal Nerve − Tibial Skin − Sun Exposed (Lower leg) −5 Thyroid Whole Blood Breast − Mammary Tissue 1 Muscle − Skeletal Muscle − Skeletal −10 −5 −5 Nerve − Tibial Nerve − Tibial −5 Brain − Hippocampus 1 Brain − Hippocampus 1 Nerve − Tibial NerveColon − Tibial − Sigmoid 5 R = 0.49 , p < 2.2e−16 R = 0.59 , p < 2.2e−16 10 R = 0.4 , p < 2.2e−16 R = 0.35 , p < 2.2e−16 R = 0.33 , p < 2.2e−16 R = 0.36 , p < 2.2e−16 R = 0.31 , p < 2.2e−16 Skin − Sun Exposed (Lower leg) Skin − Sun Exposed (Lower leg) −5 0 5 −10 −5 0 5 10 −10 0 10 −5 0 5 −5 0 5 Brain−5 − Cortex0 5 1 −5 0 5 Skin − Sun Exposed (Lower leg) Skin − Sun Exposed (Lower leg) 5 5 5 Thyroid Brain − Cortex Thyroid1 Thyroid Thyroid 5 5 Breast − Mammary Tissue 5 Whole Blood Whole Blood Whole Blood Whole Blood Correlation Correlation Esophagus − Mucosa Lung Muscle − Skeletal Nerve − Tibial Skin − Sun Exposed (Lower leg) Thyroid Whole Blood 1 1 Brain − Hippocampus 0 0 0 0 0 0 0 Lung Tibial Aorta Tibial 0.5 Lung Tibial Aorta Tibial 0.5 R = 0.49 , p < 2.2e−16 R = 0.59 , p < 2.2e−16 R = 0.4 , p < 2.2e−16 R = 0.35 , p < 2.2e−16 R = 0.33 , p < 2.2e−16 R = 0.36 , p < 2.2e− 16Cortex R = 0.31 , p− < 2.2e−16 − Cortex − − Thyroid − 5 10 − Thyroid − Sigmoid Skeletal Mucosa − Sigmoid Skeletal Mucosa 0 − − 0 Brain − Cortex 5 5 5 − − − − −5 5 5 Whole Blood −0.5 Whole Blood −0.5 −5 Hippocampus NerveArtery Artery Brain Hippocampus NerveArtery SubcutaneousArtery 5 Brain −5 Subcutaneous − − −5 −5 −5 − Colon − −1 Colon Muscle −1 Artery − Tibial Mammary Tissue Muscle Mammary Tissue −5 − − −10 Brain Esophagus 0 0 0 0 0 0 Brain 0 Esophagus Artery − Aorta −5 0 5 −5 0 5 −10 −5 0 5 10 −5 0 5 −5 0 5 −5 0 5 −5 0 5 Sun Exposed (Lower leg) Adipose Sun Exposed (Lower leg) Adipose Breast − Breast − −5 −5 Skin Adipose − Subcutaneous −5 −5 Skin −5 Adipose − Subcutaneous −5 Artery − Aorta Artery − Tibial Brain − Cortex Brain − Hippocampus Breast − Mammary Tissue Colon − Sigmoid Predicted expression −5 −10 Predicted expression 0 R = 0.13 , p = 8.2e−08 R = 0.22 , p < 2.2e−16 R = 0.25 , p < 2.2e−16 R = 0.28 , p < 2.2e−16 R = 0.47 , p < 2.2e−16 R = 0.26 , p < 2.2e−16 R = 0.3 , p < 2.2e−16 10 5 250 500 750 −5 0 5 5 −5 0 5 −10 −5 0 5 10 −5 0 5 4 −5 0 5 −5 0 5 5 −5 0 5 statistic) 5 10 SEX - 5 SEX 2 SEX SEX (t Adipose − Subcutaneous Artery − Aorta Artery − Tibial Brain − Cortex Brain − Hippocampus Breast − Mammary Tissue Colon − Sigmoid Whole Blood 0 0 100 0 0 Esophagus0 − Mucosa 0 1 Thyroid 1 R = 0.13 , p = 8.2e−08 R = 0.22 , p < 2.2e−16 R = 0.25 , p < 2.2e−16 R = 0.28 , p < 2.2e−16 R = 0.47 , p < 2.2e−16 R = 0.26 , p < 2.2e−16 R = 0.3 , p < 2.2e−16 10 5 5 4 Artery − Tibial 5 1 Breast − Mammary Tissue 1 Thyroid −5 −2 −5 −10Muscle − Skeletal 2 Nerve − Tibial 1 5 2 Muscle − Skeletal Skin − Sun Exposed (Lower leg) −10 −5 −5 −4 Adipose − Subcutaneous −5 2 1 0 0 −100 0 0 0 Thyroid 0 2 Artery − Tibial 1 −10 −5 0 5 10 −5 0 5 −10 −5 0 5 10 −5 0 5 −4 −2 0 2 4 −10 0 10 −5 0 5 Tissue Tissue Nerve − Tibial Artery − Aorta 2 Tissue Adipose − SubcutaneousTissue 2 −5 −2 Adipose − Subcutaneous Adipose − Subcutaneous Adipose − Subcutaneous Adipose − Subcutaneous −5 Whole Blood 2 Artery − Aorta ArteryMuscle − Aorta − Skeletal −10 Nerve − Tibial 4 Artery − Aorta Artery − Aorta Esophagus − Mucosa Lung Muscle − Skeletal Nerve − Tibial Skin − Sun Exposed (Lower leg) Thyroid Whole Blood Artery − Aorta 1 Artery − Tibial Artery − Tibial −10 −5 −5 −4 Colon − Sigmoid −5 1 Artery − Tibial Artery − Tibial Brain − Cortex Brain − Cortex Brain − Cortex Brain − Cortex Lung R = 0.25 , p < 2.2e−16 5 R = 0.22 , p < 2.2e−16 −10 R = 0.16 , p = 3.2e−09 R = 0.13 , p = 9e−08 10 R = 0.1 , p = 1.7e−05 R = 0.21 , p < 2.2e−16 5 R = 0.26 , p < 2.2e−16 Brain − Hippocampus 1 Brain − Hippocampus Brain − Hippocampus Whole Blood 1 Brain − Hippocampus Brain − Hippocampus 10 −10 −5 0 5 10 −5 0 5 −10 −5 0 5 10 −5 0 5 −4 −2 0 2 4 −10 0 10 −5 0 5 Lung Breast − Mammary Tissue Breast − Mammary Tissue Association of actual expression with phenotype with expression actual of Association 10 Breast − Mammary Tissue Breast − Mammary1 Tissue Actual expression Colon − Sigmoid EsophagusColon − Sigmoid − Mucosa 5 5 Skin − Sun Exposed5 (Lower leg) 1 Actual expression Colon − Sigmoid Brain − Cortex Colon − Sigmoid1 Esophagus − Mucosa Esophagus − Mucosa Esophagus − Mucosa Esophagus − Mucosa Lung Lung Lung 1 Colon − Sigmoid Esophagus − Mucosa Lung Muscle − Skeletal Nerve − Tibial Skin − Sun Exposed (Lower leg) Thyroid Whole Blood Lung Lung 1 Muscle − Skeletal MuscleColon − Skeletal − Sigmoid 0 0 0 0 0 Breast −0 Mammary Tissue 0 R = 0.25 , p < 2.2e−16 R = 0.22 , p < 2.2e−16 R = 0.16 , p = 3.2e−09 R = 0.13 , p = 9e−08 10 R = 0.1 , p = 1.7e−05 R = 0.21 , p < 2.2e−16 15 R = 0.26 , p < 2.2e−16 SkinMuscle − Sun − Skeletal Exposed (Lower leg) Muscle2 − Skeletal Nerve − Tibial Nerve − Tibial 5 Skin − Sun Exposed (Lower leg) Skin − Sun Exposed (Lower leg) 10 Brain − Hippocampus 1 Nerve − TibialEsophagus − Mucosa Nerve1 − Tibial Breast − Mammary Tissue 10 −5 Skin − Sun Exposed (Lower leg) Skin − Sun Exposed (Lower leg) Thyroid Thyroid −5 5 −5 Brain − Cortex 1 Whole Blood Whole Blood −10 −10 Thyroid Thyroid Brain − Hippocampus Whole Blood Whole Blood Correlation −5 −10 −5 1 0 0 0 0 0 0 0 Correlation Lung Aorta Tibial Tibial 0.5 Brain − Cortex −10 0 10 −5 0 5 −5 0 5 −10 0 10 −10 −5 0 5 10 −5 0 5 −5 0 5 1 Cortex − − − Thyroid Mucosa Sigmoid− Skeletal − − − 0 Lung Tibial Aorta Tibial 0.5 −5 Cortex − − Thyroid − Whole Blood −0.5 −5 −5 − Sigmoid Skeletal Mucosa Brain HippocampusArtery SubcutaneousArtery Nerve Artery − Tibial −10 − − − 0 − − −1 −10 Colon Muscle Whole Blood −0.5 Mammary Tissue −5 NerveArtery Artery − Brain Hippocampus −5 Subcutaneous Esophagus Brain Artery − Aorta −10 − Colon − −1 Mammary Tissue Muscle Sun Exposed (Lower leg) Adipose −10 0 10 −5 0 5 −5 0 5 −10 0 10 −10 −5 0 5 10 −5 0 − 5 −5 0 5 − Breast Brain Esophagus Adipose − Subcutaneous Sun Exposed (Lower leg) Adipose Skin Association of predicted expression with phenotype Breast − Predicted expression miRNA 0 (t-statistic) Skin 250 500 750 Predicted expression RNABP TF Number of features associated with phenotype 577 AGE

578 Figure 5. Application of the person-specific models to analyze phenotype-related gene signatures. (A) The scatter plots

579 showing the relations between the associations of genes with age and sex using predicted expression and those using the

580 actual expression. Spearman’s correlation between t-statistics using the predicted and the actual gene expression is

581 displayed in the scatter plot. (B) The pairwise Spearman’s correlations between the predicted and the actual associations

582 of genes with age and sex in all tissues. The numbers in diagonal elements of the heatmap indicate the ranks of similarity

583 of the predictions with the actual observations in the corresponding tissues. (C) The regulators whose DeepLIFT scores

584 were associated with age and sex. The Benjamini–Hochberg procedure was used to control the false discovery rate at 5%

585 for each phenotype.

28 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Figure 6

A Adipose − Subcutaneous Artery − Aorta Artery − Tibial Brain − Cortex Brain − Hippocampus Breast − Mammary Tissue Colon − Sigmoid B 3 2 0.2 1

0 tissue correlation Esophagus − Mucosa Lung Muscle − Skeletal Nerve − Tibial Skin − Sun Exposed (Lower leg) Thyroid Whole Blood − 0.0

Density 3 2 1 −0.2 Average inter Average 0 Negative correlation Positive correlation −1 0 1 −1 0 1 −1 0 1 −1 0 1 −1 0 1 −1 0 1 −1 0 1

Spearman's correlation Observed correlation

All models Models with better performance co−expression Negative correlation Positive correlation

C Adipose − Subcutaneous Artery − Aorta Artery − Tibial Brain − Cortex Brain − Hippocampus Breast − Mammary Tissue Colon − Sigmoid

36 25 23 131 7 1519 877 1419 231 142 327 100 134 2398 105 243 403 247 894 6042 1342

Full features were required Esophagus − Mucosa Lung Muscle − Skeletal Nerve − Tibial Skin − Sun Exposed (Lower leg) Thyroid Whole Blood Promoter features worked better than RNA features RNA features worked better than promoter features

46 55 68 52 15 65 2249 557 146 367 136 329 247 346 220 266 4941 1216 359 479 2011

LAPTM5 LAPTM5 LAPTM5LAPTM5 LAPTM5 LAPTM5CD53 CD53 CD53 CD53 CD53 D Promoter Promoter PromoterRNA RNA PromoterRNA Promoter PromoterRNA RNA RNA F ** * P-value <0.05 ** P-value <0.005

strengthen 75 0.0 0.0 0.0 ** ** ** ** ** 50 ** expression expression ** ** − −

expression *

expression ** - − *

weaken 25 * Percent of covariance Percent on co on −0.2 −0.2 Effect on co Effect Effect on co Effect

−0.2 regulator expression by explained 0 Effect of binding sites removal removal sites binding of Effect Effect on co Effect Lung Tibial Tibial Aorta Cortex − − − Thyroid Mucosa Skeletal − Sigmoid − − − 50 50 Nerve Artery Artery Hippocampus Brain Brain Subcutaneous − − Colon Muscle 50 Mammary Tissue − Brain Brain Esophagus 0 0 Adipose Sun Exposed (Lower leg) Sun Exposed (Lower

0 10 -20020 1030 0TSS205 +103010 05’15 5 20100 15 10203’ -02020 1030 0TSS20 10+1030 2005’ 1030 20 303’ Breast − Number of binding sites Number of binding sites

Number of binding sites binding of Number 0

Position (100 bp per interval) Skin 0 10 20 Position30 0 5 10 15 20 0 10 20 30 0 10 20 30 Number of binding sites Brain − Cortex Thyroid Position 4 R 2 = 0.42 R 2 = 0.75 2 2 2 R = 0.026 R = 0.26 E 0.1 expression expression

- 0 − strengthen 0.0 LAPTM5 −2 weaken

Effect on co Effect −4 −4 −2 0 2 4 −4 −2 0 2 4 AR CELF2 HNF1B RBM27 RELA RXRA SMAD5 SPI1 STAT1 TBX21

Effect of KO on co KOon of Effect Brain − Cortex Esophagus − Mucosa Thyroid Artery − Tibial Muscle − Skeletal CD53 Tissue Brain − Hippocampus Lung Adipose − Subcutaneous Colon − Sigmoid 586 Breast − Mammary Tissue Skin − Sun Exposed (Lower leg) Artery − Aorta Nerve − Tibial Regulator levels normalized a No a Yes

587 Figure 6. Application of the person-specific models to investigate gene co-expression. (A) Co-expression relationships

588 in the predicted gene expression. We defined the ground truth co-expression relationships as gene pairs with the absolute

589 Spearman’s correlation greater than 0.7 in the actual expression of the testing data. The density of Spearman’s correlation

590 between the co-expressed gene pairs in the predicted gene expression data was estimated using the density function in R

591 with the Gaussian kernel. (B) Inter-tissue gene co-expression relationships in the predicted gene expression. We defined

592 the ground truth co-expression relationships as gene pairs with the absolute Spearman’s correlation greater than 0.5 in 29 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

593 the actual expression of the testing data. We computed the average of Spearman’s correlation of the inter-tissue co-

594 expressed gene pairs in each pair of tissues using the predicted expression from all models and the models whose

595 performances on validation data were greater than 50% percentile in each tissue. (C) The major feature types contributed

596 to the gene co-expression. We defined the gene pairs with the absolute Spearman’s correlation greater than 0.3 and the

597 sign of the correlation matched with one with the ground truth as the successfully predicted gene pairs. The successfully

598 predicted gene pairs of the model trained with the full set of features were split into three groups based on the performance

599 of the models trained with only RNA features or promoter features. (D) The effect of the binding site removal on co-

600 expression between LAPTM5 and CD53. We simulated gene expression profiles with random removals of the binding

601 sites in each gene 10,000 times and computed a correlation between LAPTM5 and CD53 for each simulation. Multiple

602 regression was used to estimate the effect of the binding site removal on the co-expression in each tissue. (E) The key

603 regulators for the co-expression between LAPTM5 and CD53. We simulated gene expression profiles with random KOs

604 of regulators in each gene 10,000 times and computed a correlation between LAPTM5 and CD53 for each simulation.

605 We used multiple regression to estimate the effect of the KO on the co-expression in each tissue and identified the

606 consensus regulators across tissues. (F) Percent of the co-expression relationship explained by the expression levels of

607 the key regulators. The white diamond and the error bars in the bar indicated the average and 95% percentile of the

608 percent of variance explained by randomly picked regulators, respectively. The scatter plots show the effect of the key

609 regulators on the co-expression.

30 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.10.894238; this version posted January 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Figure 7 A DE Prior Rank B Prediction of genes with DE Prior Rank > 0.9 1.00 R = 0.53 , p < 2.2e−16 1.00

0.75 count 0.75 7.5 0.50 5.0 Actual 0.50

2.5 Sensitivity 0.25 Model input 0.25 Promoter & RNA features RNA features Promoter features 0.00 0.00 Random features 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00

610 Predicted Specificity

611 Figure 7. DEcode predicts gene’s prior probability of differential expression. (A) The scatter plots showing the relations

612 between predicted and actual DE prior rank. The predicted logit of DE prior rank was converted to probability and

613 compared with actual DE prior rank with Spearman’s correlation. (B) The performances of the models trained with a

614 distinct feature set. ROC represents the performance of model predicting genes with DE prior rank greater than 0.9.

31