bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Decoding transcriptional regulation via a expression predictor

Yuzhou Wang*1,2, Yu Zhang*1, Jiazhen Gong1, Jianqiang Bao2, Shisong Ma1,3

1. Hefei National Laboratory for Physical Sciences at the Microscale, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Innovative Academy of Seed Design, Chinese Academy of Sciences, Hefei, China 2. The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China 3. School of Data Science, University of Science and Technology of China, Hefei, China

* These authors contribute equally.

Correspondence: Shisong Ma ([email protected])

ABSTRACT

Transcription factors (TF) regulate cellular activities via controlling , but a predictive model describing how TFs quantitatively modulate human transcriptomes was lacking. We constructed a universal human gene expression predictor and utilized it to decode transcriptional regulation. Using 1613 TFs’ expression, the predictor reconstituted highly accurate transcriptomes for samples derived from a wide range of tissues and conditions. The predictor’s broad applicability indicated it had recapitulated the quantitative relationships between TFs and target ubiquitous across tissues. Significant interacting TF-target gene pairs were then extracted from the predictor and enabled downstream inference of TF regulators for diverse pathways involved in development, immunity, metabolism, and stress response. Thus, we present a novel approach to study human transcriptional regulation following the “understanding by modeling” principle.

INTRODUCTION The human tissues and cells are extremely complex entities, whose functional and physiological status are largely determined by transcriptomes. Understanding predictively and quantitatively how gene expression are regulated across diverse tissues and conditions is crucial to

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

decode the genetic blueprint of human. As transcription is mainly modulated by transcription factors (TF), gene regulatory networks (GRN) that connect TFs and their target genes are important tools to decipher transcriptional regulation (1-3). To construct GRNs, TFs binding specificities to target genes have been determined systematically via charactering TF binding motifs and profiling -wide TF-DNA interactions (4, 5). GRNs were also deduced from transcriptome changes after perturbing TFs’ expression, using methods like Perturb-Seq (6). GRNs were further reverse- engineered from transcriptomes via computational algorithms like probabilistic graphical model, neural network, decision tress, LASSO regression, and network component analysis (7-11). However, to our knowledge no universal model has been built that can quantitatively predict human transcriptomes based on TF activities. Previously we developed a systematic approach named EXPLICIT (Expression Prediction via Linear Combination of Transcription Factors) to construct a gene expression predictor for the model plant Arabidopsis thaliana (Geng et al., submitted). Here, we applied the same approach to build a universal human gene expression predictor. The predictor successfully reconstituted highly accurate transcriptomes for samples across a wide range of human tissues and cell types. It further enabled downstream inference of transcriptional regulators for human genes and pathways involved in development, immunity, metabolism, and stress response.

RESULTS AND DISCUSSION

A human gene expression predictor

Using the EXPLICIT approach (Geng et al., submitted), a human gene expression predictor was built to predict gene expression levels based on TFs’ expression. Large-scaled RNA-Seq-based human gene expression data were obtained from the ARCHS4 (v7) database (12), which contained transcriptomes for more than 167000 samples. After removing single-cell RNA-Seq (see below) and low-quality samples, 59097 samples were kept for the predictor model training. The predictor, constructed via ordinary linear least squares regression, shall use 1613 TF genes to predict the expression of 23346 non-TF genes (Figure 1A). Similar to the Arabidopsis predictor model, to obtain an accurate human predictor model, the number of samples within the training dataset was critical. Increasing the number of training samples effectively increased the model’s performance on predicting novel samples, as judging from the Pearson’s correlation coefficient (r) between the predicted and actual expression values of the independent test samples (Figure 1B). The over-fitting or optimism within the model also decreased with increasing training samples, as indicated by the decreased difference between the training and testing r. Thus, all 59097 human RNA-Seq samples were used to train the predictor model, which reduced over-fitting to a minimal level.

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

The resulted human gene expression predictor was tested for predicting performance on novel samples via leave-one-out cross-validation (LOOCV). The training samples of the predictor originally came from 4000+ NCBI gene expression omnibus series (GSE), with each series being an independent study. In each LOOCV run, a single GSE series was selected and its samples were withheld, while the model was retrained with samples from all other series and then tested on the selected hold-out GSE series. The hold-out samples were deemed as novel independent samples for the retrained model. Shown in Figure 1C are the LOOCV test results for GSE115736 that had measured the transcriptomes of 12 different human hematopoietic cell types (13). Using 1613 TFs’ expression values, the predictor accurately predicted other 23346 non-TF genes’ expression in two samples of neutrophils (GSM3188532) and natural killer cells (GSM3188533), with r of 0.970 and 0.982 between the predicted and actual expression, respectively. The average r for all samples with the series was 0.980. LOOCV test was conducted for all 1545 GSE series that have 10+ samples. These series covered a broad range of experiment conditions used for human bulk RNA-Seq analysis, including sampling on different tissues or cell types. Despite such diversities, the predictor still achieved high accuracy, with an average r of 0.976±0.026 for all tested GSE series (Figure 1D).

The predictor was further tested on the GTEx (v8) dataset that have measured 17382 transcriptomes from diverse human tissues (14). The ARCHS4 training dataset did not contain samples from GTEx, and the two datasets were independent from each other. The predictor accurately predicted the GTEx transcriptomes derived from different tissues. Most tissues had average r ranging between 0.95 and 0.97, i.e 0.961 for , 0.964 for liver, and 0.969 for . Even the two tissues with the lowest accuracy, pituitary and adrenal gland, had average r of 0.931 and 0.946, respectively (Figure 1E). But the predictor had relatively low performance on single-cell RNA-Seq samples (data not shown), and these samples were removed before model training. Thus, the predictor model can be universally applied to most human bulk RNA-Seq samples from various tissues or cell types, indicating it had recapitulated the quantitative relationships between TFs and target genes ubiquitous across tissues. We envisioned such a model can be used to infer TF regulators for human genes and pathways.

Human gene co-expression modules

To infer TF regulators for human pathways, we first identified gene co-expression modules containing genes that might function in the same pathways. A human gene co-expression network based on the graphical Gaussian model (GGM) was built using the ARCHS4 dataset via a previously published procedure with minor modification (15). After calculating the partial

3 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

correlation coefficients (pcor) between all genes, 217686 co-expressed gene pairs with pcor >= 0.045 were selected to construct a gene co-expression network, which accounted for 0.07% of all possible gene pairs (Figure S1 and Table S1). The network, containing 24867 genes, was then assembled into 1416 gene co-expression modules (Figure 2A and Table S2). Via guilt-by- association, the genes within the same module co-expressed together and might have functions in the same pathways. (GO) analysis indicted 429 modules have enriched GO terms (P <= 1E-03) (Table S3).

Further inspection pinpointed modules related to various development, immunity, metabolism, or stress response pathways. For example, a number of modules are involved in the development or functions of specific organs or tissues. Module #18 works in angiogenesis as it is enriched with 20 vasculature development genes (P = 2.78E-16), including KDR and TEK (16), #116 functions in insulin secretion (P = 3.64E-13) with biased expression in pancreas, while #37 is involved in visual perception (P = 1.27E-37) and photoreceptor cell development (Figure 2B-D, and S2). Modules related to hematopoietic and immunity cells were also discovered, including #38 for erythrocyte differentiation, #81 for activation, #13 for activation, and #57 for defense response to virus via type I interferon signaling pathway (Figure 2E-H). Also discovered were modules functioning in metabolism, such as #113 for lipid metabolism with biased expression in adipose tissues and #487 for thyroid hormone metabolism (Figure 2J, K, and S2). Stress response modules were also revealed, such as #74 involved in DNA damage response and mediated apoptotic signaling pathway (Figure 2L). Thus, the network analysis identified gene modules for a broad range of pathways. As genes within the same module have similar co-expression patterns, they might be regulated by the same transcription factors. Identifying such TFs will be highly valuable since these modules play critical roles in diverse developmental and physiological processes related to illnesses like obesity, diabetes, tumors, virus infection, and cardiovascular, immunity, or eye diseases.

TF regulators for genes and pathways

The human gene expression predictor was then used to infer TF regulators for genes and pathways, following the EXPLICIT approach (Geng et al., submitted). Individual regression coefficients within the predictor’s coefficient matrix B were tested for significance, and 2286331 coefficients with p-values <= 1E-12 (approximately equal to a Bonferroni-adjusted P of 5E-05) were deemed as significant and their corresponding TF-gene pairs extracted (Table S4). These significant interacting TF-gene pairs connected TFs to their target genes, and the TFs connected with a target gene were deemed as that gene’s predictor TFs. In average each gene has 98 predictor

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

TFs, which could also be considered as the gene’s potential transcriptional regulators. For example, the gene GNAT1, encoding the rod-specific transducin  subunit functioning in phototransduction (17), had 101 identified predictor TFs (Figure 3A). Among the top predictor TFs were three known regulators of GNAT1 expression, NRL, NR2E3, and CRX, while the rest might also contain additional regulators as many of them regulate retinal development (18).

TF regulators were also inferred for gene co-expression modules at the pathways level. For instance, GNAT1 is contained within Module #37 that function in visual perception. The module has 36 non-TF genes in total. Predictor TFs for the module were identified as the common predictor TFs shared by these genes. Specifically, if a TF was a common predictor TF for at least 6 as well as 20% of the non-TF genes within a module and its target genes were over-represented within the same module (P <= 1E-06), the TF was considered as a predictor TF for that module. CRX is a predictor TF for 32 genes within Module #37, while it has only 712 target genes among all 23346 non-TF genes, thus its target genes are enriched 29 folds within the module (P = 8.3E-45). Therefore, CRX was considered as a predictor TF for Module #37. In total 23 predictor TFs were identified for the module (Figure 3B and Table S5). Surprisingly, 16 of them have been shown before to be regulators of retinal development or functions (Table S6, same as below for other modules), such as CRX, NRL, NR2E3, and OTX2 (19). Similarly, 24 predictor TFs were identified for Module #18 involved in vasculature development, among which are 15 known regulators of angiogenesis, like SOX17 and SOX18 (Figure 3C) (20). And for the insulin secretion and pancreas related Module #116, 31 predictor TFs were identified, 19 of which are known regulators of pancreas development or functions, i.e. PDX1, PAX6, MAFA and NEUROG3 (Figure 3D) (21). Thus, among the identified predictor TFs, more than 60% are known regulators, while the rest might be potential novel regulators.

Predictor TFs were also recovered for the modules involved in erythrocyte differentiation (#38), B-cell activation (#81), T-cell activation (#13), and defense response to virus (#57) (Figure 3E-H). Module #38 has 22 predictor TFs, 14 of which are known regulators of erythrocyte differentiation or functions, like KLF1, GATA1 and TAL1 (22). Module #81 has 21 predictor TFs, 12 of which have known functions in B-cells, i.e. PAX5, EBF1 and BCL11A (23). Module #13 has 30 predictor TFs, and 18 of them are known regulators of T cell development or functions, such as TCF7, GATA3, and BCL11B (24). Similarly, 15 out of 24 predictor TFs for Module 57 are also known regulators of defense response to virus, such as STAT1, STAT2, IRF7 and IRF9 (25). Again, more than 55% of the identified predictor TFs are known regulators of the pathways involved.

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Also identified were predictor TFs for the modules participating in lipid metabolism (#113), thyroid hormone metabolism (#487), and DNA damage response (#74) (Figure 3I-K). Module #113 has 33 predictors, 17 of them are known regulators of lipid metabolism or , including PPARG and CEBPA that modulate adipocyte differentiation (26). Similarly, Module #487 has 4 predictor TFs, 3 of which (PAX8, FOXE1, and NKX2-1) are among the four curial TFs regulating thyroid development and their are associated with congenital (27). Noted that the other crucial TF HHEX also had 6 target genes within the module, but its enrichment p-value (5.98E-05) was slightly larger than the selected cutoff. In contrast, Module #74 has 30 predictor TFs, but only 8 of them are known regulators of DNA damage response, including the master regulator TP53 (28). Interestingly, the module’s predictor TFs are over-represented with 15 genes, but only ZSCAN4 has known functions in DNA damage response (29), while the rest genes’ relevance remains to be investigated, such as the top predictor TFs ZNF79, ZNF337, and ZNF561.

Including the ones mentioned above, predictor TFs were identified for 757 co-expression modules, with each module having 8.6 predictor TFs in average (Table S5). Besides co-expression modules, predictor TFs were also identified for annotated KEGG pathways (Table S7) (30). For example, the dilated cardiomyopathy (DCM) pathway (hsa05414) contains 96 human genes related to DCM, 94 of which are among the non-TF genes used for our model training. Treating these 94 genes as a group, 40 predictor TFs were identified, among which 27 are known regulators of heart or muscle development, such as NKX2-5, GATA4, HAND1, and TBX5 (Figure 3L) (31). Similarly, predictor TFs were identified for the hematopoietic cell lineage (hsa04640) and PPAR signaling (hsa04979) pathways, with >= 65% of them being known regulators (Figure 3M, N). These two pathways play critical physiological roles by regulating blood cells regeneration and modulating lipid and glucose metabolism and overall energy homeostasis, respectively (32, 33).

Verification of predictor TFs by independent gene expression dataset

The above analysis identified a number of predictor TFs as novel potential regulators for different pathways whose functions remained to be investigated. However, a question remains that if finding these predictor TFs depended on a specific training dataset, such as ARCHS4. To address that, an independent human gene expression predictor was also built using the GTEx (v8) dataset. As GTEx had much less samples than ARCHS4, far less significant TF-target gene pairs were recovered. To make the two models more comparable, significant TF-target gene pairs were extracted from the coefficients with p-value <= 1E-09 (approximately equal to a Bonferroni- adjusted P of 0.05) from the GTEx-based predictor. In average, only 13.3 predictor-TFs were

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

identified for each target gene. Nevertheless, using the GTEx-based predictor, similar predictor TFs were still identified, although their numbers were much less (Table S8). For example, 7 predictor TFs were identified for Module #74 of DNA damage response, including 3 known regulators (TP53, CDIP1, and E2F7) as well as 4 uncharacterized TFs (ZNF79, ZNF561, RFX7, and ZNF337) (Figure 4A). Interestingly, these 4 uncharacterized TFs, including 3 zinc finger protein genes, were also identified via the ARCHS4-based predictor. Thus, these predictor TFs were identified twice by two independent datasets. Similar uncharacterized predictor TFs verified by two independent datasets were also recovered for other modules involved in lipid metabolism, erythrocyte differentiation, T cell activation, response to virus, and vasculature development (Figure 4B-E). These predictor TFs can be treated as prioritized candidate genes for future potential regulator research. Additionally, it should be noted that the GTEx-based predictor has much less training samples and statistics power than the ARCHS4-based predictor, which might limit its ability to verify more uncharacterized predictor TFs.

Thus, by using large-scaled transcriptome dataset, we were able to construct a universal human gene expression predictor. There were reports before on predicting transcriptomes using a selected number (100 – 300) of informative genes, which served as the basis to develop cheaper alternatives rather than RNA-Seq to measure gene expression profiles (34, 35). Our analysis had different goals. By using almost all TFs as regressors to construct the predictor model, our predictor allowed downstream inference of TF regulators in addition to predicting transcriptomes. The analysis followed the “understanding by modeling” principle, which is similar to the methodology of “understanding by building” in Synthetic Biology (36). By building a universal gene expression predictor model, it became possible to reconstitute highly accurate transcriptomes using only TF genes’ expression for samples derived from a wide range of tissues or cell lines. The broad applicability and high accuracy of our predictor model indicated there exist a set of ubiquitous and tissue-independent quantitative relationships between TFs and target genes, which had been recapitulated and abstracted by the predictor. These quantitative relationships can be translated into regulatory interactions in many cases and revealed transcriptional regulators for genes and pathways, as demonstrated by the aforementioned examples. Thanks to its tissue-independence, the predictor enabled the identification of TF regulators for completely different tissues or cell types, i.e. heart, pancreas, thyroid, adipose, and different hematopoietic cell lines. Besides gene co- expression modules, the predictor also inferred regulators for annotated KEGG pathways and possibly for any pre-defined gene sets, including those related to human diseases. The predictor is expected to facilitate mechanistic studies on human transcriptional regulation.

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

MATERIALS AND METHODS

Gene expression data and gene expression predictor

A large-scaled processed RNA-Seq-based human transcriptome dataset was downloaded from the ARCHS4 (v7) database (https://amp.pharm.mssm.edu/archs4/data.html) (12). The downloaded file (human_transcript_v7.h5) contained read counts for 178136 transcripts in 167726 RNA-Seq samples, which originated from 6000+ NCBI GSE series. The GSE series containing single-cell RNA-Seq samples were manually removed. The transcript counts were then consolidated according to gene identities to produce a gene counts table, from which cpm (counts per million) gene expression values were calculated. The samples with less than 1000000 mapped reads or with reads mapping rates < 60% or with less than 5000 genes having expression values cpm >= 1 were removed. Further filtered out were the GSE series with > 50% samples removed in previous step. There were a limited number of transcriptomes generated using tissue samples from the GTEx project, which were also removed from the ARCHS4 training dataset. Finally, 59097 samples from 4214 GSE series were remained. Among these remaining samples, the low-expressed genes that have cpm values >= 1 in less than 100 samples were removed. It was found out that 1998 genes have duplicated expression values across samples with other genes, and they were also removed. In order to use the GTEx (v8) dataset, which was processed via a different pipeline (14), to verify the predictor, only the genes common to both datasets were kept. As a result, 24959 genes were remained. The cpm gene expression matrix for these 24959 genes in 59097 selected samples

were then log-transformed via log2(cpm+1) and kept for further analysis.

Two gene expression matrices, X for 1613 TF-genes and Y for 23346 non-TF genes, were then extracted according to a human TF gene list obtained from AnimalTFDB 3.0 (37). They were used to build a human gene expression predictor via the EXPLICIT approach described before (Geng et al, submitted). Briefly, a linear regression model, Y = X B + , was trained via the least squares method. The significances of individual coefficients within the predictor model were determined via hypothesis testing method for multiple linear regression (38). Significant interacting TF-target gene pairs were extracted corresponding to the coefficients with p-value less or equal to selected cutoff and used to identify predictor TFs for non-TF genes as well as gene modules.

To test the model’s performance, two actual expression matrices, for TFs and for non-TFs genes, were extracted from the test samples’ actual transcriptomes, and a predicted

expression matrix was estimated as = . Note that for LOOCV test, the predictor model

was retained with all other samples without the test samples. and were then used to compute

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Pearson’s correlation coefficients between actual and predicted transcriptomes for the individual samples they contained. A MATLAB script developed in-house was used to conduct the analysis.

As another independent dataset, processed and publically available RNA-Seq-based human transcriptome data were also obtained from the GTEx (v8) project (https://gtexportal.org) (14). The downloaded file (GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct.gz) contained read counts for 56200 genes in 17382 samples spanning 54 human tissue-sites. The gene counts expression table was used to calculate the cpm gene expression values, which were subjected

to log-transformation via log2(cpm+1). Two expression matrices, containing the same TF and non- TF genes in the same order as the ARCHS4 training dataset respectively, were then extracted and used for predictor performance testing. These two expression matrices were further used to train an independent GTEx-based human gene expression predictor, so as to compare with the ARCHS4- based predictor to investigate if they recover similar predictor TFs for gene co-expression modules.

Gene co-expression modules and predictor TFs identification

Using the ARCHS4 gene expression matrix containing 24959 genes in 59097 samples, a human GGM gene co-expression network was constructed via a previously published procedure with minor modification (15). Briefly, pcors between all gene pairs were calculated via a random sampling procedure that consisted of 20000 iterations. In each iteration, 2000 genes were randomly selected and used for partial correlation coefficient (pcor) calculation. Previously, a shrinkage approach implemented in the GeneNet package in R was employed to estimate the covariance matrix between the selected genes and use it for pcor calculation (39). The shrinkage approach was developed specifically to address the “small n, large p” problem common to biological transcriptome datasets, which usually had far less samples number (n) than genes number (p). In the current study, because the samples number (59097) is much larger than the selected genes number (2000), the sample covariance estimator were directly used to estimate the covariance matrix for the selected genes. The pcors between the selected genes were then calculated, which was related to inverse of the covariance matrix (39). After 20000 iterations, each gene pairs were randomly sampled together in 128 iterations with 128 pcors calculated, and the pcor with the lowest absolute value was then chosen as that gene pair’s final pcor. The gene pairs whose pcors were >= 0.045 were then chosen to construct a GGM gene co-expression network for human. A MATLAB script was developed in-house to conduct the analysis. The network was then clustered into 1416 gene modules that contain at least 6 genes via the MCL algorithm. The network and its modules were visualized in Cytoscape (V3.4.0) (40). Human GO annotations were obtained from Ensembl

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

BioMart (http://www.ensembl.org/biomart/martview) and used for GO enrichment analysis via the hypergeometric distribution.

Predictor TFs were then identified for the modules, following EXPLICIT approach used for the Arabidopsis study (Geng et al., submitted). Briefly, predictor TFs were first identified for individual genes from the significant interacting TF-target gene pairs out of the gene expression predictor model. For any gene module, if a TF was a common predictor TF for at least 6 as well as 20% of the non-TF genes within the module and its target genes were over-represented within the same module (P <= 1E-06) as determined via the hypergeometric distribution, the TF was considered as a predictor TF for that module. A perl script developed in-house was used to conduct the analysis. Chord diagrams were drawn using the ‘circlize’ package in R to illustrate the interactions between predictor TFs and target genes within a gene module (41). Besides gene co- expression modules, the gene sets for annotated pathways were also obtained from the KEGG database and used for predictor TFs inference (30).

ACKNOWLEDGEMENT

This work was supported by grants from the National Natural Science Foundation of China (31770268), the Strategic Priority Research Program of the Chinese Academy of Sciences (XDA24010302), the Fundamental Research Funds for the Central Universities (WK2070000091), and University of Science and Technology of China (Start-up fund to S.M.). The numerical calculations in this manuscript were conducted on the supercomputing systems in USTC Supercomputing Center and USTC School of Life Sciences Bioinformatics Center.

REFERENCES

1. D. Thompson, A. Regev, S. Roy, Comparative analysis of gene regulatory networks: from network reconstruction to evolution. Annu Rev Cell Dev Biol 31, 399-428 (2015). 2. S. A. Lambert et al., The Human Transcription Factors. Cell 172, 650-665 (2018). 3. P. J. Mitchell, R. Tjian, Transcriptional regulation in mammalian cells by sequence-specific DNA binding . Science 245, 371-378 (1989). 4. M. B. Gerstein et al., Architecture of the human regulatory network derived from ENCODE data. Nature 489, 91-100 (2012). 5. M. T. Weirauch et al., Determination and Inference of Eukaryotic Sequence Specificity. Cell 158, 1431-1443 (2014). 6. A. Dixit et al., Perturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic Screens. Cell 167, 1853-1866 e1817 (2016).

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

7. N. Friedman, Inferring cellular networks using probabilistic graphical models. Science 303, 799-805 (2004). 8. J. VohradskÝ, Neural network model of gene expression. The FASEB Journal 15, 846-854 (2001). 9. E. Segal et al., Module networks: identifying regulatory modules and their condition- specific regulators from gene expression data. Nat Genet 34, 166-176 (2003). 10. N. Omranian, J. M. O. Eloundou-Mbebi, B. Mueller-Roeber, Z. Nikoloski, Gene regulatory network inference using fused LASSO on multiple data sets. Scientific reports 6, 14 (2016). 11. J. C. Liao et al., Network component analysis: Reconstruction of regulatory signals in biological systems. Proceedings of the National Academy of Sciences 100, 15522 (2003). 12. A. Lachmann et al., Massive mining of publicly available RNA-seq data from human and mouse. Nature communications 9, 1366 (2018). 13. J. Choi et al., Haemopedia RNA-seq: a database of gene expression during in mice and . Nucleic Acids Research 47, D780-D785 (2018). 14. G. T. Consortium et al., Genetic effects on gene expression across human tissues. Nature 550, 204-213 (2017). 15. S. Ma, Q. Gong, H. J. Bohnert, An Arabidopsis gene network based on the graphical Gaussian model. Genome Res 17, 1614-1625 (2007). 16. P. Carmeliet, R. K. Jain, Molecular mechanisms and clinical applications of angiogenesis. Nature 473, 298-307 (2011). 17. P. D. Calvert et al., Phototransduction in transgenic mice after targeted deletion of the rod transducin alpha -subunit. Proc Natl Acad Sci U S A 97, 13913-13918 (2000). 18. H. Cheng et al., Photoreceptor-specific nuclear NR2E3 functions as a transcriptional activator in rod photoreceptors. Hum Mol Genet 13, 1563-1575 (2004). 19. J. B. Miesfeld, N. L. Brown, Eye organogenesis: A hierarchical view of ocular development. Curr Top Dev Biol 132, 351-393 (2019). 20. T. Matsui et al., Redundant roles of Sox17 and Sox18 in postnatal angiogenesis in mice. J Cell Sci 119, 3513-3526 (2006). 21. R. E. Jennings, A. A. Berry, J. P. Strutt, D. T. Gerrard, N. A. Hanley, Human pancreas development. Development 142, 3126-3137 (2015). 22. S. M. Hattangadi, P. Wong, L. B. Zhang, J. Flygare, H. F. Lodish, From stem cell to red cell: regulation of erythropoiesis at multiple levels by multiple proteins, RNAs, and modifications. Blood 118, 6258-6268 (2011). 23. H. Singh, K. L. Medina, J. M. R. Pongubala, Contingent gene regulatory networks and B cell fate specification. Proc. Natl. Acad. Sci. U. S. A. 102, 4949-4953 (2005). 24. M. A. Yui, E. V. Rothenberg, Developmental gene networks: a triathlon on the course to T cell identity. Nat Rev Immunol 14, 529-545 (2014). 25. L. B. Ivashkiv, L. T. Donlin, Regulation of type I interferon responses. Nat Rev Immunol 14, 36-49 (2014). 26. E. D. Rosen, C. J. Walkey, P. Puigserver, B. M. Spiegelman, Transcriptional regulation of adipogenesis. Genes Dev. 14, 1293-1307 (2000). 27. M. Nilsson, H. Fagman, Development of the thyroid gland. Development 144, 2123-2140 (2017). 28. M. B. Kastan, O. Onyekwere, D. Sidransky, B. Vogelstein, R. W. Craig, PARTICIPATION OF P53 PROTEIN IN THE CELLULAR-RESPONSE TO DNA DAMAGE. Cancer Res. 51, 6304-6311 (1991).

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

29. J. Jiang et al., Zscan4 promotes genomic stability during reprogramming and dramatically improves the quality of iPS cells as demonstrated by tetraploid complementation. Cell Res 23, 92-106 (2013). 30. M. Kanehisa, M. Furumichi, M. Tanabe, Y. Sato, K. Morishima, KEGG: new perspectives on , pathways, diseases and drugs. Nucleic Acids Res 45, D353-d361 (2017). 31. B. G. Bruneau, Signaling and transcriptional networks in heart development and regeneration. Cold Spring Harbor perspectives in biology 5, a008292 (2013). 32. J. Berger, D. E. Moller, The mechanisms of action of PPARs. Annu. Rev. Med. 53, 409-435 (2002). 33. S. H. Orkin, L. I. Zon, Hematopoiesis: An evolving paradigm for stem cell biology. Cell 132, 631-644 (2008). 34. S. Biswas et al., Tradict enables accurate prediction of eukaryotic transcriptional states from 100 marker genes. Nature communications 8, 15309 (2017). 35. Y. Donner, T. Feng, C. Benoist, D. Koller, Imputing gene expression from selectively reduced probe sets. Nat Methods 9, 1120-1125 (2012). 36. C. J. Bashor, A. A. Horwitz, S. G. Peisajovich, W. A. Lim, Rewiring cells: synthetic biology as a tool to interrogate the organizational principles of living systems. Annu Rev Biophys 39, 515-537 (2010). 37. H. Hu et al., AnimalTFDB 3.0: a comprehensive resource for annotation and prediction of animal transcription factors. Nucleic Acids Res 47, D33-d38 (2019). 38. D. C. Montgomery, E. A. Peck, G. G. Vining, Introduction to linear regression analysis. Wiley series in probability and statistics (Wiley, Hoboken, NJ, ed. 5th, 2012), pp. xvi, 645 p. 39. J. Schafer, K. Strimmer, A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol 4, Article32 (2005). 40. P. Shannon et al., Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13, (2003). 41. Z. Gu, L. Gu, R. Eils, M. Schlesner, B. Brors, circlize Implements and enhances circular visualization in R. Bioinformatics 30, 2811-2812 (2014).

FIGURES

Figure 1. A human gene expression predictor and its predicting performance. (A) The analysis workflow. (B) Increasing training sample numbers improved the predictor’s performance. Predictor models trained with different numbers of training samples were tested for their predicting performance. Shown are the average Pearson’s correlation coefficients (r) between the predicted and actual transcriptomes for the training (black) and independent test samples (red, n=3000). (C) Leave-one-out cross-validation (LOOCV) test result for GSE115736 (13). The scatter plots show the log-transformed gene expression values for the predicted and actual transcriptomes of two samples. (D) LOOVC test results for 1545 GSE series contain 10+ samples. LOOCV test was conducted for each series, and the average r between predicted and actual transcriptomes for all its samples were calculated. The average r of every series were then tallied together to generate the

12 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

histogram to show their distributions. (E) The predictor’s performance on predicting the GTEx transcriptomes. Boxplots and violin plots for the distribution of r between predicted and actual transcriptomes were shown for samples according to tissue types. 17382 samples in total, with between 9 to 2642 samples for each tissues.

Figure 2. Human gene co-expression modules identified through co-expression network analysis. (A) The 20 largest module within the human GGM gene co-expression network. Dots represent genes and are color-coded by their module identities. Edges represent co-expression between genes. Due to space limit only the 20 largest modules are shown. (B – K) Subnetworks for selected gene co-expression modules. The module id number, its represented enrich GO term (and the enrichment p-value) are labeled for each module. Highlighted in red text are the genes possessing the represented GO term.

Figure 3. Predictor TFs for genes and pathways identified from the predictor. (A) Interactions between GNAT1 and its predictor TFs are shown as a Chord diagram. Colors indicate the p-values for the coefficients corresponding to the TF-target gene pair, and the links’ widths are proportional to the coefficients’ magnitude. Only top 30 predictor TFs are shown due to space limit. (B-H) Chord diagrams connecting predictor TFs to their target genes within the selected gene co- expression modules. Links are colored by predictor TF genes, their left ends’ colors indicates p- values of the corresponding coefficients, while their widths are proportional to the coefficients’ magnitude. Enrichment of the predictor TFs target genes within the module are also indicated by colors. TFs highlighted in red text are known TF regulators. Only 15-20 target genes were shown due to space limit. (L-N) Chord diagrams connecting predictor TFs to their target genes within KEGG pathways.

Figure 4. Predictor TFs for selected modules verified by an independent predictor trained with the GTEx v8 dataset. Predictor TFs for modules were identified via the GTEx-based gene expression predictor, as shown in the Chord diagrams (A-E). Legends are the same as Figure 3, except that TFs in blue text are uncharacterized TFs identified by both predictors, TFs in red text are known TF regulators identified by both predictors, and TFs in dark-red text are known TFs identified by the GTEx-based predictor only.

13 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

SUPPLEMENTAL FIGURES AND TABLES

Figure S1. A histogram showing the distribution of all pcors between 24959 genes. Only those gene pairs with pcors >= 0.045 were used for GGM gene co-expression network construction.

Figure S2. A heatmap showing the tissue specific expression patterns form the genes in Module #113 and #116. Shown are the median gene expression values for each tissue were calculated from the GTEx (v8) transcriptomes.

Table S1. 217686 significant co-expressed gene pairs used to construct the GGM gene co- expression network.

Table S2. Genes and their module identities within the co-expression network.

Table S3. GO enrichment analysis results for the identified gene co-expression modules.

Table S4. Significant interacting TF-target gene pairs identified form the human gene expression predictor.

Table S5. Predictor TFs identified for the gene co-expression modules.

Table S6. Known regulators among the predictor TFs identified for the selected gene co-expression modules.

Table S7. Predictor TFs identified for KEGG pathways.

Table S8. Predictor TFs identified for gene co-expression modules based on the GTEx-based predictor.

14 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure 1

Neutrohpil A C (GSM3188532) E Model Adipose Tissue     

Y = X B + ε Adrenal Gland        r = 0.9701 X: 1613 TF genes Bladder Y: 23346 non-TF genes B: coefficient matrix Blood      

  ε: random errors d expression Blood Vessel

Brain       

Breast  Predicte Training Cervix Uteri 59097 RNA-seq transcriptomes 0 5 10 15  from ARCHS4 as training data Actual expression Colon Esophagus    (GSM3188533) Fallopian Tube

Prediction Heart         r = 0.9819

1613 23346   Kidney TF genes non-TF genes Liver 

Lung  d expression d expression Inference Muscle  

TF regulators   Coefficient Nerve for genes and    matrix B Predicte Ovary pathways 0 5 10 15 Pancreas   Actual expression Pituitary         

B D Prostate                                     Salivary Gland       Skin    Small Intestine        r r 

  Train Stomach  Test Testis    Thyroid    ber ber of GSE series Uterus   0.6 0.7 0.8 0.9 1.0 0 100 200 300  

Num Vagina 0 15000 30000 45000 60000 0.70 0.80 0.90 1.00 0.93 0.95 0.97 TrainingTraining sample number number r r bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure 2

ROBO3 B ESM1 A SCARF1 C D TOGARAM2 RD3 CCDC15 PECAM1 TEK GRM4 GABRR3 APLN MYCT1 PDX1 SST CLIP4 ADGRL4 GRM6 C7orf33 ANGPT2 ECSCR PAX4 CDH22 NPVF C2orf71 RP1 CD34 UCN3 THBD ROBO4 CABP2 RCVRN CD93 MAFA CDH5 TIE1 LDB2 GCK GJA5 IMPG2 CPLX4 TMEM244 PPY GJA4 PLVAP MYZAP INS LRFN2 AIPL1 PDC EXOC3L1 KDR CRX GUCA1C CLEC14A GCG CNN2P8 CGNL1 RS1 BCL6B EMCN GPR142 KCNK16 RAX2 PDE6H DLL4 CXorf36 LRIT1 GPR4 SPARCL1 ADCYAP1 SEMA3G LACTBL1 SAMD7 PCDH12 RHOJ C1orf127 IMPG1 FAM124B ARHGEF15 ESAM SYNGR4 G6PC2 IAPP TULP1 RBP3 KCNV2 GUCA1B EYS EXOC3L2 VWF SLC30A8 ARR3 KCNK17 NRL GNAT1 USHBP1 CLDN5 RAMP2 FLT4 FFAR1 ANKRD34C ZNF488 CACNA1F SOX18 GUCA1A HSPA12B ANO2 PDE6C CCM2L KCNA5 NXNL1 ANKRD33 SLC38A3 TCF15 KANK3 #18 vasculature development #116 insulin secretion #37 visual perception (2.78E-16) (3.64E-13) (1.27E-37)

GPR174 HERC5 ZC3HAV1 F IFIH1 E G CD27 H TFR2 TRPV5 FCRL4 TRAC IFI35 PMAIP1 FCRL3 THEMIS CD3G LAT PVRIG OASL ANK1 LLCFC1 FCRL2 RIPOR3 KEL DDX58 PPM1K OAS1 IFIT2 YPEL4 RUBCNL FCRL5 SH2D1A TOR3A CD46P1 PKLR SPTB SIT1 GZMM USP41 USP18 IFI6 SPIB CLEC17A TRAT1 CR1L BANK1 CD96 AC133555.3 RHAG IFIT1 GATA1 GPR171 CD3D EML2 HERC6 IFIT3 PNOC TNFRSF13C DDX60 SPTA1 RHCE FCER2 CXCR5 CD247 BCL11B STAT2 IFI44 KLF1 GYPA CD6 OAS3 CD79B ITK EFCAB14P1 GFI1B AC104561.2 PAX5 CD28 CD2 ISG15 AHSP POU2AF1 DACT3 DHX58 LIPA EPB42 TSPO2 FAM129C FCRL1 LCK MAPK9 UBASH3A CD5 ADCY9 ALAS2 ICOS MRPL30P1 SAMD9 SMIM1 HMBS BLK MS4A1 IL2RB MX2 EIF2AK2 HEMGN ACSL6 FAM172BP RPL21P129 OAS2 SLC4A1 CD79A CAMK4 CD19 CD3E ZAP70 SEPT1 HELZ2 HBQ1 HBM CA1 SLFN14 TCL1A COL19A1 Z86062.1 RPL37AP1 IFI44L IFIT5 IFI27 TMPRSS9 RUNDC3A CTLA4 CD40LG BTNL10 VPREB3 STAP1 TIGIT SPOCK2 DDX60L CDK6 SELENBP1 IFIT1B TCL1B HMGB1P40 ZC3H12D NLRC3 KLC3 BPGM TESPA1 PALLD SAMD9L CD72 CD22 EBF1 TCF7 LEF1 CARMIL2 #38 erythrocyte differentiation #81 B cell activation #13 T cell activation #57 response to virus (1.80E-07) (2.82E-06) (1.17E-27) (2.75E-30)

CPM BLOC1S2 BTG2 PPM1D I LPL J H POLH KLB XPC SLC19A3 PFKFB1 SESN1 IYD TP53INP1 ACACB RRM2B UCP1 MDM2 ADIPOQ FAS LEP SNTG2 CEP128 TRIAP1 TUSC5 TIGAR CYFIP2 PHLDA3 THRSP FDXR TPO TSHR PTCHD4 TG CDKN1A C14orf180 MRAP GPD1 BBC3 EDA2R OTOS DDB2 GDF15 SLC7A10 CIDEC SLC26A7 TNFRSF10B BAX PCK1 GPAM RPS27L CIDEA ZMAT3 PIDD1 DGAT2 ZNF804B LIPE RSPH6A AEN PLIN1 KIF7 ADIG CCNG1 KCNIP2 SPATA18 PEX11A ASCC3 RPL36AP43

#113 lipid metabolic process #487 thyroid hormone metabolic #74 cellular response to DNA (1.21E-08) process (2.84E-05) damage stimulus (6.16E-16) bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure 3

F 1 B F B KLF16 I I 6 E1 G A NS G 2 N O 1 2 2 V N R O L F P 2 R X 2 X A 2 1 F H R R H M A L X T T A X 2 A 1  O X - log (p-value) N U L FA P L K B L 1 - log (p-value) 6 10 E 3 T X 1 X O 2 A H 10 1 T E R D 1 T S X H T N X A M B 1 I N N X K L 1 O S ZN V 190 4 1 O S N H A H B 190 M T T N F N A A L 7 E 4 1 1 1 X C 6 G F G 53 2 1 U 7 T O V S 1 53 T P R S R IT 1 U X R 3 Z E 2 2 L P N N H 27 RB 1 F7 AC 2 A V T 27 0 B TX 85 SX NA 5G PI F3 1 G H N 1 16 E6 Z HL E PD 16 AR SRR F B B NPV N 10 D7 EUR SAM 10 OD4 1C NEURO GUCA 6 D4 RCVRN 6 VSX2 IMPG1 Target GNAT1 TFs NRL GUCA1A genes CPLX4 P TOH7 DC A K CNV A 2 E3 R R2 A R3 B N IP RR L S RD 1 E C 3 A G C X C A N R 1 A B A C R P B R 1 L P R F A U 3 X T 2 1 2 C P R L X U 2 L N Color corresponds to p-value for TF-target gene pair T X

R R A

2 N R

E

3 Color corresponds to p-value for target genes enrichment #37 visual perception

2

3 F

D 1 D 1 3 3 N M  O E N 1 2 2 A 5 A O B A 2 R 2 K X 3 S A L 6 D F G R C B R R A X X S F FX X T O 3 I F C 3 F O X 1 A 5 N Y H R M M U P 4 I 2 X O K F K 1 N B A X T A S X G 4 N Z O H IC 1 F E 1 E D M W  1 N A 8 L F 1 S 1 O 2 A B O Z L C H D L E 2 C L B N F M P T T 1 E P M R F Y IC I O N 1 D T C O K YT 1 L 1 F N R G 1 U B F M H 6 E 0 R P E 1 1 N 3 N 1 L rf C B N Z IT H 5 o U 1 6 F A M X 5 1 F IX IF P E X 3 1 IN T K P O 6 C 9 F S 3 N T M Y X E M C C 1 G P D G K F S E 2 C H L 2 1 H T R M 5 A G S1 A 4 X1 S JA A JA A IN Z H 2 G N M F1 R O C YT SP 14 FL EM 1L T FN I1 1 G M SL TIE IS GC XI1 CA1 AP HE PLV 2 S YL G6PC HMB T1 YBX3 4A1 RFX6 SLC30A8 SLC CLEC14A ZNF366 EPB ECSCR IAPP LTF 42 HB PCDH12 NKX63 PP M 7 Y TAL1 SOX CD RUN H5 S DC3A ST TA2 HE KDR L GA MG L1 PJ FF B N TA R RB AR PG AM U 1 X6 R M A P2 CN O HC DG CD 3 S S E F R 4 G H B PT U AM L4 S G P 22 T A1 S 1 PA C R A N 7 H 24 N C K 14 L L X1 V B B 6 A 1 2 A 10 O W P X N o 1 S S S E F 1 PA rf A E 2 S K 1 T A L G R 2 A C C M D 7 G E D 1 R 3 S N G 3 X M 4 L B R 4 R 1 C 6 P E  A 1 6 1 P F 8 X D A 1 B 1 D 4

K F C X X O X

N L 1 L O R A

K

6 S U P

B E #18 vasculature development #116N insulin secretion #38 erythrocyte differentiation

B D

H

M S S

L 1 T 8 B A A T RT A H Z 3 3 H C L T A I 3 B K 5 2 8 R

E 1 M R 4 8 1 B Z D 3 T B F 8  N 6 C1 L T E D F

2 F F X F O T I F F 3 Y 4 X X 9 6 B 1 R Z X 8 D A A 2 2 NF R 4 R B 1 R T R 6 I R 5 H BT B X X F I F B 1 2 I 6 I Z M 0 A A F L K 3 S Z E 1 2L 3 F M G 2 H R X 2 2 T S I S B N E L B N A 1 F 2 3 G S 1 4 Z F NF T N L S E IK 2 Z A T N 7 T 4 Z Y A F V Z A A L T N T B F 9 E 3 A 3 7 T 1 E C D K S H 9 L 1 K S 2 C 1 C F IT E P 1 D 9 1 3 1 IT C 12 2 D 0 F E M TO C 0 I 2 A 1 X D2 3 F5 F L C FIT CR ZN S BA I F C F1 ICO TF S1 NO 01 D 2 OA E P D3 BF 4A1 C 1 ASL 1 MS RAT O PBX T X58 EB3 4 D247 DD VPR C PLS LYL1 B CR1 IFIH1 FCRL5 SP1 IL2R 40 SG15 79B LCK I BCL11A CD T CD3G IFI44 NFRSF13C F1 IKZ CD96 MX POU2AF 2 SP140 1 SH2D CD7 1A 2 IFI6 2 ZAP STAT TC MES 70 O B L1A EO CD5 AS2 EF2 F G M CE PR OA B R2 3 SE 17 D S3 LK XP C PT 1 D G1 C FO D 1 H X6 A LE TH 6 E 0 R T C C E 2 S R F CL 1 D M 1 A C C 1 7 3 U 2 I P M 5 F R B A 68 B 8 S R D 0 S C L F A A 9 6 R 2 N S P I L 8 TA L Z I1 H R F 3 F 3 F N P A 7 Z 1 G 7 Z S F N 9 P C S I F F 1 5 T B K 8 Z 1

P 1 R X 1 Z 3 I N 0

I F 1 F 1 A B F

L E P 3 X

C L 1

B #81 B cell activation #13 T cell activation #57 response to virus

Z

N

Z 1 Z F F 1 1 N B 6 Z N N O  3 3 Z F 3 2 N M 1 1 R 2 Z F 8 N S X 5 6 D D F D N 5 5 I P Z 1 X O F C N F 3 5 I 1 5 A S S 5 Z 7 R 7 2 H K F R X 5 X 1 4 2 N 4 D C O X R N 9 S 3 9 2 P P 4 4 L 4 N Z 0 A O I D Z F1 C 6 N J F 7 E A X 3 B 1 N F D D 5 M S N C F 2 7 H E 5 I 1 K Z F 6 F I T R X 2 N M S 6 6 A N T Z I C 6 K A 1 Z L S A 2 F 1 5 Z N R 1 C F 4 2 N 4 S P A R U Q M D 1 T O A E R K H IP R 1 X L 3 D G ID D 3 S F1 A C T 3A F AT TA 5 DE M 7L T5 CI YB Z S2 TW A 1 X3 P IS IN G R 2B T2 PL RH RM L3 R 2 LHX EP IYD MDM 6 L Z IP2 NF19 DB2 KCN TB 5 D TFE3 X22 CIDEA CDKN1A MRAP SLC26A7 HES2 EBF1 C14orf180 PTCHD4 BBC3 EBF2 THRSP RFX7 KLB SESN1 TPO T X2 LIP RIAP YB E F3 TI 1 G AT GAR PD O BA S 1 TO P X 1 LC S OL XF G 7A R 19 P H O PA 10 S F2 H RH PF M PH ZN S LD D K TS 6A P A3 G FB H T AT S A 1 Z R 3 N A 2 L T2 N 5 A F 18 X P C A F P C E R B A C 1 D 8 T N S D C K 9A IG 04 C F1 A 1 3 B 1 N 0 7 C 6 G B 1 B 5 1 1 M P 1 F AX8 NF L G E Z Z N S N R I S 7 Z R A 1 O F F A P 7 P L X P E F 1 3 7 B P 6 P X E2 3 9 E I

F O

X C F

ZN

ML #113 lipid metabolism #487 thyroid hormone metabolism #74 response to DNA damage stimulus

H Y 1 E A A B X T E 3 1 P T I 3 S T N X V 3 D C S A 3 B

B A P O 3 2 1 D 2 3 X 5 C AS R 2 A E 2 X R H H 4 N L I X L L A 2 5 B R R C P C 1 F 2 N F D T E I 5 S 8 0 L Y 1 M H N R4 R A 1 N Z G X M F R P B 1 V F 4 N E F X X 1 K N 1 1 4 N T A ZNF85 O O F E X 6 7 GA R Z C A BX H L 5 I P P N O S 4 F B F C 1 3 N T 6 F S X F X 1 T R E X D 3 C FE 3 1 E 2 I 2 Z 1 O F M O A 5 N F A 2 1 E F 2 P F 2 F L O H L S S A 5 3 K F P I 1 T N I F V 1 T 1 A G F Z E T T L 66 L I 3 A 3 M E L 3 P N C T A A F S R O A S F G 0B P 4 A R 2 A T E BH BX LH B K1 5 E2 D8 PC H7 3 C Y 1E 2 HA M 2 CD C OA ND NT UX2 AP 1 TN CD2 PC3 YB CD5 1 E M IK FABP 2F8 YH6 ZF1 CD3D M D1A R YL3 C HMGCS2 XRG M ZBTB16 MS4A1 ACTC1 P TSHZ2 CD3E LIN1 CSDC2 IL 1I3 PLN 2RA NR CYP8B S ITG 1 MYL2 EOME AM AQP AD CD1B 7 CY5 C PL BX20 SGCA D8 IN4 T T I A C NNC B L7R YP 2C RY 1 11 C A 4A EF T R2 CL D7 PO 11 M NN B CR C A TT I3 2 YP 5 N 1 FA 4 SLC G B A A 4 A P4 22 G 8A R H D O 1 1 I Y 1 TC R P M 3 N O 8 F7 Q F N Z 3 Z M 5 X N 3 L M B C L P  2 G F X 2 Y E 3 P Y F F 6 1 K F I F A X F 4 B P 2 O 8 K S T E X I 6 F 1 3 E R L M L U T N X 6 X

R G  P R O N P O Y I F 2 C P 3 F E X

5 V K Dilated cardioN myopathy (KEGG) Hematopoietic cell lineage (KEGG) PPAR signaling pathway (KEGG) bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029470; this version posted April 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure 4

- log10 (p-value) T

F 190 3 F Z D 2 E A 7 B T B O A A F P P 6 C 5 Y 33 E X L B C 1 N C E X T B F 1 D I B 53 Z E 1 A 1 D B H F X 1 N 3 T 6 F C I L 3 I 7 P O Z Z F S X B P F G X N E 1 2 R O G 27 F I R E 1 4 1 S B 7 1 IT 16 S 1 M F R 2R IX D X I M F A 1 P I1 G X D G Q P 7 E O B 10 DIP SP R A AH DX E A F 6 M LIP B NF YP LX AC IX G IPL AC N BAX EMG IN1 H IAP1 PL Z TR PTB NF561 S DDB2 CIDEC GATA1 TSPO2 ZMAT3 LEP EPB42 C MDM2 IDEA SELE 6 NBP1 PO GP SOX R LH AM HCE TN SL BTN FRS ARG C19 L10 F10 PP P A3 A RPS B CK LA 27 1 S2 S L T PA HR H TA S BQ 53 B 1 T P R 1 P B 8 US H T R C D C Y A R 3 G 5 P G C M K A E A C 2 L T2 L S N B B 4 C G C 1 3 1 M A F P E L B S K

9 E P 7 1 F C

N

Z #74 response to DNA damage stimulus #113 lipid metabolism #38 erythrocyte differentiation

Z

T

N

T E TC F

C A 6 6 S 5 3 L H D 4 F X P 8 F4 1 X TA T 4 0 Z E T 1 B 3 4 H L A N A 0 S 4 A A Y 5 7

X I T T 1 L A 3 P T 8 R 2 F 6 C D E OX2 E 1 4 1 T F F S S 5 4 S F P F 3 I N P H E 2 F 1 S D 1 H I M AT N K Z 1 A Z 0 B E Z F 0 E T 4 1 D T S L 3 Z 1 R D N 5 N G C E F 1 R D 1 3 X G 5A A M S D 1 IS 2 CA P1 C 1 E 4 IT P 4 0 D2 IF EL O C K3 OB IT3 R K IF L CT1 LC S D9 MY TAT AM SO G D96 2 S 58 X7 SCR FI1 C DDX EC A 2AK2 BASH3 EIF CCM2L U EPAS1 I44L SHBP1 ITK IF U PLSC 7 R1 CDH5 TCF C OAS3 D6 OM MEC CLEC Z IFI4 14A AP70 SP110 4 TIE1 C OA MES ARM S1 KA EO IL2 NK3 CD HER R 5 C6 RG AM T IFI E P2 RA H1 VW T T1 IF F H I6 P 1B G EM 9 L 1 P I F O VA L C R S IR A D P C D 1 S B L B 3 7 O 2 6 L4 G 1 A L S C 3 L B P S X O 1 S X O I I 1 K F O 1 PA L 3 Z 7 6 7 F X

8 F F 6 R 1 F 3 3 R I 8 N P F

Z 12 N

Z #13 T cell activation #57 response to virus #18 vasculature development