<<

Deep learning for inferring relationships from single- expression data

Ye Yuana and Ziv Bar-Josepha,b,1

aMachine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213; and bComputational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213

Edited by Nancy R. Zhang, University of Pennsylvania, Philadelphia, PA, and accepted by Editorial Board Member Peter J. Bickel November 12, 2019 (received for review July 7, 2019) Several methods were developed to mine gene–gene relationships The network is trained with positive and negative examples for from expression data. Examples include correlation and mutual the specific domain of interest (e.g., known targets of a tran- information methods for coexpression analysis, clustering and un- scription factor [TF], known pathways for a specific biological directed graphical models for functional assignments, and directed process, known disease , etc.), and the output can be either graphical models for pathway reconstruction. Using an encoding binary or multinomial. for data, followed by deep neural networks anal- We applied CNNC using a large cohort of single-cell (SC) ysis, we present a framework that can successfully address all of expression data and tested it on several inference tasks. We show these diverse tasks. We show that our method, convolutional neu- that CNNC outperforms prior methods for inferring interactions ral network for coexpression (CNNC), improves upon prior methods (including TF–gene and –protein interactions), causality in- in tasks ranging from predicting factor targets to iden- ference, and functional assignments (including biological processes tifying disease-related genes to causality inference. CNNC’s encoding and diseases). provides insights about some of the decisions it makes and their biological basis. CNNC is flexible and can easily be extended to inte- Results grate additional types of data, leading to further improve- We developed CNNC, a general computational framework for ments in its performance. supervised gene relationship inference (Fig. 1). CNNC is based on a

CNN, which is used to analyze summarized co-occurrence histo- gene interactions | deep learning | causality inference grams from pairs of genes in single-cell RNA- (scRNA- seq) data. Given a relatively small labeled set of positive pairs (with everal computational methods have been developed to infer either negative or random pairs serving as negative), CNNC learns relationships between genes based on gene expression data. S to discriminate between interacting, causal pairs, negative pairs, or These range from methods for inferring coexpression relation- any other gene relationship types that can be defined. ships between pairs of genes (1) to methods for inferring a bio- logical or disease process for a gene based on other genes [either Learning a CNNC Model. CNNC can be trained with any expression

using clustering or guilt by association (2)] to causality inferences dataset, although as with other neural network applications, the COMPUTER SCIENCES (3, 4) and pathway reconstruction methods (5). To date, each of more data, the better its performance. Given expression data, we these tasks was handled by a different computational framework. For example, gene coexpression analysis is usually performed using Pearson correlation (PC) or mutual information (MI) (6). Significance Functional assignment of genes is often performed using clus- tering (7) or undirected graphical models including Markov Accurate inference of gene interactions and causality is required random fields (8), while pathway reconstruction is often based for pathway reconstruction, which remains a major goal for on directed probabilistic graphical models (4). These methods many studies. Here, we take advantage of 2 recent technological also serve as an initial step in some of the most widely used tools developments, single-cell RNA sequencing and deep learning to for the analysis of genomics data including network inference propose an encoding scheme for gene expression data. We use and reconstruction approaches (3, 9, 10), methods for classifi- this encoding in a supervised framework to perform several dif- cation based on genes expression (11) and many more. ferent types of analysis using minimal assumptions. Our method, While successful and widely used, these methods also suffer convolutional neural network for coexpression (CNNC), first from serious drawbacks. First, most of these methods are unsu- transforms expression data lacking locality to an image-like ob- pervised. Given the large number of genes that are profiled, and ject on which convolutional neural networks (CNNs) work very well. We then utilize CNNs for learning relationships between the often relatively small (at least in comparison) number of genes, causality inferences, functional assignments, and disease samples, several genes that are determined to be coexpressed or gene predictions. For all of these tasks, CNNC significantly out- cofunctional may only reflect chance or noise in the data (12). In performs all prior task-specific methods. addition, most of the widely used methods are symmetric, which

means that each pair has only one relationship value. While this Author contributions: Y.Y. and Z.B.-J. designed research; Y.Y. and Z.B.-J. performed re- is advantageous for some applications (e.g., clustering), it may be search; Y.Y. analyzed data; and Y.Y. and Z.B.-J. wrote the paper. problematic for methods that aim at inferring causality (e.g., The authors declare no competing interest. network reconstruction tasks). This article is a PNAS Direct Submission. N.R.Z. is a guest editor invited by the To address these issues, we developed a method, convolutional Editorial Board. neural network for coexpression (CNNC), which provides a su- Published under the PNAS license. pervised way (that can be tailored to the condition/question of Data deposition: The software in this paper has been deposited in GitHub, https://github. interest) to perform gene relationship inference. CNNC utilizes a com/xiaoyeye/CNNC. representation of the input data specifically suitable for deep 1To whom correspondence may be addressed. Email: [email protected]. learning. It represents each pair of genes as an image (histogram) This article contains supporting information online at https://www.pnas.org/lookup/suppl/ and uses convolutional neural networks (CNNs) to infer rela- doi:10.1073/pnas.1911536116/-/DCSupplemental. tionships between different expression levels encoded in the image.

www.pnas.org/cgi/doi/10.1073/pnas.1911536116 PNAS Latest Articles | 1of8 Downloaded by guest on September 24, 2021 cell-type–specific protein–DNA interactions (17). We thus evaluated CNNC’s performance using cell-type-specific scRNA-seq datasets (for mouse embryonic stem cells [mESCs], bone marrow- derived macrophages, and dendritic cells; Methods) and ChIP-seq data from Gene Transcription Regulation Database (GTRD) (18). We extracted data from GTRD for 38 TFs for which ChIP-seq experiments were performed in mESCs, 13 TFs studied in macrophages, and 16 TFs for dendritic cells. To determine tar- gets for each TF using the ChIP-seq data, we followed prior work (19, 20) and defined a promotor region as 10 kb upstream to 1 kb downstream from the transcription start site (TSS) for each gene. If a TF a has at least one detected peak signal in or overlapping the promotor region of gene b, we say that TF a regulates gene b. For this prediction task, we compared CNNC with several pop- ular methods for gene–gene coexpression analysis: PC and MI, which are the 2 most popular coexpression analysis methods; Genie3 (9), which was the best performer in the dialogue for re- Fig. 1. CNNC input, output, and architecture. CNNC aims to infer gene– gene relationships using single-cell expression data. For each gene pair, verse engineering assessments and methods (DREAM4) challenge scRNA-seq expression levels are transformed into 32 × 32 normalized em- (21), in silico networks construction challenge; count statistics (CS) pirical probability function (NEPDF) matrices. The NEPDF serves as an input (22), which relies on local information based on gene expression to a convolutional neural network (CNN). The intermediate layer of the CNN ranks in large heterogeneous samples; conditional-density resampled can be further concatenated with input vectors representing DNase-seq and estimate of mutual information (DREMI) (23); and a fully con- PWM data. The output layer can either have a single, 3, or more values, nected deep neural network (DNN), which also uses our NEPDF depending on the application. For example, for causality inference the as input. Since most of the prior methods used for comparison are output layer contains 3 probability nodes where p0 represents the proba- symmetric, we focused here on the 2 labels setting (interacting or bility that genes a and b are not interacting, p1 encodes the case that gene a not). We applied 3-fold cross-validation to all datasets. Each fold regulates gene b, and p2 is the probability that gene b regulates gene a. contains several TFs, although the test is performed separately for each TF (Methods). first generate a normalized empirical probability distribution Fig. 2 presents the results of these comparisons. As can be seen, function (NEPDF) for each gene pair (genes a and b) (Fig. 1). CNNC and DNN outperform all prior methods for all cell types. For this, we calculate a normalized 2-dimensional (2D) histo- We observe significant improvement over all prior methods (Fig. A–G gram of fixed size (32 × 32). The specific dimension for the input 2 ). To evaluate the performance of CNNC, we chose both area is a hyperparameter that can be learned for each data (SI Ap- under the receiver operating characteristic curve and the area pendix, Fig. S3A and Supplementary Notes). In the histogram, under the precision recall curve (AUROC/AUPRC) as our eval- columns represent gene a expression levels and rows represent uation scores. The AUPRC achieved by CNNC is around 20% b higher than PC and MI on some datasets (see SI Appendix,Fig.S2, gene such that entries in the matrix represent the (normalized) A–D co-occurrences of these values. If different data types are com- for details). Importantly, as can be seen in Fig. 2 , the dif- bined (e.g., bulk and SC), they can be either used separately or ference is even more pronounced for the top ranked predictions. concatenated to form a combined NEPDF with dimension of 32 × For CNNC, we see almost no false negatives (less than 20%) for the top 5% ranked pairs. Such top predictions are often the most 64 (SI Appendix,Fig.S3B). Next, the distribution matrix is used as important since the ability to validate predicted interaction is input to a CNN, which is trained using a N-dimension (ND) output usually limited to the top few predictions. label vector, where N depends on specific task. For example, for N coexpression or interaction prediction is set to 1 (interacting or Data Integration Further Improves TF Target Gene Prediction. The not), while for causality inference it is set to 3 where label 0 indi- a b above analysis was only based on using expression values. How- cates that genes and are not interacting and label 1 (2) indicates ever, as noted above, gene relationship inference is often used as a a b b a that gene ( ) regulates gene ( ). In general, our CNN model component in more extensive procedures that often integrate × consists of one 32 32 input layer, 10 intermediate layers including different types of genomics data. To test how the use of the NN- 6 convolutional layers, 3 maxpooling layers, 1 flatten layer, and a based method can aid such procedures, we extended CNNC so “ ” “ ” Methods final ND softmax layeror1scalar sigmoid layer ( and that it can utilize sequence and DNase hypersensitivity informa- SI Appendix ,Fig.S1). tion. For sequence, we used PWMs from Jaspar (24). DNase-seq For the analysis presented in this paper, we used processed data for mESCs were obtained from the mouse ENCODE project scRNA-seq and bulk RNA-seq from different studies (13). All raw (25). We used a simple strategy for processing the PWM and data were uniformly processed and assigned to a predetermined DNase data, which resulted in an additional 2D vector as input for set of more than 20,000 mouse genes for each task (Methods). each pair, which we embedded to create a 128D vector (Methods). In addition to gene expression data, CNNC can integrate We next concatenated this vector with the NEPDF’s 128D vector other data types including DNase-seq (14), position weight ma- in the flatten layer to form a 256D vector as shown in Fig. 1 and SI trix (PWM) (15), etc. For this, we concatenated the additional Appendix,Fig.S1A. information as a vector to the intermediate output of the gene Results, presented in Fig. 2H, show that these additional data expression data and continued with the standard CNN archi- sources indeed improve the ability to predict TF–gene interac- tecture. See Methods and SI Appendix, Fig. S1 for different ar- tions. As before, a combined framework utilizing CNNC out- chitectures and details of CNNC and SI Appendix, Table S1, for performs methods that used CS. information on training and run time. CNNC Can Predict Pathway Regulator–Target Gene Pairs. While TFs Using CNNC to Predict TF–Gene Interactions. We first tested the usually directly impact the expression of their targets, several CNNC framework on the task of predicting pairwise interactions methods have also utilized RNA-seq data to infer pathways that from gene expression data (16). Chromatin immunoprecipitation combine protein–protein and protein–DNA interactions (26). To (ChIP)-seq has been widely used as a gold standard for studying test whether CNNC can serve as a component in pathway inference

2of8 | www.pnas.org/cgi/doi/10.1073/pnas.1911536116 Yuan and Bar-Joseph Downloaded by guest on September 24, 2021 ABCDCNNC DNN Count statistics Mutual Info 0.6694(0.7025) 0.6826(0.6873) 0.6615(0.6752) 0.6640(0.6726)

Precision median quantile

Recall Recall Recall Recall

MESC Macrophage G Comparison for all H Integration with PWM EF-4 -4 8.66×10 1.11×10 five experiments and Dnase data

-4 -6 1.67×10 1.55×10

scRNA-seqother dataintegration data_type PC

Fig. 2. GTRD TF-target prediction. (A–D) Precision recall curve (PRC) of CNNC, deep neural network (DNN), count statistics (CS), and mutual information (MI), which are the top 4 performing methods for TF-target prediction. Training and testing for these plots were done using bone marrow-derived macrophage scRNA-seq and ChIP-seq data. Median (mean) AUPRC was shown on the top of each panel. Here, each gray line represents one TF, the red line represents the median curve, and the light green part represents the region between the 25th and 75th quantiles. (E and F) AUPRC and AUROC for all 7 methods we compared. We tested all methods on the macrophage and mESC tissue-specific datasets. For mESC, the comparison P values with the best 2 other methods in term of AUPRC (AUROC) were based on Wilcoxon signed-rank test. For macrophage, the comparison P values are based on AUPRC and AUROC because the −2 −2 number of cells is not enough for Wilcoxon test: CS, 2.85 × 10 ; DNN, 3.55 × 10 .SeeSI Appendix, Fig. S2, for similar analysis using 3 additional datasets. (G) SYSTEMS BIOLOGY Overall AUPRC and AUROC of the methods in all 5 train and test experiments (3 tissue-specific datasets and 2 with larger 1.3M mouse brain scRNA-seq from − 10×) with comparison P values using Wilcoxon signed-rank test between CNNC and the best 2 other methods in terms of AUPRC (AUROC): CS, 4.90 × 10 8 − − − (2.55 × 10 12); DNN, 9.42 × 10 3 (3.15 × 10 3). (H) Comparison of TF-target predictions with additional data using mESC expression and TFs. Columns 1 to 4 show the AUPRCs of PC, MI, CS, and CNNC using scRNA-seq data, respectively. The fifth and sixth columns show performance when only using PWM or DNase. The last 3 columns show performance of the integration of expression, sequence (PWM), and DNase data. The comparison P value between CS and CNNC − integration methods is (AUPRC): 8.37 × 10 4.

methods, we selected 2 representative pathway databases, Kyoto NN. We next used CNNC to infer causal edges for KEGG and COMPUTER SCIENCES Encyclopedia of Genes and (KEGG) (27) and Reactome Reactome datasets. Specifically, we used CNNC to predict (28) as gold standard and used these, together with a large scRNA- whether for 2 genes, a and b, the interaction is from a to b or vice seq dataset of 43,261 cells that was collected from over 500 versa. For the pathway databases, we only analyzed directed different studies representing a wide range of cell types, conditions, edges and so had the ground truth for that data as well. As can be etc. (13), and a bulk RNA-seq data from ENCODE (25) to train seen in Fig. 4, for the KEGG dataset, CNNC is very successful and test our framework. Since we are interested in causal rela- achieving a median AUROC of 0.9949 (Fig. 4 A and B). For tionships, we only used directed edges (from regulator to target Reactome (SI Appendix, Fig. S4), we see that the most confident gene) with activation or inhibition edge types and filtered out cyclic predictions are correct, but beyond the top prediction perfor- gene pairs where genes regulate each other mutually (to allow for a mance levels off. We compared the performance of CNNC to unique label for each pair). As for the negative data, here we another method developed for learning causal relationships from limited the negative set to a random set of pairs where both genes gene expression data, BDN (4), which learns a global directed appear in pathways in the database but do not interact. Given the interaction graph. Results presented in SI Appendix, Fig. S6, large number of genes, we performed a 3-fold cross-validation, show that CNNC greatly outperforms BDN on this causality keeping the set of regulator genes in the training and test set prediction task. We have also tested several other applications of separated (Methods and SI Appendix, Supplementary Notes). CNNC that included its use for determining the impact of the Results are presented in Fig. 3. As can be seen, CNNC per- interaction (activation or repression) and for determining its forms very well on the KEGG pathways reaching an median ability to identify directed vs. undirected (complex-based) in- (mean) AUROC of 0.9949 (0.8822) compared to less than teractions. In both cases, CNNC performs well as we show in SI 0.9309 (0.7324) for the methods we compared, which here also Appendix, Fig. S7. included Bayesian directed networks (BDNs) (4) that learn a To try to understand the basis for the decisions reached by global directed interaction graph (Fig. 3F). CNNC also performs CNNC, we plotted the 2D and 3D figures of 3 NEPDF inputs well on Reactome pathways (SI Appendix, Fig. S4). We also used (Fig. 4 C–H), which were correctly predicted as different labels the KEGG data to test the specific architecture CNNC utilizes (0 in Fig. 4 C and D, 1 in Fig. 4 E and F, and 2 in Fig. 4 G and H). and observed that the architecture used improves upon 2 alter- As can be seen, random pairs look more uninformative and native deep NN architectures, deep fully connected NN (DNN), symmetric, while in both label 1 and 2 figures, the 2 genes display and CNN without pooling layers (Fig. 3H and SI Appendix, partial correlations and there are places where both are up or Fig. S5). down concurrently, and the main difference between the histo- grams in Fig. 4 E–H are cases where one gene is up and the other Using CNNC for Causality Prediction. So far, we focused on general is not. In Fig. 4 E and F, gene 2 is up while gene 1 is not, indi- interaction predictions. However, as discussed above, CNNC can cating that the causal relationship is likely 1→2. The opposite also be used to infer directionality by changing the output of the holds for Fig. 4 G and H and so the method infers that 2→1 for

Yuan and Bar-Joseph PNAS Latest Articles | 3of8 Downloaded by guest on September 24, 2021 ABCDCNNC Pearson corr Count statistics GENIE3 0.9949 (0.8822) 0.7104 (0.5990) 0.7698(0.7324) 0.9303 (0.7228) TP

median quantile

0.0 1.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 FP FP FP FP Mutual Info Bayesian network DREMI Comparison EFGH-7 -7 0.7192 (0.6393) 0.7239 (0.6750) 0.5333 (0.5206) 3.90×10 7.61×10 0 TP

-66

area under curve 8.04×10 7.31×10-110 0.0 1. 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 PR FP FP FP ROCcurve_type

Fig. 3. Predicting undirected pathway edges. (A) Overall ROCs for CNNC performance on KEGG pathway gene interaction prediction using a large compendium of scRNA-seq data and bulk data. Here, each gray line represents one regulator with outgoing edges. Median (mean) AUROC is shown on the top of each panel. (B–G) Overall ROCs for Pearson correlation (PC), mutual information (MI), count statistics (CS), Bayesian directed network (BDN), DREMI, and Genie3 when tested on the KEGG pathway gene interaction prediction task. (H) Comparison of the 7 methods on the gene interaction prediction task. The comparison P values are − − − − − − − − (AUPRC [AUROC]): DREMI, 0.0 (0.0); PC, 5.71 × 10 184 (3.48 × 10 214); MI, 1.79 × 10 191 (2.25 × 10 193); BDN, 8.31 × 10 231 (2.84 × 10 226); CS, 7.91 × 10 160 (1.45 × 10 189); − − − − GENIE3, 8.04 × 10 66 (7.31 × 10 110); DNN, 7.61 × 10 7 (3.90 × 10 7). Boxplot shown with median, first quartile, third quartile, maximum, and minimum.

that input. While relationships between expression values, in- Given its success on a well-studied functional set, we next cluding the ones mentioned above, can be manually prescribed asked whether CNNC can be used to predict novel disease genes. for an algorithm, we also noted that the encoding used for CNNC We focused on 2 lung diseases, asthma and chronic obstructive allows it to look at more complicated relationships between genes. pulmonary disease (COPD) and on head and neck cancer In Fig. 4 I–P, we plot the mean, variance, and coefficient of var- (HNC). We obtained 147, 44, and 72 genes for asthma, COPD, iance (CV) for gene 2 as a function of the expression of gene 1 for and HNC, respectively, from “Malacards” (31). We next trained both prediction directions (1→2, top and 2→1, bottom). As can be CNNC with all known genes for each of the 2 diseases and used seen, the variance and CV trends are consistent within category it to predict additional genes for each disease. We evaluated the and diverging between categories, indicating that CNNC can make predicted set both manually and by statistical analysis using gene use of second-order or even higher-order distribution properties. ontology (GO) and compared these to prior methods for GBA Similar phenomena have been anecdotally observed in specific (32) analysis, as can be seen in Fig. 5 D–F. For all of the 3 dis- cases, for example for micro-RNA regulation (29), but the ability eases, CNNC outperforms GBA, and it obtained much more of CNNC to learn such relationships on its own strongly suggests significant GO terms when compared to GBA (Fig. 5 I and J). that it can generalize much better than prior methods for inferring Manual inspection of the top 10 genes for asthma indicated that such causal interactions. 7 of them are supported based on recent studies (Dataset S1), We also performed a number of experiments to test the ro- including “Lck,” which was recently determined to be a potential bustness of CNNC to dropouts, size of the input data, the impact drug target for asthma therapy (33). of unbalanced labels, and different cross-validation strategies. Results, presented in SI Appendix, Fig. S8, indicate that CNNC is Discussion and Conclusion robust and can be successfully applied even to unbalanced Several methods for inferring gene–gene relationships from ex- datasets, which are common in biology. pression data have been developed over the last 2 decades. While these methods perform well in some cases, they suffer from a Using CNNC for Functional Assignments. We next explored the use number of drawbacks that often led to false positives or missing of CNNC for assigning function or disease relevance for genes. key relationships (false negatives). The former can be attributed to For this, we applied CNNC to predict genes associated with 3 the unsupervised nature of most methods (including methods for essential cell functions: cell cycle, circadian rhythm, and immune coexpression and clustering) making it hard to “train” them on a system. For each of these functions, we obtained known genes labeled dataset. The latter often resulted from the assumptions from gene set enrichment analysis (GSEA) (30) and trained used by specific methods (e.g., distribution assumptions for DBNs) CNNC using all expression data on 2/3 of these genes holding the that do not always hold. other 1/3 as a test set. In this setting, the network is trained to To address these issues, we presented CNNC, a general frame- predict 1 for a pair of genes that are both cell cycle genes and work for gene relationship inference, which is based on CNNs. The 0 for all other pairs (Methods and SI Appendix, Supplementary key idea here is to convert the input data into a co-occurrence Notes). When testing on the held-out set, CNNC achieved an histogram. Such representation enables us to fully utilize both the AUROC of 0.79 (Fig. 5A), outperforming both guilt by associ- information contained in SC data and the ability of CNNs to ation (GBA) and DNN significantly. Importantly, the top 10% exploit spatial information. On the one hand, SC data provide predicted genes were all true positives (SI Appendix, Fig. S9). information about the actual, cell-based, relationships, while CNNC also performs best for circadian rhythm, and immune relationships in bulk studies only provide information on av- system functions (Fig. 5 B and C). erages and so do not accurately reflect real interactions and

4of8 | www.pnas.org/cgi/doi/10.1073/pnas.1911536116 Yuan and Bar-Joseph Downloaded by guest on September 24, 2021 SYSTEMS BIOLOGY COMPUTER SCIENCES

Fig. 4. Directed (causal) edge prediction. (A) Overall ROCs for performance of CNNC on KEGG pathway directed edge prediction using a large compendium of scRNA-seq and bulk data. Median (mean) AUROC is shown on the Top.(B) The AUROC histogram for A.(C–H) A typical NEPDF sample from a KEGG in- teraction that is correctly predicted as label 0, 1, and 2, in the form of 2D and 3D plot. (I–L) Variance (var), mean, and coefficient of variance (CV) of gene 1 as the expression of gene 2 increases for top correctly predicted pairs with label 1. (M–P) Same for top predictions for label 2. (Q–T) Average and variance of CV for the top prediction groups correctly predicted as label 1 (Q and S) and label 2 (R and T).

causality. Furthermore, the large number of cells in recent SC integrate complementary data including epigenetic and sequence datasets enables us to accurately estimate joint distribution for information. Comparisons to more advanced methods for bio- gene pairs. Here, we used tens of thousands of expression profiles logical network reconstruction further highlight the advantages from relatively small number of experiments (a few hundreds), of CNNC. In addition, CNNC can be used as a preprocessing whereas bulk datasets contained much fewer profiles (the bulk step, or as a component in more advanced network reconstruc- data we use, which is from one of the largest experiments, has only tion methods. Finally, CNNC is easy to use either with general ∼300 profiles). In addition, unlike most prior methods, CNNC is data or with condition-specific data. For the former, users can supervised, which allows the CNN to zoom in on subtle differences download the data and implementation from the supporting between positive and negative pairs. Supervision also helps fine- website, provide a list of labels (positive and negative pairs for tune the scoring function based on the different application. For their system of interest), and retrieve the scores for all possible example, different features may be important for analyzing TF– gene pairs. These in turn can be used for any downstream appli- gene interactions when compared to inferring in the cation including network analysis, functional gene assignment, etc. same pathway. Finally, the fact that the network can utilize the While a number of prior NN methods were developed, by us and large volumes of scRNA-seq data without requiring explicit as- others, to analyze single-cell expression vectors (11, 34–38), these sumptions about the distribution of the input allows it to better methods are very different from CNNC. First, their goal is usually overcome noise and other errors, reducing false negative. to compare data across cells rather than to analyze gene relation- Analysis of several different interaction prediction and func- ships within cells as CNNC does. Second, unlike CNNC, these tional assignment tasks indicates that CNNC can improve upon prior methods rely on a vector (or matrix for multiple cells) rep- prior, unsupervised methods. It can also be naturally extended to resentation of expression data, which does not utilize the spatial

Yuan and Bar-Joseph PNAS Latest Articles | 5of8 Downloaded by guest on September 24, 2021 asthma gene prediction ABCcell_cylce immune_system rhythm G I 0.8 0.8 0.8 cycle~cycle pair AUROC CNNC 0.5 GBA 0.7 0.6

COPD gene prediction DEFasthma COPD HNC H J

0.8 cycle~non-cycle pair 0.8 0.8 CNNC GBA AUROC 0.7 0.6 0.7 GBA DNN CNNC GBA DNN CNNC GBA DNN CNNC method method method

Fig. 5. Functional assignment and pathway reconstruction using CNNC. CNNC can be used as a component in downstream analysis algorithms including functional assignments and disease gene prediction. (A–C) Performance of CNNC on the cell cycle, immune system, and circadian rhythm gene prediction task. (D–F) Performance of CNNC on the asthma, COPD, and HNC disease gene prediction task. (G) Predicted expression pattern of a cell cycle∼cell cycle gene pair. (H) Predicted expression pattern of a cell cycle∼non-cell cycle gene pair. (I and J) The most significant GO terms of top 5% predicted asthma (D) and COPD (E) disease genes by CNNC and GBA, respectively.

analysis advantages of deep NN. CNNC uses such an idea by one ChIP-seq peak signal in or partially in the TSS region of gene b, as de- converting coexpression relationships to image histograms prior to fined above, we say that a regulates b.SeeSI Appendix,TableS3, for details. KEGG and Reactome pathway data were downloaded by the R package their analysis. While this was applied here to gene expression data, “ ” such an approach may also be appropriate for other types of data, graphite (43). KEGG contains 290 pathways, and Reactome contains 1,581 pathways. For both, we only select directed edges with either activation or for example, financial data. inhibition edge types and filter out cyclic gene pairs where genes regulate Since CNNC is supervised, it would indeed not generalize to each other mutually (to allow for a unique label for each pair). In total, we cases where no labels are available, unlike some of the methods have 3,057 proteins with outgoing directed edges in KEGG, and the total we compare to. On the other hand, when labels are available, number of directed edges is 33,127. For Reactome, the corresponding which is common to several cases with genomics data (including numbers are 2,519 and 33,641. all of the tasks we presented), CNNC is a much better choice than unsupervised methods. Constructing the Input Histogram. Image size is very important, since small CNNC is implemented in Python, and both data and an open- image size will lead to data loss (very few expression levels), whereas large source version of the software are available from the supporting sizes can miss important relationships due to noise. Thus, the best image size depends on the particular task and the amount of data available. The best website (https://github.com/xiaoyeye/CNNC). way to determine the optimal size is to treat it as another network hyper- Methods parameter and perform cross-validation with different sizes to select the optimal one. As we show in SI Appendix, Fig. S3, applying such method to the Dataset Sources and Preprocess Pipelines. We used genomics data of different KEGG prediction tasks identifies 32 × 32 as the optimal input size for types from several studies. The mouse scRNA-seq dataset collected by Alavi the scRNA-seq data we used, and so this is the dimension used throughout et al. (13) consists of uniformly processed 43,261 expression profiles from this paper. For any gene pair a and b, we first log transformed their ex- over 500 different scRNA-seq studies. For each profile, expression values are pression, and then uniformly divided the expression range of each gene to available for the same set of 20,463 genes. Among the cells, 4,126 are 32 bins. Next, we created the 32 × 32 histogram by assigning each sample to dendritic cells, and 6,283 are bone marrow-derived macrophage cells. Addi- an entry in the matrix and counting the number of samples for each entry. tionally, mESC data, which contain 2,717 cells, were downloaded from Gene Due to the very low expression levels and even more so to dropouts in scRNA Expression Omnibus with accession number GSE65525 (39). The 1.3 million data, the value in zero–zero position is always very large and often domi- mouse SC data were downloaded from ref. 40. Mouse bulk RNA-seq dataset nates the entire matrix. To overcome this, we added pseudocounts to all was downloaded from Mouse Encode project (25). That data included 249 entries and did another log transformation for each entry to get the final samples, and we only utilized genes that are present in the scRNA-seq dataset matrix. We combined bulk and scRNA-seq NEPDFs by concatenating them as leading to the same number of genes for both datasets. mESC DNase data a32× 64 matrix to achieve better performance. See SI Appendix, Fig. S3,for were also downloaded from Mouse Encode project (25) (ENCFF096WRW.bed). additional ways to integrate the different data types. Mouse TF motif information is from TRANSFAC database (41). PWM values were calculated by Python package “Biopython” (42). CNN for RPKM Data. We followed VGGnet (44) to build our CNN model (SI For the DNase and PWM analysis, we followed prior papers and defined the Appendix, Fig. S1). The CNN consists of stacked layers of x 3 × 3 convolu- TSS region as 10 kb upstream to 1 kb downstream from the TSS for each gene tional filters (Eq. 1)(x is a power of 2, ranging from 32 to 64 to 128) and (19, 20). For each TF and gene pair, using Biopython package, we calculated interleaved layers of 2 × 2 maxpooling (Eq. 2). We used the constructed + − the score between the TF motif sequence and both the “ / ” sequences at all input data as input to CNN. Each convolution layer computes the following possible positions along the TSS region of the gene, and then selected the function: maximum one as the final PWM score. The maximum DNase peak signal in the TSS region was calculated as the scalar DNase value for each gene. X2 X3 ð Þk = k Convolution X i,j Wi,j Xi+m,j+n, [1] m=0 n=1 Labeled Data. mESC ChIP-seq peak region data were downloaded from GTRD − database, and we used peaks with threshold P value < 10 400 for mESC cells where X is the input from the previous layer, (i,j) is output position, k is and 10−200 for macrophage cells and dendritic cells. If one TF a has at least convolutional filter index, and W is the filter matrix of size 3 × 3. In other

6of8 | www.pnas.org/cgi/doi/10.1073/pnas.1911536116 Yuan and Bar-Joseph Downloaded by guest on September 24, 2021 words, each convolutional layer computes a weighted average of the prior probability vector [p0, p1, p2], where p0 represents the probability that layer values where the weights are determined based on training. The genes a and b are not interacting, p1 encodes the case that gene a regulates maxpooling layer computes the following function: gene b, and p2 is the probability that gene b regulates gene a. After n o training, we used p1(a, b) + p2(a, b) as the probability that a interacts with b ð Þk = k k k k and p2(a, b) – p2(b, a) as the pseudo-probability that b regulates a. For in- maxpooling X i,j max X2i,2j , X2i+1,2j , X2i,2j+1, X2i+1,2j+1 , [2] teraction prediction, only the prob vectors of (a, b) and (r, s) were used in the where X is input, (i,j) is output position, and k is the convolutional filter evaluations, while for causality prediction, only the prob vectors of (a, b)and index. In other words, the layer selects one of the values of the previous (b, a) were used. To avoid overfitting, early stopping strategy by monitoring layer to move forward. validation loss function is used. To evaluate the detailed performance for every TF and regulator in the KEGG and Reactome tasks, we calculated the Overall Structure of CNNC. The overall structure of the CNN is presented in SI AUROC and AUPRC for each of them and combined all values as the final Appendix, Fig. S1. The input layer of the CNN is either 32 × 32 (scRNA-seq) or result (SI Appendix, Supplementary Notes). We also performed 4-categorical 32 × 64 (scRNA-seq and bulk RNA-seq) as discussed above. In addition, the tasks using this data separation and cross-validation strategy (see SI Ap- CNN contains 10 intermediate layers and a single 1- or 3-dimensional output pendix, Fig. S7, for details). layer. The 10 layers include both convolutional and maxpooling layers, and the exact dimensions of each layer are shown in SI Appendix, Fig. S1.Fol- Integrating Expression, Sequence, and DNase Data. To integrate DNase and lowing ref. 45, we used rectified linear activation function (ReLU) as the PWM data with the processed RNA-seq data, we first computed the max value activation function (Eq. 3) across the whole network, except the final clas- for a PWM scan and DNase accessibility for each promotor region. We next sification layers where “sigmoid” function (Eq. 4) was used for binary clas- generated a 2-value vector from this data for each pair and embedded it to a sification and “softmax” function (Eq. 5) for multiple categories classification. 128D vector using one fully connected layer containing 128 nodes. Next these These functions are defined below: are concatenated with the expression processed data to form a 256D vector, which serves as input to a fully connected 128-node layer neural network with x if x ≥ 0 binary classifier. See SI Appendix, Fig. S1, for details. ReLUðxÞ = [3] 0if x < 0, Functional Gene Assignment. To assign genes to a function (biological process Sigmoid ðxÞ = 1 θ , [4] or disease), we train a CNNC model for each function with known genes g. θ 1 + e x Similar to all other tasks, input to each model is a pair of genes where the 2 3 θ e 1 x first is g and the second is either a positive (known) or negative (randomly 6 θ 7 1 6 e 2 x 7 selected from the whole unknown gene set) gene. SoftmaxθðxÞ = P 4 5. [5] k θj x ... j=1e θ e k x Known Genes for Functional Assignment Testing. We downloaded 855 (332, 103, 182, 59, 128) human cell cycle (immune system, rhythm, asthma, COPD, SYSTEMS BIOLOGY HNC) genes from GSEA [“Malacards” (31) (a human disease website, https:// Training and Testing Strategy. We evaluated the CNN using 3-fold cross- www.malacards.org/)]. We obtained mouse ontologies for all genes result- validation across all tasks. In these, training and test datasets are strictly sep- ing in 682, 278, 98, 147 and 47, 71 genes for them, respectively. For training, arated to avoid information leakage. See SI Appendix, Supplementary Notes, we used all genes for the diseases and a randomly selected set of unknown for details. For TF-binding prediction, we used a binary classification, where a genes. See SI Appendix, Supplementary Notes, for details. NEPDF matrix with label 1 was generated for each target b of TF a,anda matrix with label 0 was generated for a randomly selected nontarget gene r Data Availability. All data, scripts, and instruction required to run CNNC in to balance the negative positive dataset. Similar to prior work, targets are de-

Python can be found in our support website, https://github.com/xiaoyeye/ COMPUTER SCIENCES fined based on the presence of a ChIP-seq peak in their (46). CNNC. All other public data can be found following the pipelines in Dataset For KEGG and Reactome pathway prediction tasks, we used 3 labels (to Sources and Preprocess Pipelines and Labeled Data in Methods. enable causality analysis): for each gene pair (a, b), we generated (a, b)’s and ’ (b, a) s NEPDF matrices with label of 1 and 2, respectively. NEPDF matrices ACKNOWLEDGMENTS. This work was partially supported by NIH Grants with a label of 0 were generated from random (r, s) gene pairs among KEGG 1R01GM122096 and OT2OD026682 (to Z.B.-J.), and a James S. McDonnell or Reactome gene sets. For each candidate gene pair, CNNC computes a Foundation Scholars Award in Studying Complex Systems (to Z.B.-J.).

1. E. Kuzmin et al., Systematic analysis of complex genetic interactions. Science 360, 15. S. Sinha, On counting position weight matrix matches in a sequence, with application eaao1729 (2018). to discriminative motif finding. 22, e454–e463 (2006). 2. T. Itzel et al., Translating bioinformatics in oncology: Guilt-by-profiling analysis and 16. M. Crow, J. Gillis, Co-expression in single-cell analysis: Saving grace or original sin? identification of KIF18B and CDCA3 as novel driver genes in carcinogenesis. Bio- Trends Genet. 34 , 823–831 (2018). informatics 31, 216–224 (2015). 17. D. S. Johnson, A. Mortazavi, R. M. Myers, B. Wold, -wide mapping of in vivo 3. S. M. Hill et al., Inferring causal molecular networks: Empirical assessment through a protein-DNA interactions. Science 316, 1497–1502 (2007). community-based effort. Nat. Methods 13, 310–318 (2016). 18. I. Yevshin, R. Sharipov, T. Valeev, A. Kel, F. Kolpakov, GTRD: A database of tran- 4. M. H. Maathuis, D. Colombo, M. Kalisch, P. Buhlmann, Predicting causal effects in scription factor binding sites identified by ChIP-seq experiments. Nucleic Acids Res. 45, – large-scale systems from observational data. Nat. Methods 7, 247 248 (2010). D61–D67 (2017). 5. D. Marbach et al., Wisdom of crowds for robust gene network inference. Nat. 19. M. H. Schulz et al., Reconstructing dynamic microRNA-regulated interaction net- – Methods 9, 796 804 (2012). works. Proc. Natl. Acad. Sci. U.S.A. 110, 15686–15691 (2013). 6. L. Song, P. Langfelder, S. Horvath, Comparison of co-expression measures: Mutual 20. M. H. Schulz et al., DREM 2.0: Improved reconstruction of dynamic regulatory net- information, correlation, and model based indices. BMC Bioinf. 13, 328 (2012). works from time-series expression data. BMC Syst. Biol. 6, 104 (2012). 7. P. Langfelder, S. Horvath, WGCNA: An R package for weighted correlation network 21. A. Greenfield, A. Madar, H. Ostrer, R. Bonneau, DREAM4: Combining genetic and analysis. BMC Bioinf. 9, 559 (2008). dynamic information to identify biological networks and dynamical models. PLoS One 8. Z. Wei, H. Li, A Markov random field model for network-based analysis of genomic 5, e13397 (2010). data. Bioinformatics 23, 1537–1544 (2007). 22. Y. X. Wang, M. S. Waterman, H. Huang, Gene coexpression measures in large het- 9. V. A. Huynh-Thu, A. Irrthum, L. Wehenkel, P. Geurts, Inferring regulatory networks erogeneous samples using count statistics. Proc. Natl. Acad. Sci. U.S.A. 111, 16371– from expression data using tree-based methods. PLoS One 5, e12776 (2010). 10. T. E. Chan, M. P. H. Stumpf, A. C. Babtie, Gene regulatory network inference from single- 16376 (2014). 23. S. Krishnaswamy et al., Systems biology. Conditional density-based analysis of T cell cell data using multivariate information measures. Cell Syst. 5,251–267.e3 (2017). 11. C. Lin, S. Jain, H. Kim, Z. Bar-Joseph, Using neural networks for reducing the di- signaling in single-cell data. Science 346, 1250689 (2014). mensions of single-cell RNA-seq data. Nucleic Acids Res. 45, e156 (2017). 24. A. Khan et al., JASPAR 2018: Update of the open-access database of transcription – 12. S. Freytag, J. Gagnon-Bartsch, T. P. Speed, M. Bahlo, Systematic noise degrades gene factor binding profiles and its web framework. Nucleic Acids Res. 46, D260 D266 co-expression signals but can be corrected. BMC Bioinf. 16, 309 (2015). (2018). 13. A. Alavi, M. Ruffalo, A. Parvangada, Z. Huang, Z. Bar-Joseph, A web server for com- 25. F. Yue et al., A comparative encyclopedia of DNA elements in the mouse genome. parative analysis of single-cell RNA-seq data. Nat. Commun. 9, 4768 (2018). Nature 515, 355–364 (2014). 14. L. Song, G. E. Crawford, DNase-seq: A high-resolution technique for mapping active 26. A. Gitter, M. Carmi, N. Barkai, Z. Bar-Joseph, Linking the signaling cascades and dy- gene regulatory elements across the genome from mammalian cells. Cold Spring namic regulatory networks controlling stress responses. Genome Res. 23, 365–376 Harb. Protoc. 2010, pdb.prot5384 (2010). (2013).

Yuan and Bar-Joseph PNAS Latest Articles | 7of8 Downloaded by guest on September 24, 2021 27. M. Kanehisa, M. Furumichi, M. Tanabe, Y. Sato, K. Morishima, KEGG: New perspec- 37. G. Eraslan, L. M. Simon, M. Mircea, N. S. Mueller, F. J. Theis, Single-cell RNA-seq tives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361 denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019). (2017). 38. E. Arvaniti, M. Claassen, Sensitive detection of rare disease-associated cell subsets via 28. A. Fabregat et al., The reactome pathway knowledgebase. Nucleic Acids Res. 46, representation learning. Nat. Commun. 8, 14825 (2017). D649–D655 (2018). 39. A. M. Klein et al., Droplet barcoding for single-cell transcriptomics applied to em- 29. J. M. Schmiedel et al., Gene expression. MicroRNA control of protein expression noise. bryonic stem cells. Cell 161, 1187–1201 (2015). – Science 348, 128 132 (2015). 40. 10x Genomics (2018) 1.3 Million Brain Cells from E18 Mice. https://support.10xgenomics. 30. A. Subramanian et al., Gene set enrichment analysis: A knowledge-based approach com/single-cell-gene-expression/datasets. Accessed 8 May 2019. for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 102, 41. E. Wingender, P. Dietze, H. Karas, R. Knuppel, TRANSFAC: A database on transcription 15545–15550 (2005). factors and their DNA binding sites. Nucleic Acids Res. 24, 238–241 (1996). 31. N. Rappaport et al., MalaCards: An amalgamated human disease compendium with 42. P. J. Cock et al., Biopython: Freely available Python tools for computational molecular diverse clinical and genetic annotation and structured search. Nucleic Acids Res. 45, biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009). D877–D887 (2017). 43. G. Sales, E. Calura, D. Cavalieri, C. Romualdi, Graphite—a bioconductor package to 32. S. Oliver, Guilt-by-association goes global. Nature 403, 601–603 (2000). convert pathway topology to gene network. BMC Bioinf. 13, 20 (2012). 33. S. Zhang, R. Yang, Y. Zheng, The effect of siRNA-mediated lymphocyte-specific pro- 44. A. Z. Karen Simonyan, Very deep convolutional networks for large-scale image rec- tein tyrosine kinase (Lck) inhibition on pulmonary inflammation in a mouse model of asthma. Int. J. Clin. Exp. Med. 8, 15146–15154 (2015). ognition. arXiv:1409.1556 (10 April 2015). “ ” 34. M. Amodio et al., Exploring single-cell data with deep multitasking neural networks. 45. A. B. Xavier Glorot, B. Yoshua, Deep sparse rectifier neural networks in Proceedings Nat. Methods, 10.1038/s41592-019-0576-7 (2019). of the Fourteenth International Conference on Artificial Intelligence and Statistics, 35. J. Ding, A. Condon, S. P. Shah, Interpretable dimensionality reduction of single cell PMLR, G. Gordon, D. Dunson, M. Dudík, Eds. (2011), vol. 15, pp. 315–323. data with deep generative models. Nat. Commun. 9, 2002 (2018). 46. J. Ernst, H. L. Plasterer, I. Simon, Z. Bar-Joseph, Integrating multiple evidence sources 36. R. Lopez, J. Regier, M. B. Cole, M. I. Jordan, N. Yosef, Deep generative modeling for to predict binding in the human genome. Genome Res. 20, 526– single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018). 536 (2010).

8of8 | www.pnas.org/cgi/doi/10.1073/pnas.1911536116 Yuan and Bar-Joseph Downloaded by guest on September 24, 2021