Deep Learning for Inferring Gene Relationships from Single-Cell Expression Data

Deep learning for inferring gene relationships from single-cell expression data Ye Yuana and Ziv Bar-Josepha,b,1 aMachine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213; and bComputational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 Edited by Nancy R. Zhang, University of Pennsylvania, Philadelphia, PA, and accepted by Editorial Board Member Peter J. Bickel November 12, 2019 (received for review July 7, 2019) Several methods were developed to mine gene–gene relationships The network is trained with positive and negative examples for from expression data. Examples include correlation and mutual the specific domain of interest (e.g., known targets of a tran- information methods for coexpression analysis, clustering and un- scription factor [TF], known pathways for a specific biological directed graphical models for functional assignments, and directed process, known disease genes, etc.), and the output can be either graphical models for pathway reconstruction. Using an encoding binary or multinomial. for gene expression data, followed by deep neural networks anal- We applied CNNC using a large cohort of single-cell (SC) ysis, we present a framework that can successfully address all of expression data and tested it on several inference tasks. We show these diverse tasks. We show that our method, convolutional neu- that CNNC outperforms prior methods for inferring interactions ral network for coexpression (CNNC), improves upon prior methods (including TF–gene and protein–protein interactions), causality in- in tasks ranging from predicting transcription factor targets to iden- ference, and functional assignments (including biological processes tifying disease-related genes to causality inference. CNNC’s encoding and diseases). provides insights about some of the decisions it makes and their biological basis. CNNC is flexible and can easily be extended to inte- Results grate additional types of genomics data, leading to further improve- We developed CNNC, a general computational framework for ments in its performance. supervised gene relationship inference (Fig. 1). CNNC is based on a CNN, which is used to analyze summarized co-occurrence histo- SYSTEMS BIOLOGY gene interactions | deep learning | causality inference grams from pairs of genes in single-cell RNA-sequencing (scRNA- seq) data. Given a relatively small labeled set of positive pairs (with everal computational methods have been developed to infer either negative or random pairs serving as negative), CNNC learns relationships between genes based on gene expression data. S to discriminate between interacting, causal pairs, negative pairs, or These range from methods for inferring coexpression relation- any other gene relationship types that can be defined. ships between pairs of genes (1) to methods for inferring a biological or disease process for a gene based on other genes [either Learning a CNNC Model. CNNC can be trained with any expression using clustering or guilt by association (2)] to causality inferences dataset, although as with other neural network applications, the COMPUTER SCIENCES (3, 4) and pathway reconstruction methods (5). To date, each of more data, the better its performance. Given expression data, we these tasks was handled by a different computational framework. For example, gene coexpression analysis is usually performed using Pearson correlation (PC) or mutual information (MI) (6). Significance Functional assignment of genes is often performed using clustering (7) or undirected graphical models including Markov Accurate inference of gene interactions and causality is required random fields (8), while pathway reconstruction is often based for pathway reconstruction, which remains a major goal for on directed probabilistic graphical models (4). These methods many studies. Here, we take advantage of 2 recent technological also serve as an initial step in some of the most widely used tools developments, single-cell RNA sequencing and deep learning to for the analysis of genomics data including network inference propose an encoding scheme for gene expression data. We use and reconstruction approaches (3, 9, 10), methods for classifi- this encoding in a supervised framework to perform several dif- cation based on genes expression (11) and many more. ferent types of analysis using minimal assumptions. Our method, While successful and widely used, these methods also suffer convolutional neural network for coexpression (CNNC), first from serious drawbacks. First, most of these methods are unsu- transforms expression data lacking locality to an image-like ob- pervised. Given the large number of genes that are profiled, and ject on which convolutional neural networks (CNNs) work very well. We then utilize CNNs for learning relationships between the often relatively small (at least in comparison) number of genes, causality inferences, functional assignments, and disease samples, several genes that are determined to be coexpressed or gene predictions. For all of these tasks, CNNC significantly out- cofunctional may only reflect chance or noise in the data (12). In performs all prior task-specific methods. addition, most of the widely used methods are symmetric, which means that each pair has only one relationship value. While this Author contributions: Y.Y. and Z.B.-J. designed research; Y.Y. and Z.B.-J. performed re- is advantageous for some applications (e.g., clustering), it may be search; Y.Y. analyzed data; and Y.Y. and Z.B.-J. wrote the paper. problematic for methods that aim at inferring causality (e.g., The authors declare no competing interest. network reconstruction tasks). This article is a PNAS Direct Submission. N.R.Z. is a guest editor invited by the To address these issues, we developed a method, convolutional Editorial Board. neural network for coexpression (CNNC), which provides a su- Published under the PNAS license. pervised way (that can be tailored to the condition/question of Data deposition: The software in this paper has been deposited in GitHub, https://github. interest) to perform gene relationship inference. CNNC utilizes a com/xiaoyeye/CNNC. representation of the input data specifically suitable for deep 1To whom correspondence may be addressed. Email: [email protected]. learning. It represents each pair of genes as an image (histogram) This article contains supporting information online at https://www.pnas.org/lookup/suppl/ and uses convolutional neural networks (CNNs) to infer rela- doi:10.1073/pnas.1911536116/-/DCSupplemental. tionships between different expression levels encoded in the image. www.pnas.org/cgi/doi/10.1073/pnas.1911536116 PNAS Latest Articles | 1of8 Downloaded by guest on September 24, 2021 cell-type–specific protein–DNA interactions (17). We thus evaluated CNNC’s performance using cell-type-specific scRNA-seq datasets (for mouse embryonic stem cells [mESCs], bone marrow- derived macrophages, and dendritic cells; Methods) and ChIP-seq data from Gene Transcription Regulation Database (GTRD) (18). We extracted data from GTRD for 38 TFs for which ChIP-seq experiments were performed in mESCs, 13 TFs studied in macrophages, and 16 TFs for dendritic cells. To determine targets for each TF using the ChIP-seq data, we followed prior work (19, 20) and defined a promotor region as 10 kb upstream to 1 kb downstream from the transcription start site (TSS) for each gene. If a TF a has at least one detected peak signal in or overlapping the promotor region of gene b, we say that TF a regulates gene b. For this prediction task, we compared CNNC with several popular methods for gene–gene coexpression analysis: PC and MI, which are the 2 most popular coexpression analysis methods; Genie3 (9), which was the best performer in the dialogue for re- Fig. 1. CNNC input, output, and architecture. CNNC aims to infer gene– gene relationships using single-cell expression data. For each gene pair, verse engineering assessments and methods (DREAM4) challenge scRNA-seq expression levels are transformed into 32 × 32 normalized em- (21), in silico networks construction challenge; count statistics (CS) pirical probability function (NEPDF) matrices. The NEPDF serves as an input (22), which relies on local information based on gene expression to a convolutional neural network (CNN). The intermediate layer of the CNN ranks in large heterogeneous samples; conditional-density resampled can be further concatenated with input vectors representing DNase-seq and estimate of mutual information (DREMI) (23); and a fully con- PWM data. The output layer can either have a single, 3, or more values, nected deep neural network (DNN), which also uses our NEPDF depending on the application. For example, for causality inference the as input. Since most of the prior methods used for comparison are output layer contains 3 probability nodes where p0 represents the proba- symmetric, we focused here on the 2 labels setting (interacting or bility that genes a and b are not interacting, p1 encodes the case that gene a not). We applied 3-fold cross-validation to all datasets. Each fold regulates gene b, and p2 is the probability that gene b regulates gene a. contains several TFs, although the test is performed separately for each TF (Methods). first generate a normalized empirical probability distribution Fig. 2 presents the results of these comparisons. As can be seen, function (NEPDF) for each gene pair (genes a and b) (Fig. 1). CNNC and DNN outperform all

Deep Learning for Inferring Gene Relationships from Single-Cell Expression Data

Gene Prediction and Genome Annotation

Gene Structure Prediction

A Curated Benchmark of Enhancer-Gene Interactions for Evaluating Enhancer-Target Gene Prediction Methods

There Is a Lot of Research on Gene Prediction Methods

Gene Prediction Using Deep Learning

Prediction of Protein-Protein Interactions and Essential Genes Through Data Integration

"An Overview of Gene Identification: Approaches, Strategies, and Considerations"

A Benchmark Study of Ab Initio Gene Prediction Methods in Diverse Eukaryotic Organisms

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D

Progress in Gene Prediction: Principles and Challenges Srabanti Maji and Deepak Garg*

PATTERNS of DIPEPTIDE USAGE for GENE PREDICTION a Thesis

Bioinformatics Is a New Discipline That Addresses the Need to Manage and Interpret the Data That in the Past Decade Was Massively Generated by Genomic Research