Network Inference in Systems Biology: Recent Developments, Challenges, and Applications
Total Page:16
File Type:pdf, Size:1020Kb
Network Inference in Systems Biology: Recent Developments, Challenges, and Applications Michael M. Saint-Antoine1 and Abhyudai Singh2 Abstract One of the most interesting, difficult, and potentially useful topics in compu- tational biology is the inference of gene regulatory networks (GRNs) from ex- pression data. Although researchers have been working on this topic for more than a decade and much progress has been made, it remains an unsolved prob- lem and even the most sophisticated inference algorithms are far from perfect. In this paper, we review the latest developments in network inference, includ- ing state-of-the-art algorithms like PIDC, Phixer, and more. We also discuss unsolved computational challenges, including the optimal combination of algo- rithms, integration of multiple data sources, and pseudo-temporal ordering of static expression data. Lastly, we discuss some exciting applications of net- work inference in cancer research, and provide a list of useful software tools for researchers hoping to conduct their own network inference analyses. 1. Introduction Networks of gene interactions, in which genes activate and repress the tran- scription of other genes, are responsible for much of the complexity of cellular life [1,2], and their malfunction can be disastrous for an organism [3]. So, un- arXiv:1911.04046v1 [q-bio.MN] 11 Nov 2019 derstanding these networks has long been a goal of systems biology. Discovering gene interactions experimentally can be very difficult, and given the approxi- 1Center for Bioinformatics and Computational Biology, University of Delaware, Newark, Delaware 19716, USA 2Electrical and Computer Engineering, University of Delaware, Newark, Delaware 19716, USA Preprint submitted to Current Opinion in Biotechnology November 12, 2019 mately 20,000 genes in the human genome, it is not feasible to do an experiment for every possible pair to check for an interaction. However, with traditional DNA microarray [4], or next-generation sequencing technologies, most notably (single-cell or bulk) RNA-sequencing (RNA-Seq) [5,6], one can get a quantita- tive peek into the transcriptomic profile of an individual cell, or a population of cells. Unfortunately, these measurement techniques involve killing the cell, so each cell can provide only one timepoint of data. A significant disadvantage of this static data is that it will not allow us to detect self-edges (in which a gene regulates itself). In some cases, it may be possible to construct pseudo-temporal data from static single-cell measurements. For example, a researcher may administer a stimulus to a set of cells, and then perform an RNA-Seq experiment on some cells after one hour, then on some cells after two hours, and so on, in order to collect data on the post-stimulus gene expression dynamics. In other cases, it may be possible to infer the temporal ordering of a set of gene expression measurements using computational methods (more on that in the Computational Challenges section). In this paper, the input gene expression data will be assumed to be static single cell data (meaning that each measured cell provides us with only one timepoint of data), unless otherwise noted. 2. Problem Formulation For convenience, it can be useful to abstract the problem of network inference into a graph theory framework. Figure 1 shows the workflow of a hypotheti- cal network inference scheme where, for example, single-cell RNA-Seq is used to measure the expression of a large number of genes in many individual cells. Given the large number of genes measured in these experiments it maybe pru- dent to select a smaller a subset of genes with high biological variance for further analysis. Such genes are typically identified by first plotting the Coefficient of Variation or CV (standard deviation over mean) in expression levels across cells with respect to the mean expression levels for all genes, and then selecting genes 2 Measure gene expression in single cells Biological Variation Variation of Find genes with biological variance Coefficient Mean Expression Level Technical Variation Infer gene regulatory networks Figure 1: A typical workflow to identity genes with high biological variance for network map- ping studies. RNA-Seq typically quantifies the expression of thousands of genes across individ- ual cells or environmental/genetic conditions. The Coefficient of Variation or CV (Standard deviation over mean) in the expression level across cells or conditions is plotted as a function of the mean expression level for all genes. Fitting a trend line through this data provides the expected CV of a gene at given mean level, and this trend line is then used to select genes that have significantly higher CVs (representing high biological variance) while filtering out genes with low CVs that could result from technical noise. 3 with significantly higher CVs than what is expected for a gene at the same mean level. Consider N genes that are identified for network analysis and let their expression levels be represented by random variables fX1;X2; :::; XN g. Each variable corresponds to a node in the GRN, and each edge Xi ! Xj represents a regulatory relationship between Xi and Xj. We can think of the true biological network as a real, unknown set of interactions, and our goal is to approximate it by constructing a set of weighted interactions, where each weight corresponds to our confidence that an edge exists in the true network. A network prediction can be either directed, meaning that a prediction is made about the direction of causality for each interaction, or undirected, mean- ing that no such prediction is made. Directed networks are preferable, as they convey more information, but are also more difficult to construct. A network can also be signed, meaning that activation edges are labeled with a \+" and repression edges are labeled with a \-", or unsigned, meaning the edges are not labeled as positive or negative. Signed networks are important for drawing bio- logical insights. However, unsigned networks are often used for theoretical work on network inference, since the difficult problem is determining which edges represent true interactions. Once an interaction is known, determining its sign is very easy and can be found with a simple covariance. Inferred networks have been shown to contain several motifs (smaller modules that occur more often in the network than expected by random chance), and these motifs are illustrated in Figure 2. The output of a network inference algorithm is a set of weighted edge pre- dictions, where each edge-weight corresponds to the confidence that a real in- teraction exists between two genes. So, the accuracy of these algorithms can be evaluated by running them on a \gold standard" dataset for which the true network structure is already known. The algorithms' ability to recover the true interactions can then be scored using receiver operating characteristic (ROC) curves or precision-recall (PR) curves. Some researchers have suggested that PR curves are the superior metric [7], although both metrics are commonly used. Fortunately for network inference researchers, several gold standard datasets 4 1 2 3 4 5 6 7 8 Figure 2: Illustration of some common GRN motifs. 1. Modularity (clustering of nodes), 2. Self-edge, 3. Fan-out, 4. Fan-in, 5. Feed-back loop, 6. Feed-forward loop, 7. Cascade, 8. Cascade false positive error (a feed-forward loop is incorrectly predicted). In the \Algorithms" section, we will discuss the strengths and weaknesses of various algorithm classes when it comes to detecting different network motifs. 5 are publicly available in the form of DREAM Challenges [8]. These are com- putational biology competitions for which researchers are invited to design new algorithms to make predictions from data, and many involve predicting GRN structures from expression data. Once each competition is over, the datasets and true network structures are released to the public, and are commonly used for algorithm evaluation [9,10,11]. 3. Algorithms Several broad classes of algorithms are used for network inference. An excel- lent introductory overview of these can be found in Huynh-Thu and Sanguinetti 2019 [12]. A thorough evaluation of the different types of algorithms can be found in Marbach et al. 2012 [9], in which many different algorithms were benchmarked on the DREAM5 gold standard networks, with varying results. To put it simply, there is no overall \best" class of algorithms. Rather, each has its own advantages and disadvantages, so choosing the best one depends on the context. Here, we will give a brief description of each class, but will mainly focus on recent developments and unsolved challenges. Figure 3 provides a summary of some of the properties of different algorithm types. 3.1. Correlation One of the most basic methods of network inference is to simply compute the correlation for each pair of genes. This method is very simplistic, but is also fast and scalable for large datasets. It can be especially useful in cases where researchers want a general idea of which genes are related, without caring much about causal direction or distinguishing between direct and indirect regulation. For this reason, the resulting network prediction may sometimes be referred to as a \gene co-expression network" rather than a GRN. In general, correlation- based algorithms tend to perform relatively well on detecting feed-forward loops, fan-ins, and fan-outs, but also have an increased false positive rate for cascades [9] (please see Figure 2 for an explanation of these motifs). 6 Algorithm Class Temporal Data Directionality Advantages Disadvantages Examples Required? Correlation No Undirected • Fast, scalable • Possibly WGCNA [13] • Detection of over-simplistic PGCNA [14] feed-forward loops, • False positives for fan-ins, and fan-outs cascades Regression No Directed • Good overall accuracy • Bad detection of TIGRESS [15], feed-forward loops, GENIE3 [16], fan-ins, and fan-outs bLARS [17] Bayesian - No Directed • Performance on small • Performance on large [19,20] Simple networks networks.