Network Inference in Systems Biology: Recent Developments, Challenges, and Applications

Network Inference in Systems Biology: Recent Developments, Challenges, and Applications Michael M. Saint-Antoine1 and Abhyudai Singh2 Abstract One of the most interesting, difficult, and potentially useful topics in computational biology is the inference of gene regulatory networks (GRNs) from expression data. Although researchers have been working on this topic for more than a decade and much progress has been made, it remains an unsolved problem and even the most sophisticated inference algorithms are far from perfect. In this paper, we review the latest developments in network inference, including state-of-the-art algorithms like PIDC, Phixer, and more. We also discuss unsolved computational challenges, including the optimal combination of algorithms, integration of multiple data sources, and pseudo-temporal ordering of static expression data. Lastly, we discuss some exciting applications of network inference in cancer research, and provide a list of useful software tools for researchers hoping to conduct their own network inference analyses. 1. Introduction Networks of gene interactions, in which genes activate and repress the tran- scription of other genes, are responsible for much of the complexity of cellular life [1,2], and their malfunction can be disastrous for an organism [3]. So, un- arXiv:1911.04046v1 [q-bio.MN] 11 Nov 2019 derstanding these networks has long been a goal of systems biology. Discovering gene interactions experimentally can be very difficult, and given the approxi- 1Center for Bioinformatics and Computational Biology, University of Delaware, Newark, Delaware 19716, USA 2Electrical and Computer Engineering, University of Delaware, Newark, Delaware 19716, USA Preprint submitted to Current Opinion in Biotechnology November 12, 2019 mately 20,000 genes in the human genome, it is not feasible to do an experiment for every possible pair to check for an interaction. However, with traditional DNA microarray [4], or next-generation sequencing technologies, most notably (single-cell or bulk) RNA-sequencing (RNA-Seq) [5,6], one can get a quantita- tive peek into the transcriptomic profile of an individual cell, or a population of cells. Unfortunately, these measurement techniques involve killing the cell, so each cell can provide only one timepoint of data. A significant disadvantage of this static data is that it will not allow us to detect self-edges (in which a gene regulates itself). In some cases, it may be possible to construct pseudo-temporal data from static single-cell measurements. For example, a researcher may administer a stimulus to a set of cells, and then perform an RNA-Seq experiment on some cells after one hour, then on some cells after two hours, and so on, in order to collect data on the post-stimulus gene expression dynamics. In other cases, it may be possible to infer the temporal ordering of a set of gene expression measurements using computational methods (more on that in the Computational Challenges section). In this paper, the input gene expression data will be assumed to be static single cell data (meaning that each measured cell provides us with only one timepoint of data), unless otherwise noted. 2. Problem Formulation For convenience, it can be useful to abstract the problem of network inference into a graph theory framework. Figure 1 shows the workflow of a hypotheti- cal network inference scheme where, for example, single-cell RNA-Seq is used to measure the expression of a large number of genes in many individual cells. Given the large number of genes measured in these experiments it maybe pru- dent to select a smaller a subset of genes with high biological variance for further analysis. Such genes are typically identified by first plotting the Coefficient of Variation or CV (standard deviation over mean) in expression levels across cells with respect to the mean expression levels for all genes, and then selecting genes 2 Measure gene expression in single cells Biological Variation Variation of Find genes with biological variance Coefficient Mean Expression Level Technical Variation Infer gene regulatory networks Figure 1: A typical workflow to identity genes with high biological variance for network map- ping studies. RNA-Seq typically quantifies the expression of thousands of genes across individual cells or environmental/genetic conditions. The Coefficient of Variation or CV (Standard deviation over mean) in the expression level across cells or conditions is plotted as a function of the mean expression level for all genes. Fitting a trend line through this data provides the expected CV of a gene at given mean level, and this trend line is then used to select genes that have significantly higher CVs (representing high biological variance) while filtering out genes with low CVs that could result from technical noise. 3 with significantly higher CVs than what is expected for a gene at the same mean level. Consider N genes that are identified for network analysis and let their expression levels be represented by random variables fX1;X2; :::; XN g. Each variable corresponds to a node in the GRN, and each edge Xi ! Xj represents a regulatory relationship between Xi and Xj. We can think of the true biological network as a real, unknown set of interactions, and our goal is to approximate it by constructing a set of weighted interactions, where each weight corresponds to our confidence that an edge exists in the true network. A network prediction can be either directed, meaning that a prediction is made about the direction of causality for each interaction, or undirected, meaning that no such prediction is made. Directed networks are preferable, as they convey more information, but are also more difficult to construct. A network can also be signed, meaning that activation edges are labeled with a \+" and repression edges are labeled with a \-", or unsigned, meaning the edges are not labeled as positive or negative. Signed networks are important for drawing biological insights. However, unsigned networks are often used for theoretical work on network inference, since the difficult problem is determining which edges represent true interactions. Once an interaction is known, determining its sign is very easy and can be found with a simple covariance. Inferred networks have been shown to contain several motifs (smaller modules that occur more often in the network than expected by random chance), and these motifs are illustrated in Figure 2. The output of a network inference algorithm is a set of weighted edge predictions, where each edge-weight corresponds to the confidence that a real interaction exists between two genes. So, the accuracy of these algorithms can be evaluated by running them on a \gold standard" dataset for which the true network structure is already known. The algorithms' ability to recover the true interactions can then be scored using receiver operating characteristic (ROC) curves or precision-recall (PR) curves. Some researchers have suggested that PR curves are the superior metric [7], although both metrics are commonly used. Fortunately for network inference researchers, several gold standard datasets 4 1 2 3 4 5 6 7 8 Figure 2: Illustration of some common GRN motifs. 1. Modularity (clustering of nodes), 2. Self-edge, 3. Fan-out, 4. Fan-in, 5. Feed-back loop, 6. Feed-forward loop, 7. Cascade, 8. Cascade false positive error (a feed-forward loop is incorrectly predicted). In the \Algorithms" section, we will discuss the strengths and weaknesses of various algorithm classes when it comes to detecting different network motifs. 5 are publicly available in the form of DREAM Challenges [8]. These are computational biology competitions for which researchers are invited to design new algorithms to make predictions from data, and many involve predicting GRN structures from expression data. Once each competition is over, the datasets and true network structures are released to the public, and are commonly used for algorithm evaluation [9,10,11]. 3. Algorithms Several broad classes of algorithms are used for network inference. An excel- lent introductory overview of these can be found in Huynh-Thu and Sanguinetti 2019 [12]. A thorough evaluation of the different types of algorithms can be found in Marbach et al. 2012 [9], in which many different algorithms were benchmarked on the DREAM5 gold standard networks, with varying results. To put it simply, there is no overall \best" class of algorithms. Rather, each has its own advantages and disadvantages, so choosing the best one depends on the context. Here, we will give a brief description of each class, but will mainly focus on recent developments and unsolved challenges. Figure 3 provides a summary of some of the properties of different algorithm types. 3.1. Correlation One of the most basic methods of network inference is to simply compute the correlation for each pair of genes. This method is very simplistic, but is also fast and scalable for large datasets. It can be especially useful in cases where researchers want a general idea of which genes are related, without caring much about causal direction or distinguishing between direct and indirect regulation. For this reason, the resulting network prediction may sometimes be referred to as a \gene co-expression network" rather than a GRN. In general, correlation- based algorithms tend to perform relatively well on detecting feed-forward loops, fan-ins, and fan-outs, but also have an increased false positive rate for cascades [9] (please see Figure 2 for an explanation of these motifs). 6 Algorithm Class Temporal Data Directionality Advantages Disadvantages Examples Required? Correlation No Undirected • Fast, scalable • Possibly WGCNA [13] • Detection of over-simplistic PGCNA [14] feed-forward loops, • False positives for fan-ins, and fan-outs cascades Regression No Directed • Good overall accuracy • Bad detection of TIGRESS [15], feed-forward loops, GENIE3 [16], fan-ins, and fan-outs bLARS [17] Bayesian - No Directed • Performance on small • Performance on large [19,20] Simple networks networks.

Network Inference in Systems Biology: Recent Developments, Challenges, and Applications

Dynamical Gene Regulatory Networks Are Tuned by Transcriptional Autoregulation with Microrna Feedback Thomas G

Systematic Comparison of Sea Urchin and Sea Star Developmental Gene Regulatory Networks Explains How Novelty Is Incorporated in Early Development

An Automatic Design Framework of Swarm Pattern Formation Based on Multi-Objective Genetic Programming

Temporal Flexibility of Gene Regulatory Network Underlies a Novel Wing Pattern in Flies

Gene Regulatory Networks

A Computational Model Inspired by Gene Regulatory Networks

Gene Regulatory Network Reconstruction Using Bayesian

'Non-Genetic' Inheritance: Insights from Molecular-Evolutionary Crosstalk

Algorithmic Information Theory Applications in Bright Field Microscopy and Epithelial Pattern Formation

Computational Representation of Developmental Genetic Regulatory Networks

The Sonic Hedgehog Signaling System As a Bistable Genetic Switch

Analysis on Gene Modular Network Reveals Morphogen-Directed Development Robustness in Drosophila