Network Inference in : Recent Developments, Challenges, and Applications

Michael M. Saint-Antoine1 and Abhyudai Singh2

Abstract

One of the most interesting, difficult, and potentially useful topics in compu- tational biology is the inference of regulatory networks (GRNs) from ex- pression data. Although researchers have been working on this topic for more than a decade and much progress has been made, it remains an unsolved prob- lem and even the most sophisticated inference algorithms are far from perfect. In this paper, we review the latest developments in network inference, includ- ing state-of-the-art algorithms like PIDC, Phixer, and more. We also discuss unsolved computational challenges, including the optimal combination of algo- rithms, integration of multiple data sources, and pseudo-temporal ordering of static expression data. Lastly, we discuss some exciting applications of net- work inference in research, and provide a list of useful software tools for researchers hoping to conduct their own network inference analyses.

1. Introduction

Networks of gene interactions, in which activate and repress the tran- scription of other genes, are responsible for much of the complexity of cellular [1,2], and their malfunction can be disastrous for an organism [3]. So, un- arXiv:1911.04046v1 [q-bio.MN] 11 Nov 2019 derstanding these networks has long been a goal of systems biology. Discovering gene interactions experimentally can be very difficult, and given the approxi-

1Center for and Computational Biology, University of Delaware, Newark, Delaware 19716, USA 2Electrical and Computer Engineering, University of Delaware, Newark, Delaware 19716, USA

Preprint submitted to Current Opinion in Biotechnology November 12, 2019 mately 20,000 genes in the human genome, it is not feasible to do an experiment for every possible pair to check for an interaction. However, with traditional DNA microarray [4], or next-generation sequencing technologies, most notably (single-cell or bulk) RNA-sequencing (RNA-Seq) [5,6], one can get a quantita- tive peek into the transcriptomic profile of an individual cell, or a population of cells. Unfortunately, these measurement techniques involve killing the cell, so each cell can provide only one timepoint of data. A significant disadvantage of this static data is that it will not allow us to detect self-edges (in which a gene regulates itself). In some cases, it may be possible to construct pseudo-temporal data from static single-cell measurements. For example, a researcher may administer a stimulus to a set of cells, and then perform an RNA-Seq experiment on some cells after one hour, then on some cells after two hours, and so on, in order to collect data on the post-stimulus dynamics. In other cases, it may be possible to infer the temporal ordering of a set of gene expression measurements using computational methods (more on that in the Computational Challenges section). In this paper, the input gene expression data will be assumed to be static single cell data (meaning that each measured cell provides us with only one timepoint of data), unless otherwise noted.

2. Problem Formulation

For convenience, it can be useful to abstract the problem of network inference into a graph theory framework. Figure 1 shows the workflow of a hypotheti- cal network inference scheme where, for example, single-cell RNA-Seq is used to measure the expression of a large number of genes in many individual cells. Given the large number of genes measured in these experiments it maybe pru- dent to select a smaller a subset of genes with high biological variance for further analysis. Such genes are typically identified by first plotting the Coefficient of Variation or CV (standard deviation over mean) in expression levels across cells with respect to the mean expression levels for all genes, and then selecting genes

2 Measure gene expression in single cells

Biological Variation Variation

of Find genes with biological variance Coefficient Mean Expression Level Technical Variation

Infer gene regulatory networks

Figure 1: A typical workflow to identity genes with high biological variance for network map- ping studies. RNA-Seq typically quantifies the expression of thousands of genes across individ- ual cells or environmental/genetic conditions. The Coefficient of Variation or CV (Standard deviation over mean) in the expression level across cells or conditions is plotted as a function of the mean expression level for all genes. Fitting a trend line through this data provides the expected CV of a gene at given mean level, and this trend line is then used to select genes that have significantly higher CVs (representing high biological variance) while filtering out genes with low CVs that could result from technical noise.

3 with significantly higher CVs than what is expected for a gene at the same mean level. Consider N genes that are identified for network analysis and let their expression levels be represented by random variables {X1,X2, ..., XN }. Each variable corresponds to a node in the GRN, and each edge Xi → Xj represents a regulatory relationship between Xi and Xj. We can think of the true as a real, unknown set of interactions, and our goal is to approximate it by constructing a set of weighted interactions, where each weight corresponds to our confidence that an edge exists in the true network. A network prediction can be either directed, meaning that a prediction is made about the direction of causality for each interaction, or undirected, mean- ing that no such prediction is made. Directed networks are preferable, as they convey more information, but are also more difficult to construct. A network can also be signed, meaning that activation edges are labeled with a “+” and repression edges are labeled with a “-”, or unsigned, meaning the edges are not labeled as positive or negative. Signed networks are important for drawing bio- logical insights. However, unsigned networks are often used for theoretical work on network inference, since the difficult problem is determining which edges represent true interactions. Once an interaction is known, determining its sign is very easy and can be found with a simple covariance. Inferred networks have been shown to contain several motifs (smaller modules that occur more often in the network than expected by random chance), and these motifs are illustrated in Figure 2. The output of a network inference algorithm is a set of weighted edge pre- dictions, where each edge-weight corresponds to the confidence that a real in- teraction exists between two genes. So, the accuracy of these algorithms can be evaluated by running them on a “gold standard” dataset for which the true network structure is already known. The algorithms’ ability to recover the true interactions can then be scored using receiver operating characteristic (ROC) curves or precision-recall (PR) curves. Some researchers have suggested that PR curves are the superior metric [7], although both metrics are commonly used. Fortunately for network inference researchers, several gold standard datasets

4 1 2

3 4

5 6

7 8

Figure 2: Illustration of some common GRN motifs. 1. (clustering of nodes), 2. Self-edge, 3. Fan-out, 4. Fan-in, 5. Feed-back loop, 6. Feed-forward loop, 7. Cascade, 8. Cascade false positive error (a feed-forward loop is incorrectly predicted). In the “Algorithms” section, we will discuss the strengths and weaknesses of various algorithm classes when it comes to detecting different network motifs.

5 are publicly available in the form of DREAM Challenges [8]. These are com- putational biology competitions for which researchers are invited to design new algorithms to make predictions from data, and many involve predicting GRN structures from expression data. Once each competition is over, the datasets and true network structures are released to the public, and are commonly used for algorithm evaluation [9,10,11].

3. Algorithms

Several broad classes of algorithms are used for network inference. An excel- lent introductory overview of these can be found in Huynh-Thu and Sanguinetti 2019 [12]. A thorough evaluation of the different types of algorithms can be found in Marbach et al. 2012 [9], in which many different algorithms were benchmarked on the DREAM5 gold standard networks, with varying results. To put it simply, there is no overall “best” class of algorithms. Rather, each has its own advantages and disadvantages, so choosing the best one depends on the context. Here, we will give a brief description of each class, but will mainly focus on recent developments and unsolved challenges. Figure 3 provides a summary of some of the properties of different algorithm types.

3.1. Correlation

One of the most basic methods of network inference is to simply compute the correlation for each pair of genes. This method is very simplistic, but is also fast and scalable for large datasets. It can be especially useful in cases where researchers want a general idea of which genes are related, without caring much about causal direction or distinguishing between direct and indirect regulation. For this reason, the resulting network prediction may sometimes be referred to as a “gene co-expression network” rather than a GRN. In general, correlation- based algorithms tend to perform relatively well on detecting feed-forward loops, fan-ins, and fan-outs, but also have an increased false positive rate for cascades [9] (please see Figure 2 for an explanation of these motifs).

6 Algorithm Class Temporal Data Directionality Advantages Disadvantages Examples Required?

Correlation No Undirected • Fast, scalable • Possibly WGCNA [13] • Detection of over-simplistic PGCNA [14] feed-forward loops, • False positives for fan-ins, and fan-outs cascades Regression No Directed • Good overall accuracy • Bad detection of TIGRESS [15], feed-forward loops, GENIE3 [16], fan-ins, and fan-outs bLARS [17]

Bayesian - No Directed • Performance on small • Performance on large [19,20] Simple networks networks. • Inability to detect cycles

Bayesian - Yes Directed • Performance on small • Performance on large [21] Dynamic networks networks. • Detection of cycles and self-edges Information No Undirected (at • Detection of • False positives for ARACNE [25], Theory least in feed-forward loops, cascades CLR [26], simplest fan-ins, and fan-outs MRNET [27], form) • Similar to correlation PIDC [28] methods, with better accuracy Phixer No Directed • Parsimonious output • Possible loss of [31] due to pruning step. overal accuracy due to pruning step (this can be removed if the user chooses)

Figure 3: Summary of the properties, advantages, and disadvantages of different types of algorithms.

7 Though simplistic, correlation networks can yield powerful insights when the proper analytical tools are applied. For example, Weighted Gene Co- expression Network Analysis (WGCNA) [13], based on correlation, was an early and widely-used method in gene network analysis. Furthermore, improvements to correlation-based network inference are still being made. Care et al. 2019 [14] describes how to deal with the problem of over-connectivity in correlation networks, effectively reducing the number of edges to get to the point where the network is sparse enough for a cluster analysis to be performed.

3.2. Regression

Another common method of network inference is regression analysis. In its simplest form, this method involves solving the following linear regression equation to predict Xj from Xi:

Xj = β0 + β1Xi +  (1)

Here, β0 is the intercept, β1 is the slope, and  is the random error term.

So, since β1 represents the relationship between Xj and Xi, we can assign it as the weight of the edge Xi → Xj. This is only the simplest version of regression analysis, and many regression-based algorithms use more advanced methods. Two notable, highly-regarded examples are TIGRESS [15] and GENIE3 [16]. Regression methods are more computationally expensive than correlation, but also provide the advantage of predicting causal direction. In general, regression- based methods perform well compared to other methods overall, but perform poorly on the specific network motifs of feed-forward loops, fan-ins, and fan-outs [9]. It is also important to note that regression methods that resample data (by bootstrapping, for example) typically outperform regression methods that do not [9]. One limitation of regression-based methods, at least in their simplest form, is that they typically assume linear relationships between genes, and may fail to detect non-linear regulatory interactions. However, some progress has been made on this matter: bLARS [17], a modification of the Least Angle Re-

8 gression method [18], is a regression-based algorithm designed to detect both linear and pre-defined non-linear interactions.

3.3. Bayesian Methods

Bayesian methods have long been used in gene network inference [19,20]. In these methods, an interaction between genes is represented as a conditional probability. For two genes with expression levels Xi and Xj, the conditional probability P (Xj|Xi) corresponds to the edge Xi → Xj, and Xi is said to be the parent of Xj. The graphical representation of a set of these conditional probabilities is called a Bayesian network. Given a gene expression dataset, a maximum likelihood estimation can be applied to determine which Bayesian network structure has the highest posterior probability. In other words, the goal of the maximum likelihood estimation is to determine which network structure is the most likely to have produced the observed data. An advantage of Bayesian methods is the ease with which prior knowledge of interactions can be integrated [12]. Unfortunately, Bayesian methods are typically very computationally expensive. In general, they tend to perform poorly on large datasets compared to other methods, and may be better suited to small networks for which their heuristic searching method can more easily converge on the true optimal network structure [9]. Attempting to improve their scalability to large datasets is a current computational challenge. Another significant disadvantage of Bayesian methods, at least in their sim- plest form, is their inability to detect cycles. In other words, the conditional probabilities only flow one way, so they will be unable to detect something like a loop, which is a common motif of GRNs. However, a subtype of Bayesian methods has been developed in an attempt to correct for this prob- lem: Dynamic Bayesian Networks (DBNs). While a node in simple Bayesian methods corresponds to a gene, a node in DBNs corresponds to a gene at a specific timepoint. While this solves the problem of detecting cycles (and allows us to detect self-edges), it also presents some new challenges. First of all, either temporal or pseudo-temporal data is required. Obtaining temporal data may

9 not always be feasible, and pseudo-temporal ordering is a complex computa- tional challenge in itself (this will be discussed in the next section). Also, DBNs are more computationally expensive than the already expensive simple Bayesian methods. However, despite these challenges, DBNs are still fairly widely used [21,22,23].

3.4. Information Theory

Information theory, first developed by Claude Shannon in 1948, provides a theoretical framework for quantifying information and studying its properties. One fundamental measure of information theory is entropy, which gives a nu- merical measure of a random variable’s “uncertainty”. For a discrete random variable X, the entropy is defined as:

X H(X) = − p(x)log(p(x)) (2) x∈X where p(x) is the probability distribution of the random variable. If the ex- pression data is continuous, then it must first be discretized through a binning process before being used in this formula, and development of an optimal binning strategy is itself an active area of research [29,30]. For two random variables Xi and Xj, one can quantify their “” as the amount by which the entropy of their joint distribution is reduced compared to their combined individual entropies:

X X p(xi, xj) I(Xi,Xj) = p(xi, xj)log (3) p(xi)p(xj) xi∈Xi xj ∈Xj where p(xi, xj) is the joint probability distribution of Xi and Xj. In network inference, mutual information can serve as a measure of depen- dence between genes. Since it is a symmetric measure, I(Xi,Xj) is assigned as the weights of both edges Xi → Xj and Xj → Xi (this can also be represented as the undirected edge Xi ↔ Xj). Despite this disadvantage of not predicting causal direction, mutual information has the advantages of being able to de- tect non-linear interactions (unlike simple correlation measures) and of being

10 scalable to whole-genome networks. In general, information theory-based algo- rithms tend to perform better than correlation methods, but experience similar biases: increased detection of feed-forward loops, fan-ins, and fan-outs, but also an increased rate of false positives for cascades [9]. One of the most famous network inference methods based on information theory is ARACNE [25], introduced by Margolin et al. in 2006. In this method, the mutual information is estimated for each pair of genes, and then assigned as their edge weight. Then, all edges with a weight below a certain threshold of statistical significance are eliminated. Finally, and perhaps most importantly, the remaining edges are pruned according to the Data Processing Inequality.

For each triplet of nodes Xi, Xj, and Xk, the following inequality is checked:

I(Xi,Xk) ≤ min{I(Xi,Xj),I(Xj,Xk)} (4)

If this statement is true, then the edge Xi ↔ Xk is eliminated. The goal of this pruning step is to yield a sparse network with minimal redundancy and maximal explanatory power. This is only a brief overview of ARACNE, and for a more in-depth explanation we refer readers to the original paper [25]. Other methods based on information theory and similar to ARACNE include Butte and Kohane’s method [24], CLR [26], and MRNET [27]. Several newer algorithms also make use of concepts from information theory. An exciting new method is Partial Information Decomposition and Context (PIDC) [28]. PIDC computes informational relationships in a triplet-wise, rather than pair-wise, manner, so as to determine the proportion of the mutual information between two genes that cannot be explained in terms of any other third gene, thereby eliminating indirect and redundant relationships in the predicted network. For a detailed explanation of how this is calculated, please see the original paper [28].

3.5. Phixer Another new algorithm that is different from but inspired by information theory is Phixer, introduced in Singh et al. 2018 [31]. Phixer has the advantage

11 of producing an output graph that is both directed (unlike most information theory methods) and can contain cycles (unlike most simple Bayesian methods).

In Phixer, one computes the so-called φ-mixing coefficient for the edge Xi → Xj as

φ(Xj|Xi) = max |P r{Xj ∈ S|Xi ∈ T } − P r{Xj ∈ S}| (5) S⊆A,T ⊆B which in essence quantifies the maximum distance between conditional proba- bility of Xj given Xi, and the unconditional probability of Xj. Here A and B are the finite sets in which the random variables Xj and Xi take values, respec- tively. S and T represent the different subsets of A and B, and φ(Xj|Xi) is the maximum distance between the conditional and the unconditional probability across subsets. The φ-mixing coefficient comes with some useful properties. It is bounded in the interval [0, 1], and φ (Xi|Xk) = 0 if and only if Xi and Xk are independent.

Moreover, unlike correlation or mutual information it is asymmetric φ (Xi|Xk) 6=

φ (Xk|Xi), and hence can discriminate the direction of the influence. The inference starts by first computing the φ-mixing coefficient for each di- rected edge between two nodes, and then the network is pruned to eliminate redundant edges. For every possible triplet of nodes Xi, Xj, and Xk, the fol- lowing inequality is checked:

φ(Xk|Xi) ≤ min{φ(Xj|Xi), φ(Xk|Xj)} (6)

If this statement is true, then the edge Xi → Xk is eliminated, and hence the statistical relationship between Xi and Xk can be fully explained by the edges

Xi → Xj and Xj → Xk. This pruning method is similar to the Data Processing Inequality (Equation 4), used in ARACNE [25]. Importantly, the goal of the pruning step is to produce the most parsimonious network consistent with the data, which is not necessarily the same as producing the most accurate network prediction [32]. So, depending on their priorities, researchers may choose to include or omit the pruning step.

12 3.6. Miscellaneous

In addition to correlation, regression, Bayesian networks, information the- ory, and Phixer, many more interesting and creative types of network inference algorithms exist. Some examples are Gaussian graphical models [33,34], ODE- based methods [35,36], Boolean methods [37,38], and deep learning with neural networks [39]. Unfortunately, we do not have time to summarize every single method here, so we refer the reader to the original papers.

4. Computational Challenges

Although the field of network inference has progressed a great deal since its inception, there are several problems that are unsolved and require more research. In this section, we will discuss some of these problems.

4.1. Combining Algorithms

It has long been thought that combining an assortment of different algo- rithms could prove beneficial in network inference. Hill et al. 2016 [40] confirmed this empirically with a computational experiment in which they aggregated re- sults from randomly-chosen inference algorithms, and evaluated the results for accuracy using the area under the ROC curve measure. The results showed a general trend in which the more algorithms were included in the aggregation, the more accurate the results were. Another similar computational experiment was performed in which the results from the top performing algorithms were aggregated (rather than the algorithms being randomly chosen). In this case, the aggregated results generally achieved higher accuracy than even the best individual algorithms. (Please note that Hill et al. 2016 is about the inference of networks from phosphoprotein data, which is computationally very similar to gene network inference.) These results confirm the findings of an earlier anal- ysis found in Marbach et al. 2012 [9]. However, while there is ample evidence that combining algorithms can be beneficial, more research is needed to find the optimal combination strategy.

13 4.2. Multiple Data Sources

Up to this point in the paper, we have been under the assumption that we have only gene expression data to work with when inferring the structure of a GRN. However, in some cases, we may have access to more information. So, an interesting current problem is how best to combine information from multiple data sources in order to generate a prediction. Some recent papers on the integration of multiple data sources are Yuan et al. 2018 [41], Liang et al. 2019 [42], Lam et al. 2016 [43], and Aibar et al. 2017 [44]. Yuan et al. 2019 combines gene expression data with DNA methylation and copy number variation data. Liang et al. 2019 combines gene expression data with genome-wide binding data, gene ontologies, pathway data, and ChIP-Seq data. Lam et al. 2016 is an especially creative application of this concept. Here, analysis of gene expression data from the species in question is combined with analysis of gene expression data from homologous species. Aibar et al. 2017 introduces SCENIC, a method that combines the raw results of a GENIE3 [16] analysis with factor binding motif information from RcisTarget (also introduced in [44]) to select out a subset of high-confidence interactions, and then uses AUCell (also introduced in [44]) to classify the cells into transcriptional states. In some cases, it may be necessary to study a change in a GRN, for which something is already known about the GRN’s prior structure. This could be in the context of cellular differentiation, or in a pathological context, as with cancer. In these cases, the challenge is to incorporate the prior structural knowl- edge into the inference of the new structure. Some algorithms have been de- veloped specifically for this purpose [33,45]. Also of note is the database Tran- scriptional Regulatory Relationships Unravelled by Sentence-based Text-mining (TRRUST) [46], which contains known gene regulatory interactions in humans and mice. This resource can be used for constructing prior networks from which to infer the reprogrammed differential networks [33].

14 4.3. Pseudo-Temporal Ordering

As discussed before, most of the expression data we rely on for network inference is static data collected by methods such as RNA-Seq. This is dis- advantageous, in part, because it doesn’t allow us to detect self-edges. In an ideal situation, we would have dynamic data for each gene. While this is not currently feasible, some methods have been developed which attempt to create “pseudo-” data from static data [47,48,49,50]. An excellent review of pseudo-temporal ordering methods is Cannoodt et al. 2016 [51]. Sanchez-Castillo et al. 2017 [52] is a great example of how pseudo-temporal ordering can be can be useful in network inference. The researchers attempt to infer the GRN structure of a set of 48 genes, using a sample of 442 expression profiles from mouse embryos. The expression profiles are static, from single-cell qPCR, but the algorithm the researchers are using requires time series data. So, they employ the MOLO algorithm [53] to construct a pseudo-time series. The researchers then infer the GRN and draw biological insights about cell differentiation in mice from its structure. Moreover, a computational experiment is conducted to show how differences in the temporal order can affect the results of the network inference. In addition to the analysis of the mouse embryo data, another inference analysis is performed on pseudo-temporally ordered RNA-Seq data from zebrafish. However, while pseudo-temporal ordering has been useful in some cases, it has also faced criticism. Moris et al. 2016 [54] questions one of the underlying assumptions of many of these algorithms, that cell fate transitions are smooth and continuous. More research is needed on the subject to improve upon these algorithms and address these criticisms. It should also be noted that most of the work on pseudo-temporal ordering has focused on cellular differentiation, and the application of these methods to cells that switch between transient expression states is a more difficult problem that requires further study.

15 5. Applications in Cancer Research

While network inference can be an invaluable tool for the investigation of many different diseases, the most obvious applications are in cancer research. Some important roles of GRNs are to regulate cell cycle timing, proliferation, and apoptosis. Cancer arises from a loss of regulation of these processes, and understanding the structure of the underlying networks can yield powerful in- sights into discovery of relevant genes [55], clinical outcome prediction [56], drug target identification [57,58,37], elucidation of sex-linked differences [59], inves- tigation of transcriptional reprogramming [33], and more. For example, Figure 4 shows as inferred network of genes using Phixer from [60] that is critical in driving drug resistance in BRAF V600E-mutated melanoma. Also, many of the papers previously cited include a testing section in which the algorithm is applied to a cancer dataset to test its efficacy [14,21,41,42,33,34,37,31,44]. There have been so many exciting new applications of network inference that we cannot provide an in-depth explanation of all of them here. So, we will focus on one specific paper to serve as a case-study: Moore et al. 2019 [61]. This is an excellent example of how network inference can be used to understand cancer at the cell systems level. The researchers apply the BC3Net [62] inference algo- rithm to the gene expression profiles of 333 prostate cancer patients, obtained from The Cancer Genome Atlas (TCGA) [63]. The resulting network is then analyzed, using Gene Pair Enrichment Analysis (GPEA) [64] and the Cancer Gene Census [65], with a focus on gene interactions that the researchers feel could be exploited for clinical benefits, including targeted therapy. This paper is valuable not only because of its interesting insights on prostate cancer, but also because the methodology is described in such a clear way that it could serve as a step-by-step guide for researchers hoping to conduct their own analysis of data from another disease. Similar analyses have been performed on data from lymphoma [66], colon cancer [67], and breast cancer [68], but there are still hun- dreds of different cancer types and subtypes that have yet to be analyzed, many of which are available on TCGA and just waiting for an eager computational

16 Figure 4: Gene regulatory network inferred in [60] in the context of drug resistance in melanoma. More specifically, RNA Fluorescent In Situ Hybridization (FISH) was used to count the number of mRNAs of 19 different genes in single melanoma cells. The joint distri- bution of mRNA levels measured across thousands of cells was then used to infer the network using Phixer. This network was shown to play a key role in driving drug-sensitive cells into a transient drug-tolerant state even in the absence of the drug. Such drug-tolerant cells survive exposure to the drug, and drug-induced reprogramming of this network allows these to become stably resistant to the drug.

17 biologist to take up the challenge.

6. Tools

Here are some useful, freely available tools for network inference researchers:

• DREAM Challenges [8] - network inference competitions, for which gold standard datasets are made publicly available. These are often used for benchmarking. Available online at: http://dreamchallenges.org

• The Cancer Genome Atlas (TCGA) [63] - a collection of genomic, epige- nomic, transcriptomic, and proteomic cancer data, including expression profiles that can be used for gene network inference. Available online at: https://portal.gdc.cancer.gov

• TRRUST [46] - a database of experimentally-validated gene interactions in humans and mice. Available online at: https://www.grnpedia.org/trrust

• Cancer Gene Census [65] - a catalogue of genes known to be related to cancer. Available online at: https://cancer.sanger.ac.uk/census

• The Cancer Network Galaxy (TCNG) - an online database of cancer gene networks, predicted from public expression data. Not yet officially pub- lished, but a beta release is available online at: http://tcng.hgc.jp

• GeNeCK [69] - an online tool for gene network inference and visualiza- tion, for which the user can choose from 8 different inference algorithms. Available online at: http://lce.biohpc.swmed.edu/geneck

7. Conclusion

In this paper, we have given a brief overview of the basics of network in- ference, different types of algorithms, some unsolved computational challenges, and exciting new applications in cancer research. While we hope this review has been helpful, it is not completely comprehensive, and we encourage the reader

18 to further explore this topic by reading the papers cited here, and by testing out the algorithms and data sources for themselves.

References

References

[1] Ptashne M: The chemistry of regulation of genes and other things. Journal of Biological Chemistry 2014, 289(9):54175435.

[2] Karlebach G and Shamir R: Modelling and analysis of gene regulatory networks. Nature Reviews Molecular Cell Biology 2008, 9(10):77080.

[3] Kreeger PK, Lauffenburger DA: Cancer systems biology: A network mod- eling perspective. Carcinogenesis 2010, 31(1):2-8.

[4] Brown PO, Botstein D: Exploring the new world of the genome with DNA microarrays. Nature genetics 1999, 21:3337.

[5] Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for tran- scriptomics. Nature reviews genetics 2009, 10(1):5763.

[6] Kulkarni A, Anderson AG, Merullo DP, Konopka G: Beyond bulk: a review of single cell transcriptomics methodologies and applications. Curr Opin Biotechnol. 2019, 58:129-136.

[7] Schrynemackers M, Kuffner R, Geurts P: On protocols and measures for the validation of supervised methods for the inference of biological networks. Frontiers in Genetics 2013, 4:262.

[8] Stolovitzky G, Monroe D, Califano A: Dialogue on reverse-engineering as- sessment and methods: the DREAM of high-throughput pathway inference. Ann N Y Acad Sci. 2007, 1115:1-22.

[9] Marbach D, Costello JC, Kuffner R, Vega NM, Prill RJ, Camacho DM, Allison KR, The DREAM5 Consortium, Kellis M, Collins JJ, Stolovitzky

19 G: Wisdom of crowds for robust gene network inference. Nature Methods 2012, 9(8).

[10] Stolovitzky G, Prill RJ, Califano A: Lessons from the DREAM2 Challenges. Ann N Y Acad Sci. 2009, 1158:159-95.

[11] Prill RJ, Marbach D, Saez-Rodriguez J, Sorger PK, Alexopoulos LG, Xue X, Clarke ND, Altan-Bonnet G, Stolovitzky G: Towards a Rigorous As- sessment of Systems Biology Models: The DREAM3 Challenges. PLoS One 2010, 5(2): e9202.

[12] Huynh-Thu VA, Sanguinetti G: Gene regulatory network inference: an introductory survey. Methods in 2019, DOI:10.1007/978- 1-4939-8882-2 1.

[13] Zhang B, Horvath S: A general framework for weighted gene co- expression network analysis. Statistical applications in genetics and molecular biology 2005, 4(1)

[14] Care MA, Westhead DR, Tooze RM: Parsimonious Gene Correlation Net- work Analysis (PGCNA): a tool to define modular gene co-expression for refined molecular stratification in cancer. Systems Biology and Applications 2019, 5:13.

[15] Haury AC, Mordelet F, Vera-Licona P, Vert JP: TIGRESS: Trustful Infer- ence of Gene REgulation using Stability Selection. BMC Systems Biology 2012, 6:145.

[16] Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P: Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PLoS One 2010, 5(9). DOI: 10.1371/journal.pone.0012776.

[17] Singh N, Vidyasagar M: bLARS: An Algorithm to Infer Gene Regulatory Networks. IEEE/ACM Transactions of Computational Biology and Bioin- formatics 2016, 13(2):301-14. DOI: 10.1109/TCBB.2015.2450740.

20 [18] Efron B, Hastie T, Johnstone I, Tibshirani R: Least Angle Regression. Annals of Statistics 2004, 32(2):407499.

[19] Friedman N, Linial M, Nachman I, and Peer D: Using Bayesian networks to analyze expression data. Journal of Computational Biology 2000, 7(3- 4):60120.

[20] Yu J, Smith VA, WangPP, Hartemink AJ, Jarvis ED: Advances to Bayesian network inference for generating causal networks from observational biolog- ical data. Bioinformatics 2004, 20(18):3594603.

[21] Adabor ES, Acquaah-Mensah GK: Restricted-derestricted dynamic Bayesian Network inference of transcriptional regulatory relationships among genes in cancer. Computational Biology and Chemistry 2019, 79:155-164.

[22] Kim SY, Imoto S, Miyano S: Inferring gene networks from time series mi- croarray data using dynamic Bayesian networks. Briefings in Bioinformatics 2003,4(3):22835.

[23] Hill SM, Lu Y, Molina J, Heiser LM, Spellman PT, Speed TP, Gray JW, Mills GB, Mukherjee S: Bayesian inference of signaling in a cancer cell line. Bioinformatics 2012, 28(21):2804-10.

[24] Butte AJ and Kohane IS: Mutual information relevance networks: func- tional genomic clustering using pairwise entropy measurements. Pacific Symposium on Biocomputing 2000, 5:418-29.

[25] Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A: ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioin- formatics 2006, 7(1):S7. Available online at: https://github.com/califano- lab/ARACNe-AP/

[26] Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS: Large-scale mapping and validation of

21 transcriptional regulation from a compendium of expression profiles. PLoS Biology 2007, 5(1):e8.

[27] Meyer PE, Kontos K, Lafitte F, Bontempi G: Information-theoretic infer- ence of large transcriptional regulatory networks. EURASIP Journal on Bioinformatics and Systems Biology 2007, (1):79,879.

[28] Chan TE, Stumpf MPH, Babtie AC: Gene Regulatory Network Inference from Single-Cell Data Using Multivariate Information Measures. Cell Systems 2017, 5(3):251-267. Available online at: https://github.com/Tchanders/NetworkInference.jl

[29] Kraskov A, Stogbauer H, Grassberger P: Estimating mutual information. Phys. Rev. E 2004, 69: 066138.

[30] Steuer R, Kurths J, Daub CO, Weise J, Selbig J: The mutual information: detecting and evaluating dependencies between variables. Bioinformatics 2002, 18 (Suppl 2 ):S231S240.

[31] Singh N, ME Ahsen, N Challapalli, HS Kim, MA White, M Vidyasagar: Inferring Genome-Wide Interaction Networks Using the Phi-Mixing Coeffi- cient, and Applications to Lung and Breast Cancer. IEEE Transactions on Molecular, Biological, and Multi-Scale Communications 2018, 4(3): 123- 139.

[32] Saint-Antoine MM and Singh A: Evaluating Pruning Methods in Gene Network Inference. 2019 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[33] Xu T, Ou-Yang L, Hu X, Zhang XF: Identifying Gene Network Rewiring by Integrating Gene Expression and Gene Network Data. IEEE/ACM Trans- actions of Computational Biology and Bioinformatics 2018, 15(6): 2079- 2085.

22 [34] Zhao H, Duan ZH: Cancer Genetic Network Inference Using Gaussian Graphical Models. Bioinform Biol Insights. 2019, doi: 10.1177/1177932219839402.

[35] Bansal M, Della Gatta G, di Bernardo D: Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics 2006, 22(7):81522.

[36] Huynh-Thu VA and Sanguinetti G: Combining tree-based and dynamical systems for the inference of gene regulatory networks. Bioinformatics 2015, pages 18.

[37] Biane C, Delaplace F, Melliti T: Abductive Network Action Inference for Targeted Therapy Discovery. Electronic Notes in Theoretical Computer Science 2018, 335:3-25.

[38] Barman S and Kwon YK: A inference from time-series gene expression data using a genetic algorithm. Bioinformatics 2018, 34(17): i927i933.

[39] Kishan KC, Li R, Cui F, Yu Q, Haake AR: GNE: a deep learning framework for gene network inference by aggregating biological information. BMC Sys- tems Biology 2019, 13:38.

[40] Hill SM, Heiser LM, Cokelaer T, Unger M, Nesser, NK, Carlin DE, Zhang Y, Sokolov A, Paull EO, Wong CK, et al: Inferring causal molecular net- works: empirical assessment through a community-based effort. Nat. Meth- ods 2016, 13:310318.

[41] Yuan L, Guo LH, Yuan CA, Zhang YH, Han K, Nandi A, Honig B, Huang DS: Integration of Multi-omics Data for Gene Regula- tory Network Inference and Application to Breast Cancer. IEEE/ACM Transactions of Computational Biology and Bioinformatics 2018, DOI: 10.1109/TCBB.2018.2866836.

23 [42] Liang X, Young WC, Hung LH, Raftery AE, Yeung KY: Integra- tion of Multiple Data Sources for Gene Network Inference Using Ge- netic Perturbation Data. Journal of Computational Biology 2019, DOI: 10.1089/cmb.2019.0036.

[43] Lam KY, Westrick ZM, Mller CL, Christiaen L, Bonneau R: Fused Regres- sion for Multi-source Gene Regulatory Network Inference. PLoS Computa- tional Biology 2016, 12(12):e1005157.

[44] Aibar S, Gonzlez-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, Rambow F, Marine JC, Geurts P, Aerts J, van den Oord J, Atak ZK, Wouters J, Aerts S: SCENIC: single-cell regulatory network inference and clustering. Nature Methods 2017, 14(11):1083-1086.

[45] Wang Y, Cho DY, Lee H, Fear J, Oliver B, Przytycka TM: Reprogramming of regulatory network using expression uncovers sex-specific gene regulation in . Nat Commun 2018, 9(1):4061.

[46] Han H, Shim H, Shin D, Shim JE, Ko Y, Shin J, Kim H, Cho A, Kim E, Lee T, Kim H, et al: Trrust: A reference database of human transcrip- tional regulatory interactions, Sci. Rep. 2015, 5:11432. Available online at: https://www.grnpedia.org/trrust/

[47] Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS, Rinn JL: The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol 2014, 32: 381.

[48] Haghverdi L, Buttner M, Wolf FA, Buettner F, Theis FJ: Diffusion pseudo- time robustly reconstructs lineage branching. Nat. Methods 2016, 13:845.

[49] Reid JE and Wernisc L: Pseudotime estimation: deconfounding single cell time series. Bioinformatics 2016, 32:29732980.

[50] Bendall SC, Davis KL, Amir AD, Tadmor MD, Simonds EF, Chen TJ, Shenfeld DK, Nolan GP, Pe’er D: Single-cell trajectory detection uncovers

24 progression and regulatory coordination in human B cell development. Cell 2014,157(3):714-25.

[51] Cannoodt R, Saelens W, Saeys Y: Computational methods for trajec- tory inference from single-cell transcriptomics. Eur J Immunol. 2016, 46(11):2496-2506.

[52] Sanchez-Castillo M, Blanco D, Tienda-Luna IM, Carrion MC, Huang Y: A Bayesian framework for the inference of gene regulatory networks from time and pseudo-time series data. Bioinformatics 2018, 34(6):964-970.

[53] Sakai R, Winand R, Verbeiren T, Moere AV, Aerts J: dendsort: modular leaf ordering methods for dendrogram representations in R. F1000 Res., 2014. 3:177.

[54] Moris N, Pina C, Martinez Arias A: Transition states and cell fate decisions in epigenetic landscapes. Nat. Rev. Genet. 2016, 17:693.

[55] Gladitz J, Klink B, Seifert M: Network-based analysis of oligoden- drogliomas predicts novel cancer gene candidates within the region of the 1p/19q co-deletion. Acta Neuropathol Commun 2018, 6(1):49.

[56] Allahyar A, Ubels J, de Ridder J: A data-driven interactome of synergistic genes improves network-based cancer outcome prediction. PLoS Computa- tional Biology 2019, 15(2):e1006657.

[57] Madhamshettiwar PB, Maetschke SR, Davis MJ, Reverter A, Ragan MA: Gene regulatory network inference: evaluation and application to ovarian cancer allows the prioritization of drug targets. Genome Medicine 2012, 4(5):41.

[58] Vundavilli H, Datta A, Sima C, Hua J, Lopes R, Bittner ML: Bayesian Inference Identifies Combination Therapeutic Targets in Breast Cancer. IEEE Trans Biomed Eng 2019, DOI: 10.1109/TBME.2019.2894980.

25 [59] Lopes-Ramos CM, Kuijjer ML, Ogino S, Fuchs CS, DeMeo DL, Glass K, Quackenbush J: Gene Regulatory Network Analysis Identifies Sex- Linked Differences in Colon Cancer Drug Metabolism. Cancer Res 2018, 78(19):5538-5547.

[60] Shaffer SM, Dunagin MC, Torborg SR, Torre EA, Emert B, Krepler C, Beqiri M, Sproesser K, Brafford PA, Xiao M et al: Rare cell variability and drug-induced reprogramming as a mode of cancer drug resistance. Nature 2017, 546: 431435.

[61] Moore D, Simoes RM, Dehmer M, Emmert-Streib F: Prostate Cancer Gene Regulatory Network Inferred from RNA-Seq Data. Curr Genomics 2019, 20(1):38-48.

[62] de Matos Simoes R, Emmert-Streib F: Bagging statistical network inference from large-scale gene expression data. PLoS One 2012, 7(3), e33624.

[63] Weinstein JN, Collisson EA, Mills GB, Shaw KM, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM, and Cancer Genome Atlas Research Network: The Cancer Genome Atlas Pan-Cancer Analysis Project. Nat Genet. 2013, 45(10): 11131120.

[64] de Matos Simoes R, Dehmer M, Emmert-Streib F: Interfacing cellular net- works of S. cerevisiae and E. coli: Connecting dynamic and genetic infor- mation. BMC. Genomics. 2013, 14(1):324.

[65] Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat. Rev. Cancer. 2004, 4(3):177-183.

[66] de Matos Simoes R, Dehmer M, Emmert-Streib F: B-cell lymphoma gene regulatory networks: Biological consistency among inference methods. Front. Genet. 2013, 4, 281.

26 [67] Emmert-Streib F, de Matos Simoes R, Glazko G, McDade S, Haibe-Kains B, Holzinger A, Dehmer M, Campbell F: Functional and genetic analysis of the colon cancer network. BMC. Bioinf. 2014, 15(Suppl 6), S6.

[68] Emmert-Streib F, de Matos Simoes R, Mullan P, Haibe-Kains B, Dehmer M: The gene regulatory network for breast cancer: Integrated regulatory landscape of cancer hallmarks. Front. Genet. 2014, 5:15.

[69] Zhang M, Li Q, Yu D, Yao B, Guo W, Xie Y, Xiao G: GeNeCK: a web server for gene network construction and visualization. BMC Bioinformatics 2019,20(1):12.

27