A Method for Inferring Gene Regulatory Networks from Time
Total Page:16
File Type:pdf, Size:1020Kb
www.nature.com/scientificreports OPEN BGRMI: A method for inferring gene regulatory networks from time-course gene expression data Received: 08 September 2016 Accepted: 24 October 2016 and its application in breast cancer Published: 23 November 2016 research Luis F. Iglesias-Martinez1, Walter Kolch1,2,3 & Tapesh Santra1 Reconstructing gene regulatory networks (GRNs) from gene expression data is a challenging problem. Existing GRN reconstruction algorithms can be broadly divided into model-free and model–based methods. Typically, model-free methods have high accuracy but are computation intensive whereas model-based methods are fast but less accurate. We propose Bayesian Gene Regulation Model Inference (BGRMI), a model-based method for inferring GRNs from time-course gene expression data. BGRMI uses a Bayesian framework to calculate the probability of different models of GRNs and a heuristic search strategy to scan the model space efficiently. Using benchmark datasets, we show that BGRMI has higher/comparable accuracy at a fraction of the computational cost of competing algorithms. Additionally, it can incorporate prior knowledge of potential gene regulation mechanisms and TF hetero-dimerization processes in the GRN reconstruction process. We incorporated existing ChIP-seq data and known protein interactions between TFs in BGRMI as sources of prior knowledge to reconstruct transcription regulatory networks of proliferating and differentiating breast cancer (BC) cells from time-course gene expression data. The reconstructed networks revealed key driver genes of proliferation and differentiation in BC cells. Some of these genes were not previously studied in the context of BC, but may have clinical relevance in BC treatment. Cellular functions depend on the precise regulation of thousands of genes which are activated or silenced by transcription factors (TFs)1. The networks representing the interactions between TFs and their target genes are typically known as GRNs and can be reconstructed from temporal measurements of gene expressions2–5. The GRN reconstruction methods can be classified into two main categories; model-based and model-free methods. Model-based methods aim to capture the regulatory interactions by fitting mathematical models of gene regula- tion to observed gene expression data2,4,5. On the other hand, model-free approaches use information-theoretic criteria to infer the structure of the network3,6. Though the performances of GRN reconstruction methods depend on several aspects such as data type, network properties of the GRN etc7., generally, model-based methods tend to be faster but have lower predictive performance than model-free methods 20156. However, model-free methods are often not scalable enough to reconstruct genome-wide GRNs5,6 in reasonable time. Typically, model-based methods formulate the expression of a gene as a function of its regulators, evaluate competing models containing different sets of regulators and chose those which closely predict the target gene expression4,5,8. Although vast majority of model based methods assume that the expressions of a gene and its regulators are linearly depend- ent5,9–12, these methods use different model search algorithms e.g. Least Absolute Shrinkage and Selection Operator (LASSO), Dantzig Selector, elastic net, Markov Chain Monte Carlo and Heuristic search5,12–19. Some of these methods such as LASSO and elastic net choose the best model, others such as MCMC or Heuristic search based Bayesian Model Averaging (BMA) methods5,12,13,15,18 select multiple models that provide close fits to the 1Systems Biology Ireland, University College Dublin, Belfield, Dublin 4, Republic of Ireland.2 Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Belfield, Dublin 4, Ireland.3 School of Medicine and Medical Science, University College Dublin, Belfield, Dublin 4, Ireland. Correspondence and requests for materials should be addressed to T.S. (email: [email protected]) SCIENTIFIC REPORTS | 6:37140 | DOI: 10.1038/srep37140 1 www.nature.com/scientificreports/ data and use these to estimate an average model along with its confidence interval. It is also possible to incorpo- rate different types of existing data in BMA5,12,18 to increase the accuracy of the reconstructed GRN. We developed BGRMI, a model-based method that relies on the principles of BMA for inferring GRNs from time course gene expression data. BGRMI uses discretized ordinary differential equation (DODE) based mathe- matical models to formulate the interactions between each gene and its regulators. It formulates the rate of change in a gene’s expression as a function of the expressions of its regulators, takes basal expression and self-regulation into account and therefore provides a more realistic model of gene regulation than many existing methods. These models are then used in a Bayesian framework to evaluate how likely a set of TFs is to regulate a certain gene. We developed a greedy heuristic search algorithm to explore different combinations of TFs and find the most likely TF combinations for each gene. The proposed algorithm is faster and more scalable than many existing methods. The average of some of the most likely models was then used to represent the regulatory model of the gene. We compared the accuracy of BGRMI against other methods using in-silico and in-vivo benchmark datasets. BGRMI consistently out-performed most of the other competing methods in our benchmarking study. We then showed how additional data sources, e.g. ChIP-seq and the protein-protein interaction (PPI) between TFs can be incorpo- rated as prior knowledge in the core BGRMI formulation. Finally, we applied BGRMI to study the transcriptional mechanisms that lead to proliferation and differentiation in BC cells by combining ChIP-seq, PPI and time course gene expression profiles. Our study uncovered previously unknown transcriptional mechanisms that drive phe- notypic changes in BC cells. Method A brief overview of the BGRMI algorithm is as follows. We first developed a mathematical model of TF-mediated gene regulations. The model can predict the temporal changes in the expressions of target genes using the tempo- ral expression patterns of TFs as input. This model is then used to evaluate different combinations of TFs to find those that closely predict target gene expressions. This is done by iteratively exploring different TF combinations, calculating the posterior probability of each of these combinations to predict target gene expression, and selecting those with high probabilities. The average of selected gene regulation models for each gene is used as its regulation model. Below we describe each step of our algorithm in detail. The mathematical model of gene regulation. We used a discretized form of ODEs to formulate the dependence of a target gene on its regulators as shown below. [(mRNA tm)(−−RN At ∆t)] mRNA ()tt−∆ ∆=ii=+αεβTi + mRNAi ()t ii i ()t ∆t TFi()tt−∆ (1) Here, mRNAi(t) is the expression of gene i at time t, αi is the basal gene expression rate, βi is a vector that con- tains the coefficients of self regulation (by means of degradation, auto-activation/inhibition) and the regulation by a set of TFs (TFi), TFi(t − Δt ) are the expressions of the TFs that regulate gene i at time (t − Δt ). εi(t) is the model fitting error caused by the measurement noise in expression data. Since measurement noise is random εi(t) is a 2 2 12,18 random variable and typically has Gaussian distribution with zero mean and variance σ , i.e. εi(t)~N(0, σ ) . The error variance (σ2) depends on many factors such as biological variability and measurement noise, and is typically unknown. The posterior probability of a gene regulation model. We used Bayesian statistics to calculate this probability. There are two main components of Bayesian formulations; (a) the prior probability (p(Mk)) which represents how well a model (Mk) is supported by prior knowledge, and (b) the likelihood function (p(mR- 20 NAi|Mk)) which evaluates how well a model explains experimental data. By Bayes’ rule , the posterior probability (p(Mk|mRNAi)) is proportional to the product of these two entities and represents how well a model (Mk) is sup- ported by prior knowledge and experimental data combined. In the absence of prior knowledge, we assumed that sparse regulatory models (i.e. those involving fewer TFs) are a priori more likely than dense models (i.e. those involving a large number of TFs). This assumption was for- 5 −2.66 mulated by assigning the following prior distributions over regulatory models (Mk) : p(Mk) = L , where L is the number of regulators in the model. However, there is a wealth of publicly available information about GRNs of several organisms such as yeast, E. coli, humans etc. This information can be used to formulate more informative priors for reconstructing GRNs of these organisms. We shall discuss the formulation of priors for human GRNs using ChIP-seq data12 in a later section where we describe the implementation of BGRMI on human transcrip- tomic data. The likelihood of a gene regulation model (Mk) is the probability that the observed expression pattern 5,12 (mRNAi) of gene i, can be predicted by the model (Mk), and has the following form : T 22TimRNA ()tt−∆ PN(mRNA TF ,,,ασββ) = , σ iii i ∏ i −∆ tt=∆ TFi()tt (2) 2 The likelihood function in Eq. 2 depends on the model parameters (αi, βi σ ), whose values are typically unknown. Therefore, we evaluated the average of the likelihood (Eq. 2) over all possible values of the model parameters. The average likelihood is also called the marginal likelihood P( (mRNAi|Mk)). To analytically calculate the marginal likelihood we assigned conjugate prior distributions to each of the unknown variables. These distri- butions represent our prior knowledge of how likely a parameter is to have a certain value. Following Fernandez 13 13,21 et al. we assigned uninformative Jeffrey’s prior distribution to the basal expression rates αi, p(αi) = 1, which SCIENTIFIC REPORTS | 6:37140 | DOI: 10.1038/srep37140 2 www.nature.com/scientificreports/ implies that αi is equally likely to have any real value.