Combining Heterogeneous Sources of Data for the Reverse-Engineering of Gene Regulatory Networks

Combining heterogeneous sources of data for the reverse-engineering of gene regulatory networks A thesis submitted for the degree of Doctor of Philosophy Emma Steele School of Information Systems, Computing and Mathematics Brunel University January 2010 Abstract Gene Regulatory Networks (GRNs) represent how genes interact in various cellular processes by describing how the expression level, or activity, of genes can a®ect the expression of the other genes. Reverse- engineering GRN models can help biologists understand and gain in- sight into genetic conditions and diseases. Recently, the increasingly widespread use of DNA microarrays, a high-throughput technology that allows the expression of thousands of genes to be measured si- multaneously in biological experiments, has led to many datasets of gene expression measurements becoming publicly available and a sub- sequent explosion of research in the reverse-engineering of GRN models. However, microarray technology has a number of limitations as a data source for the modelling of GRNs, due to concerns over its reliability and the reproducibility of experimental results. The underlying theme of the research presented in this thesis is the incorporation of multiple sources and di®erent types of data into techniques for reverse-engineering or learning GRNs from data. By draw- ing on many data sources, the resulting network models should be more robust, accurate and reliable than models that have been learnt using a single data source. This is achieved by focusing on two main strands of research. First, the thesis presents some of the earliest work in the incorporation of prior knowledge that has been generated from a large body of scienti¯c papers, for Bayesian network based GRN models. Second, novel methods for the use of multiple microarray datasets to produce Bayesian network based GRN models are intro- duced. Empirical evaluations are used to show that the incorporation of literature-based prior knowledge and combining multiple microarray datasets can provide an improvement, when compared to the use of a single microarray dataset, for the reverse-engineering of Bayesian network based GRN models. Acknowledgements Firstly I would like to thank my supervisor Allan Tucker. Without his support, encouragement, guidance and understanding throughout my PhD, this thesis would have never been completed, and certainly not within the magic three years! I would also thank my second supervisor Xiaohui Liu for all of his invaluable advice and wise words. I would like to acknowledge my colleagues (past and present) in the Centre for Intelligent Data Analysis at Brunel for interesting discus- sions that have contributed towards my research, in particular Michael Hirsch, Veronica Vinciotti and Stephen Swift. I would also like to thank members of the Biosemantics Association in the Netherlands for all their help and support during my PhD. They have been the perfect partners in collaboration! In particular Peter- Bram 't Hoen, Matthew Hestand (Leiden University) and Martijn Schuemie (Erasmus University Rotterdam), who have spent a lot of their time assisting me with this research. To my o±ce buddies, past and present, for keeping the IDA lab fun and a pleasure to work in: Kyri, Karl, Dan, Ana, Natalie, Jose, Jian, Yahya, Christine, Nota and others. A big thankyou to my family for their encouragement throughout my PhD but also for not asking how the thesis was going! Last but de¯nitely not least, thankyou to my husband Andy for all his love and support, and especially for reading this thesis and allowing me to monopolise his brand-new computer without too much complaint! And a last minute addition to these acknowledgements: my daughter Isla who arrived after this thesis was submitted, but before the viva. Thankyou for allowing me the time to ¯nish this PhD! Publications The following publications have resulted from the research presented in this thesis: 1. Peeling, E., Tucker, A. & 't Hoen, P. (2007). Discovery of local regulatory structure from microarray gene expression data using Bayesian networks. In Proceedings of the annual workshop on Intelligent Data Analysis in bioMedicine And Pharmacology (IDAMAP). 2. Peeling, E. & Tucker, A. (2007). Consensus gene regulatory networks: combining multiple microarray gene expression datasets. In AIP Conference Proceedings vol. 940, The 3rd In- ternational Symposium on Computational Life Sciences (COM- PLIFE 2007). 3. Steele, E. & Tucker, A. (2008). Consensus and meta- analysis regulatory networks for combining multiple microarray gene expression datasets. Journal of Biomedical Informatics, 41, 914{926. 4. Steele, E., Tucker, A., 't Hoen, P., & Schuemie, M.J. (2009) Literature-based priors for gene regulatory networks. Bioin- formatics, 25, 1768-1774. 5. Steele, E. & Tucker, A. (2009) Selecting and weighting data for building Consensus Gene Regulatory Networks. In LNCS 5772, Proceedings of the 8th International Symposium on Intelli- gent Data Analysis: Advances in Intelligent Data Analysis VIII , 190-201. Please note that Peeling is the author's maiden name. Item 1 results from preliminary work for this thesis. Items 2 and 3 result from the research described in Chapter 5. Item 4 results from the research presented in Chapter 4. Finally, item 5 is based on Chapter 6. Contents 1 Introduction 1 1.1 Reverse-engineering gene regulatory network models . 2 1.2 Thesis aims . 5 1.3 Thesis contributions . 6 1.4 Thesis outline . 7 2 Gene Regulatory Networks 9 2.1 Regulation of gene expression . 9 2.2 Microarray technology . 12 2.2.1 The basics: data collection . 13 2.2.2 Data processing . 16 2.2.2.1 Image processing . 16 2.2.2.2 Expression ratios . 17 2.2.2.3 Normalisation . 18 2.2.3 Analysis of microarray expression data . 22 2.2.3.1 The gene expression matrix . 22 2.2.3.2 Basic analysis techniques . 23 2.2.3.3 Inferring regulatory relationships . 26 2.2.4 Using multiple microarray gene expression datasets . 30 2.3 Other types of data . 34 2.3.1 Protein-protein interaction data . 34 2.3.2 Transcription factor binding site data . 35 2.3.3 Literature-based knowledge . 35 2.3.4 Use of prior knowledge in inferring regulatory relationships 38 2.4 Summary . 41 v CONTENTS 3 Modelling GRNs using Bayesian networks 43 3.1 Bayesian networks . 43 3.1.1 Conditional independence . 45 3.1.2 Inference . 45 3.1.3 Explaining away . 48 3.1.4 Parameter learning . 49 3.1.5 Equivalence classes . 50 3.1.6 Causality . 50 3.2 Bayesian networks to model gene regulation . 51 3.2.1 Simple regulatory structures . 52 3.2.2 Modelling using microarray expression data . 52 3.2.3 More complex regulatory interactions . 54 3.3 Learning Bayesian network structures . 56 3.3.1 Constraint-based learning . 56 3.3.2 Score-based learning . 57 3.3.3 Comparison of both approaches . 58 3.3.4 Con¯dence levels for network edges . 59 3.4 Evaluation of model performance . 61 3.4.1 Network structure comparison to current knowledge . 62 3.4.2 Prediction of gene expression values . 65 3.5 Recent work in using BNs to model GRNs . 65 4 Prior knowledge 70 4.1 Introduction . 70 4.2 Related work . 71 4.3 Informative Bayesian network priors . 73 4.3.1 Calculating the prior probability of a network structure . 73 4.3.2 Weighting the prior . 76 4.4 Literature-based Bayesian network priors . 77 4.4.1 Literature-based gene concept pro¯ling . 77 4.4.2 Edge prior probabilities from literature-based knowledge . 79 4.5 Evaluating the use of literature-based priors . 81 4.5.1 Evaluation procedure . 81 vi CONTENTS 4.5.2 Application: yeast . 83 4.5.2.1 Comparison to documented interactions . 84 4.5.2.2 Expression value prediction . 84 4.5.3 Application: E. coli ...................... 88 4.5.3.1 Comparison to documented interactions . 88 4.5.3.2 Expression value prediction . 89 4.5.4 Application: human . 92 4.6 Discussion . 96 5 Combining multiple microarray gene expression datasets 99 5.1 Introduction . 99 5.2 Pre-learning aggregation . 100 5.3 Post-learning aggregation . 101 5.3.1 Consensus Bayesian networks . 101 5.3.2 Bayesian network meta-analysis . 103 5.3.3 Related work . 105 5.4 Comparison of pre- and post-learning aggregation methods . 106 5.4.1 Comparison procedure . 107 5.4.2 Application: synthetic network . 108 5.4.3 Application: E. coli SOS response network . 112 5.4.4 Application: yeast heat stress network . 116 5.5 Discussion . 120 6 Incorporating network reliability 123 6.1 Introduction . 123 6.2 Consensus Bootstrapped Bayesian Networks . 125 6.2.1 Algorithm . 125 6.2.2 Incorporating weights . 128 6.2.3 Comparison to the original Consensus algorithm . 133 6.3 Measuring network reliability . 133 6.3.1 A prediction-based reliability measure for networks . 134 6.3.2 Prediction-based network weighting . 136 6.3.3 Prediction-based network selection . 136 6.3.4 Related work . 137 vii CONTENTS 6.4 Network selection or network weighting: a comparison . 138 6.4.1 Datasets and experiment design . 138 6.4.2 Application: synthetic networks . 141 6.4.2.1 Performance comparison of approaches . 141 6.4.2.2 Selection of the consensus threshold . 144 6.4.2.3 Method comparison for the network selection approach . 147 6.4.3 Application: yeast heat-stress network . 148 6.5 Discussion . 152 7 Inducing heuristics for the Consensus approach 156 7.1 Introduction . 156 7.2 Motivation . 157 7.3 Rule learning . 159 7.3.1 Inputs and Outputs . 160 7.3.2 Algorithm basics . 161 7.3.3 Relational rule learning . 162 7.3.4 ACE . 166 7.4 Experimental results . 167 7.4.1 Rule learning inputs .

Load more