Gene Regulatory Network Inference Using Machine Learning Techniques
Total Page:16
File Type:pdf, Size:1020Kb
GENE REGULATORY NETWORK INFERENCE USING MACHINE LEARNING TECHNIQUES Stephanie Kamgnia Wonkap A Thesis in the department of Computer Science and Software Engineering Presented in Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy(Computer Science) Concordia University Montreal,´ Quebec,´ Canada August 26 2020 c Stephanie Kamgnia Wonkap, 2020 Concordia University School of Graduate Studies This is to certify that the thesis prepared By: Miss. Stephanie Kamgnia Wonkap Entitled: Gene Regulatory Network Inference using Machine Learning Techniques and submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science) complies with the regulations of this University and meets the accepted standards with respect to originality and quality. Signed by the final examining committee: Chair Dr. Liangzhu Wang External Examiner Dr. Mathieu Blanchette Examiner Dr. Leila Kosseim Examiner Dr. Malcolm Whiteway Examiner Dr. Volker Haarslev Supervisor Dr. Gregory Butler Approved Dr. Leila Kosseim, Graduate Program Director August 26th, 2020 Dr. Amir Asif, Dean Date of Defence Faculty of Engineering and Computer Science Abstract Gene Regulatory Network Inference using Machine Learning Techniques Stephanie Kamgnia Wonkap, Ph.D. Concordia University, 2020 Systems Biology is a field that models complex biological systems in order to better understand the working of cells and organisms. One of the systems modeled is the gene regulatory network that plays the critical role of controlling an organism's response to changes in its environment. Ideally, we would like a model of the complete gene regulatory network. In recent years, several advances in technology have permitted the collection of an unprecedented amount and variety of data such as genomes, gene expression data, time-series data, and perturbation data. This has stimulated research into computational methods that reconstruct, or infer, models of the gene regulatory network from the data. Many solutions have been proposed, yet there remain open challenges in utilising the range of available data as it is inherently noisy, and must be integrated by the inference techniques. The thesis seeks to contribute to this discourse by investigating challenges of performance, scale, and data integration. We propose a new algorithm BENIN that views network inference as feature se- lection to address issues of scale, that uses elastic net regression for improved per- formance, and adapts elastic net to integrate different types of biological data. The BENIN algorithm is benchmarked on a synthetic dataset from the DREAM4 challenge, and on real expression data for the human HeLa cell cycle. On the DREAM4 dataset BENIN out-performed all DREAM4 competitors on the size 100 subchallenge, and is also competitive with more recent state-of-the-art methods. Moreover, on the HeLa cell cycle data, BENIN could infer known regulatory interactions and propose new interactions that warrant further experimental investigation. Keys words: gene regulatory network, network inference, feature selection, elastic net regression. iii Acknowledgments First of all, I would like to thank my academic supervisor, Dr. Gregory Butler, for his precious advice that helped me accomplish this project and become a better researcher. Moreover, I would like to thank him for providing me with financial support. I would like to thank my mother, Bernadette, and my father, Emmanuel, for showing the path through my Ph.D. Your prayer, love, and support helped go through these tough times. Thank you for your education that made me the strong woman I am today. I would like to thank my sister Nathalie who is my model since our childhood. I made it I am a doctor like you. Thank you to my sister Helene who was always there, our phone calls, and our discussion helped during all these years. I will not forget my sister Diana, my nephews, my older brother Armand and two little brothers Joan and Gracien. Every one of you plays an essential role in this journey. To you my beloved husband Samir, you were my rock through this journey. Thank you for holding my back, for being such a good confident and my motivation. I think that if you were not there, I would not be able to accomplish this. I thank God for having you in my life. Last but not the less, my colleagues from office 11.411. I will never forget all these good times: the sushi time, our potluck dinner, our late discussions and all our fun time. I will miss these times and I wish you all the best in your life. iv Contents List of Figures vii List of Tables ix List of Terms and Abbreviations 1 1 Introduction 1 1.1 Gene Regulation . 2 1.2 Gene Regulatory Network . 5 1.3 Problem Statement . 9 1.4 Motivation . 10 1.5 Challenge in Gene Regulatory Network Inference . 12 1.6 Limitation of State-of-the-Art . 14 1.7 Contribution . 14 1.8 Organization of the Thesis . 18 2 Background 19 2.1 Background for Network Inference . 19 2.2 Feature Selection . 30 2.3 Resources Available for Network Inference . 33 2.4 Assesment and Validation of Network Inference . 39 2.5 Computational Methods . 45 2.6 Conclusion . 78 3 BENIN 80 3.1 The BENIN Algorithm . 81 v 3.2 Experimental Validation . 91 3.3 Computational Complexity . 97 3.4 Results and Discussion . 98 3.5 Conclusion . 119 4 BENIN: Application to the HeLa Cell cycle 121 4.1 Introduction . 121 4.2 Background . 124 4.3 Building a gold-standard . 132 4.4 Material . 137 4.5 Method . 146 4.6 Results and Discussion . 158 4.7 Conclusion . 186 5 Conclusion 188 5.1 Recap . 188 5.2 Contributions . 189 5.3 Limitations . 192 5.4 Future Work . 192 Bibliography 193 A Background 227 A.1 IUPAC degenerate base symbols . 227 B BENIN 229 B.1 BENIN parameters setting . 229 B.2 BENIN results . 229 C BENIN: Application to Human HeLa Cell Cycle GRN 235 C.1 Data . 235 C.2 Other . 345 vi List of Figures 1 Organization of an operon in prokaryotes. 3 2 Tryptophan regulation in E. coli ..................... 4 3 Eukaryotic gene structure . 6 4 Chromatin in eukaryotic cells . 7 5 Gene regulatory network abstraction . 8 6 Different representations of binding sites . 29 7 DNA microarray experiment . 35 8 RNA-seq experiment . 37 9 Confusion matrix . 41 10 Procedure to identify regulon . 51 11 Step for regulatory network inference. 58 12 Example DREAM4 Input for BENIN . 84 13 Effect of the noise in location data . 99 14 Influence of BENIN parameters . 100 15 A subnetwork from 100-nodes network 4 . 109 16 The Eukaryotic cell cycle . 125 17 From nucleus to DNA sequence . 126 18 Steps for retrieving knockdown data . 140 19 Steps for collecting promoter sequences . 142 20 Steps for collecting protein sequences . 144 22 Snapshot of FIMO output . 153 24 Inference of GRN controlling HeLa cell cycle through BENIN . 157 21 BED file and BETA-minus output . 160 23 Differential Expression analysis output . 161 25 Effect of τ on BENIN performance. 161 vii 26 Precision-recall curves for BENIN . 163 27 ROC curves for BENIN . 165 28 Precision-recall curves for BENIN +orthology . 167 29 ROC curves for BENIN +orthology . 167 30 Orthologous Regulatory Network From mouse . 180 31 Edge Distribution . 181 32 Global score Distribution for the DREAM4 size 100 subchallenge. 233 33 Global score Distribution for the DREAM4 size 10 subchallenge. 234 viii List of Tables 1 Motifs Finding Methods . 52 2 Reverse-Engineering Methods . 71 3 Description of DREAM4 size 10 and size 100 networks . 91 4 Motifs and errors type . 94 5 BENIN execution time on the DREAM4 . 96 6 DREAM4 size 100 performance with KO expression . 103 7 DREAM4 size 100 performance with Location data . 104 8 Global score on the DREAM4 size 100 subchallenge . 105 9 DREAM4 size 10 performance with Location data . 106 10 DREAM4 size 10 performance with KO expression . 106 11 Global score on the DREAM4 size 10 subchallenge . 107 12 Motif prediction confidence (median rank) . 113 13 Cell cycle Transcription Factors . 127 14 Characteristics of our Human \gold-standard" network . 136 15 Missing transcription factors in our \gold-standard network" . 137 16 Mouse gene regulatory network . 146 17 BENIN execution time . 159 18 BENIN performance . 162 19 Transcription factor and target gene . 168 20 Inference from BENIN +combined+max . 174 21 Inference from BENIN +orthology . 182 22 List of Degenerate IUPAC base symbols . 227 23 BENIN General Parameter setting . 230 24 BENIN +KO parameters on size 100 subchallenge . 230 25 BENIN +Location parameters setting on size 100 subchallenge . 231 ix 26 BENIN +KO parameters on size 10 subchallenge . 231 27 BENIN +Location data parameters on size 10 subchallenge . 232 28 List of HeLa Peak Files . 238 29 List of knockdown datasets . 239 30 Information Motif and Transcription Factor . 240 31 Cell cycle genes . 249 32 Cell Cycle Transcription factor . 251 33 Knockdown Data from Gene Expression Omnibus . 268 34 Edges repetition in networks from HumanBase . 270 35 Edges repetition in Garcia networks . 286 36 HeLa \gold-standard" network - Positive links . 302 37 HeLa \gold-standard" network- Negative links . 319 38 Duplicate regulatory.