Computational Methods for Cis-Regulatory Module Discovery A
Total Page:16
File Type:pdf, Size:1020Kb
Computational Methods for Cis-Regulatory Module Discovery A thesis presented to the faculty of the Russ College of Engineering and Technology of Ohio University In partial fulfillment of the requirements for the degree Master of Science Xiaoyu Liang November 2010 © 2010 Xiaoyu Liang. All Rights Reserved. 2 This thesis titled Computational Methods for Cis-Regulatory Module Discovery by XIAOYU LIANG has been approved for the School of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by Lonnie R.Welch Professor of Electrical Engineering and Computer Science Dennis Irwin Dean, Russ College of Engineering and Technology 3 ABSTRACT LIANG, XIAOYU, M.S., November 2010, Computer Science Computational Methods for Cis-regulatory Module Discovery Director of Thesis: Lonnie R.Welch In a gene regulation network, the action of a transcription factor binding a short region in non-coding sequence is reported and believed as the key that triggers, or represses genes’ expression. Further analysis revealed that, in higher organisms, multiple transcription factors work together and bind multiple sites that are located nearby in genomic sequences, rather than working alone and binding a single anchor. These multiple binding sites in the non-coding region are called cis -regulatory modules. Identifying these cis - regulatory modules is important for modeling gene regulation network. In this thesis, two methods have been proposed for addressing the problem, and a widely accepted evaluation was applied for assessing the performance. Additionally, two practical case studies were completed and reported as the application of the proposed methods. Approved: _____________________________________________________________ Lonnie R.Welch Professor of Electrical Engineering and Computer Science 4 ACKNOWLEDGEMENTS I would like to express my sincere gratitude to my advisor, Dr. Lonnie R. Welch, for introducing me into the area of Bioinformatics, and for his guidance and encouragement for the present thesis and my research projects. I would like to extend my deep gratitude to Dr. Frank Drews, Dr. Sarah Wyatt, and Dr. Razvan Bunescu for all the help and advices for my research, and serving as my thesis committee members. Especially, thank Dr. Razvan Bunescu for teaching me machine learning knowledge and guiding me to develop the HAC algorithm that is proposed in this thesis. I would also like to thank Dr. Klaus Ecker for participating in my research projects, and for his valuable advice, discussion, and friendly help. I would also like to thank past and present members in Bioinformatics group, including Jens Litchenberg, Kyle Kurz, Rami Alouran, Lee Nau, Joshua Welch, Daniel Evans, Kaiyu Shen, Paul Burns, Lev Neimen, Zekai Huang, Eric Petri, Nathaniel George, and Josiah Seaman, for all your help and collaborations. Specially, thank to Jens Litchenberg for all advices, encouragements, and being a wonderful landlord. I would like to thank Dr. Wyatt and her lab again for providing data, and initial analysis of the case study of plant gravitropic signal pathway that is presented as one of the case studies in this thesis. 5 I would also like to thank Dr. Eric Grotewold and Dr. Alper Yilmaz for providing a priceless internship opportunity, and cooperating in the case study of comparison between rice and Arabidopsis thaliana , which is presented in this thesis. I have learned a lot in both theory and practical applications, and really enjoyed the experience working in their lab. Thank you to my parents deeply for always being support, for their love and understandings. I would also like to thank Stephane J. Litchenberg for correcting my grammatical errors and sharing friendship. 6 TABLE OF CONTENTS Page ABSTRACT ........................................................................................................................ 3 ACKNOWLEDGEMENTS ................................................................................................ 4 LIST OF TABLES ............................................................................................................ 10 LIST OF FIGURES .......................................................................................................... 12 LIST OF ABBREVIATIONS ........................................................................................... 14 CHAPTER 1: INTRODUCTION ..................................................................................... 15 1.1 Background ............................................................................................................. 15 1.2 Problem Statement .................................................................................................. 17 1.3 Overview of Thesis ................................................................................................. 19 CHAPTER 2: LITERATURE REVIEW .......................................................................... 20 2.1 With Prior Knowledge of TFBSs............................................................................ 21 2.2 Without Prior Knowledge of TFBSs ...................................................................... 25 CHAPTER 3: PROBLEM DEFINITION......................................................................... 28 3.1 Introduction ............................................................................................................. 28 3.2 Terminologies ......................................................................................................... 28 CHAPTER 4: WORDSEEKER ........................................................................................ 35 4.1 WordSeeker Design ................................................................................................ 35 7 4.2 Integration with WordSeeker .................................................................................. 38 CHAPTER 5: ALGORITHMIC APPROACHES FOR MODULE DISCOVERY ......... 40 5.1 Enumerative Module Discovery Method ................................................................ 41 5.1.1 Motivation ........................................................................................................ 41 5.1.2 Algorithm ......................................................................................................... 43 5.1.3 Complexity Analysis ........................................................................................ 50 5.2 Hierachical Agglomerative Clustering (HAC) Module Discovery Method ........... 52 5.2.1 Motivation ........................................................................................................ 52 5.2.2 Algorithm ......................................................................................................... 52 5.2.3 Complexity Analysis ........................................................................................ 58 CHAPTER 6: EVALUATION ......................................................................................... 59 6.1 Benchmark Datasets ............................................................................................... 59 6.1.1 TRANSCompel ................................................................................................ 60 6.1.2 Muscle Dataset ................................................................................................. 61 6.1.3 Liver Dataset .................................................................................................... 62 6.2 Evaluation of Scoring Functions............................................................................. 63 6.2.1 AP1-NFAT ....................................................................................................... 64 6.2.2 AP1-Ets ............................................................................................................ 70 6.2.3 AP1-NFkappaB ................................................................................................ 75 8 6.2.4 CEBP-NFkappaB ............................................................................................. 79 6.2.5 Ebox-Ets ........................................................................................................... 82 6.2.6 Ets-AML .......................................................................................................... 85 6.2.7 IRF-NFkappaB ................................................................................................. 91 6.2.8 NFkappaB-HMGIY ......................................................................................... 97 6.2.9 PU1-IRF ........................................................................................................... 97 6.2.10 Sp1-Ets ......................................................................................................... 102 6.2.11 Discussion .................................................................................................... 107 6.3 Evaluation of Prediction Performance .................................................................. 108 6.3.1 Statistical Values ............................................................................................ 108 6.3.2 Statistical Formulas ........................................................................................ 110 6.3.3 Results and Discussion .................................................................................. 111 CHAPTER 7: CASE STUDY......................................................................................... 121 7.1 Plant Gravitropic Signal