Computational Methods for Cis-Regulatory Module Discovery
A thesis presented to
the faculty of the Russ College of Engineering and Technology of Ohio University
In partial fulfillment
of the requirements for the degree
Master of Science
Xiaoyu Liang
November 2010
© 2010 Xiaoyu Liang. All Rights Reserved.
2
This thesis titled
Computational Methods for Cis-Regulatory Module Discovery
by
XIAOYU LIANG
has been approved for
the School of Electrical Engineering and Computer Science
and the Russ College of Engineering and Technology by
Lonnie R.Welch
Professor of Electrical Engineering and Computer Science
Dennis Irwin
Dean, Russ College of Engineering and Technology 3
ABSTRACT
LIANG, XIAOYU, M.S., November 2010, Computer Science
Computational Methods for Cis-regulatory Module Discovery
Director of Thesis: Lonnie R.Welch
In a gene regulation network, the action of a transcription factor binding a short region in non-coding sequence is reported and believed as the key that triggers, or represses genes’ expression. Further analysis revealed that, in higher organisms, multiple transcription factors work together and bind multiple sites that are located nearby in genomic sequences, rather than working alone and binding a single anchor. These multiple binding sites in the non-coding region are called cis -regulatory modules. Identifying these cis - regulatory modules is important for modeling gene regulation network. In this thesis, two methods have been proposed for addressing the problem, and a widely accepted evaluation was applied for assessing the performance. Additionally, two practical case studies were completed and reported as the application of the proposed methods.
Approved: ______
Lonnie R.Welch
Professor of Electrical Engineering and Computer Science
4
ACKNOWLEDGEMENTS
I would like to express my sincere gratitude to my advisor, Dr. Lonnie R. Welch, for introducing me into the area of Bioinformatics, and for his guidance and encouragement for the present thesis and my research projects.
I would like to extend my deep gratitude to Dr. Frank Drews, Dr. Sarah Wyatt, and Dr.
Razvan Bunescu for all the help and advices for my research, and serving as my thesis committee members. Especially, thank Dr. Razvan Bunescu for teaching me machine learning knowledge and guiding me to develop the HAC algorithm that is proposed in this thesis. I would also like to thank Dr. Klaus Ecker for participating in my research projects, and for his valuable advice, discussion, and friendly help.
I would also like to thank past and present members in Bioinformatics group, including
Jens Litchenberg, Kyle Kurz, Rami Alouran, Lee Nau, Joshua Welch, Daniel Evans,
Kaiyu Shen, Paul Burns, Lev Neimen, Zekai Huang, Eric Petri, Nathaniel George, and
Josiah Seaman, for all your help and collaborations. Specially, thank to Jens Litchenberg for all advices, encouragements, and being a wonderful landlord.
I would like to thank Dr. Wyatt and her lab again for providing data, and initial analysis of the case study of plant gravitropic signal pathway that is presented as one of the case studies in this thesis. 5
I would also like to thank Dr. Eric Grotewold and Dr. Alper Yilmaz for providing a priceless internship opportunity, and cooperating in the case study of comparison between rice and Arabidopsis thaliana , which is presented in this thesis. I have learned a lot in both theory and practical applications, and really enjoyed the experience working in their lab.
Thank you to my parents deeply for always being support, for their love and understandings.
I would also like to thank Stephane J. Litchenberg for correcting my grammatical errors and sharing friendship.
6
TABLE OF CONTENTS
Page
ABSTRACT ...... 3
ACKNOWLEDGEMENTS ...... 4
LIST OF TABLES ...... 10
LIST OF FIGURES ...... 12
LIST OF ABBREVIATIONS ...... 14
CHAPTER 1: INTRODUCTION ...... 15
1.1 Background ...... 15
1.2 Problem Statement ...... 17
1.3 Overview of Thesis ...... 19
CHAPTER 2: LITERATURE REVIEW ...... 20
2.1 With Prior Knowledge of TFBSs...... 21
2.2 Without Prior Knowledge of TFBSs ...... 25
CHAPTER 3: PROBLEM DEFINITION...... 28
3.1 Introduction ...... 28
3.2 Terminologies ...... 28
CHAPTER 4: WORDSEEKER ...... 35
4.1 WordSeeker Design ...... 35 7
4.2 Integration with WordSeeker ...... 38
CHAPTER 5: ALGORITHMIC APPROACHES FOR MODULE DISCOVERY ...... 40
5.1 Enumerative Module Discovery Method ...... 41
5.1.1 Motivation ...... 41
5.1.2 Algorithm ...... 43
5.1.3 Complexity Analysis ...... 50
5.2 Hierachical Agglomerative Clustering (HAC) Module Discovery Method ...... 52
5.2.1 Motivation ...... 52
5.2.2 Algorithm ...... 52
5.2.3 Complexity Analysis ...... 58
CHAPTER 6: EVALUATION ...... 59
6.1 Benchmark Datasets ...... 59
6.1.1 TRANSCompel ...... 60
6.1.2 Muscle Dataset ...... 61
6.1.3 Liver Dataset ...... 62
6.2 Evaluation of Scoring Functions...... 63
6.2.1 AP1-NFAT ...... 64
6.2.2 AP1-Ets ...... 70
6.2.3 AP1-NFkappaB ...... 75 8
6.2.4 CEBP-NFkappaB ...... 79
6.2.5 Ebox-Ets ...... 82
6.2.6 Ets-AML ...... 85
6.2.7 IRF-NFkappaB ...... 91
6.2.8 NFkappaB-HMGIY ...... 97
6.2.9 PU1-IRF ...... 97
6.2.10 Sp1-Ets ...... 102
6.2.11 Discussion ...... 107
6.3 Evaluation of Prediction Performance ...... 108
6.3.1 Statistical Values ...... 108
6.3.2 Statistical Formulas ...... 110
6.3.3 Results and Discussion ...... 111
CHAPTER 7: CASE STUDY...... 121
7.1 Plant Gravitropic Signal Transduction ...... 121
7.1.1 Introduction ...... 121
7.1.2 Methods and Results ...... 122
7.1.3 Conclusions ...... 130
7.2 Comparison of Rice and Arabidopsis Thaliana ...... 130
7.2.1 Introduction ...... 131 9
7.2.2 Methods and Results ...... 131
7.2.3 Conclusion ...... 140
CHAPTER 8: CONCLUSIONS ...... 141
8.1 Summary of results ...... 141
8.2 Future Work ...... 141
REFERENCES ...... 143
APPENDIX 1: TRANSCOMPEL BENCHMARK DATASET ...... 150
APPENDIX 2: MUSCLE BENCHMARK DATASET...... 155
APPENDIX 3: LIVER BENCHMARK DATASET ...... 157
APPENDIX 4: RESULT FROM FOR MUSCLE DATASET ...... 158
APPENDIX 5: GO ANALYSIS RESULT FOR NINE CLUSTERS OF GENES ...... 166
APPENDIX 6: GO ANALYSIS RESULT FOR THE WHOLE GENE LIST ...... 169
10
LIST OF TABLES
Page
Table 1: The summarization of benchmark datasets ...... 60
Table 2: Detailed information of TRNASCompel benchmark data ...... 61
Table 3: Detailed information of Muscle benchmark dataset ...... 62
Table 4: Detailed information of Liver benchmark dataset ...... 63
Table 5: Result summarization for AP1-NFAT dataset ...... 65
Table 6: Detailed result of AP1-NFAT dataset ...... 66
Table 7: Result summarization for AP1-Ets dataset ...... 70
Table 8: Detailed result of AP1-Ets dataset ...... 71
Table 9: Result summarization for AP1-NFkappaB dataset ...... 76
Table 10: Detailed result of AP1-NFkappaB dataset ...... 76
Table 11: Result summarization for CEBP-NFkappaB dataset ...... 80
Table 12: Detailed result of CEBP-NFkappaB dataset ...... 80
Table 13: Result summarization for Ebox-Ets dataset ...... 82
Table 14: Detailed result of Ebox-Ets dataset ...... 83
Table 15: Result summarization for Ets-AML dataset ...... 85
Table 16: Detailed result of Ets-AML dataset ...... 86
Table 17: Result summarization for PU1-IRF dataset ...... 91
Table 18: Detailed result of IRF-NFkappaB dataset ...... 92
Table 19: Result summarization for NFkappaB-HMGIY dataset ...... 97
Table 20: Result summarization for PU1-IRF dataset ...... 98 11
Table 21: Detailed result of PU1-IRF dataset ...... 98
Table 22: Result summarization for Sp1-Ets dataset ...... 103
Table 23: Detailed result of Sp1-Ets dataset ...... 103
Table 24: Result of Enumerative Method for Muscle Dataset ...... 113
Table 25: Result of Enumerative Method for Liver Dataset ...... 115
Table 26: Result of HAC Method of Muscle Dataset ...... 116
Table 27: Result of HAC Method of Liver Dataset ...... 117
Table 28: Significant Words ...... 124
Table 29: Significant Modules ...... 126
Table 30: AGRIS Look-up results ...... 128
Table 31: TRANSFAC Look-up results ...... 129
Table 32: Common Words Shared by Arabidopsis and Rice ...... 133
Table 33: TRANSFAC Look-up results ...... 134
Table 34: AGRIS Functional Look-up Result for Known Motif AGAA[ACGT] ...... 134
Table 35: Top 25 modules of Arabidopsis thaliana ...... 138
Table 36: Common Modules shared by Arabidopsis thaliana and rice ...... 139
Table 37: Distance distribution map of common modules ...... 139
Table 38: Density distribution map of common modules ...... 139
12
LIST OF FIGURES
Page
Figure 1: An example of Span ...... 29
Figure 2: An example of the number of occurrence times ...... 31
Figure 3 : Example 3...... 32
Figure 4 : Architecture of WordSeeker Pipeline...... 39
Figure 5 : An observed example of words patterns...... 42
Figure 6 : Flow chart of the enumerative module discovery method...... 43
Figure 7 : An example of a PWM presented motif...... 44
Figure 8 : Pseudo code for the function of PWMs read and filter...... 45
Figure 9 : The pseudo code for the function of enumerating combinations...... 46
Figure 10: An example for describing ...... 56 cog . Figure 11 : Word description of HAC algorithm...... 57
Figure 12 : High level pseudo code for HAC algorithm...... 57
Figure 13 : Visualized result of AP1-NFAT for Markov order of 0...... 67
Figure 14 : Visualized result of AP1-NFAT of Markov order of 1...... 68
Figure 15 : Visualized result of AP1-NFAT of Markov order of 2...... 69
Figure 16 : Visualized result of AP1-Ets of Markov order of 0...... 72
Figure 17 : Visualized result of AP1-Ets of Markov order of 1...... 73
Figure 18 : Visualized result of AP1-Ets of Markov order of 2...... 74
Figure 19 : Visualized result of AP1-NFkappaB of Markov order of 0...... 77
Figure 20 : Visualized result of AP1- NFkappaB of Markov order of 1...... 78 13
Figure 21 : Visualized result of CEBP-NFkappaB of Markov order of 0...... 81
Figure 22 : Visualized result of Ebox-Ets of Markov order of 0...... 84
Figure 23 : Visualized result of Ets-AML of Markov order of 0...... 87
Figure 24 : Visualized result of Ets-AML of Markov order of 1...... 88
Figure 25 : Visualized result of Ets-AML of Markov order of 2...... 89
Figure 26 : Visualized result of Ets-AML of Markov order of 3...... 90
Figure 27 : Visualized result of IRF-NFkappaB of Markov order of 0...... 93
Figure 28 : Visualized result of IRF-NFkappaB of Markov order of 1...... 94
Figure 29 : Visualized result of IRF-NFkappaB of Markov order of 2...... 95
Figure 30 : Visualized result of IRF-NFkappaB of Markov order of 3...... 96
Figure 31 : Visualized result of PU1-IRF of Markov order of 0...... 99
Figure 32 : Visualized result of PU1-IRF of Markov order of 1...... 100
Figure 33 : Visualized result of PU1-IRF of Markov order of 2...... 101
Figure 34 : Visualized result of Sp1-Ets of Markov order of 0...... 104
Figure 35 : Visualized result of Sp1-Ets of Markov order of 1...... 105
Figure 36 : Visualized result of Sp1-Ets of Markov order of 2...... 106
Figure 37 : The relationship between reality and prediction...... 110
Figure 38 : Measuring Formulas ...... 111
Figure 39 : Evaluation result of the muscle benchmark dataset...... 118
Figure 40 : Evaluation result of the liver benchmark dataset...... 119
Figure 41 : Motifs Comparison for Rice and Arabidopsis thaliana...... 136
14
LIST OF ABBREVIATIONS
GRN -- Gene Regulatory Network
TFs -- Transcription Factors
TSSs -- Transcription Start Sites
TFBSs -- Transcription Factor Binding Sites
TFBMs -- Transcription Factor Binding Modules
CRMs -- Cis -Regulatory Modules
PWMs -- Position Weight Matrices
HMM -- Hidden Markov Model
EM -- Expectation Maximization
HAC -- Hierarchical Agglomerative Clustering
HMM -- Hidden Markov Model
TP -- True Positive
FP -- False Positive
TN -- True Negative
FN -- False Negative
Sn -- Sensitivity
Sp -- Specificity
PPV -- Positive Predictive Value
ASP -- Average Site Performance
PC -- Performance Coefficient
CC -- Correlation Coefficient 15
CHAPTER 1: INTRODUCTION
1.1 Background
A Gene Regulatory Network (GRN) is a complicated system which controls gene expression. It is the functional network that decides which gene gets expressed in each cell, and in each stage of development and differentiation. Several phases are involved in this control, and most tasks are completed via transcription controls [1]. Obviously, understanding this regulatory mechanism will not only lead to a higher level comprehension of life, but also benefit the development of disease treatments and other industries related to biology.
Over the last several decades, scientists have been trying to understand the GRN. With the advancements in biology, especially in molecular biology, modern biology has accumulated an unprecedented amount of knowledge in this realm. It has been known that gene expression is triggered by the activation of Transcription Factors (TFs) binding to specific regions in the non-coding genomic sequence, which could be located either kilo-bases away from the promoter region [1] or near to the Transcription Start Sites
(TSSs)[2]. The specific region that TFs bind, which is called a Transcription Factor
Binding Site (TFBS), is usually a short sequence (≈ 10 bp), and mostly located in the upstream of the target gene. TFBSs are also called cis -regulatory elements, in which the term “ cis ” means that this short DNA specific region is usually located at the same side as the target gene [3]. Hereby, the action of TFs binding TFBSs is considered as central process for the gene regulation system. For the purpose of understanding gene regulation 16 mechanisms, it is necessary to simulate or model this whole pathway. Therefore, detecting and identifying TFBSs is the entry to understanding the whole system.
Numerous software programs with the goal of motif discovery have been conceived and developed in detecting single TFBSs. However, in most organisms, especially in higher organisms, TFs do not work alone, and they usually work together with nearby TFs, binding multiple TFBSs for gene regulation [4].
The number of these multiple TFBSs, which are called Transcription Factor Binding
Modules (TFBMs) or Cis -Regulatory Modules (CRMs), is estimated as ten-times the number of genes [3]. Clearly, these CRMs are a basic necessity for modeling the gene regulation network. Several common features shared by Cis -Regulatory Modules have been analyzed and reported, such as distance between single binding sites inside the module, module span, density of binding sites, number of individual binding sites contained in a module, and GC contents [2, 3, 5, 6, 7, 8, 9]. Even though more features of
CRMs have recently been, and continue to be, explored and reported, the key working mechanism of GRN remains elusive. On the other hand, these reported features could be utilized as a guide for developing computational module discovery tools to identify
CRMs within massive background sequences, although these features are not sufficient to unveil the gene regulation mystery.
17
1.2 Problem Statement
As stated above, it is significant and necessary to identify CRMs for the purpose of understanding the GRN. The ideal approach of recognizing these functional elements and modules is biological experiments, which cannot only identify, but also practically verify, these elements. However, unfortunately, biological experiments are always time- consuming, and demand abundant investments in both materials and manpower. Coupled with the uncertainties, it is impossible to finish the task by biological experiments alone.
In the past decades, with the help of computational methods, it has become possible to discover putative CRMs, which could be used for guiding the subsequent biological experiments and be verified or falsified. Thus, a computational method is desirable; nonetheless, it is actually difficult to conceive an efficient method. A variety of factors determine the difficulty, including the ignorance of the regulation mechanism model, motifs’ degeneracy, little understanding of functions carried by evolutionarily conserved regions, and the complexity of non-regulatory sequences inside regulatory regions [2].
Hence, the primary problem is that of how to design and develop computational methods to detect and predict CRMs. Furthermore, the expected method for solving this problem is not just a proposed possible solution, but it should also be an accurate and effective solution. The most straight forward method is to use known TFBSs, and cluster them together based on the “rules”, which are reported as properties shared by CRMs, and used to measure the similarity among multiple binding sites. Although more and more features have been reported and continue to be explored, the systematic process of the gene 18 regulatory network is still a mystery. Furthermore, nobody really knows or has experimentally verified all TFBSs for any species. Last, the method that depends on known TFBSs leads to a severely biased performance because of the impact of the quality of input Position Weighted Matrices (PWMs) that are usually used for presenting TFBSs.
The detailed reason will be explained in next chapter. Because of these limitations, a method with prior knowledge of TFBSs is not a suitable solution and will not be employed in this thesis.
For the purpose of developing a high quality approach, several sub-problems have to be stated and solved, which are: how to avoid too many false positive predictions without missing significant clusters; how to deal with the insufficient known data; and how to maximize the profit from current knowledge of the gene regulation network. To address this complicated problem, in this thesis, two methods have been proposed. One employs an enumerative algorithm in order to report over-represented modules; and the other adopts a supervised machine learning method for clustering putative binding sites. Both methods could be executed either with or without taking the advantage of known TFBSs.
For the first time, the approaches presented in this thesis combine these two feasible solutions (with/without using known TFBSs), and try to achieve benefits from both sides.
19
1.3 Overview of Thesis
This thesis is structured into eight chapters. Chapter one introduces the biological background of this research, and the relevance of the problem. Chapter two gives a review of current solutions that have been encountered in this realm, as well as comparative advantages and disadvantages. Chapter three methodically describes the problem that is solved in this thesis. Chapter four describes a genomic analysis toolkit,
WordSeeker, from design to each functional component, which is also the host for the two proposed methods. Chapter five describes two approaches for solving the problem.
For each approach, the motivation, algorithm and complexity analysis are explained in order. Chapter six focuses on the assessment of the proposed methods, within which scoring functions and prediction performance are evaluated separately. Chapter seven presents two case studies: (1) identifying functional cis -regulatory elements and modules in Arabidopsis thaliana , and (2) analyzing the similarity pattern between rice (japonica and indica ) and Arabidopsis thaliana .
20
CHAPTER 2: LITERATURE REVIEW
In this chapter, related works in this field are reviewed, and the comparative advantages and disadvantages are discussed as well. Traditionally, all methods could be divided into two categories according to whether they require prior knowledge of known TFBSs or
TFs as input. In the following sections, methods encountered for solving the same problem are reviewed based on these two classes.
As stated above, using known motifs or known TFBSs as the input for module discovery are the simplest and most direct solution. Several public databases, such as TRANSFAC
[10], AGRIS [11, 12], JASPAR [13], and REDfly [14], are available for providing experimentally verified TFBSs, or TFs, of different organisms, and could be used as initial input for module discovery tools. With known TFBSs, or TFs, the research objective could be focused on the common features among single TFBSs shared by
CRMs. After all, besides several tissues or species that have been learned relatively comprehensively, such as livers [7], muscles [15] and Drosophila [16], most TFBSs and the rules of how TFBSs are involved in gene regulation pathway are barely known for most species. At this point, using known TFBSs might be a suitable solution for a specific tissue or organism, but it is not appropriate for addressing general problems.
Additionally, the degeneracy of motifs, and the non-uniqueness between PWMs and motifs cause the problem to be more complicated. Furthermore, this kind of method requires the prior knowledge as input; in other words, they assume the problem of detecting TFBSs has been solved. Thus, this solution raised a new problem that the 21 performance of CRM discovery is impacted heavily by the quality of input TFBSs, which also has been proven by the benchmark evaluation in chapter six. On the other hand, not using known TFBSs can avoid these problems, although it might be less efficient in the species which have been intensively learned.
The following two sections review the two classes of tools respectively. Due to the limited space, only the nine tools that are used for comparing the proposed methods are reviewed in detail.
2.1 With Prior Knowledge of TFBSs
All tools reviewed in this section require known TFBSs or PWMs that present motifs as input. In this category, the problem solved by methods is CRMs search, or reorganization, rather than CRMs discovery. Some of them even desire a list of TFBSs or TFs that are known to work together as input, which is still an outstanding issue. What is more, most of them employ Hidden Markov Models (HMMs) for detecting putative module regions; however, one drawback of HMMs is that it is hard to control the relations between
TFBSs inside each module, which is one of the most important features considered by modules. Also, the optimization of parameters needs training data, which is also one of the primary limitations in this realm. Furthermore, a user-defined parameter design can easily cause over-fit so that methods would not be proper for general problems. In the following content, each paragraph gives a general description of one method.
22
Motif Cluster Alignment Search Tool (MCAST)[17]: The algorithm requires a database of DNA sequences and a set of motifs that are believed to work together for target genes’ regulation as the input. By inheriting and developing an established HMM algorithm, named Meta-MEME [18], the algorithm recognizes CRMs by searching the corresponding matches within sequences, and outputs predicted CRMs associated with novel scoring functions. The whole method is based on the strategy of alignment and a novel scoring function that has been conceived for handling matches and gaps by giving them proper award or penalty. Rather than a module discovery method, as authors stated themselves, the algorithm does the “database search task” (p. ii17).
Stubb [19]: Stubb requires a set of co-operated TFs that are presented in PWMs as an input, and then outputs a list of regions with scores, which reflect the likelihood to be a module. The algorithm employs HMMs for exploiting the statistical features of generating specific sequences from input TFs, and uses Expectation Maximization (EM) algorithm to determine the best parameters. Besides the space and order conservation that are used as features of CRMs, phylogenetic comparisons are considered and explored as well. Although alignments among similar organisms, which are close in phylogenetic distance, are popular in the area of detecting functional elements, how to deal with the matches in the conserved regions is still an open question. Furthermore, EM algorithm requires training data to help decide the best parameters to set; thus, the insufficient training data will be a main issue in this supervised method.
23
Cister [20]: Cister is an HMM based method that requires PWMs, a query sequence, and parameters as inputs. The algorithm considers several features of CRMs, including the gap between adjacent binding sites inside a module, the number of binding sites contained in a module, and the span which could be defined by users. The algorithm targets in searching the cis -regulatory regions by assessing the strength of clusters of individual cis -elements. Similar to other HMM-based algorithms, due to the lack of experimental evidences, one of the important shortcomings is that the parameter chosen for this kind of method is established on several assumptions.
MSCAN [21]: Similar to former methods, MSCAN requires a set of PWMs as input, and based on the frequency carried by PWMs, the algorithm assesses the significant level of each possible binding site that are presented by input PWMs hits. After the first step of filtering, candidate binding sites are extracted from input PWMs, and are ready to be combined and evaluated in the statistical value as modules. Within a fixed-size window, which is predefined, multiple candidate binding sites that have higher density are assumed as modules; and this is also the criterion that MSCAN selects functional regulatory regions. One drawback of this algorithm, which cannot be ignored, is that the algorithm is able to handle only one genomic sequence.
24
Cluster-Buster [22]: Similar to other approaches in this category, it requires a list of motifs and DNA sequences as inputs. Then the algorithm is trying to recognize the subsequence that has the maximal log likelihood ratio. Features considered in this algorithm include gap , span , and occurrence times . With a user-defined threshold, possible locations are selected and output as the putative modules’ regions. The algorithm is only allowed to upload sequences that are shorter than 100 kb, which is not feasible for real sets of co-promoter regions. And what is more, besides the threshold used for selecting module regions, a few parameters, such as the gap and the occurrence times , have to be defined by users without providing any clues. As other methods in this category, the main drawback of this method is the way to deal with known motifs.
Compo [23]: With a list of input PWMs, Compo enumerates all possible composite motifs, then outputs a list of qualified modules associated with a confidence score based on specific constraints. The first step of Compo is to deal with input PWMs; it accepts motifs that have high number of occurrence times with low expectation against the background model with a given threshold. After filtering out un-significant motifs, it enumerates all possible combinations with a tree structure. A specified sliding window is used for restricting distance inside modules; and un-qualified modules, which are presented as a branch in the tree structure, are pruned. Then the list of qualified modules is evaluated, and finally the list of modules is output associated with a confident score.
Some aspects worth discussing about this approach are: firstly, although there are several methods to calculate the threshold, such as likelihoods, how to decide the threshold is still 25 an unsolved issue; secondly, the choice of the sliding widow’s size is a tough problem that could affect the predicted results. Besides the pre-process of PWMs, the basic idea of
Compo is similar to the enumerative method proposed in this thesis. The difference between them is that, instead of deciding the parameters intuitively, the enumerative method provides all information as features associated with putative modules.
2.2 Without Prior Knowledge of TFBSs
In this section, methods that do not require prior knowledge of TFBSs are reviewed.
Usually, they detect the candidate TFBSs first by cooperating with other algorithms, and then use the built-in model to discover modules that are statistically over-represented or phylogentically conserved.
Module Searcher [24]: Module Searcher is the most similar method with the enumerative algorithm that is proposed in this thesis. The algorithm targets in finding the best combinations of transcription factor binding sites without using prior knowledge of motifs by cooperating with MotifScanner [25]. By assessing the quality of each combination, the algorithm determines whether the combination should be reported as
CRMs. The features considered for the assessment include the overlap between individual binding sites and the span of a module. As a required input, a set of co- regulated sequences are compared with the pre-store database for extracting respective conserved regions between mouse and human. This characteristic causes the outstanding and steady performance in specific issues, but also limits the scope of its applications. 26
Composite Module Analyst (CMA) [26]: CMA takes two groups of promoter sequences as input: one is a set of promoter regions that are co-expressed or co-regulated genes, and the other is a set of promoters for genes that significantly differ from their expressions.
These two input sequences sets are used as positive and negative training examples, respectively. Integrating with Match TM [27] algorithm, which could predict potential binding sites from input sequences, CMA does not require TFBSs as initial input. It employs a multi-components fitness function to find the best pair of binding sites with the consideration of span , order conservation , maximal and minimal gap . The sliding window, span , orientation, and distance rules are all pre-defined by users or the training data. At the point of the idea for conceiving algorithms, CMA is similar to the supervised algorithm proposed in this thesis. Both of them are trying to find the best combinations of putative TFBSs with a built-in feature vector, which is designed for exploring and evaluating potential modules’ quality. Besides considering different traits, HAC supports for n number size combinations, while CMA only supports for pairs of TFBSs.
CisModule [28]: CisModule employs Gibbs sampling approach and Hierarchical Mixture model for detecting motifs and CRMs at the same time. It aligns input sequences for detecting TFBSs, and believes that CRMs could be detected by searching the co- occurrence of multiple TFBSs. Starting from random initial cases, the algorithm iteratively executes different alignments and updates parameters until obtaining the marginal distribution. Besides a set of sequences, the algorithm also requires two input 27 parameters, which are the number of TFBSs contained in a module and span . However, there is no sufficient information that could be used as suggestions for the general problem, so the strategy for selecting these two parameters would be a challenge.
Furthermore, the other problem of this algorithm is that only the feature of gap is considered as the character of modules. Despite several shortcomings, this algorithm provides a new perspective of solving the problem: it is not necessary to detect binding sites first; the binding sites and modules could be detected concurrently.
28
CHAPTER 3: PROBLEM DEFINITION
3.1 Introduction
To our knowledge, there are neither standard nor clear models for describing CRM discovery problems. Different approaches focus on different aspects of CRMs. Some of them detect the co-regulate regions upstream in the non-coding areas by searching for evolutionally conserved sequences; while others cluster known or putative individual
TFBSs that cooperate for gene regulation [6]. In this thesis, a formal model, which is adapted from the model defined by Klaus Ecker and Lonnie Welch [29], is used.
In this proposed model, a CRM is defined as a cluster of TFBSs, which are predicted to work together in GRN and have co-regulatory functions. The formal definition of a module is [29]: M
(1) M = ((P ,L … (P ,L
≤ ≤ ≤ Where k is the number of single TFBSs contained in the module M, P1 P 2 ... P k are
the start position of each TFBS in the considered genomic sequence, and L1 ,...,L k are the respective word-lengths.
3.2 Terminologies
This section introduces terminologies that are used for describing modules’ features.
Additionally, how to compute these values is described. 29
1. Span is the largest region in genomic sequence that is covered by a module. In a specific module, it is the number of nucleotides from the start position of the first TFBS to the end position of the last TFBS. The formal definition for counting the span of a module M is: