Computational Methods for Cis-Regulatory Module Discovery

A thesis presented to

the faculty of the Russ College of Engineering and Technology of Ohio University

In partial fulfillment

of the requirements for the degree

Master of Science

Xiaoyu Liang

November 2010

© 2010 Xiaoyu Liang. All Rights Reserved.

2

This thesis titled

Computational Methods for Cis-Regulatory Module Discovery

by

XIAOYU LIANG

has been approved for

the School of Electrical Engineering and Computer Science

and the Russ College of Engineering and Technology by

Lonnie R.Welch

Professor of Electrical Engineering and Computer Science

Dennis Irwin

Dean, Russ College of Engineering and Technology 3

ABSTRACT

LIANG, XIAOYU, M.S., November 2010, Computer Science

Computational Methods for Cis-regulatory Module Discovery

Director of Thesis: Lonnie R.Welch

In a gene regulation network, the action of a transcription factor binding a short region in non-coding sequence is reported and believed as the key that triggers, or represses genes’ expression. Further analysis revealed that, in higher organisms, multiple transcription factors work together and bind multiple sites that are located nearby in genomic sequences, rather than working alone and binding a single anchor. These multiple binding sites in the non-coding region are called cis -regulatory modules. Identifying these cis - regulatory modules is important for modeling gene regulation network. In this thesis, two methods have been proposed for addressing the problem, and a widely accepted evaluation was applied for assessing the performance. Additionally, two practical case studies were completed and reported as the application of the proposed methods.

Approved: ______

Lonnie R.Welch

Professor of Electrical Engineering and Computer Science

4

ACKNOWLEDGEMENTS

I would like to express my sincere gratitude to my advisor, Dr. Lonnie R. Welch, for introducing me into the area of Bioinformatics, and for his guidance and encouragement for the present thesis and my research projects.

I would like to extend my deep gratitude to Dr. Frank Drews, Dr. Sarah Wyatt, and Dr.

Razvan Bunescu for all the help and advices for my research, and serving as my thesis committee members. Especially, thank Dr. Razvan Bunescu for teaching me machine learning knowledge and guiding me to develop the HAC algorithm that is proposed in this thesis. I would also like to thank Dr. Klaus Ecker for participating in my research projects, and for his valuable advice, discussion, and friendly help.

I would also like to thank past and present members in Bioinformatics group, including

Jens Litchenberg, Kyle Kurz, Rami Alouran, Lee Nau, Joshua Welch, Daniel Evans,

Kaiyu Shen, Paul Burns, Lev Neimen, Zekai Huang, Eric Petri, Nathaniel George, and

Josiah Seaman, for all your help and collaborations. Specially, thank to Jens Litchenberg for all advices, encouragements, and being a wonderful landlord.

I would like to thank Dr. Wyatt and her lab again for providing data, and initial analysis of the case study of plant gravitropic signal pathway that is presented as one of the case studies in this thesis. 5

I would also like to thank Dr. Eric Grotewold and Dr. Alper Yilmaz for providing a priceless internship opportunity, and cooperating in the case study of comparison between rice and Arabidopsis thaliana , which is presented in this thesis. I have learned a lot in both theory and practical applications, and really enjoyed the experience working in their lab.

Thank you to my parents deeply for always being support, for their love and understandings.

I would also like to thank Stephane J. Litchenberg for correcting my grammatical errors and sharing friendship.

6

TABLE OF CONTENTS

Page

ABSTRACT ...... 3

ACKNOWLEDGEMENTS ...... 4

LIST OF TABLES ...... 10

LIST OF FIGURES ...... 12

LIST OF ABBREVIATIONS ...... 14

CHAPTER 1: INTRODUCTION ...... 15

1.1 Background ...... 15

1.2 Problem Statement ...... 17

1.3 Overview of Thesis ...... 19

CHAPTER 2: LITERATURE REVIEW ...... 20

2.1 With Prior Knowledge of TFBSs...... 21

2.2 Without Prior Knowledge of TFBSs ...... 25

CHAPTER 3: PROBLEM DEFINITION...... 28

3.1 Introduction ...... 28

3.2 Terminologies ...... 28

CHAPTER 4: WORDSEEKER ...... 35

4.1 WordSeeker Design ...... 35 7

4.2 Integration with WordSeeker ...... 38

CHAPTER 5: ALGORITHMIC APPROACHES FOR MODULE DISCOVERY ...... 40

5.1 Enumerative Module Discovery Method ...... 41

5.1.1 Motivation ...... 41

5.1.2 Algorithm ...... 43

5.1.3 Complexity Analysis ...... 50

5.2 Hierachical Agglomerative Clustering (HAC) Module Discovery Method ...... 52

5.2.1 Motivation ...... 52

5.2.2 Algorithm ...... 52

5.2.3 Complexity Analysis ...... 58

CHAPTER 6: EVALUATION ...... 59

6.1 Benchmark Datasets ...... 59

6.1.1 TRANSCompel ...... 60

6.1.2 Muscle Dataset ...... 61

6.1.3 Liver Dataset ...... 62

6.2 Evaluation of Scoring Functions...... 63

6.2.1 AP1-NFAT ...... 64

6.2.2 AP1-Ets ...... 70

6.2.3 AP1-NFkappaB ...... 75 8

6.2.4 CEBP-NFkappaB ...... 79

6.2.5 Ebox-Ets ...... 82

6.2.6 Ets-AML ...... 85

6.2.7 IRF-NFkappaB ...... 91

6.2.8 NFkappaB-HMGIY ...... 97

6.2.9 PU1-IRF ...... 97

6.2.10 Sp1-Ets ...... 102

6.2.11 Discussion ...... 107

6.3 Evaluation of Prediction Performance ...... 108

6.3.1 Statistical Values ...... 108

6.3.2 Statistical Formulas ...... 110

6.3.3 Results and Discussion ...... 111

CHAPTER 7: CASE STUDY...... 121

7.1 Plant Gravitropic Signal Transduction ...... 121

7.1.1 Introduction ...... 121

7.1.2 Methods and Results ...... 122

7.1.3 Conclusions ...... 130

7.2 Comparison of Rice and Arabidopsis Thaliana ...... 130

7.2.1 Introduction ...... 131 9

7.2.2 Methods and Results ...... 131

7.2.3 Conclusion ...... 140

CHAPTER 8: CONCLUSIONS ...... 141

8.1 Summary of results ...... 141

8.2 Future Work ...... 141

REFERENCES ...... 143

APPENDIX 1: TRANSCOMPEL BENCHMARK DATASET ...... 150

APPENDIX 2: MUSCLE BENCHMARK DATASET...... 155

APPENDIX 3: LIVER BENCHMARK DATASET ...... 157

APPENDIX 4: RESULT FROM FOR MUSCLE DATASET ...... 158

APPENDIX 5: GO ANALYSIS RESULT FOR NINE CLUSTERS OF GENES ...... 166

APPENDIX 6: GO ANALYSIS RESULT FOR THE WHOLE GENE LIST ...... 169

10

LIST OF TABLES

Page

Table 1: The summarization of benchmark datasets ...... 60

Table 2: Detailed information of TRNASCompel benchmark data ...... 61

Table 3: Detailed information of Muscle benchmark dataset ...... 62

Table 4: Detailed information of Liver benchmark dataset ...... 63

Table 5: Result summarization for AP1-NFAT dataset ...... 65

Table 6: Detailed result of AP1-NFAT dataset ...... 66

Table 7: Result summarization for AP1-Ets dataset ...... 70

Table 8: Detailed result of AP1-Ets dataset ...... 71

Table 9: Result summarization for AP1-NFkappaB dataset ...... 76

Table 10: Detailed result of AP1-NFkappaB dataset ...... 76

Table 11: Result summarization for CEBP-NFkappaB dataset ...... 80

Table 12: Detailed result of CEBP-NFkappaB dataset ...... 80

Table 13: Result summarization for Ebox-Ets dataset ...... 82

Table 14: Detailed result of Ebox-Ets dataset ...... 83

Table 15: Result summarization for Ets-AML dataset ...... 85

Table 16: Detailed result of Ets-AML dataset ...... 86

Table 17: Result summarization for PU1-IRF dataset ...... 91

Table 18: Detailed result of IRF-NFkappaB dataset ...... 92

Table 19: Result summarization for NFkappaB-HMGIY dataset ...... 97

Table 20: Result summarization for PU1-IRF dataset ...... 98 11

Table 21: Detailed result of PU1-IRF dataset ...... 98

Table 22: Result summarization for Sp1-Ets dataset ...... 103

Table 23: Detailed result of Sp1-Ets dataset ...... 103

Table 24: Result of Enumerative Method for Muscle Dataset ...... 113

Table 25: Result of Enumerative Method for Liver Dataset ...... 115

Table 26: Result of HAC Method of Muscle Dataset ...... 116

Table 27: Result of HAC Method of Liver Dataset ...... 117

Table 28: Significant Words ...... 124

Table 29: Significant Modules ...... 126

Table 30: AGRIS Look-up results ...... 128

Table 31: TRANSFAC Look-up results ...... 129

Table 32: Common Words Shared by Arabidopsis and Rice ...... 133

Table 33: TRANSFAC Look-up results ...... 134

Table 34: AGRIS Functional Look-up Result for Known Motif AGAA[ACGT] ...... 134

Table 35: Top 25 modules of Arabidopsis thaliana ...... 138

Table 36: Common Modules shared by Arabidopsis thaliana and rice ...... 139

Table 37: Distance distribution map of common modules ...... 139

Table 38: Density distribution map of common modules ...... 139

12

LIST OF FIGURES

Page

Figure 1: An example of Span ...... 29

Figure 2: An example of the number of occurrence times ...... 31

Figure 3 : Example 3...... 32

Figure 4 : Architecture of WordSeeker Pipeline...... 39

Figure 5 : An observed example of words patterns...... 42

Figure 6 : Flow chart of the enumerative module discovery method...... 43

Figure 7 : An example of a PWM presented motif...... 44

Figure 8 : Pseudo code for the function of PWMs read and filter...... 45

Figure 9 : The pseudo code for the function of enumerating combinations...... 46

Figure 10: An example for describing ...... 56 cog . Figure 11 : Word description of HAC algorithm...... 57

Figure 12 : High level pseudo code for HAC algorithm...... 57

Figure 13 : Visualized result of AP1-NFAT for Markov order of 0...... 67

Figure 14 : Visualized result of AP1-NFAT of Markov order of 1...... 68

Figure 15 : Visualized result of AP1-NFAT of Markov order of 2...... 69

Figure 16 : Visualized result of AP1-Ets of Markov order of 0...... 72

Figure 17 : Visualized result of AP1-Ets of Markov order of 1...... 73

Figure 18 : Visualized result of AP1-Ets of Markov order of 2...... 74

Figure 19 : Visualized result of AP1-NFkappaB of Markov order of 0...... 77

Figure 20 : Visualized result of AP1- NFkappaB of Markov order of 1...... 78 13

Figure 21 : Visualized result of CEBP-NFkappaB of Markov order of 0...... 81

Figure 22 : Visualized result of Ebox-Ets of Markov order of 0...... 84

Figure 23 : Visualized result of Ets-AML of Markov order of 0...... 87

Figure 24 : Visualized result of Ets-AML of Markov order of 1...... 88

Figure 25 : Visualized result of Ets-AML of Markov order of 2...... 89

Figure 26 : Visualized result of Ets-AML of Markov order of 3...... 90

Figure 27 : Visualized result of IRF-NFkappaB of Markov order of 0...... 93

Figure 28 : Visualized result of IRF-NFkappaB of Markov order of 1...... 94

Figure 29 : Visualized result of IRF-NFkappaB of Markov order of 2...... 95

Figure 30 : Visualized result of IRF-NFkappaB of Markov order of 3...... 96

Figure 31 : Visualized result of PU1-IRF of Markov order of 0...... 99

Figure 32 : Visualized result of PU1-IRF of Markov order of 1...... 100

Figure 33 : Visualized result of PU1-IRF of Markov order of 2...... 101

Figure 34 : Visualized result of Sp1-Ets of Markov order of 0...... 104

Figure 35 : Visualized result of Sp1-Ets of Markov order of 1...... 105

Figure 36 : Visualized result of Sp1-Ets of Markov order of 2...... 106

Figure 37 : The relationship between reality and prediction...... 110

Figure 38 : Measuring Formulas ...... 111

Figure 39 : Evaluation result of the muscle benchmark dataset...... 118

Figure 40 : Evaluation result of the liver benchmark dataset...... 119

Figure 41 : Motifs Comparison for Rice and Arabidopsis thaliana...... 136

14

LIST OF ABBREVIATIONS

GRN -- Gene Regulatory Network

TFs -- Transcription Factors

TSSs -- Transcription Start Sites

TFBSs -- Transcription Factor Binding Sites

TFBMs -- Transcription Factor Binding Modules

CRMs -- Cis -Regulatory Modules

PWMs -- Position Weight Matrices

HMM -- Hidden Markov Model

EM -- Expectation Maximization

HAC -- Hierarchical Agglomerative Clustering

HMM -- Hidden Markov Model

TP -- True Positive

FP -- False Positive

TN -- True Negative

FN -- False Negative

Sn -- Sensitivity

Sp -- Specificity

PPV -- Positive Predictive Value

ASP -- Average Site Performance

PC -- Performance Coefficient

CC -- Correlation Coefficient 15

CHAPTER 1: INTRODUCTION

1.1 Background

A Gene Regulatory Network (GRN) is a complicated system which controls gene expression. It is the functional network that decides which gene gets expressed in each cell, and in each stage of development and differentiation. Several phases are involved in this control, and most tasks are completed via transcription controls [1]. Obviously, understanding this regulatory mechanism will not only lead to a higher level comprehension of life, but also benefit the development of disease treatments and other industries related to biology.

Over the last several decades, scientists have been trying to understand the GRN. With the advancements in biology, especially in molecular biology, modern biology has accumulated an unprecedented amount of knowledge in this realm. It has been known that gene expression is triggered by the activation of Transcription Factors (TFs) binding to specific regions in the non-coding genomic sequence, which could be located either kilo-bases away from the promoter region [1] or near to the Transcription Start Sites

(TSSs)[2]. The specific region that TFs bind, which is called a Transcription Factor

Binding Site (TFBS), is usually a short sequence (≈ 10 bp), and mostly located in the upstream of the target gene. TFBSs are also called cis -regulatory elements, in which the term “ cis ” means that this short DNA specific region is usually located at the same side as the target gene [3]. Hereby, the action of TFs binding TFBSs is considered as central process for the gene regulation system. For the purpose of understanding gene regulation 16 mechanisms, it is necessary to simulate or model this whole pathway. Therefore, detecting and identifying TFBSs is the entry to understanding the whole system.

Numerous software programs with the goal of motif discovery have been conceived and developed in detecting single TFBSs. However, in most organisms, especially in higher organisms, TFs do not work alone, and they usually work together with nearby TFs, binding multiple TFBSs for gene regulation [4].

The number of these multiple TFBSs, which are called Transcription Factor Binding

Modules (TFBMs) or Cis -Regulatory Modules (CRMs), is estimated as ten-times the number of genes [3]. Clearly, these CRMs are a basic necessity for modeling the gene regulation network. Several common features shared by Cis -Regulatory Modules have been analyzed and reported, such as distance between single binding sites inside the module, module span, density of binding sites, number of individual binding sites contained in a module, and GC contents [2, 3, 5, 6, 7, 8, 9]. Even though more features of

CRMs have recently been, and continue to be, explored and reported, the key working mechanism of GRN remains elusive. On the other hand, these reported features could be utilized as a guide for developing computational module discovery tools to identify

CRMs within massive background sequences, although these features are not sufficient to unveil the gene regulation mystery.

17

1.2 Problem Statement

As stated above, it is significant and necessary to identify CRMs for the purpose of understanding the GRN. The ideal approach of recognizing these functional elements and modules is biological experiments, which cannot only identify, but also practically verify, these elements. However, unfortunately, biological experiments are always time- consuming, and demand abundant investments in both materials and manpower. Coupled with the uncertainties, it is impossible to finish the task by biological experiments alone.

In the past decades, with the help of computational methods, it has become possible to discover putative CRMs, which could be used for guiding the subsequent biological experiments and be verified or falsified. Thus, a computational method is desirable; nonetheless, it is actually difficult to conceive an efficient method. A variety of factors determine the difficulty, including the ignorance of the regulation mechanism model, motifs’ degeneracy, little understanding of functions carried by evolutionarily conserved regions, and the complexity of non-regulatory sequences inside regulatory regions [2].

Hence, the primary problem is that of how to design and develop computational methods to detect and predict CRMs. Furthermore, the expected method for solving this problem is not just a proposed possible solution, but it should also be an accurate and effective solution. The most straight forward method is to use known TFBSs, and cluster them together based on the “rules”, which are reported as properties shared by CRMs, and used to measure the similarity among multiple binding sites. Although more and more features have been reported and continue to be explored, the systematic process of the gene 18 regulatory network is still a mystery. Furthermore, nobody really knows or has experimentally verified all TFBSs for any species. Last, the method that depends on known TFBSs leads to a severely biased performance because of the impact of the quality of input Position Weighted Matrices (PWMs) that are usually used for presenting TFBSs.

The detailed reason will be explained in next chapter. Because of these limitations, a method with prior knowledge of TFBSs is not a suitable solution and will not be employed in this thesis.

For the purpose of developing a high quality approach, several sub-problems have to be stated and solved, which are: how to avoid too many false positive predictions without missing significant clusters; how to deal with the insufficient known data; and how to maximize the profit from current knowledge of the gene regulation network. To address this complicated problem, in this thesis, two methods have been proposed. One employs an enumerative algorithm in order to report over-represented modules; and the other adopts a supervised machine learning method for clustering putative binding sites. Both methods could be executed either with or without taking the advantage of known TFBSs.

For the first time, the approaches presented in this thesis combine these two feasible solutions (with/without using known TFBSs), and try to achieve benefits from both sides.

19

1.3 Overview of Thesis

This thesis is structured into eight chapters. Chapter one introduces the biological background of this research, and the relevance of the problem. Chapter two gives a review of current solutions that have been encountered in this realm, as well as comparative advantages and disadvantages. Chapter three methodically describes the problem that is solved in this thesis. Chapter four describes a genomic analysis toolkit,

WordSeeker, from design to each functional component, which is also the host for the two proposed methods. Chapter five describes two approaches for solving the problem.

For each approach, the motivation, algorithm and complexity analysis are explained in order. Chapter six focuses on the assessment of the proposed methods, within which scoring functions and prediction performance are evaluated separately. Chapter seven presents two case studies: (1) identifying functional cis -regulatory elements and modules in Arabidopsis thaliana , and (2) analyzing the similarity pattern between rice (japonica and indica ) and Arabidopsis thaliana .

20

CHAPTER 2: LITERATURE REVIEW

In this chapter, related works in this field are reviewed, and the comparative advantages and disadvantages are discussed as well. Traditionally, all methods could be divided into two categories according to whether they require prior knowledge of known TFBSs or

TFs as input. In the following sections, methods encountered for solving the same problem are reviewed based on these two classes.

As stated above, using known motifs or known TFBSs as the input for module discovery are the simplest and most direct solution. Several public databases, such as TRANSFAC

[10], AGRIS [11, 12], JASPAR [13], and REDfly [14], are available for providing experimentally verified TFBSs, or TFs, of different organisms, and could be used as initial input for module discovery tools. With known TFBSs, or TFs, the research objective could be focused on the common features among single TFBSs shared by

CRMs. After all, besides several tissues or species that have been learned relatively comprehensively, such as livers [7], muscles [15] and Drosophila [16], most TFBSs and the rules of how TFBSs are involved in gene regulation pathway are barely known for most species. At this point, using known TFBSs might be a suitable solution for a specific tissue or organism, but it is not appropriate for addressing general problems.

Additionally, the degeneracy of motifs, and the non-uniqueness between PWMs and motifs cause the problem to be more complicated. Furthermore, this kind of method requires the prior knowledge as input; in other words, they assume the problem of detecting TFBSs has been solved. Thus, this solution raised a new problem that the 21 performance of CRM discovery is impacted heavily by the quality of input TFBSs, which also has been proven by the benchmark evaluation in chapter six. On the other hand, not using known TFBSs can avoid these problems, although it might be less efficient in the species which have been intensively learned.

The following two sections review the two classes of tools respectively. Due to the limited space, only the nine tools that are used for comparing the proposed methods are reviewed in detail.

2.1 With Prior Knowledge of TFBSs

All tools reviewed in this section require known TFBSs or PWMs that present motifs as input. In this category, the problem solved by methods is CRMs search, or reorganization, rather than CRMs discovery. Some of them even desire a list of TFBSs or TFs that are known to work together as input, which is still an outstanding issue. What is more, most of them employ Hidden Markov Models (HMMs) for detecting putative module regions; however, one drawback of HMMs is that it is hard to control the relations between

TFBSs inside each module, which is one of the most important features considered by modules. Also, the optimization of parameters needs training data, which is also one of the primary limitations in this realm. Furthermore, a user-defined parameter design can easily cause over-fit so that methods would not be proper for general problems. In the following content, each paragraph gives a general description of one method.

22

Motif Cluster Alignment Search Tool (MCAST)[17]: The algorithm requires a database of DNA sequences and a set of motifs that are believed to work together for target genes’ regulation as the input. By inheriting and developing an established HMM algorithm, named Meta-MEME [18], the algorithm recognizes CRMs by searching the corresponding matches within sequences, and outputs predicted CRMs associated with novel scoring functions. The whole method is based on the strategy of alignment and a novel scoring function that has been conceived for handling matches and gaps by giving them proper award or penalty. Rather than a module discovery method, as authors stated themselves, the algorithm does the “database search task” (p. ii17).

Stubb [19]: Stubb requires a set of co-operated TFs that are presented in PWMs as an input, and then outputs a list of regions with scores, which reflect the likelihood to be a module. The algorithm employs HMMs for exploiting the statistical features of generating specific sequences from input TFs, and uses Expectation Maximization (EM) algorithm to determine the best parameters. Besides the space and order conservation that are used as features of CRMs, phylogenetic comparisons are considered and explored as well. Although alignments among similar organisms, which are close in phylogenetic distance, are popular in the area of detecting functional elements, how to deal with the matches in the conserved regions is still an open question. Furthermore, EM algorithm requires training data to help decide the best parameters to set; thus, the insufficient training data will be a main issue in this supervised method.

23

Cister [20]: Cister is an HMM based method that requires PWMs, a query sequence, and parameters as inputs. The algorithm considers several features of CRMs, including the gap between adjacent binding sites inside a module, the number of binding sites contained in a module, and the span which could be defined by users. The algorithm targets in searching the cis -regulatory regions by assessing the strength of clusters of individual cis -elements. Similar to other HMM-based algorithms, due to the lack of experimental evidences, one of the important shortcomings is that the parameter chosen for this kind of method is established on several assumptions.

MSCAN [21]: Similar to former methods, MSCAN requires a set of PWMs as input, and based on the frequency carried by PWMs, the algorithm assesses the significant level of each possible binding site that are presented by input PWMs hits. After the first step of filtering, candidate binding sites are extracted from input PWMs, and are ready to be combined and evaluated in the statistical value as modules. Within a fixed-size window, which is predefined, multiple candidate binding sites that have higher density are assumed as modules; and this is also the criterion that MSCAN selects functional regulatory regions. One drawback of this algorithm, which cannot be ignored, is that the algorithm is able to handle only one genomic sequence.

24

Cluster-Buster [22]: Similar to other approaches in this category, it requires a list of motifs and DNA sequences as inputs. Then the algorithm is trying to recognize the subsequence that has the maximal log likelihood ratio. Features considered in this algorithm include gap , span , and occurrence times . With a user-defined threshold, possible locations are selected and output as the putative modules’ regions. The algorithm is only allowed to upload sequences that are shorter than 100 kb, which is not feasible for real sets of co-promoter regions. And what is more, besides the threshold used for selecting module regions, a few parameters, such as the gap and the occurrence times , have to be defined by users without providing any clues. As other methods in this category, the main drawback of this method is the way to deal with known motifs.

Compo [23]: With a list of input PWMs, Compo enumerates all possible composite motifs, then outputs a list of qualified modules associated with a confidence score based on specific constraints. The first step of Compo is to deal with input PWMs; it accepts motifs that have high number of occurrence times with low expectation against the background model with a given threshold. After filtering out un-significant motifs, it enumerates all possible combinations with a tree structure. A specified sliding window is used for restricting distance inside modules; and un-qualified modules, which are presented as a branch in the tree structure, are pruned. Then the list of qualified modules is evaluated, and finally the list of modules is output associated with a confident score.

Some aspects worth discussing about this approach are: firstly, although there are several methods to calculate the threshold, such as likelihoods, how to decide the threshold is still 25 an unsolved issue; secondly, the choice of the sliding widow’s size is a tough problem that could affect the predicted results. Besides the pre-process of PWMs, the basic idea of

Compo is similar to the enumerative method proposed in this thesis. The difference between them is that, instead of deciding the parameters intuitively, the enumerative method provides all information as features associated with putative modules.

2.2 Without Prior Knowledge of TFBSs

In this section, methods that do not require prior knowledge of TFBSs are reviewed.

Usually, they detect the candidate TFBSs first by cooperating with other algorithms, and then use the built-in model to discover modules that are statistically over-represented or phylogentically conserved.

Module Searcher [24]: Module Searcher is the most similar method with the enumerative algorithm that is proposed in this thesis. The algorithm targets in finding the best combinations of transcription factor binding sites without using prior knowledge of motifs by cooperating with MotifScanner [25]. By assessing the quality of each combination, the algorithm determines whether the combination should be reported as

CRMs. The features considered for the assessment include the overlap between individual binding sites and the span of a module. As a required input, a set of co- regulated sequences are compared with the pre-store database for extracting respective conserved regions between mouse and human. This characteristic causes the outstanding and steady performance in specific issues, but also limits the scope of its applications. 26

Composite Module Analyst (CMA) [26]: CMA takes two groups of promoter sequences as input: one is a set of promoter regions that are co-expressed or co-regulated genes, and the other is a set of promoters for genes that significantly differ from their expressions.

These two input sequences sets are used as positive and negative training examples, respectively. Integrating with Match TM [27] algorithm, which could predict potential binding sites from input sequences, CMA does not require TFBSs as initial input. It employs a multi-components fitness function to find the best pair of binding sites with the consideration of span , order conservation , maximal and minimal gap . The sliding window, span , orientation, and distance rules are all pre-defined by users or the training data. At the point of the idea for conceiving algorithms, CMA is similar to the supervised algorithm proposed in this thesis. Both of them are trying to find the best combinations of putative TFBSs with a built-in feature vector, which is designed for exploring and evaluating potential modules’ quality. Besides considering different traits, HAC supports for n number size combinations, while CMA only supports for pairs of TFBSs.

CisModule [28]: CisModule employs Gibbs sampling approach and Hierarchical Mixture model for detecting motifs and CRMs at the same time. It aligns input sequences for detecting TFBSs, and believes that CRMs could be detected by searching the co- occurrence of multiple TFBSs. Starting from random initial cases, the algorithm iteratively executes different alignments and updates parameters until obtaining the marginal distribution. Besides a set of sequences, the algorithm also requires two input 27 parameters, which are the number of TFBSs contained in a module and span . However, there is no sufficient information that could be used as suggestions for the general problem, so the strategy for selecting these two parameters would be a challenge.

Furthermore, the other problem of this algorithm is that only the feature of gap is considered as the character of modules. Despite several shortcomings, this algorithm provides a new perspective of solving the problem: it is not necessary to detect binding sites first; the binding sites and modules could be detected concurrently.

28

CHAPTER 3: PROBLEM DEFINITION

3.1 Introduction

To our knowledge, there are neither standard nor clear models for describing CRM discovery problems. Different approaches focus on different aspects of CRMs. Some of them detect the co-regulate regions upstream in the non-coding areas by searching for evolutionally conserved sequences; while others cluster known or putative individual

TFBSs that cooperate for gene regulation [6]. In this thesis, a formal model, which is adapted from the model defined by Klaus Ecker and Lonnie Welch [29], is used.

In this proposed model, a CRM is defined as a cluster of TFBSs, which are predicted to work together in GRN and have co-regulatory functions. The formal definition of a module is [29]: M

(1) M = ((P,L … (P,L

≤ ≤ ≤ Where k is the number of single TFBSs contained in the module M, P1 P 2 ... P k are

the start position of each TFBS in the considered genomic sequence, and L1 ,...,L k are the respective word-lengths.

3.2 Terminologies

This section introduces terminologies that are used for describing modules’ features.

Additionally, how to compute these values is described. 29

1. Span is the largest region in genomic sequence that is covered by a module. In a specific module, it is the number of nucleotides from the start position of the first TFBS to the end position of the last TFBS. The formal definition for counting the span of a module M is:

(M = (P + L − 1 − P + 1 (2) = (P + L − P

An example, shown in Figure 1, illustrates how span is calculated. Assume the word- length of each TFBS is the same, which is set as 6 bps. Then, according to the equation

(2), Span (M) = (P 3 +L 3-P1) = (18+6-1) = 23 (bps).

Example 1

Module M: W1nnnW 2nnW 3 Figure 1: An example of Span . In this example, module M consists of three single TFBSs, which are presented by Wk, where k = 1, 2 and 3; n stands for nucleotides between adjacent TFBSs; and all TFBSs are assumed to have a fixed word-length of 6.

2. Density is the ratio of the number of nucleotides covered by TFBSs to the span . It measures the compactness of TFBSs contained in a module. For M, if there are k TFBSs,

and Li is the length of the ith TFBS, then the equation for computing density is:

k

Density (M) = ∑ Li / Span (M) (3) i=1 30

For the example shown in Figure 1, and according to the equation (3), the density is:

3

Density (M) = ∑ Li / S pan (M) = 18/22 = 81.82%. i=1

3. Gap , also called distance , is the number of nucleotides between any two adjacent

TFBSs. It is another way to measure the compactness of TFBSs in a module. If a module consists of more than two TFBSs, the gap of the module is the minimum distance between any two adjacent TFBSs inside the considered module. Assume there is a module M, and P i is the start position, while L i is the length of the ith TFBS, then the gap is:

Gap (M) = Min (P i+1 -Pi-1-Li-1), i = 1,…,k-1 (4)

For the example shown in Figure 1, and according to the equation (4), Gap (M) = 2, which is the distance between W 2 and W 3.

4. Order conservation is an object considering whether the arrangement of individual TFBSs in a module cannot be disordered.

5. Number of occurrence times counts the total number that the module occurs. A functional module is expected to be either statistically over-represented or under- represented, so it is necessary to record the occurrence number. The number of 31 occurrence times records this observation for the further comparison against expected values. Obviously, the number of occurrence times would be different depending on the different set of order conservation . Take Figure 2 as Example 2 for illustrating this situation. Assume that module M consists of three individual binding sites, W 1, W 2, and

W3. If the order conservation is considered, which means the arranged order cannot be disturbed, in another word, W 1 has to be followed by W 2, and W 2 has to be followed by

W3, then the number of occurrence times of module M is 1. Otherwise, all possible combinations of these three TFBSs, W 1, W 2, and W 3, should all be considered as the module M, so the number of occurrence times of module M is 4.

Example 2: >Seq1 W1nnnW 2nnW 3nnW2nn W1

Figure 2: An example of the number of occurrence times . It assumes that the module consists of three TFBSs, which are presented as W , W , M 1 2 and W 3. And n stands for nucleotides that locate between any two adjacent TFBSs.

6. Sequence hit is the number of sequences in which the module occurs. In addition to the number of occurrence times of the module itself, the question of how many sequences are covered is interesting as well, especially when the query object is a set of co-regulated or co-expressed promoters. If the assumption that co-expressed or co- regulated genes share similar promoter structures is true, then the module that occurs many times only in one sequences is much less interesting than the module that occurs in every sequences with less number of occurrence times . 32

For the example shown as in Figure 3, which is designed to show how to calculate all concepts introduced above.

Example 3: >Seq 1 W1nnnW 2nnnnW 3 >Seq 2 W1nnW 2nnnnnnnW 3 >Seq 3 W1nnnnW 3nnW 2 Figure 3: Example 3. There are three sequences, Seq1, Seq2, and Seq3, which are displayed in the FASTA format. Wk refers to different TFBSs, and a module M is assumed to consist of W 1, W 2, and W 3. And n stands for the nucleotide that is not contained in TFBSs. All TFBSs have the same length, which is assigned to be 6.

Span : since the module M occurs more than once, for each appeared module M, there is a corresponding Span . According to equation (2):

In Seq 1:

M = ((1,6),(10,6),(20,6))

Span (M) = (P m + L m - Pn) = (20 + 6 -1) = 25

In Seq 2:

M = ((1,6), (9,6),(22,6))

Span (M) = (P m + L m - Pn) = (22 + 6 -1) = 27

In Seq 3:

M = ((1,6), (11,6), (19,6))

Span (M) = (P m + L m - Pn) = (19 + 6 - 1) = 24 33

So, among these three sequences, the maximum Span (M) = 27 bps, which occurred in

Seq 2; while the minimum Span (M) =24 bps, which occurred in Seq 3. Note that, if the order conservation is turned on, there is no module M in Seq3, and the minimum span

(M) = 25 occurred in Seq 1, while the maximum Span (M) is still 27 bps that occurred in

Seq 2.

Density : Since the target module occurs more than once in the dataset, similar to the span , the density will be calculated multiple times corresponding to every time it occurs. The module M has three TFBSs with the same length of 6. According to equation (3), the density is calculated for each sequence separately. Note that, if the order conservation is turned on, the Seq 3 does not include module M, then it is unnecessary to calculate

Density (M) in Seq 3.

In Seq 1: k

Density (M) = ∑ Li / Span (M) = 18/25 = 72%; i=1

In Seq 2: k

Density (M) = ∑ Li / Span (M) = 18/27 = 66.7%; i=1

In Seq 3 (if not considering the order conservation ): k

Density (M) = ∑ Li / Span (M) = 18/24 = 75%. i=1

Gap : similar to the two terms computed above, gap is calculated for every time the module M occured. Also, if the order conservation is considered, then no module occurs in Seq3, and it is unnecessary to calculate Gap in Seq 3. According to equation (4): 34

In Seq 1:

Gap (M) = Minimum (P i+1 -Pi-1-Li-1), i = 1, 2, 3

= Minimum (3, 4) = 3

In Seq 2:

Gap (M) = Minimum (P i+1 -Pi-1-Li-1), i = 1, 2, 3

= Minimum (2, 7) = 2

In Seq 3 (if not considering the order conservation ):

Gap (M) = Minimum (P i+1 -Pi-1-Li-1), i = 1, 2, 3

= Minimum (4, 2) = 2

Number of occurrence times : if the order conservation is considered, then the number of occurrence times is 3; otherwise, the total number would be five: (W 1, W2, and W 3) as a module occurs twice, and (W 1, W 3, and W 2) as a module occurs once.

Sequences hit : with considering order conservation , (W 1, W 2, and W 3) as a module occurs in 2 sequences, and (W 1, W 3, and W 2) occurs in 1 sequence; otherwise, the module hits all 3 sequences.

All terms illustrated in this chapter are used as either a parameter to allow users to define in the enumerative method or a feature for constructing similarity functions in the HAC method.

35

CHAPTER 4: WORDSEEKER

This chapter introduces a genomic analysis software suite, WordSeeker, which aids biologists in analyzing their genomic data by indentifying functional elements that are involved within the GRN from non-coding regions. Recently, the number of organisms that have been sequenced is increasing, and continues to been increase. However, genome sequences do not carry much information by themselves unless they are annotated. Numerous bioinformatics tools have been developed for annotating genomic sequences, such as BLAST [30], GeneMark [31], Pfam [32], etc., and they focus on analyzing and predicting the location of genes, gene structures, and protein coding regions, in spite of the fact that the percentage of coding areas, especially for a higher organism, is much lower compared to non-coding areas, and genes’ regulatory mechanism is believed to be located inside the non-coding regions. Currently, multiple computational tools, rather than aiming at coding regions, aim at promoter regions, and some of them even focus on the non-coding regions that are further away from genes.

WordSeeker is one of these tools.

4.1 WordSeeker Design

WordSeeker was conceived and developed from the assumption that functional elements are statistically over- or under-represented against the huge amount of background sequences. The biological evidence that supports this assumption is that highly conserved regions during evolution are usually functional [1, 33]. The whole approach combines several functional components into one software suite. 36

According to functions, all components could be divided into two parts: the basic function part and the post-processing part. The basic function part, also called word counting, contains three sub-components: word enumerating, word scoring, and word selecting, which are also three steps in order. The word enumerating part searches all possible words that are carried by the query sequence/sequences, which could be either uploaded by users or selected from a pre-stored database. In the meantime, the number of occurrence times and sequence hits are counted and memorized in the data structure.

Three different types of data structures, radix tree [34], suffix tree [35], and suffix array

[36], are designed and implemented in the pipeline for the purpose for speeding up the enumeration part. It is well known that, for an exhaustive searching algorithm, the main issue is the time and space consumed; and how to strike the balance between these two is the key to provide an efficient algorithm with high scalability. These three data structures have their own merits depending on different input sizes and word lengths. Thus, the three data structures are designed as a parameter that could be switched as required by users depending on different features of queries.

Because of the assumption stated above, WordSeeker does not only provide word enumeration, but also provides scoring functions for evaluating every word’s statistical significant level. Word scoring builds a HMM for calculating the probability and expected value of each word’s occurrence with a specific Markov order that is set by users. By comparing the expected values with the observed values, multiple scoring functions, including O/E, OlnO/E, S/Es, and SlnS/Es, are provided for presenting the 37 significant level of each word, where O is used to note the observed number of occurrence times , E is the expected number of occurrence times , S is the sequence hits , and Es is the expected number of sequence hits .

The last step in the basic function part is the word selecting. In accordance with users defined selection rules, including word length, the minimum and maximum occurrence times, filtering options, preferred scores etc., word selecting ranks words according to the selected score, and outputs words that satisfy all requirements. These output words are considered as putative regulatory elements. Associated with each word, P-Value is also provided for presenting the confident level. Furthermore, with filtering functions,

WordSeeker is able to handle the annotated sequences. It allows users to filter out the character of Ns and lower cases, which sometimes indicate the useless repeating and unknown information in biology.

In addition to the basic function, several post-processing parts have been integrated into

WordSeeker and perform their functions as parts of the whole analysis pipeline.

Currently, the post-processing functions contain word clustering, word distribution, module discovery, and functional look-up. Briefly, word clustering categorizes words conforming to pre-set hamming distance [37] or edit distance [38] with seed words that are selected from the phase of word selecting. By constructing the word cluster, PWMs are able to be composed, and then it is capable to report motifs. Word distribution offers detailed information about the location of each word in the format of a map, and will be 38 able to be viewed through GBrowse[39] in the future. Module discovery, as the main topic discussed in this thesis, will be introduced in detail in the next chapter, from the algorithm to the performance. Lastly, functional look-up is the latest part integrated into

WordSeeker, which aids biologists to look up whether the putative words or modules have been reported in public databases, TRANSFAC and AGRIS.

4.2 Integration with WordSeeker

The enumerative module discovery method has been integrated into WordSeeker, and serves as a part of the post-process as stated above. By inheriting the data structure, it is easy to gain seed words that users want to use for CRM analysis. Also, by calling the score function, it is unnecessary to rebuild the mathematical model for re-calculating the probability and expected values for each word. The architecture chart of this whole pipeline is generated by Jens Litchenberg and shown in Figure 4. 39

Figure 4: Architecture of WordSeeker Pipeline. The blue blocks present components that have already been integrated into the WordSeeker pipeline. Other colors are either in the developing phase or waiting to be integrated into WordSeeker. This architecture chart is generated by Jens Litchenberg.

40

CHAPTER 5: ALGORITHMIC APPROACHES FOR MODULE DISCOVERY

Chapter 2 reviewed several popular approaches in this field. Most of them retain the help from known TFs, TFBSs, or PWMs. Potential problems of using prior knowledge of TFs or TFBSs have been discussed in chapter 2. Additionally, the imprecise model, which is widely used for presenting TFs, is another reason why using known TFs would not be a refined choice. TRANSFAC [10] and JASPAR [13] are the most popular databases that researchers can access in order to obtain published TFs’ references and sequences. PWM is the method these two databases employ for presenting TFs; however, this presentation method presents more issues and makes the difficult problem even more burdensome.

Different PWMs might be developed for the same TFs; on the other hand, different TFs might have similar PWMs [12]. An algorithm that takes known PWMs as input has to have sufficient knowledge and be able to utilize the presented information and do the proper filters; however, there is no comprehensive understanding for the regulatory mechanism, which means there are no uniform standards or answers to the question.

This chapter introduces the algorithm of two methods that are proposed for solving the problem in this thesis. Both of them could be executed either with or without the use of the prior knowledge of TFBSs: one enumerates all possible combinations of individual

TFBSs, and the other one uses training data to improve the parameter set, and pursue a more accurate performance for specific issues.

41

5.1 Enumerative Module Discovery Method

5.1.1 Motivation

As the name of this method suggests, it employs an enumerative algorithm and provides all possible combinations of putative regulatory elements, which are obtained from

WordSeeker. The idea behind this method is that functional regulatory modules probably might be highly conserved during evolution. Also, genes that are expressed or suppressed in similar situations might have similar regulatory mechanisms. Based on these assumptions, statistically over-represented regions should be interesting for biologists and would provide a guide for future biological experiments. Thus, this method not only simply enumerates all possible modules, but also provides different scoring functions for evaluating the statistically significant level and explores gap and density features for each reported module.

According to the observation of Rigoutsos et al. [40], in numerous genes’ un-translated regions, unusual words occur in certain patterns, as shown in Figure 5. The phenomenon that specific pyknons occur in certain patterns can easily be distinguished. For example, the words that are highlighted with the colors pink and green always occur together in the same arrangement - the pink one is always followed by the green one with 3 nucleotides distance. This visualized observation provides the original and intuitive motivation to develop an enumerative method. Besides the example illustrated below, other limitations, including insufficient training data and ignorance of gene regulatory model, not only 42 cause more difficulties for module discovery, but also display the necessity and motivation for developing an enumerative method.

Figure 5: An observed example of words patterns. The highlighted words are pyknons, which have at least 40 copies in non-coding regions. The line -(xx)- in grey means the distance between adjacent pyknons in terms of the number of nucleotides. Several pyknons occur in a certain pattern. This observation provides a motivation for developing an enumerative method. The figure is adapted from [40].

43

5.1.2 Algorithm

The enumerative modules discovery algorithm does several steps, and the flow chart is shown in Figure 6:

Figure 6: Flow chart of the enumerative module discovery method. This image illustrates all steps the algorithm does for discovering CRMs. The initial steps completed by other parts in WordSeeker pipeline are not included.

44

The first step of the algorithm is to read input sequences, which are usually co-expressed non-coding regions. It also reads a word list, which is a set of either putative/known

TFBSs, or PWMs. Both of them are obtained from the built-in data structure in the

WordSeeker framework. If a set of PWMs is chosen rather than a list of TFBSs, then a user-defined threshold is employed for filtering out unsatisfied binding sites. For example

(Figure 7), the PWM exhibits 36 possible binding sites: AACCA, AACCC, AACCG,

AACGA, AACGC, AACGG, AGCCA, AGCCC, AGCCG, AGCGA, AGCGC, AGCGG,

ATCCA, ATCCC, ATCCG, ATCGA, ATCGC, ATCGG, CACCA, CACCC, CACCG,

CACGA, CACGC, CACGG, CGCCA, CGCCC, CGCCG, CGCGA, CGCGC, CGCGG,

CTCCA, CTCCC, CTCCG, CTCGA, CTCGC, CTCGG.

A C G T [AC] 1/2 1/2 0 0 [AGT] 1/5 0 3/5 1/5 C 0 1 0 0 [CG] 0 1/4 3/4 0 [ACG] 1/5 2/5 2/5 0

Figure 7: An example of a PWM presented motif. The regular expression of the motif presented by this PWM is [AC][AGT]C[CG][ACG]. The four columns show the frequency of each nucleotide in their specific positions.

45

If a user sets 6/200 as the threshold for selecting binding sites, which means a binding site is valuable only if its frequency is equal to or higher than 6/200, then 18 out of 36 binding sites are retained: AACGC, AACGG, AGCCC, AGCCG, AGCGA, AGCGC,

AGCGG, ATCGC, ATCGG, CACGC, CACGG, CGCCC, CGCCG, CGCGA, CGCGC,

CGCGG, CTCGC, CTCGG. The pseudo code of this function is shown in Figure 8:

PWM read: Note: P – a word length that a PWM presents i – the index in P, the initial value of i is set as 0 array – a vector, storing binding sites obtained from PWMs freq – a double variable, recording the frequency of each word threshold – a pre-set value words – return a list of words that are extracted from PWMs with higher frequency than the threshold.

if(i == P.size()-1){ For each n ∈{A,C,G,T} freq = freq * freq n; if (freq >= threshold){ array.push_back(n); words.push_back(array); array.pop_back(); freq = freq/freq n; } return; }

For (m=i;m

freq = freq * freq n; array.push_back(n); PWM_read(i+1,P,array); i--; array.pop_back(); freq = freq/freq n; } return;

Figure 8: Pseudo code for the function of PWMs read and filter. 46

After extracting all qualified binding sites, the next step is to detect locations for each word by using a sliding window. Then the information of all positions is recorded and will be applied for the further steps. In this step, considering all parameters set by users, the algorithm exhausts all possible combinations using an iterative algorithm, and the pseudo code is shown in Figure 9. With the help of recorded positions, non-exist and non-satisfied modules are filtered out, as described in the flow chart of Figure 6.

Enumeration: Note: n – pre-set dimensions of the module W – the input word list M – modules list will be returned sM – a temp modules list used

Function MakeCombinations(W, n, M){ If (n<0 || W.size

Figure 9: The pseudo code for the function of enumerating combinations. This function enumerates all possible combinations of single TFBSs by using an iterative algorithm.

For the purpose of selecting statistically over-represented modules, three scoring functions are implemented and output is associated with predicted modules. 47

Fundamentally, the probability of each word is the key for calculating expected sequence hits for each module, and is also the basis for the establishment of assessment criteria.

The way of calculating the probability has been introduced previously. In the part of module discovery, the probability of each putative binding site is extracted from the built- in data structure. The expected value of sequence hits is conceived for presenting how many sequences are expected to cover a specific module. Before doing the calculation, an assumption needs to be made: the occurrence probability of any nucleotide is independent for each position. All calculations in the further steps are all based on this assumption.

Let be a word and be a binary random variable defined by: Wi Zj

1, if occurs in sequence = (5) Zj(Wi 0, otherwise

Suppose that there are sequences, and the length of sequence is nj . The character w is j Sj used to present a single word, is the length and is the probability of the word |w| p(w w, . Then, formally, given a word , the expected number of sequences that respectively w contain the word is provided by [41]:

Es Zj( = Zj( =

= (( =

|| (6) = (1 − (1 − (

48

The above equation displays how to calculate the expected sequence hits for a single word; the expected number of sequences hits for two words, and , W1 W2

is [41]: Es Z(ww

|| || (7) ( = (1 − (1 − ( (1 − (1 − (

Assuming that in sequence the possibility of a word is position-independent, and is Sj w displayed as , then || is the probability that does not occur in p(w (1 − ( w

, so || is the probability that occurs in sequence . Thus, the Sj 1−(1−( w Sj product || || is the probability that (1 − (1 − ( (1 − (1 − ( and both occur in sequence . Equation (7) can be easily generalized for the case w1 w2 Sj of ≥ words. Let be a set of words, then the expected value of k 2 W = {w1, …, wk} sequence hits for is: W

Es Z(W = pZ(W = i

(8) = P(Z(W = 1

So far, the key question of equation (8) is how to compute , which is the probability of P the set of words all occuring in a specific sequence. Since it is assumed that the W occurrence of each word is independent, the probability is simply the product of the probability of each word. Then, the question is derived into how to calculate the possibility of a single word , . According to the above introduction, it has already w p(w 49

been known that || is the probability that occurs in sequence . 1 − (1 − ( w Sj Then the probability of a set of words W is:

PZ(W = 1 = P(Z(w = 1

|| (9) = (1 − (1 − P(w

Thus, the generalized formula for the expected value of sequence hits for a set of words is:

|| (10) Es Z(W = ((1 − (1 − (

For each module , the observed sequence hits is presented by and the expected M Os(M value of sequence hits is . These two values are compared for the purpose of Es(M selecting statistically over-represented modules. Three scoring functions, named , , S1 S2 and are conceived, and associated with output modules. They are defined: S3,

(11) S1(M = Os(M/Es(M

(12) S2(M = Sln(Os(M/Es(M

(13) S3(M = (Os (M − Es(M /Es(M

S1 indicates the most intuitive impression by comparing observed values to expected values directly. If the value of S1 is higher than 1, then the observed value is higher than the expected one; however, this naïve comparison is not sufficient for achieving the goal.

Hence, two more scoring functions are conceived. S2 considers the effect factor of sequence hits , and combines it with S1; S2 is also used in WordSeeker’s basic function 50 part as a main evaluation function for putative binding sites. This scoring function performs effectively in discovering binding sites. Finally, S3 is commonly used as the test-statistic in Pearson’s Chi-squared test [42]. A p value could be easily obtained from

Chi-squared table, and the significance level is captured easily.

In addition to the above mentioned steps, two optional features are provided: the density and the distance distribution map. The terminology of density and distance and the necessity for offering this information have been introduced in chapter 3. In this paragraph, the introduction is focused on the format that carries these two optional features. For distance distribution file, six ranges are conceived for displaying this information; and in each range group, the number of occurrence times has been counted and outputted. The six ranges are 0-20, 21-40, 41-60, 61-80, 81-100, and more than 100 base pairs. For the density distribution, similar to the distance distribution file, six categories are constructed based on different percentages: <10%, 10%~20%, 20%~30%,

30%~40%, 40%~50%, and more than 50%. Also, for each category, the number of occurrence times is reported.

5.1.3 Complexity Analysis

As an exhaustive approach, the time and space consumptions are one of the greatest concerns. This section analyzes the time and space complexity of the enumerative module discovery method. 51

Firstly, the time complexity is the sum of time consumption of each part, as shown in

Figure 6. Since the words and sequences are directly extracted from the existing data structure, the analysis of time complexity begins from finding the positions of the remaining steps in order.

Assume that there are n words, m sequences and the required dimension is d, then the time complexity for finding positions is , because every word has to be searched Ο( through every sequence. Making combinations needs , Ο( − 1( − = Ο( − shown in the pseudo code. After combination, the algorithm checks not only the combination of words, but also the position of each combination’s components for the purpose of avoiding or allowing overlaps, and creating a distance or/and density distribution map according to the request. So the time complexity in this part is

, where is the number of positions the ith word within a module occurs, Ο(C × p P and is the total number of combinations that have d dimensions in the worst case. C

Thereby, the time complexity is . Secondly, Ο( + Ο( − + Ο(C × p storing the position of each word is structured in a vector of vector, which takes ( space; and the combination result is stored in a vector, and consumes space. So Ο( the total space consumption is . Ο( + Ο(

52

5.2 Hierachical Agglomerative Clustering (HAC) Module Discovery Method

5.2.1 Motivation

Besides the enumerative method, another method has been designed and developed for module discovery. The method is the HAC method, which is inherited from the

Hierarchical Agglomerative Clustering (HAC) algorithm [43], as well as the source of the name. It is a classic and forward machine learning approach for classifying items that have multiple dimensions of features, which is a perfect fit for the module discovery problem. Several reported features, as stated above, of CRMs could be considered as basic components for developing the similarity function. Since all features have their own unique measurements, it is impossible to compare different features directly. Thus, HAC provides a feasible way for solving the problem.

As a machine learning algorithm, it requires training data for deciding parameters.

Despite the insufficient training data that cause this kind of method not to be proper for general problems, for some in-depth understanding issues or species, this method has advantages that could not be replaced. Furthermore, the junction of two methods leads the whole proposed approach more comprehensive and flexible.

5.2.2 Algorithm

The basic idea of this method is that every individual instance is considered as an initial cluster, and based on a similarity function, which is also called quality function. The two most similar clusters among all entities are gathered into a new cluster iteratively, until 53 there is only one cluster, or the similarity requirements, which is also called quality threshold, and cannot be satisfied. A feature vector, which includes several distinctive characters, is generated for each cluster and prepared for evaluating the quality. Decisions are made by a quality function, which is a weighted sum of all features. Specifically, for module discovery, the initial clusters are the single TFBSs, and then based on the conceived feature vector the most similar two binding sites are joined together and form a new module. This process is repeated until either the pre-set quality threshold cannot be achieved or all binding sites are gathered into one module.

CRMs’ distinctive characters, which were introduced in chapter 3, are all considered and set as one of the features in the feature vector. Moreover, since genomics regulatory analysis is usually interested in several co-regulated promoters, the feature that can distinguish one sequence from a cluster of sequences would be interesting. Thus, two categories of features are conceived for the feature vector or the quality function: one is the feature in a single sequence, and the other is the feature that cross all input sequences.

In the following text, each feature is illustrated in details, and examples are shown for specific explanations.

Features in a specific sequence . Four features are considered significant and necessary to measure the similarity in this category: number of TFBSs , maximum span , minimum gap , and maximum gap . 54

Number of TFBSs is the number of TFBSs that are contained in the evaluating cluster.

Take the example shown in Figure 1, and assume that module M is under evaluation, then the number of TFBSs of this potential new cluster is equal to 3.

Maximum span is the maximum number of nucleotides that the evaluating cluster in the specific sequence could reach. Span is defined in chapter 2. The reason to have this feature is that for any evaluating cluster, they might occur more than once in a sequence, and the arrangement of the same components could be different, so one module could have different spans ; however, it is unnecessary and inefficient to investigate all possible spans.

Minimum gap and maximum gap . Gap has been introduced in chapter 2. And as the name suggested, minimum and maximum are the smallest and the largest gaps among all occurrence times of the evaluating cluster. If the minimum gap is allowed to be negative, then overlaps among binding sites are allowed. With different penalty settings, the overlap could be prevented or allowed. If the potential cluster has more than one entity, then both minimum and maximum gap are counted as the smallest or largest from any two adjacent entities.

Features cross all input sequences . Three features are included in this category: minimum gap , sum of gaps , and the sequence hits . 55

Minimum gap is the minimum number of nucleotides among any adjacent TFBSs inside all copies of the evaluating module. The definition of this term is the same as the minimum gap in a specific sequence and the only difference is that the former one investigates only one sequence, while the later one has to explore all sequences for determining the value of this parameter.

Collection of gaps (cog). As known, multiple TFs usually work together and bind multiple binding sites that locate in a nearby region. Rather than inspecting only the minimum and maximum gap , it is more important and significant to explore the overall situation of gaps . Thus, the distribution of all gaps in a module is assessed. Furthermore, considering that input sequences may have different lengths, it is not fair to penalize gaps with the same length in sequences with different lengths. So, a parameter λ, which can be set automatically or manually, is introduced and employed for normalizing different lengths of sequences and adjusting the cog comparable. The formula of cog is:

(14) =

is the length of sequence and is the list of gaps. The example 4, Si , i = 1, …, s, shown in Figure 10 , is designed for describing this concept. In this example, assume that

, , and together constitute a potential module, which is being evaluated at this point; and all three words have the same word length of 6 bps. Then the length of Seq1 is equal to 23, while the length of Seq3 is equal to 34. There is no occurrence of the potential module in Seq2. So according to equation 14, sog = ( + × 23 + ( +

. + × 34 56

Example 4: >Seq1 W1nnn W2nn W3 >Seq2 W1nnnnn W3 >Seq3 W1nnn W2nnnn W3nnn W2

Figure 10 : An example for describing cog . There are three sequences in Example 4. W k stands for single TFBSs.

Number of sequences (nos ). The definition of this term is also called sequence hits , which has been described in chapter 2. The reason for assessing this feature is also stated in previous chapters. Take the example shown in Figure 10 : the potential module which consists of , , and , occurs in all sequences but Seq2, so its nos is equal to 2.

After the establishment of the feature vector, the quality function ( ) is applied for Q assessing clusters’ similarity. The equation is:

(15) Q = WF

is the weight for feature x, and F is the feature’s value. The value of weights for W x every feature is assigned after learning known datasets. The principle for deciding the weight’s value is neither to miss significant clusters nor to have too many false positive predictions. Word description of the algorithm is shown in Figure 11 , while the high level pseudo code is shown in Figure 12 . 57

While the number of clusters is more than 1 or the requirement can be satisfied, do: For every cluster in the cluster array (the initial one consists of a single transcription binding site) Do the combination to generate all possible clusters. For each possible cluster Summarize all features Calculate quality If the maximum quality is bigger than the threshold Then Pick the cluster that has the maximum quality score Use this new cluster to replace smaller ones that are components of this new one, and push it into the cluster array. Else jump out of this loop, and return the cluster array. Figure 11 : Word description of HAC algorithm. The text in this figure briefly describes the working flow of HAC algorithm.

Notes: – C: a set of possible clusters – F: a set of clusters (initially, each TFBSs is considered as a cluster) – τ : a pre-set quality threshold – Q : a quality function 1. while F.size >1 do 2. let C = Fi ∪ Fj , (Fi , Fj ∈ F; i, j = 1 to F.size ); 3. for each Ci∈ C 4. do Q(Ci); 5. let max_q = maximum {Quality(Ci), where Ci = Fp ∪ Fq }; 6. if max_q ≥ τ 7. do replace Fp , Fq with Fp ∪ Fq 8. else 9. return F; Figure 12 : High level pseudo code for HAC algorithm. This figure illustrates the high level pseudo code for HAC algorithm. It is extended directly from the word description. 58

5.2.3 Complexity Analysis

This algorithm is inherited from a classic machine learning method, so the analysis is started from this classic algorithm. The naive HAC algorithm takes running time, Ο(n since it has to scan a matrix for times [44]. In the specific case of module × ( − 1 discovery, the main function part has the same complexity, where is the Οn n number of words occurring together in a specific sequence , although usually the j iterative time should be smaller than times with the pre-set threshold. Besides, ( − 1 for multiple sequences, the time complexity is , where m is the number S × θn of sequences, and stands for sequence . The whole algorithm is applied once for each S j sequence, and the stationary memory consumption is the position table for the input word list, which is , where is the number of sequences. Memory consumption for the Ο( m cluster result is released after each round, and takes ! for the Ο(C = Ο !(! = Ο( worst case. Thus, the total space complexity is . Ο( + Ο(

59

CHAPTER 6: EVALUATION

6.1 Benchmark Datasets

The benchmark dataset was obtained from an assessment paper [9], in which authors used three biological experimentally verified datasets to evaluate performances of eight published popular tools. By using the same benchmark datasets and evaluation system, it is convenient to compare prediction performance with other popular tools from the same academic fields. Three benchmark datasets were included in the assessment paper: the first was extracted from TRANSCompel [45], which is a public database providing information of verified CRMs; the second was from the tissue of liver [7]; and the third was from muscle cells [8]. Modules found in these three datasets are all verified by biological experiments. More information about these three datasets, including number of sequences, number of modules, the average span of modules, and the maximum number of TFBSs, is summarized in Table 1. For the proposed two module discovery methods, liver and muscle benchmark datasets were employed for assessing their prediction performances, and results were compared with other published tools. The

TRANSCompel dataset was used for comparing different scoring functions from the enumerative modules discovery method. The reason for this arrangement will be explained in the corresponding section.

60

Table 1

The summarization of benchmark datasets. For each dataset, the number of contained sequences/sequence sets, the number of contained modules, the average span of contained modules, and the maximum number of TFBSs included in one module have been summarized and reported in this table Sequences Included Maximum Number of TFBSs Dataset Average Span (bp) Sets Modules contained in a Module TRANSCompel 10 81 33 2 Muscle 1 24 120 8 Liver 1 14 96 9

6.1.1 TRANSCompel

The benchmark dataset of TRANSCompel has 10 sequences sets, and every module constituted of exactly two TFBSs, which is a rare condition and is not proper for general problem solution. Therefore, this data set was not employed for evaluating the prediction performance for the proposed methods; however, for this reason it is suitable for measuring different scoring functions. The sequence set, AP1-Ets, is the largest sequence set, containing 16 sequences, and has the maximum number of modules, 17. For the whole dataset, the span of modules is arranged from 12 to 135 bps, with the average span from 16 to 84 bps. The detailed information of each module, including the sequence of

CRMs and contained TFBSs are shown in APPENDIX 1.

61

Table 2

Detailed information of TRNASCompel benchmark dataset. For each sequence set included in the TRANSCompel benchmark dataset, this table shows the number of sequences, the number of modules, the average sequence length, the minimum, maximum and average span of modules, and the maximum number of TFBSs contained in one module Maximum Average Number Minimum Maximum Average number of Sequence sequence Sequences of span span span TFBSs set length modules (bps) (bps) (bps) contained in (bps) the module AP1 -Ets 16 17 929 14 99 27 2 AP1-NFAT 8 11 862 14 19 16 2 AP1 - 7 8 933 18 135 53 2 NFkappaB CEBP - 8 8 916 44 118 84 2 NFkappaB Ebox -Ets 4 6 872 16 50 25 2 Ets -AML 5 5 811 13 30 19 2 IRF - 6 6 891 23 71 43 2 NFkappaB NFk appa B- 6 7 899 10 32 13 2 HMGIY PU1 -IRF 5 5 906 12 14 13 2 Sp1-Ets 7 8 827 16 117 37 2

6.1.2 Muscle Dataset

Muscle is one of the few well-understood tissues [8], so it becomes the model issue in the field of CRMs study. In this dataset, 24 modules are embodied in 24 sequences. The largest module has 8 individual TFBSs as components, while the smallest one has 2

TFBSs. Except for the sequence of M13631, the rest of the 9 sequences all have the length of 1000 bps. The sequence of M13631 has 269 bps, and has a module that consists of 4 single binding sites. Modules embodied in this dataset consist of at least 2 TFBSs, with the average number of 3.5. The span of modules varies from 14 to 294 bps. The summarization is shown in Table 3. The detailed information of each module is shown in

APPENDIX 2. 62

Table 3

Detailed information of Muscle benchmark dataset. Summary and display of 24 sequences and 24 modules included in muscle benchmark dataset, the average length of sequences, the minimum, maximum, and average span of all modules contained in the sequence, and the maximum number and the average number of TFBSs contained in one module Maximum Average Average Number Average number of number Sequence Minimum Maximum Sequences of Span TFBSs of Length Span (bp) Span (bp) modules (bp) contained in TFBSs (bp) the module 24 24 851 14 294 120 8 3.5

6.1.3 Liver Dataset

Besides the tissue of muscle, the tissue of liver is another relatively well-understood tissue [7], so it is often used as a model or training data for modules identification. As with other benchmark datasets, the modules contained in liver datasets were obtained from and verified by biological experiments. There are 14 modules that are embodied into 12 sequences. Besides the sequence of M19524, which has the length of 943 bps, the rest of the sequences have the length of 1000 bps. The variation of modules’ span is arranged from 22 to 176 bps, which is not as significant as in muscle dataset, although the average number of contained TFBSs is higher. Thus, the density modules contained in the liver dataset are relatively higher. The information of the liver dataset and modules are summarized in Table 4. And the detailed information of every contained module is shown in APPENDIX 3.

63

Table 4

Detailed information of Liver benchmark dataset. Summary and display of the 12 sequences and 14 modules included in liver benchmark dataset, the average length of sequences, the minimum, maximum, and average span of all modules contained in the sequences, and the maximum and the average number of TFBSs contained in one module Maximum Average Average number of number Number Average Sequence Minimum Maximum TFBSs of Sequences of span Length span (bp) span (bp) contained TFBSs modules (bp) (bp) in the module 12 14 995 22 176 96 9 3.6

6.2 Evaluation of Scoring Functions

In this section, the three conceived scoring functions are assessed by applying the

TRNASCompel benchmark dataset for the purpose of selecting the best-performed one, which could guide users’ further analysis. Detailed information of the TRANSCompel dataset has been introduced in former sections, and shown in Table 2 and APPENDIX 1.

The same procedure was executed and repeated for the 10 subsets separately. Firstly, the basic function of WordSeeker, word counting, was applied with the fixed word length of

6 and variable Markov order from 0 to 4, since there is no strong evidence to show how to select the best Markov order, and the maximum order is (word length – 2). P-Value that is smaller than 0.05 was assigned as a threshold for selecting significant words.

Secondly, the enumerative module discovery method with the fixed dimension of 2 was applied for selected words. Then, a comparison among three scoring functions was applied against the correctly predicted modules. A better score function is expected to assign a higher rank for target known modules, or called correct predicted modules. Thus, by comparing the correctly predicted modules’ rank that is sorted according to the 64 specific score, the performance of different score functions could be measured. In following sections, the test result for ten subsets is reported one by one.

6.2.1 AP1-NFAT

As introduced above, the same process has been repeated and completed 10 times. Since

AP1-NFAT is the first instance for the evaluation procedure, the description of the comparison part is introduced in detail in this section. After running the word counting and the enumerative modules discovery method, the result from both phases have been obtained and reported. The number of words with P-Value smaller than 0.05, the number of enumerated modules, and matched sites are all summarized and shown in Table 5.

Take the Markov order of 0 as an example: 344 words are qualified with the P-Value threshold, and 20 words among them are matched with known binding sites. Because of choosing a fixed length for word counting, the same predicted word could cover multiple different known binding sites that have lengths shorter than 6 bps, and vice versa. Note that the term “match” means “exactly same”, or “fully contain/be contained”. The

“Matched Percentage” is the number of predicted words divided by matched numbers. In the case of Markov order of 0, matched percentage equal to . The 20 ÷ 344 = 5.81% “Cover Sites” tells how many known binding sites are correctly predicted. In the example,

14 known binding sites out of 21 are covered by predictions. For the predicted 344 words, the enumerative module discovery method was applied, and 56341 modules were obtained as result. Among these 56341 modules, 5 of them are correctly matched with known modules. Again, because of the fixed word length and the variable actual length, 65 the same known module could have more than one correct prediction. Out of the total 11 known modules, these 5 predicted modules cover 3 of them, and the ratio is shown in the column of “Cover Modules”. For Markov order of 3 and 4, which have insufficient or no intersection with target sites, obviously there would not be a correct prediction in modules, so it is unnecessary to run the enumerative module discovery algorithm.

Detailed results for different Markov orders are displayed in Table 6. As explained before, one known module could be matched with multiple predicted modules, so the rank shown in Table 6 is the highest one if there are more than one putative modules corresponding to one target.

Table 5

Result summarization for AP1-NFAT dataset. Display of parameters applied for each test, and results from both word counting and the enumerative module discovery. Each column has been described in the text Number Number Correct Markov Matched Matched Cover Cover Length of of Predicted Order Words percentage Sites Modules Words Modules modules 1 6 0 344 20 5.81 % 14/21 56341 5 3/11 2 6 1 196 11 5.61% 12/21 16562 7 3/11 3 6 2 139 7 5.04% 7/21 8013 5 2/11 4 6 3 96 1 1.04% 2/21 - - - 5 6 4 11 0 - - - - -

66

Table 6

Detailed result of AP1-NFAT dataset. Display of the matched known modules, the rank associated with the corresponding scoring function, and correctly predicted modules. For Markov order of 3 and 4, there are too less matched binding sites to generate correct predictions, so the above table only shows result for Markov order of 0, 1, and 2 Rank Rank Rank Markov Known Correct Correct Correct Seq ID by by by Order Modules Prediction Prediction Prediction S1 S2 S3 ttaatca X03021 29900 TAATCA_TTTCCT 23570 TAATCA_TTTCCT 23920 TAATCA_TTTCCT tttcctc agaaattcc AGAAAT_GAGTC 0 X14473 8727 AGAAAT_GAGTCA 5239 AGAAAT_GAGTCA 5512 agagtca A attaatca X03020 45589 CATTTC_TAATCA 45121 CATTTC_TAATCA 45121 CATTTC_TAATCA catttcctc ttaatca X03021 9159 TTAATC_TTTCCT 4141 TTAATC_TTTCCT 6628 TTAATC_TTTCCT tttcctc ttgaaaat GAAAAT_GTGTA 1 X14473 2592 GAAAAT_GTGTAA 142 GAAAAT_GTGTAA 690 gtgtaat A attaatca X03020 7936 ATTAAT_TTTCCT 3339 ATTAAT_TTTCCT 5389 ATTAAT_TTTCCT catttcctc ttaatca X03021 5460 TAATCA_TTTCCT 2508 TAATCA_TTTCCT 4456 TAATCA_TTTCCT tttcctc 2 attaatca X03020 5303 ATTAAT_TTTCCT 2359 ATTAAT_TTTCCT 4268 ATTAAT_TTTCCT catttcctc

The result table was visualized for exhibiting a more intuitive impression, shown from

Figure 13 to Figure 15 , corresponding to the Markov order. For each score, two bar charts were generated for the rank of correctly predicted modules: one illustrates the rank distribution; the other displays the cumulative result of True Positive (TP) hits (The definition of TP is introduced in section 6.3). An effective scoring function is expected to assign TP a higher rank; thus, by comparing the figure of the rank distribution and the shape of cumulative rank, the judgment could be easily decided. For the convenience of referring these three different scores, S1, S2, and S3 are used standing for the name of

, , and scores, respectively. S/E SlnSE (S − E /E

67

Figure 13 : Visualized result of AP1-NFAT for Markov order of 0. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 0. Three rows present the results of S1, S2, and S3 in order. 68

Figure 14 : Visualized result of AP1-NFAT of Markov order of 1. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 1. Three rows present the results of S1, S2, and S3 in order.

69

Figure 15 : Visualized result of AP1-NFAT of Markov order of 2. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 2. Three rows present the results of S1, S2, and S3 in order.

70

As illustrated by Figure 13 to Figure 15 , the advantage of S2 is exhibited clearly, except for the Markov order of 0. In Figure 13 , there is a slight difference between S2 and S3.

By examining the detailed result table (Table 6), S2 is slightly better than S3. For example, for the target module “AGAAATTCC_AGAGTCA”, both S2 and S3 reached the highest rank by the predicted module of “AGAAAT_GAGTCA”, but S2 ranked the correct prediction with 5239, while S3 assigned it with 5512.

6.2.2 AP1-Ets

AP1-Ets contains 16 sequences, and the detailed information about the embodied modules and TFBSs are summarized in APPENDIX 1(1.a). The summarized result for predicted binding sites and modules are reported in Table 7. According to the summary,

Markov order of 2 achieved the highest matched percentage for predicting binding sites, although the highest “cover modules” was reached by Markov order of 0. For Markov order of 3 and 4, there was no correct prediction in the module discovery phase due to the few correctly predicted binding sites.

Table 7

Result summarization for AP1-Ets dataset. Display of parameters applied for each test, and results from both word counting and the enumerative modules discovery. Each column has been described in the text Length Marko v Number Matched Matched Cover Number Correct Cover Order of percentage Sites of Prediction Modules Words Modules 1 6 0 495 26 5.25 % 16/26 121078 47 7/17 2 6 1 204 10 4.90% 13/26 19292 11 4/17 3 6 2 159 10 6.29% 13/26 11506 7 5/17 4 6 3 124 3 2.42% 3/ 26 6938 - - 5 6 4 24 0 - - - - - 71

Figure 16 to Figure 18 visualized the result for Markov order of 0, 1, and 2 in order.

Besides the Markov order of 0, S2 is absolutely superior compared to the other two. For the Markov order of 0, it is hard to distinguish a better one from both the detailed result table and the image.

Table 8

Detailed result of AP1-Ets dataset. Display of the matched known modules, the rank associated with the corresponding scoring function, and correctly predicted modules Rank Rank Markov Known Correct Correct Rank by Correct Seq ID by by Order Modules Prediction Prediction Prediction S/E SlnSE ( − / gtgagtca GAGTCA_ GAGTCA_ GAGTCA_ L05187 1384 1335 1326 ttcct TTTCCT TTTCCT TTTCCT tgagtca GAGTCA_ GAGTCA_ GAGTCA_ L10616 2239 1707 1707 cccttcctgcc CTTCCT CTTCCT CTTCCT aggaaa TGAGGT_ TGAGGT_ TGAGGT_ X12641 7276 9906 9565 tgaggtca AGGAAA AGGAAA AGGAAA aggaa GAGTCA_ GAGTCA_ GAGTCA_ 0 D10051 2793 3474 3371 tgagtca GAGGAA GAGGAA GAGGAA tgactca GACTCA_ GACTCA_ GACTCA_ L36024 18577 18296 18296 tcttcctg TTCCTG TTCCTG TTCCTG tgagtca GGGAAG_ GGGAAG_ GGGAAG_ AF039399 2262 3114 2924 aagggaag GAGTCA GAGTCA GAGTCA ttcct GAGCTC_ GAGCTC_ GAGCTC_ X02910 16707 16824 16826 atgag ctcat TTCCTG TTCCTG TTCCTG aggaa GAGGAA_ GAGGAA_ GAGGAA_ D10051 9239 1265 4160 tgagtca GAGTCA GAGTCA GAGTCA gtgagtca GAGTCA_ GAGTCA_ GAGTCA_ L05187 8966 486 3148 ttcct TTTCCT TTTCCT TTTCCT 1 tgagtca CTTCCT_ CTTCCT_ CTTCCT_ L10616 10273 1732 5152 cccttcctgcc GAGTCA GAGTCA GAGTCA ttaatca TAATCA_ TAATCA_ TAATCA_ X03020 8827 4500 6273 tttcctc TTTCCT TTTCCT TTTCCT aggaa AGGAAG_ AGGAAG_ AGGAAG_ D10051 8243 3122 5852 tgagtca GAGTCA GAGTCA GAGTCA gtgagtca TGAGTC_ TGAGTC_ TGAGTC_ L05187 7535 1415 4114 ttcct TTTCCT TTTCCT TTTCCT tgactca GACTCA_ GACTCA_ GACTCA_ 2 L36024 7246 2221 4547 tcttcctg CTTCCT CTTCCT CTTCCT tgagtca CTTCCT_ CTTCCT_ CTTCCT_ L10616 6090 320 2129 cccttcctgc GAGTCA GAGTCA GAGTCA ttaatca TAATCA_ TAATCA_ TAATCA_ X03020 6397 3025 4549 tttcctc TTTCCT TTTCCT TTTCCT

72

Figure 16 : Visualized result of AP1-Ets of Markov order of 0. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 0. Three rows present the results of S1, S2, and S3 in order.

73

Figure 17 : Visualized result of AP1-Ets of Markov order of 1. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 1. Three rows present the results of S1, S2, and S3 in order.

74

Figure 18 : Visualized result of AP1-Ets of Markov order of 2. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 2. Three rows present the results of S1, S2, and S3 in order.

75

6.2.3 AP1-NFkappaB

AP1-NFkappaB dataset has 7 sequences including 8 modules. The detailed information about the embodied modules and TFBSs are summarized in APPENDIX 1(1.c). The summarized result for predicted binding sites and modules are reported in Table 9. For

Markov order of 2, 3, and 4, there were no correctly predicted modules due to the unsuccessful prediction of binding sites. Between Markov order of 0 and 2, the highest matched percentage for predicted binding sites and modules are both achieved by the higher Markov order. Table 10 shows the detailed result of each score.

Figure 19 and Figure 20 demonstrate the result of Markov order 1 and 2 in graphs, and the advantage of the score S2 is obvious, even for the Markov order of 0. S1 performed the worst among three scores. Take the module “CTGACATCA_GGGGATTTCCT” as an example: S2 ranked it with 29 out of 7301 modules; S3 ranked it with 920; and S1 ranked it with 2761, which is far inferior to the other two.

76

Table 9

Result summarization for AP1-NFkappaB dataset. Display of parameters applied for each test, and results from both word counting and the enumerative modules discovery. Each column has been described in the text Length Markov Number Matched Matched Cover Number Correct Cover Order of percentage Sites of Prediction Modules Words Modules 1 6 0 199 12 6.03% 9/15 18380 7 2/8 2 6 1 131 10 7.63% 10/15 7301 6 3/8 3 6 2 104 3 0.29% 3/15 4286 - - 4 6 3 60 2 3.33% 2/15 1498 - - 5 6 4 11 1 9.09% - - - -

Table 10

Detailed result of AP1-NFkappaB dataset. Display of the matched known modules, the rank associated with the corresponding scoring function, and correctly predicted modules Rank Rank Markov Known Correct Correct Rank by Correct Seq ID by by Order Modules Prediction Prediction Prediction S/E SlnSE ( − / ctgacatca TGACAT_ TGACAT_ TGACAT_ M64485 4844 157 1030 ggggatttcct GATTTC GATTTC GATTTC 0 tgacatag TGACAT_ TGACAT_ TGACAT_ V00534 4294 733 1868 gggaaattcctc TTCCTC TTCCTC TTCCTC ctgacatca TGACA T_ TGACAT_ TGACAT_ M64485 2761 29 920 ggggatttcct GATTTC GATTTC GATTTC ctgacatca TGACAT_ TGACAT_ TGACAT_ 1 M64485 8966 486 3148 tggatattcc GATATT GATATT GATATT tgacatag TGACAT_ TGACAT_ TGACAT_ V00534 3647 515 1768 gggaaattcctc TTCCTC TTCCTC TTCCTC

77

Figure 19 : Visualized result of AP1-NFkappaB of Markov order of 0. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 0. Three rows present the results of S1, S2, and S3 in order.

78

Figure 20 : Visualized result of AP1- NFkappaB of Markov order of 1. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 1. Three rows present the results of S1, S2, and S3 in order.

79

6.2.4 CEBP-NFkappaB

CEBP-NFkappaB dataset has 8 sequences, and each one contains one module. The detailed information of this dataset about TFBSs and modules is summarized and reported in APPENDIX 1(1.d). The summarized result for predicted binding sites and modules are reported in Table 11. The highest matched percentage of correctly predicted binding sites was obtained by Markov order of 1, although it has no correctly predicted modules. There is an interesting point that Markov order of 2 has more correct predictions than Markov order of 1, although the lower Markov order reported more modules. Table 12 shows the detailed result of each score.

Figure 21 draws the result of Markov order of 0. It is hard to tell the differences among these three scores from both the image and the result table. Since there was only one correct module for Markov order of 2, it is unnecessary to visualize the result table, and would be easier to compare the result table directly. By observing the result table, S1 performed better than S2 and S3, although the advantage was not obvious.

80

Table 11

Result summarization for CEBP-NFkappaB dataset. Display of the parameters applied for each test, and result from both word counting and the enumerative modules discovery. Each column has been described in the text Length Markov Number Matched Matched Cover Number Correct Cover Order of percentage Sites of Prediction Modules Words Modules 1 6 0 388 9 0.23% 10/16 73448 8 3/8 2 6 1 170 5 2.94% 6/16 13389 - - 3 6 2 134 3 2.24% 7/16 8258 1 1/8 4 6 3 114 3 2.63% 3/16 5874 - - 5 6 4 8 ------

Table 12

Detailed result of CEBP-NFkappaB dataset. Display of the matched known modules, the rank associated with the corresponding scoring function, and correctly predicted modules Rank Rank Markov Known Correct Correct Rank by Correct Seq ID by by Order Modules Prediction Prediction Prediction S/E SlnSE ( − / acacaactggga CTTTCC_ CTTTCC_ CTTTCC_ L05921 20967 20972 20967 gggactttcc AACTGG AACTGG AACTGG cattgagcaatct TGAGCA_ TGAGCA_ TGAGCA_ Z11749 37857 37857 37857 0 ggattttccc GGATTT GGATTT GGATTT tgcggatgaaga GAAGAA_ GAAGAA_ GAAGAA_ M98536 _aaccatgca 49037 49037 49059 CTTTCC CTTTCC CTTTCC ggggctttcc gaaattccc TTGCAA_ TTGCAA_ TTGCAA_ 2 AY008847 3434 4045 3612 atgttgcaa GAAATT GAAATT GAAATT

81

Figure 21 : Visualized result of CEBP-NFkappaB of Markov order of 0. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 0. Three rows present the results of S1, S2, and S3 in order. 82

6.2.5 Ebox-Ets

The dataset of Ebox-Ets has 4 sequences with 8 embodied modules. The detailed information of the sequence of TFBSs and modules are summarized and reported in

APPENDIX 1(1.e). The summarized result for predicted binding sites and modules are reported in Table 13. The matched percentage of correctly predicted binding sites obtained by Markov order of 0 was much higher than others. It reached 6.28%, while others were all lower than 2%. Furthermore, Markov order of 1 was the only test that had correct module predictions, although Markov order of 1, 2, and 3 all hit known binding sites. Table 14 shows the detailed result of each score. Figure 22 shows the visualized result: S1 performed slightly better than the other two, while S2 and S3 performed almost the same.

Table 13

Result summarization for Ebox-Ets dataset. Display of the parameters applied for each test, and result from both word counting and the enumerative modules discovery. Each column has been described in the text Length Markov Number Matched Matched Cover Number Correct Cover Order of percentage Sites of Prediction Modules Words Modules 1 6 0 191 12 6.28% 7/11 16476 12 3/4 2 6 1 137 1 0.73% 1/11 - - - 3 6 2 106 2 1.89% 2/11 4770 - - 4 6 3 71 1 `1.41% 1/11 - - - 5 6 4 8 ------

83

Table 14

Detailed result of Ebox-Ets dataset. Display of the matched known modules, the rank associated with the corresponding scoring function, and correctly predicted modules Rank Rank Markov Known Correct Correct Rank by Correct Seq ID by by Order Modules Prediction Prediction Prediction S/E SlnSE ( − / agcagctggc GGAAGC_ GGAAGC_ GGAAGC_ V01523 2435 2951 2916 ggaag AGCAGC AGCAGC AGCAGC ggaa CAGCTG_ CAGCTG_ CAGCTG_ 0 X15943 7886 9267 9249 cagctg CCGGAA CCGGAA CCGGAA gtctgctgacc CTGCTG_ CTGCTG_ CTGCTG_ U11854 3673 3934 3927 ccttcctctttt CTCTTT CTCTTT CTCTTT

84

Figure 22 : Visualized result of Ebox-Ets of Markov order of 0. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 0. Three rows present the results of S1, S2, and S3 in order.

85

6.2.6 Ets-AML

This dataset has 5 sequences, and each sequence has one module. The detailed information of the sequence of TFBSs and modules are summarized and reported in

APPENDIX 1(1.f). The summarized result for predicted binding sites, and modules are reported in Table 15. Except for the Markov order of 4, the other three all hit at least 7 known binding sites out of the total number of 9. The Markov order of 1 achieved the highest matched percentage; furthermore, the module prediction of it covered all known modules. Table 16 shows the detailed result of each score. Figure 23 to Figure 26 show the visualized result. Specifically, for the Markov order of 0, three scoring functions’ performances are almost the same according to the image. By examining the detailed result table, it seems that S1 is slightly better, although not every matched module is displayed in the table. For the rest of the Markov orders, S2 performed obviously better than the other two, and S1 was the worst.

Table 15

Result summarization for Ets-AML dataset. Display of the parameters applied for each test, and result from both word counting and the enumerative modules discovery. Each column has been described in the text Length Mark ov Number Matched Matched Cover Number of Correct Cover Order of Words percentage Sites Modules Prediction Modules 1 6 0 232 7 3.02% 9/9 24539 5 4/5 2 6 1 138 8 5.80% 9/9 7843 6 5/5 3 6 2 123 7 5.69% 8/9 6238 5 4/5 4 6 3 85 4 4.71% 7/9 2920 3 3/5 5 6 4 14 ------

86

Table 16

Detailed result of Ets-AML dataset. Display of the matched known modules, the rank associated with the corresponding scoring function, and correctly predicted modules Rank Rank Markov Known Correct Correct Rank by Correct Seq ID by by Order Modules Prediction Prediction Prediction S/E SlnSE ( − / aaccaca ACCACA_ ACCACA_ ACCACA_ D14816 21653 21653 21653 gaggaa GAGGAA GAGGAA GAGGAA caggat TGGTTT_ TGGTTT_ TGGTTT_ X07177 1482 1488 1486 gtggttt CAGGAT CAGGAT CAGGAT 0 caggat TGTGGT_ TGTGGT_ TGTGGT_ X59486 5274 6029 6029 tgtggttt CAGGAT CAGGAT CAGGAT tgtggt TGTGGT_ TGTGGT_ TGTGGT_ S68887 4 14 4 ggggaa GGGGAA GGGGAA GGGGAA aaccaca GAGGAA_ GAGGAA_ GAGGAA_ D14816 7418 7443 7440 gaggaa ACCACA ACCACA ACCACA caggat TGGTTT_ TGGTTT_ TGGTTT_ X07177 3936 893 2215 gtggttt CAGGAT CAGGAT CAGGAT caggat CAGGAT_ CAGGAT_ CAGGAT_ 1 X59486 4951 3318 4417 tgtggttt TGTGGT TGTGGT TGTGGT caggatat GGATAT_ GGATAT_ GGATAT_ J02255 3796 5848 5312 tgtggtaa TGTGGT TGTGGT TGTGGT tgtggt GGGGAA_ GGGGAA_ GGGGAA_ S68887 2255 18 960 ggggaa TGTGGT TGTGGT TGTGGT caggat TGGTTT_ TGGTTT_ TGGTTT_ X07177 4045 1198 2922 gtggttt CAGGAT CAGGAT CAGGAT caggat CAGGAT_ CAGGAT_ CAGGAT_ X59486 3865 2801 3535 tgtggttt TGTGGT TGTGGT TGTGGT 2 caggatat GGATAT_ GGATAT_ GGATAT_ J02255 3614 4645 4378 tgtggtaa TGTGGT TGTGGT TGTGGT tgtggt GGGGAA_ GGGGAA_ GGGGAA_ S68887 1991 19 873 ggggaa TGTGGT TGTGGT TGTGGT caggat GTGGTT_ GTGGTT_ GTGGTT_ X07177 2112 1666 1992 gtggttt CAGGAT CAGGAT CAGGAT caggat TGTGGT_ TGTGGT_ TGT GGT_ 3 X59486 2447 2006 2301 tgtggttt CAGGAT CAGGAT CAGGAT tgtggt TGTGGT_ TGTGGT_ TGTGGT_ S68887 1221 39 640 ggggaa GGGGAA GGGGAA GGGGAA

87

Figure 23 : Visualized result of Ets-AML of Markov order of 0. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 0. Three rows present the results of S1, S2, and S3 in order.

88

Figure 24 : Visualized result of Ets-AML of Markov order of 1. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 1. Three rows present the results of S1, S2, and S3 in order.

89

Figure 25 : Visualized result of Ets-AML of Markov order of 2. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 2. Three rows present the results of S1, S2, and S3 in order.

90

Figure 26 : Visualized result of Ets-AML of Markov order of 3. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 3. Three rows present the results of S1, S2, and S3 in order.

91

6.2.7 IRF-NFkappaB

The dataset of IRF-NFkappaB has 6 modules within 6 sequences. The detailed information of the sequence of TFBSs and modules are summarized and reported in

APPENDIX 1(1.g). The summarized result for predicted binding sites and modules are reported in Table 17. The highest matched percentage of correctly predicted binding sites is obtained by Markov order of 1, which also covers all known modules with 34 correctly predicted modules. Besides the Markov order of 2, Markov order of 1 also covers all known modules with 85 correct predictions. Table 18 shows the detailed result of each score. Figure 27 to Figure 30 show the visualized results for Markov orders from 0 to 3.

For Markov order of 0, from the shape of the cumulative image, S1 is slightly better than the others; while S2 performs the best with obvious advantages for the rest of Markov orders.

Table 17

Result summarization for IRF-NFkappaB dataset. Display of parameters applied for each test, and result from both word counting and module discovery. Each column has been described in the text Length Markov Number Matched Matched Cover Number Correct Cover Order of percentage Sites of Prediction modules Words Modules 1 6 0 268 32 11.94% 12/12 34023 85 6/6 2 6 1 146 20 13.70 % 12/12 9107 34 6/6 3 6 2 116 8 6.90% 9/12 5747 5 3/6 4 6 3 77 8 10.39% 9/12 2501 6 4/6 5 6 4 4 ------

92

Table 18

Detailed result of IRF-NFkappaB dataset. Display of the matched known modules, the rank associated with the corresponding scoring function, and correctly predicted modules Rank Rank Markov Known Correct Correct Rank by Correct Seq ID by by Order Modules Prediction Prediction Prediction S/E SlnSE ( − / agtttcttttcc GGAA AC_ GGAAAC_ GGAAAC_ X70675 1209 2057 1982 ggaaactccc TTTCTT TTTCTT TTTCTT ctctttctctttca GGGACT_ GGGACT_ GGGACT_ AB006745 _cttttct 503 444 443 CTTTCT CTTTCT CTTTCT gggactccc gagaagtgaaag AGAAGT_ AGAAGT_ AGAAGT_ V00534 _tg 58 132 106 0 GGGAAA GGGAAA GGGAAA gggaaattcc gggattttcc ATTTCA_ ATTTCA_ ATTTCA_ L09126 1003 1960 1843 tatttcacttt TTTTCC TTTTCC TTTTCC ggggattcccc GGGGAT_ GGGGAT_ GGGGAT_ M12483 155 207 200 cagtttcactt CAGTTT CAGTTT CAGTTT tggggattcccca TTTCAC_ TTTCAC_ TTTCAC_ D83956 323 330 327 agtttcacttct TGGGGA TGGGGA TGGGGA agtttcttttcc TTTTCC_ TTTTCC_ TTTTCC_ X70675 5812 2301 4154 ggaaactccc GAAACT GAAACT GAAACT ctctttctctttca GGACTC_ GGACTC_ GGACTC_ AB006745 _cttttct 5308 664 2868 TTTCAC TTTCAC TTTCAC gggactccc gagaagtgaaag GGAAAT_ GAAGTG_ GGAAAT_ V00534 _tg 3058 479 2425 1 AAGTGA GGGAAA AAGTGA gggaaattcc gggattttcc TATTTC_ ATTTCA_ TATTTC_ L09126 3206 1938 2832 tatttcacttt GGGATT TTTTCC GGGATT ggggattcccc TTTCAC_ TTCCCC_ TTCCCC_ L09126 5099 998 3458 cagtttcactt GGGATT TTTCAC TTTCAC tggggattcccca TTTCAC_ TTTCAC_ TTTCAC_ D83956 5775 896 3300 agtttc acttct TGGGGA TGGGGA TGGGGA gggattttcc ATTTCA_ ATTTCA_ ATTTCA_ L09126 5496 5575 5563 tatttcacttt GGGATT GGGATT GGGATT ggggattcccc GGATTC_ GGATTC_ GGATTC_ 2 L09126 4023 1413 2862 cagtttcactt TTCACT TTCACT TTCACT tggggattcccca TTCACT_ TTCACT_ TTCACT_ D83956 4675 1020 3093 agtttcacttct TGGGGA TGGGGA TGGGGA ctctttctctttca GGACTC_ GGACTC_ GGACTC_ AB006745 _cttttct 1362 132 724 TTCACT TTCACT TTCACT gggactccc gggattttcc ATTTCA_ ATTT CA_ ATTTCA_ L09126 2272 2347 2318 3 tatttcacttt GGGATT GGGATT GGGATT ggggattcccc GGATTC_ GGATTC_ GGATTC_ L09126 1374 435 948 cagtttcactt TTCACT TTCACT TTCACT tggggattcccca TTCACT_ TTCACT_ TTCACT_ D83956 1711 304 1030 agtttcacttct TGGGGA TGGGGA TGGGGA

93

Figure 27 : Visualized result of IRF-NFkappaB of Markov order of 0. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 0. Three rows present the results of S1, S2, and S3 in order.

94

Figure 28 : Visualized result of IRF-NFkappaB of Markov order of 1. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 1. Three rows present the results of S1, S2, and S3 in order.

95

Figure 29 : Visualized result of IRF-NFkappaB of Markov order of 2. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 2. Three rows present the results of S1, S2, and S3 in order.

96

Figure 30 : Visualized result of IRF-NFkappaB of Markov order of 3. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 3. Three rows present the results of S1, S2, and S3 in order.

97

6.2.8 NFkappaB-HMGIY

The dataset of NFkappaB-HMGIY has 7 modules within 6 sequences. The detailed information about the sequence of TFBSs and modules is summarized and reported in

APPENDIX 1(1.h). The summarized result for predicted binding sites and modules are reported in Table 19. The highest matched percentage of correctly predicted binding sites is achieved by Markov order of 4. However, as the only test case, all five Markov orders have no correctly predicted results for known modules.

Table 19

Result summarization for NFkappaB-HMGIY dataset. Display of the parameters applied for each test, and result from both word counting and module discovery. Each column has been described in the text Length Markov Number Matched Matched Cover Number Correct Cover Order of percentage sites of Prediction modules Words Modules 1 6 0 242 13 5.37% 5/13 26851 - - 2 6 1 141 13 9.22% 3/13 8155 - - 3 6 2 106 8 7.55% 3/13 45 79 - - 4 6 3 64 6 9.38% 4/13 1527 - - 5 6 4 8 1 12.5% 1/13 - - -

6.2.9 PU1-IRF

The dataset of PU1-IRF has 5 modules within 5 sequences. The detailed information of the sequence of TFBSs and modules is summarized and reported in Appendix 1(1.i). The summarized result for predicted binding sites and modules are reported in Table 20. The highest matched percentage of correctly predicted binding sites is achieved by Markov order of 1, which also covers the most known modules, although the Markov order of 0 98 has covered the most known binding sites. Table 21 shows the detailed result of each score. Figure 31 to Figure 33 show the visualized results for Markov orders from 0 to 2.

Besides Markov order of 0, in which three scores perform the same, based on the result table and the image for the rest of the two, S2 performs the best with a slight advantage, since the ranks it assigned for all correct predictions are not always higher than others.

Table 20

Result summarization for PU1-IRF dataset. Display of the parameters applied for each test, and result from both word counting and module discovery. Each column has been described in the text Length Markov Number Matched Matched Cover Number Correct Cover Order of percentage sites of Prediction modules Words Modules 1 6 0 209 8 3.83% 4/9 21736 7 2/5 2 6 1 122 6 4.92% 3/9 6785 5 1/5 3 6 2 115 4 3.48% 2/9 6030 3 1/5 4 6 3 78 3 3.85% 1/9 - - - 5 6 4 11 ------

Table 21

Detailed result of PU1-IRF dataset. Display of the matched known modules, the rank associated with the corresponding scoring function, and correctly predicted Rank Rank Markov Known Correct Correct Rank by Correct Seq ID by by Order Modules Prediction Prediction Prediction S/E SlnSE ( − / ggtttc GGTTTC_ GGTTTC_ GGTTTC_ U26540 13178 13178 13178 ttcc TTCCTC TTCCTC TTCCTC 0 gttttcatttc TTTTCA_ TTTTCA_ TTTTCA_ M66390 20756 20756 20756 ttcctc TTCCTC TTCCTC TTCCTC ggtttc TTCCAC_ TTCCAC_ TTCCAC_ 1 U26540 2988 2712 3363 ttcc GGTTTC GGTTTC GGTTTC ggtttc ATTTCC_ ATTTCC_ ATTTCC_ 2 U26540 4060 3198 3393 ttcc GGTTTC GGTTTC GGTTTC

99

Figure 31 : Visualized result of PU1-IRF of Markov order of 0. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 0. Three rows present the results of S1, S2, and S3 in order.

100

Figure 32 : Visualized result of PU1-IRF of Markov order of 1. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 1. Three rows present the results of S1, S2, and S3 in order.

101

Figure 33 : Visualized result of PU1-IRF of Markov order of 2. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 2. Three rows present the results of S1, S2, and S3 in order.

102

6.2.10 Sp1-Ets

Sp1-Ets dataset has 7 sequences including 8 modules and the detailed information about the embodied modules and TFBSs are summarized in Appendix 1(1.j). The summarized result for predicted binding sites and modules are reported in Table 22. For Markov orders of 3 and 4, there are not any correctly predicted modules due to the unsuccessful prediction of binding sites. Among the rest of the three Markov orders, the Markov order of 0 achieves the highest matched percentage for both predicted binding sites and modules with significant advantages. Table 23 shows the detailed results of each score.

Figure 34 to Figure 36 demonstrate the result of Markov order of 0, 1, and 2 in images, and the score S2 performs the best, although the advantage is slight in the Markov order of 0. S1 performs the worst among three scores, especially for the higher Markov order.

Take the module “CCTCCT_TTCCTC” as an example: S2 ranked it with 29 out of 7301 modules; S3 ranked it with 920; and S1 ranked it with 2761, which is far inferior to the other two.

103

Table 22

Result summarization for Sp1-Ets dataset. Display of the parameters applied for each test, and result from both word counting and module discovery. Each column has been described in the text Length Markov Number Matched Matched Cover Number Correct Cover Order of Words Percentage Sites of Prediction Modules Words Modules 1 6 0 283 29 10.25% 12/15 36046 21 5/8 2 6 1 214 13 6.07% 7/15 20174 4 1/8 3 6 2 183 8 4.37% 5/15 13858 3 1/8 4 6 3 121 1 0.83% 1/15 - - - 5 6 4 20 1 5% 1/15 - - -

Table 23

Detailed result of Sp1-Ets dataset. Display of the matched known modules, the rank associated with the corresponding scoring function, and correctly predicted modules Ra nk Rank Markov Known Correct Correct Rank by Correct Seq ID by by Order Modules Prediction Prediction Prediction S/E SlnSE ( − / aaagggaactga 1842 AACTGA_ AACTGA_ AACTGA_ U13399 18519 18519 agggtgggg 0 AGGGTG AGGGTG AGGGTG acaggaat 1764 AGGAAT_ AGGAAT_ AGGAAT_ S71507 17846 17846 ctcgccc 0 TCGCCC TCGCCC TCGCCC acttcctc CCTCCT_ CCTCCT_ CCTCCT_ 0 D87541 3757 3818 3786 ggctcctcctcc TTCCTC TTCCTC TTCCTC gaagggcgggga 2567 GCGGGG_ GCGGGG_ GCGGGG_ U13399 23431 23431 aaagggaactga 3 AGGGAA AGGGAA AGGGAA ttcctt 312 2 TTCCTT_ TTCCTT_ TTCCTT_ M84757 31229 31229 gaggcagggc 9 AGGCAG AGGCAG AGGCAG acttcctc 1237 CCTCCT_ CCTCCT_ CCTCCT_ 1 D87541 5858 8896 ggctcctcctcc 8 TTCCTC TTCCTC TTCCTC acttcctc 1128 CCTCCT_ CCTCCT_ CCTCCT_ 2 D87541 5688 8847 ggctcctcctcc 3 TTCCTC TTCCTC TTCCTC

104

Figure 34 : Visualized result of Sp1-Ets of Markov order of 0. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 0. Three rows present the results of S1, S2, and S3 in order.

105

Figure 35 : Visualized result of Sp1-Ets of Markov order of 1. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 1. Three rows present the results of S1, S2, and S3 in order.

106

Figure 36 : Visualized result of Sp1-Ets of Markov order of 2. The figure’s left side illustrates the TP’s rank distribution, while the right side shows the cumulative rank result for each score with Markov order of 2. Three rows present the results of S1, S2, and S3 in order.

107

6.2.11 Discussion

By analyzing the above 10 tests for module discovery, S2 is the best scoring function in most test cases, although the advantage is not obvious in a few cases, such as the Sp1-Ets dataset. For a few test cases with Markov order of 0, including IRF-NFkappaB, Ebox-Ets, and Ets-AML dataset, S1 performs better than the other two with a slight advantage. The performance of S3 is usually located in the second place no matter which Markov order is used, and no matter if the first place is taken by S1 or S2.

In addition to analyzing the three score functions for module discovery, the result reveals a little concern for the selection of Markov orders as well. Starting from word counting, lower Markov order leads to more qualified words, which enlarges the input word list for module discovery. The input size certainly affects the result, and with the general consideration, a larger word list is expected to lead to a better result unless the quality of selected words is decreasing accompanied with the increasing of Markov orders. By observing the result table of each test case, the fact is not consistent with the expectation.

Specifically, take the test of AP1-NFAT (Table 5) as an example: Markov order of 0 has

56341 modules, and Markov order of 1 has 16562 modules. As expected, Markov order of 0 should have more correct predictions; however, it only reached 5 correct predictions while the Markov order of 1 has 7. Furthermore, both Markov order of 1 and Markov order of 2 achieved the same coverage of known modules, even though modules reported by Markov order of 2 are much less than the output from the Markov order of 1. Similar situations happened to datasets of AP1-Ets, AP1-NFkappaB, CEBP-NFkappaB, Ets- 108

AML, and IRF-NFkappaB. The only reason that explains this inconsistency is that the quality of words selected by lower Markov orders is not as good as higher Markov orders, although it brings more qualified words.

6.3 Evaluation of Prediction Performance

Module discovery is an open and tough problem. Due to the limited knowledge in this realm, it is arduous to evaluate the performance comprehensively. In this thesis, a widely used and accepted evaluation system for assessing the performance of prediction tools was employed and applied for the two proposed approaches. Although there is still controversy for this system, it is the most comprehensive one as far as we know. The

TRANSCompel benchmark dataset does not present a general situation. On the other hand, muscle and liver datasets reflect a more generic general situation; furthermore, muscle and liver have different patterns of modules, such as modules contained in liver dataset are more condensed than muscles as stated above. Thus, both datasets are chosen for this evaluation in order to make the test result more comprehensive. The next two sections introduce the statistical values and formulas applied for this evaluation.

6.3.1 Statistical Values

Four statistical terms need to be defined for comparing the known data with any predicted results: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative

(FN). These four items are assumed to reflect all possibilities of predicted and known situations, although this assumption is also the main issue that causes the controversy. 109

Figure 37 describes the relationship of these four factors graphically. There are two ways to calculate these four values: nucleotide level and motif level. In the case of module discovery, the nucleotide level is meaningless due to the fact that binding sites are the basic composition. Thus, in the assessment of module discovery tools, only the motif level statistical values are applied. The definition for motif level of these four items is given by Tompa et al. [46]:

TP: Both predicted and experimentally verified modules

FP: Predicted modules, but not experimentally verified

TN: Neither predicted nor experimentally verified modules

FN: Experimentally verified modules, but not predicted

The given known binding sites are presented in the format of PWMs, so before doing the module discovery, the PWMs have to be converted into binding sites. PWMs not only carry the regular expression for motifs, they also express frequency weighted information.

However, the benchmark dataset does not provide a sufficient clue about how to use the existing information, so all possible binding sites are enumerated from the regular expression and applied as the input word list.

110

Figure 37 : The relationship between reality and prediction. This table reflects the relationship between prediction and reality. It is adapted 2 × 2 from [47].

6.3.2 Statistical Formulas

For the purpose of measuring the accuracy of predictions, the above introduced four binary variables, TP, TN, FP, and FN, are adopted as fundamental parts for constructing statistical measures. According to Figure 37 , the left upper corner and the right lower corner are two blocks where prediction is consistent with reality, so it is easy to derive

Sensitivity (Sn) and Positive Predictive Value (PPV), as shown in Figure 38 . Sn presents the ratio of correctly predicted non-modules to all predicted non-modules; while PPV presents the ratio of correctly predicted modules to all predicted modules. Specificity (Sp) displays the proportion of known non-modules that have been correctly predicted.

Average Site Performance (ASP) is the average value of PPV and Sn, which performs the average performance of correctly predicted ones against all predicted ones. Performance

Coefficient (PC) is defined as the ratio of correctly predicted modules against the sum of 111 predicted and known modules. The last one used for measuring the accuracy, which is also the traditional main measurement, is Correlation Coefficient (CC) that is derived from Pearson product-moment correlation coefficient [48]. In statistics, this formula is widely used for measuring the correlation between two binary variables. The given value is ranged from -1 to +1, inclusively. Usually, two variables are considered to have tight correlation when the value is either between 0.5 and 1.0 or -1.0 and -0.5.

TP TN PPV = Sn = TP + FP TN + FN

TN TP Sp = PC = TN + FP TP + FN + FP

Sn + PPV ASP = 2

TP × TN − FN × FP CC = (TP + FN (TN + FP (TP + FP (TN + FN

Figure 38 : Measuring Formulas Seven formulas are employed for measuring the accuracy of prediction, and the detailed explanation of each formula is in the text.

6.3.3 Results and Discussion

The prediction performance of the two methods illustrated in this thesis is compared with the eight popular published module discovery tools, which have been assessed by

Klepper et al. [9] and Compo [23] that is developed by Finn Drablos et al. To ensure the fairness of the comparison, besides applying the same evaluation system, the input data 112 and known modules are both the same as Klepper et al.’s tests. As required, there are two input files, a set of sequences and a word list. In this case, instead of a binding site list, a list of PWMs was used. Before operating module discovery algorithms, these PWMs were transferred into exact words first. The process of this transformation is completed by the following steps: first, PWMs were converted into regular expressions; second, all words that are presented by the regular expressions were exhaustively enumerated; finally, search against sequences files and deleted non-existent words. After the preparation, 2041 binding sites were obtained from the four PWMs provided for liver dataset; and 50 sites were obtained from the five PWMs provided for muscle dataset. The following two sections report the evaluated results of the enumerative method and HAC method, respectively.

6.3.3.1 Enumerative Module Discovery Prediction Performance

Firstly, as stated above, for both muscle and liver benchmark datasets, WordSeeker’s basic function, word counting, was applied for extracting corresponding words obtained from PWMs, since the probability of each word needed to be computed by the built-in

HMM. The word length of muscle dataset is arranged from 11 to 13, while liver’s are 11,

14, and 15; thereby, the word counting was run three times for each dataset set with different word lengths, and the Markov order was set by word length distract 2. Then the enumerative module discovery method was executed on the result from word counting with different dimensions corresponding to the dimension’s range of each dataset.

Specifically, the module’s dimensions of muscle dataset are 2, 3, 4, 5, 6, and 8; while 113 liver dataset has dimensions of 2, 3, 4, 5, 6, and 9. For each dataset, all results generated by different dimensions are gathered together, and top n modules are selected based on the rank which is sorted by SlnSE score, which performed the best in the scoring functions evaluation.

For muscle benchmark dataset, the 50 putative binding sites with dimensions of 2, 3, 4, 5,

6, and 8 were applied; then, the top 25, and top 50 modules were selected. The seven statistical values were calculated for each result set, shown in Table 24, and compared with the other eight tools. Figure 39 shows the result. From the result table (Table 24), the performance of the enumerative method is competitive, although it is not the best among all tools. The performance is among the top 4 for all scores, except for Sp, and CC.

Table 24

Result of Enumerative Method for Muscle Dataset. The evaluation result for the prediction performance of the enumerative module discovery method for muscle dataset. The number that is highlighted in purple is the highest score achieved by other tools, and the yellow is the highest score achieved by the enumerative method Muscle ASP Sp PC Sn PPV CC Cluster -Buster 0.638 0.944 0.466 0.673 0.602 0.588 Cister 0.448 0.802 0.24 0.614 0.282 0.306 MCAST 0.615 0.846 0.273 0.826 0.404 0.504 MS CAN 0.591 0.99 0.332 0.357 0.824 0.51 Stubb 0.548 0.913 0.368 0.621 0.475 0.476 ModuleSeacher 0.493 0.975 0.291 0.348 0.638 0.425 CisModule 0.142 0.819 0.072 0.175 0.109 -0.005 CMA 0.461 0.867 0.278 0.571 0.352 0.358 Compo 0.608 0.432 0.398 0.761 0.45 5 0.196 top50 0.616 0.552 0.429 0.696 0.536 0.232 top25 0.619 0.69 0.387 0.667 0.571 0.237

114

The same procedure was completed for the liver benchmark dataset; however, due to the high number of putative binding sites, which is 2041, and the limited memory, it could not finish running through all dimensions. Lastly, dimensions were selected from 2 to 4.

The top 100 modules were selected, and the result compared to the other eight tools is shown in Table 25 . It is obvious that the performance is not satisfied. By exploring the formula in-depth, a high TP value is probably the reason for this poor performance.

While analyzing input four PWMs, it is found that the regular expression for the motif of

CEBP is NNNT[GT]NNNNNNNNN, if without using a threshold for PWMs reading.

Obviously, the reading result is too redundant. Due to the lack of related knowledge, there is no clue of how to select the threshold. In order to avoid over-fit, the motif CEBP was discarded from the input list, and the whole process was repeated for the rest of the

PWMs. Then, 103 putative binding sites were gained after discarding CEBP motif, and the top 100 modules were selected for performance evaluation. The assessed result is shown in Table 25 , and Figure 40 shows the result. Clearly, besides Sp and Sn scores, the performance of others was greatly improved. This performance variation was caused by the decline of TN accompanied by the decrease of FP.

115

Table 25

Result of Enumerative Method for Liver Dataset. The evaluation result for the prediction performance of the enumerative module discovery method for liver dataset. The number that is highlighted in purple is the highest score achieved by other tools, and the yellow is the highest score achieved by the enumerative method Liver ASP Sp PC Sn PPV CC Cluster -Buster 0.638 0.611 0.459 0.567 0.708 0.172 Cist er 0.689 0.444 0.525 0.7 0.677 0.146 MCAST 0 0 0 0 0 0 MSCAN 0.525 0.833 0.273 0.3 0.75 0.149 Stubb 0.61 0.5 0.436 0.567 0.654 0.065 ModuleSearcher 0.67 0.722 0.486 0.567 0.773 0.281 CisModule 0 0 0 0 0 0 CMA 0.743 0.389 0.585 0.8 0.686 0.206 Compo 0.533 0.667 0.333 0.4 0.667 0.067 Top 100 0.509 0.979 0.021 0.95 0.068 0.012 Top 100 modified 0.538 0.467 0.133 0.935 0.14 0.121

6.3.3.2 HAC Module Discovery Prediction Performance

Since the HAC module discovery method needs training data for setting parameters, both liver and muscle datasets were divided into several groups. One of them was applied for testing, and the rest were assigned as training data. The whole process was repeated several times depending on the number of groups, and the result of each test case was added together as the overall result for HAC algorithm. Take muscle dataset as an example: the 24 sequences were broken into 4 groups with 6 sequences each. The process of grouping was random without any preferences or criterions. Then, the evaluation test was repeated four times, so that every group had been treated as a test case once. For each round, the statistical values TP, TN, FP, and FN of each test case were all counted, and all seven formulas were computed as well. Test results for liver and muscle benchmark dataset accompanied by the performance of the other nine tools are shown in

Table 26, and Table 27, respectively. 116

According to result tables, HAC algorithm performs competitive with other methods, even the best in two evaluation criteria, ASP and PC. Figure 39 shows this comparison.

Although HAC was not the best score in all values of the muscle dataset, it was located in the top 3 among all scores but Sp. In the liver benchmark dataset, HAC algorithm accessed the highest score in Sp, although the performance was not as well as in muscle dataset.

Table 26

Result of HAC Method of Muscle Dataset. The evaluation result for the prediction performance of the HAC algorithm for muscle dataset. The number that is highlighted in purple is the highest score achieved by the comparison group, and the yellow is the highest score achieved by the HAC algorithm Muscle ASP Sp PC Sn PPV CC Cluster -Buster 0.638 0.944 0.466 0.673 0.602 0.588 Cister 0.448 0.802 0.24 0.614 0.282 0.306 MCAST 0.615 0.846 0.273 0.826 0.404 0.504 MSCAN 0.591 0.99 0.332 0.357 0.824 0.51 Stubb 0.548 0.913 0.368 0.621 0.475 0. 476 ModuleSeacher 0.493 0.975 0.291 0.348 0.638 0.425 CisModule 0.142 0.819 0.072 0.175 0.109 -0.005 CMA 0.461 0.867 0.278 0.571 0.352 0.358 Compo 0.608 0.432 0.398 0.761 0.455 0.196 HAC test 1 0.667 0.667 0.5 0.667 0.667 0.333 HAC test 2 0.857 0.75 0.75 0.857 0.857 0.607 HAC test 3 0.375 0.938 0.2 0.25 0.5 0.25 HAC test 4 0.889 1 0.778 0.778 1 0.789 HAC 0.745 0.865 0.588 0.69 0.8 0.567

117

Table 27

Result of HAC Method of Liver Dataset. The evaluation result for the prediction performance of HAC algorithm for liver dataset, comparing with the other nine tools. The number that is highlighted in purple is the highest score achieved by the comparison group, and the yellow is the highest score achieved by the HAC algorithm Liver ASP Sp PC Sn PPV CC Cluster -Buster 0.638 0.611 0.459 0.567 0.708 0.172 Cister 0.689 0.444 0.525 0.7 0.677 0.146 MCAST 0 0 0 0 0 0 MSCAN 0.525 0.833 0.273 0.3 0.75 0.149 Stubb 0.61 0.5 0.436 0.567 0.654 0.065 ModuleSearcher 0.67 0.722 0.486 0. 567 0.773 0.281 CisModule 0 0 0 0 0 0 CMA 0.743 0.389 0.585 0.8 0.686 0.206 Compo 0.533 0.667 0.333 0.4 0.667 0.067 HAC test 1 0.221 0.9626 0.0952 0.119 0.323 0.129 HAC test 2 0.246 0.802 0.121 0.333 0.160 0.100 HAC test 3 0.293 0.961 0.148 0.191 0.3 94 0.211 HAC 0.214 0.899 0.12 0.213 0.216 0.113

6.3.3.3 Discussion

For the purpose of displaying the result more intuitively, Figure 39 and Figure 40 show the comparison result for muscle and liver benchmark datasets, respectively. It is clear that both methods are competitive, although the performance for liver benchmark dataset is not as satisfactory as the performance of muscle dataset. For the muscle benchmark dataset, except for Sp score, the HAC algorithm performed among the top 3 among all the remaining scores, and was the best in ASP and PC scores. Sp score was determined by TN and FP. A high amount of FP might be the reason that led to the poor appearance in this field. The enumerative algorithm was not as good as the HAC algorithm, but it also performed in the top 5 among most scores, except for Sp and CC scores. For the liver benchmark dataset, different methods behaved with large differences. Before filtering PWMs, the enumerative method achieved the highest score in Sp and Sn. After 118 the adjustment of PWMs, the overall rank of the enumerative method was enhanced except for the Sp and Sn scores, which reached the highest rank before the modification.

The deterioration in the performance was due to the reduction of the number of TN, which decreased with the reduction of FP. The HAC algorithm’s performance was slightly lower than the average, except that it reached the second best in Sp score.

Evaluation Result of Muscle Benchmark Dataset 1

Top25 0.8 Top50 HAC

0.6 Cluster -Buster Cister MCAST 0.4 MSCAN Stubb 0.2 ModuleSeacher CisModule 0 CMA ASP Sp PC Sn PPV CC Compo

-0.2

Figure 39 : Evaluation result of the muscle benchmark dataset. In the above figure, different methods are displayed in different colors. Besides the enumerative and HAC algorithms proposed in this thesis, the other nine tools’ performances are illustrated as well.

119

On the other hand, this result also presented evidence of the performance of methods which use prior knowledge of TFBSs/TFs heavily depends on the quality of input binding sites, which is stated in Chapter 2. Specifically, for the liver dataset, 2040 binding sites were extracted from PWMs without using any thresholds. Obviously, this set of binding sites contained too much false positive information that caused poor performance in the evaluation system. After a slight modification for PWMs’ treatment, the performance had been greatly improved in most scores, except for Sp and Sn scores. This significant improvement just proves the improper of using prior knowledge of TFs for addressing the general problem.

Evaluation Result of Liver Benchmark Dataset 1 Top 100 0.9 Top 100 modified 0.8 HAC 0.7 Cluster -Buster 0.6 Cister MCAST 0.5 MSCAN 0.4 Stubb 0.3 ModuleSearcher 0.2 CisModule 0.1 CMA 0 Compo ASP Sp PC Sn PPV CC

Figure 40 : Evaluation result of the liver benchmark dataset. In the above figure, different methods are displayed into different colors. Besides the enumerative and HAC algorithm proposed in this thesis, the other nine tools’ performances are illustrated as well. 120

Another point worth discussing is the debate on this evaluation system. According to the introduction of this system, FP is deceived for penalizing the incorrect prediction; however, a predicted module that does not belong to a range of known modules does not mean it is wrong. Thus, it is not proper to penalize an unknown module that is predicted as a putative one. Furthermore, to detect modules that have not been known is the goal in this realm. Besides the method for dealing with FP, the method for counting TN and FN is controversial. 121

CHAPTER 7: CASE STUDY

7.1 Plant Gravitropic Signal Transduction

This case study was done in cooperation with Dr.Wyatt’s lab, and the result has been published [49]. Kaiyu Shen and Dr. Wyatt completed the biological experiments. Jens

Lichtenberg and I did the computational discovery section.

7.1.1 Introduction

Gravity plays an essential role in the world of plants. It has been used to control almost everything from the growing direction of a seed, to the positioning of plant organs. The terminology of gravitropism is used for describing the movement that plants make in response to the earth’s gravity. The movement could be divided into three groups following the order of steps: perception, signal transduction and growth response [50].

The aim of this case study was to discover putative regulatory elements and modules that are involved in gravitropic signal transduction. Several aspects have been proposed as related actions, such as cytoplasmic pH [51], cytoskeletal rearrangements [52], inositol 1,

4, 5-triphosphate [53], and reactive oxygen species [54, 55]. Most experiments have focused on the physical movements involved in the reaction to gravity, and ignore the genome-wide analysis. In this case study, the attention focused on the genome scale analysis, and aimed to identify regulatory elements and modules under transcription controls.

122

The raw data were obtained from Kimbrough et al. [56]. Briefly, 7 day-old Arabidopsis

Thaliana seedlings were collected, and the gravity stimulus group was provided by rotating them135 o, while the control group was provided by oscillating them gently for 5s.

For each group, the RNA was extracted from six time points: 0, 2, 5, 15, 30, and 60 minutes after treatment.

7.1.2 Methods and Results

7.1.2.1 Microarray Analysis

An R package, named Bioconductor [58], was employed for analyzing the raw microarray data. The detailed information of how to process the raw data is documented and reported in [49]; so this section only reports the analysis result. Firstly, a quality control was applied for the raw data, and after filtering out poor quality data, a list of 154 genes remained and were considered as candidate genes for further analysis. Then, the identified genes were categorized into different clusters according to their different expression levels. Ten gene clusters were obtained as the result of grouping. Finally, after discarding the empty cluster, nine clusters were obtained and applied for the whole pipeline.

7.1.2.2 GO Analysis

GOstat[59] analysis was applied to the nine clusters in order to investigate the functional similarity among clustered genes. Additionally, the complete gene list has been assessed as well, and a P-Value of 0.2 was adjusted from 0.1 as a threshold for matched result, 123 since no output was generated when 0.1 was chosen. Due to the limited space, GO analysis’ results for the nine clusters are reported in APPENDIX 5, and the result for the whole gene list is reported in APPENDIX 6.

7.1.2.3 Statistically Over-represented Words

With the help of theAGRIS database [11], the promoter regions of genes were retrieved for each cluster. WordSeeker, the genomic analysis suite which has been introduced in

Chapter 5, was applied for all promoter sequences in order to enumerate all words with a specific length of 6 nucleotides. Markov order of 4 was selected for computing the expected number of occurrences for every enumerated word, and the reason for using the order 4 Markov model has been explained in [57]. Equations (16) and (17) illustrate the formula used for computing the expected number of occurrences times and the expected number of sequences hits for a word w respectively, where is the probability of word p w, is the length of sequence i, and v is the length of word w, which is fixed as 6 l nucleotides in this case.

(16) Eo (w = (lp

(17) Es (w = (1 − (1 − p

Based on expected values, SlnSE score ( ) and P-Value, shown in equation (18), Sln( are caculated.

|Ρ| ΘͰͥ (18) PValue = 1 − p(1 − p 124

SlnSE score was used for sorting enumerated words. And among the top 5 words from each cluster, a word, which either was ranked in the top 2 or had a P-value smaller than

0.05, was selected and considered as significant. Table 28 shows all selected significant words for each cluster.

Table 28.

Significant Words. Among the top 5 over-represented words of each cluster based on SlnSE score, words that are either in the top 2 or having P-Value smaller than 0.05 were selected and considered as significant words Cluster 1 Cluster 2 Word SlnSE P-Value Word SlnSE P-Value TCCCAT 6.64755 0.060484 TAACTC 7.69913 0.020566 TGATAC 6.00179 0.06048 CCAACC 7.46182 0.024886 CGAACC 5.078417 0.029237 GGCTTA 6.59961 0.027108 TAAGCC 5.07163 0.048075 TCTAAG 6049013 0.027051 Cluster 3 Cluster 4 Word SlnSE P-Value Word SlnSE P-Value GCTCTA 7.40976 0.020465 AGATCA 8.01895 0.295071 AGATAG 6.22494 0.162696 TCTAAC 7.60798 0.068925 ACCTCT 5.97253 0.038065 GTATCC 7.3534 0.04836 Cluster 5 Cluster 6 Word SlnSE P-Va lue Word SlnSE P-Value CTCATG 8.01649 0.004801 CTCTCC 7.018 0.083491 GTATCT 7.26391 0.097377 CAATAC 6.49897 0.09034 AGAATC 6.93299 0.032026 GCATCG 6.44926 0.025044 GGATAC 6.34436 0.022551 TGTAAC 5.95994 0.031052 Cluster 7 Cluster 8 Word SlnSE P-Value Word SlnSE P-Value CTTTCG 6.59517 0.038953 TCATTC 6.69103 0.073941 ATCTGA 6.40078 0.019043 CTTAAC 6.36019 0.074444 GTGAAT 6.36019 0.022049 Cluster 9 Word SlnSE P-Value GAGTAT 6.31452 0.014108 GGAAGC 5.82544 0.021967 CATCTT 5.61811 0.017162 CCTTTC 5.61743 0.026461 ACCTTC 5.61312 0.026563

125

7.1.2.4 Word-based Cluster

Besides individual significant words, the pipeline also constructed motifs by selecting top ranked words as seeds. All enumerated words that have the pre-defined hamming distance with the seed words were identified and clustered together for constructing motif logos. The hamming distance was set as 1 for this case study. The motif logos were built up based on position weighted matrices with the TFBS Perl module developed by

Lenhard and Wassermann [60].

7.1.2.5 Module Discovery

The enumerative module discovery method was applied for discovering CRMs. The dimension of two was selected for enumerating word combinations. Detailed information about this method has been illustrated in Chapter 4, and is not repeated in this section. All combinations of word pairs were output associated with the expected and observed number of sequence hits and multiple scoring functions, as shown in equations (10), (11),

(12), and (13).

In this case, the top 25 statistically over-represented words of each cluster were chosen for the module discovery. Sorting based on SlnSE score, and the top 10 modules from each cluster containing at least one word from the list of significant words (Table28) were considered as putative modules and shown in Table 29. Also, the density and distance distribution were explored for the putative modules. Note that while most modules have lower density with long distance between words inside a module, the 126 module of CCTCAC_GAGTAT has a much higher average density of 39.21% with the average distance of 179 bps.

Table 29.

Significant Modules. Modules presented in this table were selected from the top 10 over- represented modules of each cluster based on SlnSE score. These modules contain at least one significant word. Associated with each module, the SlnSE score, the contained significant word, the average density, and the average distance are shown as well Cluster 1 Module SlnSEs Significant Word Contained Average De nsity Average Distance ATTCCG_TGATAC 5.17295 TGATAC 2.22% 403 Cluster 2 Module SlnSEs Significant Word Contained Average Density Average Distance GAACTA_CCAACC 11.7777 CCAACC 2.04% 895 TAGAAT_CCAACC 11.2629 CCAACC 5.59% 422 TTACTC_TCTAAG 10.54 TCTAAG 22.72% 549 ATGAAG_CCAACC 10.3734 CCAACC 6.11% 316 TTTGCA_TAACTC 9.88363 TAACTC 7.27% 631 TTTGCA_CCAACC 9.68001 CCAACC 2.62% 753 Cluster 3 Module SlnSEs Significant Word Contained Average Density Average Distance ATTAGC_GCTCTA 11.2158 GCTCTA 8.70% 51 8 CCCTCC_GCTCTA 11.2053 GCTCTA 15.40% 893 GCCATA_GCTCTA 10.5222 GCTCTA 2.56% 552 AGGATA_GCTCTA 9.82138 GCTCTA 7.72% 909 AAACGC_ACCTCT 9.74673 ACCTCT 7.16% 891 TGAGTT_GCTCTA 9.73652 GCTCTA 4.91% 673 Cluster 4 Module SlnSEs Significant Word Contained Average Density Average Distance GCTATA_GTATCC 13.4204 GTATCC 6.10% 715 GTAGAA_GTATCC 12.5818 GTATCC 2.18% 1047 GCAATG_GTATCC 12.3738 GTATCC 2.13% 916 CCACAA_TCTAAC 12.058 TCTAAC 10.55% 463 TCTTAT_AGATCA 12.0445 GTATCT; AGATCA 2.58% 739 Cluster 5 Mo dule SlnSEs Significant Word Contained Average Density Average Distance GTATCT_CTCATG 14.7114 CTCATG 21.51% 543 GCTTAT_GTATCT 14.08 GTATCT 6.84% 670 GTTCAC_GGATAC 13.7514 GGATAC 2.41% 913 GGATAC_CTCATG 13.2243 CTCATG 6.46% 933 TACAAG_CTCATG 12.4226 CT CATG 3.15% 905 AGATGT_CTCATG 12.2607 CTCATG 2.90% 728 GTTTTC_CTCATG 12.2258 CTCATG 3.96% 704 GTTCAC_CTCATG 11.716 CTCATG 3.13% 727 Cluster 6 Module SlnSEs Significant Word Contained Average Density Average Distance CACTCT_CTCTCC 11.7325 CTCTCC 17.47% 807 AGTGAC_CTCTCC 11.37 CTCTCC 3.84% 981 TCATAG_CAATAC 11.3198 CAATAC 1.87% 1045 127

Table 29 (continued).

TGTAAC_CTCTCC 11.1742 TGTAAC; CTCTCC 1.92% 926 AACGAT_TGTAAC 10.9469 TGTAAC 1.43% 1091 CATTTC_TGTAAC 10.4234 TGTAAC 3.16% 862 Cluster 7 Module Sl nSEs Significant Word Contained Average Density Average Distance ATCACA_CTTTCG 11.5734 CTTTCG 3.70% 960 TAGCTT_CTTTCG 11.1166 CTTTCG 16.84% 1118 CAACGT_ATCTGA 10.6013 ATCTGA 2.80% 738 CTTAAG_ATCTGA 10.5579 ATCTGA 6.86% 559 ATCACA_ATCTGA 10.4263 ATCTGA 8.24% 545 GTAACC_ATCTGA 10.3261 ATCTGA 1.94% 1060 GTCTAA_ATCTGA 10.2115 ATCTGA 3.06% 788 Cluster 8 Module SlnSEs Significant Word Contained Average Density Average Distance GTGAAT_TCATTC 12.0743 GTGAAT; TCATTC 2.93% 752 ATTAAC_CTTAAC 11.3839 CTTAAC 3.14% 780 TTACAC_GTGAAT 10.794 GTGAAT 19.27% 403 ACATTG_CTTAAC 10.5886 CTTAAC 3.78% 507 GTGAAT_CTTAAC 10.4807 GTGAAT; CTTAAC 6.08% 702 Cluster 9 Module SlnSEs Significant Word Contained Average Density Average Distance ACCTTC_GAGTAT 12.4362 GAGTAT 1. 81% 682 CACCGA_GGAAGC 11.6845 GGAAGC 1.59% 806 CAACTC_CCTTTC 11.1518 CCTTTC 1.85% 828 CCTCAC_GAGTAT 10.6285 GAGTAT 39.21% 179 CATCTT_GAGTAT 10.4563 CATCTT; GAGTAT 2.84% 655 CTGACA_GAGTAT 10.1879 GAGTAT 1.53% 835 CATCTT_GGAAGC 9.97153 CATCTT; GGAAGC 2.60% 621 CTATGT_GGAAGC 9.95253 GGAAGC 0.70% 1758 CCTCAC_ACCTTC 9.93358 ACCTTC 2.24% 977 CGAATC_CCTTTC 9.91723 CCTTTC 2.19% 613

7.1.2.6 Functional Look-up

In this section, predicted cis-regulatory elements were compared to currently known

TFBSs obtained from the AGRIS database [11, 12]. Words that belong to either significant words or modules are compared to the list of known TFBSs in the AGRIS database, and the matched binding sites are shown in Table 30.

128

Table 30

AGRIS Look-up results. Information shown in this table was extracted from the AGRIS database [11, 12]. For each matched significant word, the name of known binding sites, consensus motif, and reference literature are displayed Cluster 1 -- Cluster 2 Word SlnSEs P-Value Name Consensus Motif Reference MYB2 binding site TAACTC 7.69913 0.020566 TAACT(G/C)GTT motif [61 ] Cluster 3 Word SlnSEs P-Value Name Consensus Motif Reference GATA promoter AGATAG 6.22494 0.162696 [AT]GATA[GA] motif [62] Cluster 4 Word SlnSEs P-Value Name Consensus Motif Reference TCTAAC 7.60798 0.068925 MRE motif in CHS TCTAACCTACCA [63] Cluster 5 Word SlnSEs P-Value Name Consensus Motif Reference EIL1 BS in ERF1; EIL2 BS in ERF1; TTCAAGGGGGCATGTATCTTGAA GTATCT 7.26391 0.097377 EIL3 BS in ERF1 [64] EIN3 BS in ERF1 GGATTCAAGGGGGCATGTATCTTGAATCC Cluster 6 -- Cluster 7 -- Cluster 8 -- Cluster 9 Word SlnSEs P-Value Name Consensus Motif Reference CCTTTC 5.61743 0.026461 CArG2 motif in AP3 CTTACCTTTCATGGATTA [65]

In addition to the AGRIS-based lookup, the other two established TFBSs public databases, TRANSFAC [10] and JASPAR [13], were also applied for the comparison of the word and module elements by using a modification of an approach, which is developed by Jacox and Elnitski [66]. Sequences of each cluster were annotated with

TRANSFAC known binding sites, and subsequently analyzed for overlap between significant words and those annotations. Based on the ratio of actual matches and total 129 number of occurrence times of words, significant words were assessed for the matching level. A threshold of 0.75 was applied to limit the results to significant matches. The result is shown in Table 31.

Table 31.

TRANSFAC Look-up results. For sequences of each cluster, a modification of a published approach was applied, and the matching level was assessed for each significant word. In this table, only words that have matches higher or equal to 0.75 are considered as significant matches and shown Cluster 1 Word S O TF_ID TF_Name Score_Range TFBS Matches TGATAC 7 7 MA00 35 Gata1 90.93 - 90.93 NGATNN 14 \14 TGATAC 7 7 MA0037 Gata3 86.31 - 86.31 HGATWR 14 \14 Cluster 2 - Cluster 3 Word S O TF_ID TF_Name Score_Range TFBS Matches AGATAG 10 10 MA0035 Gata1 94.35 - 94.35 NGATNN 18 \18 AGATAG 10 10 MA0037 Gata3 100.00 - 100. 00 HGATWR 18 \18 AGATAG 10 10 V$GATA3_01 Gata3 85.26 - 95.56 NNGATWDNN 16\18 AGATAG 10 10 V$GATA6_01 Gata6 85.51 - 92.33 NNHGATWNNN 18 \18 AGATAG 10 10 V$GATA_Q6 Gata 91.43 - 98.09 WGATARN 18 \18 Cluster 4 Word S O TF_ID TF_Name Score_Range TFBS Matches AGATCA 21 25 V$HNF4_Q6_02 HNF4 88.87 - 88.87 AGKYCA 48 \48 AGATCA 21 25 V$HNF4_Q6_03 HNF4 90.67 - 90.67 NGDBCA 48\48 GTATCC 12 13 MA0035 Gata1 93.07 - 93.07 NGATNN 19 \19 Cluster 5 Word S O TF_ID TF_Name Score_Range TFBS Matches CTCATG 14 20 MA0089 TCF 11 -MafG 87.05 - 87.05 NATGAC 29 \29 GTATCT 14 16 MA0035 Gata1 92.07 - 92.07 NGATNN 26 \26 GTATCT 14 16 MA0037 Gata3 87.55 - 87.55 HGATWR 26 \26 AGAATC 17 26 V$STAT5A_04 STAT5A 85.03 - 88.96 NNNTTCYN 32 \38 GGATAC 8 9 MA0035 Gata1 93.07 - 93.07 NGATNN 11 \11 Cluster 6 Word S O TF_ID TF_Name Score_Range TFBS Matches GCATCG 6 6 MA0035 Gata1 97.24 - 97.24 NGATNN 6\6 Cluster 7 Word S O TF_ID TF_Name Score_Range TFBS Matches CTTTCG 9 10 MA0080 SPI1 85.47 - 85.47 VGGAAS 21 \21 ATCTGA 11 15 V$CAP_01 CAP 85.57 - 94.08 NCABHNNN 25 \25 Cluster 8 Word S O TF_ID TF_Name Score_Range TFBS Matches TCATTC 13 17 V$CAP_01 CAP 89.34 - 97.86 NCABHNNN 31 \31 TCATTC 13 17 V$GEN_INI2_B GEN INI2 85.83 - 100.00 BBNCANTB 24 \31 130

Table 31 (continued).

TCATTC 13 17 V$GEN_INI_B GEN INI 86.69 - 100.00 NBNCANTB 24 \31 Cluster 9 Word S O TF_ID TF_Name Score_Range TFBS Matches CATCTT 5 6 V$CAP_01 CAP 86.97 - 93.14 NCABHNNN 10 \12 GGAAGC 4 4 V$CETS168_Q6 CETS168 85.92 - 100.00 CMGGAAGY 6\7 GGAAGC 4 4 V$PEA3_Q6 PEA3 89.16 - 91.24 ACWTC CK 6\7 GGAAGC 4 4 V$STAT3_02 STAT3 91.22 - 98.25 NNNTTCCN 7\7

7.1.3 Conclusions

In this case study, starting from microarray analysis and by applying genomic analysis software, regulatory genomic signature was identified from sets of related genes. After running the pipeline and analyzing data results, thirty two words (Table 28) from the top

10 words of each cluster were selected and considered as putative cis -regulatory elements due to their over-representation. Besides analyzing individual words, significant pairs

(consisting of at least one over-represented word) were explored. Finally, 55 modules

(Table 29) were chosen as the candidate CRMs. Interestingly, among 55 candidate modules, 6 modules’ components were both over-represented.

7.2 Comparison of Rice and Arabidopsis Thaliana

Word landscape analysis of Arabidopsis Thaliana has been completed by the cooperation of Dr. Grotewold’s lab and Dr. Welch’s lab, and the result has been published [41]. All information and sequences of Arabidopsis’ introns, UTRS, and intergenic regions are available at The Arabidopsis Information Resource (TAIR) [67]; and rice’s UTRs, introns and intergenic regions are available at Rice Genome Annotation [68, 69]. Dr.

Grotewold’s lab provided both the rice and Arabidopsis genome segments. 131

7.2.1 Introduction

Arabidopsis Thaliana is one of the best annotated genome in plants, so it usually serves as a model organism. It has one of the smallest genome sequences, which consists of

125Mbp with five chromosomes [70]. In [41], seven genome segments from non-coding regions, 5’UTR, 3’UTR, introns, core-promoter, proximal-promoter, and distal-promoter, were applied for the word landscape analysis pipeline. The promoter region is the upstream region relative to gene TSSs, and it was divided into three groups depending on the distance away from the TSSs: core-promoter, proximal-promoter, and distal-promoter.

Specifically, core-promoter regions are defined as the location from [+1, -100] relative to

TSSs; proximal-promoter regions are located from [-101, -1000] relative to TSSs; distal- promoter regions are located from [-1001, -3000] relative to TSSs. For both Arabidopsis and rice, these seven segments were extracted from their genome sequences. The goal of this case study is to identify either the similar or distinct patterns between these two species by doing the same analysis among seven segments and comparing the results.

However, due to limited time, only one section, proximal-promoter region, was used for comparing the species of rice. The rest will be done in the future.

7.2.2 Methods and Results

7.2.2.1 Word Counting

The basic function part of word counting from the WordSeeker package was applied for the proximal-promoter region for both Arabidopsis and rice. Word length of 8 and 132

Markov order of 6 were chosen for enumerating all subsequences (words). In a word length of 8, the number of possible words is , because there are four possible letters 4 (Adenine(A), Guanine (G), Cytosine (C), and Thymine (T)) for the total 8 possible positions . Both species have no missing words; in other words, the proximal-promoter regions from the targeted two species include all possible 8-letter words. For each word, with the observed occurrence times (o), sequence hits (s), and the expected value of sequence hits (Es), the score of was computed, which was also used as the basis for SlnSE word sorting. The top 100 words were selected from both species, and compared with each other in order to find common over-represented words. After comparing, 13 words were obtained and shared in common (Table 32). An additional score, distance (for evaluating the different level of the same word in different species based on their SlnSE scores), was provided, and the equation (19) illustrates how to compute the distance:

(19) distance = ×

133

Table 32

Common Words Shared by Arabidopsis and Rice. This table displays the common words shared by the top 100 words sorted by the SlnSE score from both species, associated with the score of distance and other information extracted from word counting results for two species words Rice Arabidopsis Thaliana distance SlnSE S O Rank SlnSE S O Rank AATATATT 2064.5 4392 4938 1 565.916 3652 4112 4 0.018266 TACAAAAT 425.705 2437 2585 82 369.485 2605 2839 12 0.0943061 TAGAAAGT 469.565 1061 1114 59 174.289 798 816 96 0.0411501 ATTTTTCA 590.843 3395 3613 30 257.476 2343 2494 23 0.0387279 GTATATAA 454.306 1763 2064 64 228.021 1497 1558 41 0.0470064 TGAAAAAT 617.906 3400 3652 27 313.607 2376 2518 14 0.0405354 AAAAATTA 519.163 5456 6169 41 260.84 4773 5593 22 0.043995 TTATATAA 776.074 2540 2721 10 648.487 3107 3405 3 0.0626011 TTATATAG 697.074 1624 1700 19 172.425 1287 1331 99 0.030871 CTATATAA 94 9.344 2065 2187 2 216.358 1413 1461 45 0.0261178 ATTTCTTA 465.112 1706 1787 60 253.551 1951 2058 27 0.0486147 TAATTTTT 499.137 5431 6087 47 401.145 4682 5364 9 0.0714315 GATATATC 643.667 1052 1116 22 194.819 659 673 62 0.0333761

7.2.2.2 Functional Look-up

Functional Look-up was applied for the 13 common words (Table 32) against public known binding sites databases: TRANSFAC and AGRIS. One word, out of 13 words, hit known binding sites in the TRANSFAC database twice (Table 33). Although there were no words matched with any known motifs in the AGRIS database, the known motifs obtained from the TRANFAC Look-up result, AGAA[ACGT], had one match in the

AGRIS database (Table 34). As AGRIS reported, the matched one (Table 34) belongs to the Heat Stress Transcription Factor (Hsf) family and is involved in the signal pathway of the action responding to heat and chemical stress [71].

134

Table 33

TRANSFAC Functional Look-up Result for Common Wrods. The table displays the look- up result of the 13 common words in the TRANSFAC database. Only one word hit known binding sites, although there were two corresponding TRANSFAC IDs Words TRANSFAC ID Family Known Motifs TAGAAAGT M00029 F$HSF_01 AGAA[ACGT] TAGAAAGT M00028 I$HSF_01 AGAA[ACGT]

Table 34

AGRIS Functional Look-up Result for Known Motif AGAA[ACGT]. The table is the look- up result of the known motif extracted from Table 33 in AGRIS database. The known binding site’s name, consensus sequence, and the reference information were listed in the table Name Consensus motif Reference HSEs binding site motif AGAANNTTCT [71]

7.2.2.3 Motifs Comparison

In addition to common words comparison for both species, motifs that were constructed from PWMs and generated by word clustering were examined. Word clustering scanned all words against seed words, and gathered words that are similar with seeds together.

The similarity was determined by the user-defined matrix and distance, which have been set as a hamming distance of 1 in this case study. The top 5 motif logos are displayed in

Figure 41 . Although the visualized observation is not comprehensive, it is obvious that

Arabidopsis thaliana is more AT-rich than rice. Besides the no. 1 motif from rice matched with the no. 4 motif from Arabidopsis thaliana , the no. 2 motif from rice,

CTATATAA, and the no. 4 motif from Arabidopsis , TTATATAA, are reverse 135 complimentary, and they both look like the TATA box. For rice itself, the no. 1 motif,

AATATATT, and the no.2, CTATATAA, are reverse complimentary, while they both are probably TATA box; For Arabidopsis itself, two pairs of motifs, no.1 and no.2, no.3 and no.4, are reverse complimentary with each other. Due to the limited space, only the top 5 motifs are shown here. Note that the sequence written in this paragraph is the dominant word, rather than the regular expression of the motif.

136

Figure 41 : Motifs Comparison for Rice and Arabidopsis thaliana. For each species, the top 25 words sorted by SlnSE score were selected as seeds, and applied for the word clustering. The top 5 motifs from each species are displayed in the figure: the left side is from rice, and the right side is from Arabidopsis thaliana .

137

7.2.2.4 Module Discovery

The top 100 words selected from the word counting result of both Arabidopsis and Rice based on SlnSE score were applied for module discovery with a fixed dimension of 2.

After obtaining the result, the top 100 modules, also sorted based on SlnSE score, were compared with common words (Table 32) for the purpose of finding how many modules consist of common words. There is no common word contained in rice’s top 100 modules, while most modules included in Arabidopsis ’ top 100 contained at least one common word. The top 25 modules from Arabidopsis are displayed in Table 35 associated with contained common words.

Additionally, the top 100 modules from two species were compared in order to identify common modules; however, there was no common module among the top 100 modules from two species. Then, the top 250 modules were selected, and the process of comparison was repeated. Two exactly matched modules were identified (Table 36).

Besides this, 9 modules in rice’s top 250 modules are overlapped with 90 modules from

Arabidopsis’ top 250 modules by sharing one word. The distance and density distribution maps were generated and extracted for the exactly-matched modules, and shown in Table 37 and 38. Frankly, there was nothing noticeable as a distinct feature for different species.

138

Table 35

Top 25 modules of Arabidopsis thaliana. The top 25 modules, sorted by the descending SlnSE score, were selected after applying module discovery for the top 100 words from Arabidopsis thaliana , and are shown in this table. They were associated with matched common words, which are shared by the top 100 words from both Rice and Arabidopsis Module Os Es S/E SlnSE (S -E)^2/E Matched common words TAAAATTT_ATTTTTTA 911 471.457 1.93231 600.089 409.789 AAATTTTA_ATTTTTTA 881 462.511 1.90482 567.70 5 378.657 AAAATTTA_ATTTTTTA 882 468.915 1.88094 557.221 363.902 AAAAATTA_ATTTTTTA 953 534.511 1.78294 551.084 327.651 AAAAATTA TAATTTTT_ATTTTTTA 920 508.319 1.80989 545.803 333.414 TAATTTTT AAAATTTA_AAATTTTA 987 578.797 1.70526 526.78 287.889 AAATTTTA_TAAAATTT 984 581.935 1.69091 516.863 277.792 AATATATT_ATTTTTTA 741 370.022 2.00258 514.578 371.935 AATATATT AAAAATTA_TAATTTTT 1138 725.109 1.56942 512.903 235.108 AAAAATTA TAATTTTT TAAATTTT_ATTTTTTA 854 473.318 1.80428 503.999 306.176 AAA ATTTA_TAAAATTT 970 589.992 1.64409 482.271 244.759 AAAAATTA_TAAAATTT 1059 672.525 1.57466 480.829 222.092 AAAAATTA AAAATTTA_AAAAATTA 1053 668.899 1.57423 477.814 220.561 AAAAATTA AAATTTTA_TAATTTTT 1007 627.435 1.60495 476.402 229.617 TAATTTTT TAAT TTTT_TAAAATTT 1009 639.571 1.57762 460.021 213.39 TAATTTTT AAAATTTA_TAATTTTT 1001 636.123 1.5736 453.817 209.292 TAATTTTT AAAATTTA_AATATATT 809 463.055 1.74709 451.385 258.454 AATATATT TAAATTTT_TAAAATTT 953 595.532 1.60025 448.062 214.57 AAAAATTA_ AAATTTTA 1018 659.764 1.54298 441.52 194.513 AAAAATTA TAAATTTT_AAATTTTA 935 584.232 1.60039 439.682 210.598 TAATTTTT_AATATATT 844 501.966 1.68139 438.559 233.058 AATATATT TAATTTTT TAAATTTT_AAAAATTA 1032 675.18 1.52848 437.851 188.572 AAAAATTA AAATT TTA_AATATATT 793 456.731 1.73625 437.521 247.579 AATATATT TAAATTTT_AAAATTTA 937 592.322 1.58191 429.74 200.572 AAATTTTA_TTATATAA 684 368.269 1.85734 423.495 270.689 TTATATAA TAAAATTT_ATTTTTTA 911 471.457 1.93231 600.089 409.789

139

Table 36

Common Modules shared by Arabidopsis thaliana and rice. Common modules shared by both species. In addition to modules, the corresponding scores from each species’ result are reported in the table. Modules Rice Arabidopsis thaliana Rank Rank S E S/E Sln(S/E) (S-E)^2/E S E S/E Sln(S/E) (S-E)^2/E Sln(S/E) Sln(S/E) AAAAATTA 819 335.403 2.44184 731.163 188 697.269 830 527.831 1.57247 375.7 33 172.984 _AATATATT TAATTTTT_ 763 334.953 2.27793 628.152 243 547.013 844 501.966 1.68139 438.559 21 233.058 AATATATT

Table 37

Distance distribution map of common modules. For the two common modules shared by Arabidopsis and rice, distance distribution map is reported. Module Rice Arabidopsis thaliana 20~ 40~ 60~ 80~ >10 Minimum Maximum 20~ 40~ 60~ 80~ Minimum Maximum <=20 Average <=20 >100 Average 40 60 80 100 0 Distance Distance 40 60 80 100 Distance Distance AAAAATTA_ 60 56 94 34 50 772 0 870 257 85 66 50 54 43 857 0 864 261 AATATATT TAATTTTT_ 56 68 76 53 54 657 0 882 246 69 64 65 64 51 839 0 821 263 AATATATT

Table 38

Density distribution map of common modules. For the two common modules shared by Arabidopsis and rice, density distribution map is reported Rice Arabidopsis thaliana

Module <10% 10%~20% 20%~30% 30%~40% 40%~50% >=50% <10% 10%~20% 20%~30% 30%~40% 40%~50% >=50% AAAAATTA_AATATATT 698 150 105 37 26 50 720 218 74 40 25 78 TAATTTTT_AATATATT 577 178 87 45 25 52 742 194 85 39 30 62

140

7.2.3 Conclusion

Currently, among the seven genomic segments from both species, only the basic comparison of proximal-promoter regions has been completed, and based on current available results, it is hard to tell the particular distinction or similarity between rice and

Arabidopsis thaliana , although the similarity or distinction patterns are believed to exist.

There are several interesting findings worth considering for further analysis. Firstly, motif logos clearly revealed that Arabidopsis thaliana is more AT-rich than rice among significant putative binding sites. Secondly, by observing the two common modules

(Table 36), both of them contained the word AATATATT; furthermore, the other two words contained in the common modules, AAAAATTA and TAATTTTT, were reverse complementary with each other. While the story is not clear at this stage, it might be instructive for future research.

141

CHAPTER 8: CONCLUSIONS

8.1 Summary of results

Within this thesis, two computational methods for identifying and discovering CRMs have been presented. One employs the enumerative algorithm for the purpose of providing rich information for putative modules so that users can maximize the advantage of existing information and make appropriate decisions; the other is a supervised method, which needs training data for setting parameters, so that the result would fit better for specific organisms. Furthermore, through a comprehensive evaluation system, two methods have been compared with the other nine tools, and the comparison result showed that these two methods are competitive. Additionally, a genomic toolkit, WordSeeker, has been introduced as the carrier for these two approaches. This software suite provides a comprehensive computational pipeline for doing functional regulatory analysis.

Also, the enumerative module discovery method has been applied to two biological case studies. Although the experimental result has not been verified or falsified by any biological experiments, it provides a primitive guide and putative candidates.

8.2 Future Work

For improving the accuracy of performance, running time and space efficiency, and the future research of practical applications, there are several suggestions.

142

First of all, for improving the accuracy of performance: the training process of the current version of HAC algorithm has been finished manually, an automatic learning algorithm is believed to be helpful to improve parameter settings, and leads the algorithm to a more accurate prediction. Furthermore, by exploring more features or setting different feature vectors for different organisms, it would achieve a better fit that will bring a better result.

For the enumerative method, allowing users to define the gap , density , and distance thresholds for filtering binding sites’ combinations will make a more flexible approach.

Secondly, for spacing issues, currently the variable of vector, which is used most often for recording necessary information, is not a space efficient technique. A better data structure design would help improve the space efficiency, and a parallelization version of modules discovery would greatly speed up the running time because of the advantage of multi-core processing systems.

Thirdly, currently only the enumerative method has been integrated into the WordSeeker toolkit, and the HAC algorithm needs to be integrated in the future.

Lastly, other segments of rice and Arabidopsis thaliana need to be finished for the purpose of identifying similar or distinct patterns.

143

REFERENCES

[1]. Harvey Lodish, Arnold Berk, Paul Matsudaira, Chris A. Kaiser, Monty Krieger and Matthew P. Scott (2006): Molecular Cell Biology, 5 th ed. W.H.Freeman and Company.

[2]. Irina Abnizova and Walter R.Gilks (2005): Studying statistical properties of regulatory DNAsequences, and their use in predicting regulatory regions in the eukaryotic genomes. BRIEFINGS IN BIOINFORMATICS , 7(1): 48-54, doi:10.1093/bib/bbk004.

[3]. Eric H. Davidson (2006): The Regulatory Genome: Gene Regulatory Networks in Development and Evolution. Page 8.

[4]. Andreas Wagner (1999): Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic genomes. BIOINFORMATICS , 15(10): 776-784.

[5]. Long Li, Qianqian Zhu, XinHe, Saurabh Sinha and Marc SHalfon (2007): Large- scale analysis of transcriptional cis-regulatory modules reveals both common features and distinct subclasses. Genome Biology , 8:R101 (doi:10.1186/gb-2007-8-6-r101) .

[6]. Isidore Rigoutsos, Tien Huynh, Kevin Miranda, Aristotelis Tsirigos, Alice McHardy, and Daniel Platt (2006): Short blocks from the noncoding parts of the human genome have instances within nearly all known genes and relate to biological processes. PNAS , 103(17): 6605–6610.

[7]. William Krivan and Wyeth W. Wasserman (2001): A predictive model for regulatory sequences directing liver-specific transcription. Genome Research , 11:1559-1566, doi:10.1101/gr.180601.

[8]. Wyeth W. Wasserman and James W. Fickett (1998): Identification of Regulatory Regions which Confer Muscle-Specific Gene Expression. J. Mol. Biol. , 278: 167-181.

[9]. Kjetil Klepper, Geir K. Sandve, Osman Abul, Jostein Johansen and Finn Drablos (2008): Assessment of composite motif discovery methods. BMC Bioinformatics, 9:123, doi:10.1186/1471-2105-9-123.

[10].V. Matys, E. Fricke, R. Geffers,E. Go ¨ßling, M. Haubrock, R. Hehl, K. Hornischer, D. Karas, A. E. Kel, O. V. Kel-Margoulis, D.-U. Kloos, S. Land, B. Lewicki-Potapov, H. Michael, R. Munch, I. Reuter, S. Rotert, H. Saxel, M. Scheer, S. Thiele and E. Wingender (2003): TRANSFAC®: transcriptional regulation, from patterns to profiles. Nucleic Acids Research, 31(1), doi: 10.1093/nar/gkg108.

144

[11].Ramana V Davuluri, Hao Sun, Saranyan K Palaniswamy, Nicole Matthews, CarlosMolina, Mike Kurtz and Erich Grotewold (2003): AGRIS: Arabidopsis Gene Regulatory Information Server, an information resource of Arabidopsis cis-regulatory elements and transcription factors. BMC Bioinformatics, 4:25, achieved from: http://www.biomedcentral.com/1471-2105/4/25 .

[12].Saranyan K. Palaniswamy, Stephen James, Hao Sun, Rebecca S. Lamb, Ramana V. Davuluri, and Erich Grotewold (2006): AGRIS and AtRegNet. a platform to link cis- regulatory elements and transcription factors into regulatory networks. Plant Physiology, 140: 818-829.

[13]. Jan Christian Bryne, Eivind Valen, Man-Hung Eric Tang, Troels Marstrand, Ole Winther, Isabelle da Piedade, , Boris Lenhard and Albin Sandelin (2008): JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Research, Vol. 36, D102-D106, doi:10.1093/nar/gkm955.

[14]. Marc S. Halfon, Steven M. Gallo and Casey M. Bergman (2007): REDfly 2.0: an integrated database of cis-regulatory modules and transcription factor binding sites in Drosophila . Nucleic Acids Research, 1-5, doi:10.1093/nar/gkm876.

[15]. Wyeth W. Wasserman and James W. Fickett (1998): Indentification of regulatory regions which confer muscle-specific gene expression. Journal of Molecular Biology , 278: 167-181.

[16]. Benjamin P. Berman, Yutaka Nibu, Barret D. Pfeiffer, Pavel Tomancak, Susan E. Celniker, Michael Levine, Gerald M. Rubin, and Michael B. Eisen (2002): Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. PNAS , 99(2): 757-762.

[17]. Timothy L. Bailey, and William Stafford Noble (2003): Searching for Statistically Significant Regulatory Modules. BIOINOFORMATICS , 19: ii16-ii25, doi:10.1093/bioinformatics/btg1054.

[18].William N. Grundy, Timothy L.Bailey, Charles P. Elkan, and Michael E. Baker (1997): Meta-MEME: Motif-based Hidden Markov Models of Protein Families. CABIOS , 13(4):397-406.

[19]. Saurabh Sinha, Erik van Nimwegen and Eric D. Siggia (2003): Aprobabilistic method to detect regulatory modules. BIOINFORMATICS , 19 (1): i292-i301.

[20]. Martin C. Frith, Ulla Hansen, and (2001): Detection of Cis-element Clusters in Higher Eukaryotic DNA. BIOINFORMATICS , 17(10):878-889. 145

[21]. Ojvind Johansson, Wynand. Alkema, Wyeth.W.Wasserman, and Jens Lagergren (2003): Identification of Functional Clusters of Transcription Factor Binding Motifs in Genome Sequences: the MSCAN algorithm. BIOINFORMATICS, 19(1):i169-i176.

[22]. Martin C. Frith, Michael C. Li, and Zhiping Weng (2003): Cluster-Buster: Finding Dense Clusters of Motifs in DNA Sequences. Nucleic Acids Research , 31(13):3666-3668.

[23]. Geir K. Sandve, Osman Abul, and Finn Drablos (2008): Compo: Composite Motif Discovery Using Discrete Models. BMC Bioinformatics , 9:527, doi:10.1186/1471-2015- 9-527.

[24]. Stein Aerts, Peter Van Loo, Gert Thijs, Yves Moreau, and Bart De Moor (2003): Computational Detection of Cis -Regulatory Modules. BIOINFORMATICS , 19(2):ii5-ii14.

[25]. Stein Aerts, Gert Thijs, Bert Coessens, Mik Staes, Yves Moreau, and Bart De Moor (2003): Toucan: Deciphering the Cis-regulatory Logic of Coregulated Genes. Nucleic Acids Research , 31(6):1753-1764.

[26]. A. Kel, T. Konovalova, T. Waleev, E. Cheremushkin, O. Kel-Margoulis, and E. Wingender (2006): Composite Module Analyst: a Fitness-based Tool for Identification of Transcription Factor Binding Site Combinations. BIOINFORMATICS , 22(10):1190-1197.

[27]. A.E. Kel, E. Goßling, I. Reuter, E. Cheremushkin, O.V. Kel-Margoulis, and E. Wingender (2003): MATCHTM: a Tool for Searching Transcription Factor Binding Sites in DNA Sequences. Nucleic Acids Research , 13(13): 3576-3579.

[28]. Qing Zhou and Wing H. Wong (2004): CisModule: De novo Discovery of Cis- regulatory Modules by Hierarchical Mixture Modeling. PNAS , 101(33):12114-12119.

[29]. Klaus Ecker and Lonnie Welch (2009): A concept for ab initio prediction of cis - regulartory modules. In Silico Biology, 9, 0024. http://www.bioinfo.de/isb/2009/09/0024/

[30]. Christiam Camacho, George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, and Thomas L. Madden(2009): BLAST+: Architecuture and Applications. BMC Bioinformatics , 10:421, doi: 10.1186/1471-2105-10-421.

[31]. Vardges Ter-Hovhannisyan, Alexandre Lomsadze, Yury O. Chernoff, and (2008): Gene Prediction in Novel Fungal Genomes Using an ab Initio Algorithm with Unsupervised Training. Genome Research , 18:1979-1990.

[32]. Lachlan Coin, , and Richard Durbin (2003): Enhanced protein domain discovery by using language modeling techniques from speech recognition. PNAS , 100(8):4516-1520. 146

[33]. Erich H. Davidson (2001): Genomic Regulatory Systems: Development and Evolution. San Diego, Academic Press.

[34]. Donald R. Morrison (1968): PATRICIA—Practical Algorithm to Retrieve Information Coded in Alphanumeric. Journal of the ACM (JACM) , 15 (4): 514-534.

[35]. Edward M. McCreight (1967): A Space-Economical Suffix Tree Construction Algorithm. Journal of the ACM , 23(2): 262-272.

[36]. Udi Manber and Gene Myers (1989): Suffix Arrays: a New Method for On-line String Searches. SIAM Journal on Computing , 22(5):935-948.

[37]. Richard W. Hamming (1950): Error Detecting and Error Correcting Codes. The Bell System Technical Journal , XXIX (2).

[38]. Vladimir Levenshtein (1965): Binary Codes Capable of Correcting, Deletions, Insertions, and Reversals. Soviet Physics-Doklady , 10(8).

[39]. Lincoln D. Stein, Christopher Mungall, Shengqiang Shu, Michael Caudy, Marco Mangone, Allen Day, Elizabeth Nickerson, Jason E. Stajich, Todd W. Harris, Adrian Arva, and Suzanna Lewis (2002): The Generic Genome Browser: A Building Block for a Model Organism System Database. Genome Research , 12(10):1599-1610.

[40]. Isidore Rigoutsos, Tien Huynh, Kevin Miranda, Aristotelis Tsirigos, Alice McHardy, and Daniel Platt (2006): Short Blocks From the Non-coding Parts of the Human Genome Have Instances Within Nearly All Known Genes and Relate to Biological Processes. PNAS , 103 (17): 6605-6610.

[41]. Jens Lichtenberg, Alper Yilmaz, Joshua D Welch, Kyle Kurz, Xiaoyu Liang, Frank Drews, Klaus Ecker, Stephen S. Lee, Matt Geisler, Erich Grotewold, and Lonnie R.Welch (2009): The Word Landscape of the Non-coding Segments of the Arabidopsis Thaliana Genome. BMC Gemomics , 10:463, doi: 10.1186/1471-2164-10-463.

[42].M. S. Nikulin (1973): Chi-Square Test for Continuous Distributions with Shift and Scale Parameters. Theory of probability and its applications , Vol. XVIII, Number 3.

[43]. N.Jardine and R. Sibson (1967): The Construction of Hierarchic and Non- Hierarchic Classifications. The computer journal , 11(2):177.

[44]. William H. E. Day (1984): Efficient Algorithms for Agglomerative Hierarchical Clustering Methods. Journal of Classification , 1:7-24.

147

[45]. Olga V. Kel-Margoulis, Alexander E. Kel, Ingmar Reuter, Igor V. Deineko, and Edgar Wingender (2001): TRANSCompel: a database on composite regulatory elements in eukaryotic genes . Nucleic Acids Research . 2002 January 1; 30(1): 332–334.

[46]. Martin Tompa, Nan Li, Timothy L. Bailey, George M. Church, Bart De Moor, , Alexander V. Favorov, Martin C. Frith, Yutao Fu, W. James Kent, Vsevolod J. Makeev, Andrei A. Mironov, William Stafford Noble, Giulio Pavesi, Graziano PESOLE, Mireille Regnier, Nicolas Simonis, Saurabh Sinha, Gert Thijs, Jacques van Helden, Mathias Vandenbogaert, Zhiping Weng, Christopher Workman, Chun Ye, and Zhou Zhu (2005): Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites. Nature Biotechnology , 23 (1).

[47]. Moises Burset and Roderic Guigo (1996): Evaluation of Gene Structure Prediction Programs. Genomics , 34:353-367.

[48]. Joseph Lee Rodgers and W. Alan Nicewander (1988): Thirteen Ways to Look at the Correlation Coefficient. The American Statistician , 42(1): 59-66.

[49]. Xiaoyu Liang, Kaiyu Shen, Jens Lichtenberg, Sarah Waytt, and Lonnie R. Welch (2010): An Integrated Bioinformatics Approach to the Discovery of Cis -Regulatory Elements Involved in Plant Gravitropic Signal Transduction. International Journal of Computational Bioscience , 1(1).

[50]. Fred D. Sack (1991): Plant Gravity Sensing. International Review of Cytology-a Survey of Cell Biology, 127:193-252.

[51]. Amie C. Scott and Nina S. Allen (1999): Changes in cytosolic pH within Arabidopsis root columella cells play a key role in the early signaling pathway for root gravitropism. Plant Physiol , 121(4):1291-1298.

[52]. Elison B. Blancaflor and Patrick H. Masson (2003): Plant gravitropism: Unraveling the ups and downs of a complex process. Plant Physiol , 133(4):1677-1690.

[53]. Imara Y. Perera, Ingo Heilmann, and Wendy F. Boss (1999): Transient and sustained increases in inositol 1,4,5-trisphosphate precede the differential growth response in gravistimulated maize pulvini. Proc Natl Acad Sci USA , 96(10):5838-5843.

[54]. Jung H. Joo, Yun S. Bae, and June S. Lee (2001): Role of auxin-induced reactive oxygen species in root gravitropism. Plant Physiol , 126(3):1055-1060.

[55]. A. M. Clore, S. M. Doore, and S. M. N. Tinnirello (2008): Increased levels of reactive oxygen species and expression of a cytoplasmic aconitase/iron regulatory protein 1 homolog during the early response of maize pulvini to gravistimulation. Plant Cell and Environment , 31(1):144-158. 148

[56]. Jeffery M. Kimbrough, Raul Salinas-Mondragon, Wendy F. Boss, Christopher S. Brown and Heike W. Sederoff (2004): The fast and transient transcriptional network of gravity and mechanical stimulation in the Arabidopsis root apex. Plant Physiol , 136(1):2790-2805.

[57]. Jens Lichtenberg, Edwin Jacox, Joshua D. Welch, Kyle Kurz, Xiaoyu Liang, Mary Q. Yang, Frank Drews, Klaus Ecker, Stephen S. Lee, Laura Elnitski, and Lonnie R. Welch (2009): Word-based Characterization of Promoters Involved in Human DNA Repair Pathways. BMC Genomics , 10 (Suppl 1):S18.

[58]. Robert C. Gentleman, Vincent J. Carey, Douglas M. Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao Ge, Jeff Gentry, Kurt Hornik, Torsten Hothorn, Wolfgang Huber, Stefano Iacus, Rafael Irizarry, Friedrich Leisch, Cheng Li, Martin Maechler, Anthony J. Rossini, Gunther Sawitzki, Colin Smith, Gordon Smyth, Luke Tierney, Jean YH. Yang and Jianhua Zhang (2004): Bioconductor: open software development for computational biology and bioinformatics. Genome Biol , 5(10):R80.

[59]. Rainer Breitling, Patrick Armengaud, Anna Amtmann, and Pawel Herzyk (2004): Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett , 573(1-3):83-92.

[60]. Boris Lenhard and Wyeth W. Wasserman (2002): TFBS: Computational Framework for Transcription Factor Binding Site Analysis. Bioinformatics, 18:1135- 1136.

[61]. Cathie Martin and Javier Paz-Ares (1997): MYB Transcription Factors in Plants. Trends in Genetics , 13(2):67-73.

[62]. Graham R. Teakle, Iain W. Manfield, John F. Graham, and Philip M. Gilmartin (2002): Arabidopsis Thaliana GATA Factors: Organisation, Expression and DNA- binding Characteristics. Plant Molecular Biology , 50: 43-57.

[63]. Ulrike Hartmann, William J. Valentine, John M. Christie, John Hays, Gareth I. Jenkins, and Bernd Weisshaar (1998): Identification of UV/Blue Light-response Elements in the Arabidopsis Thaliana Chalcone Synthase Promoter Using a Homologous Protoplast Transient Expression System. Plant Molecular Biology , 36:741-754.

[64]. Roberto Solano, Anna Stepanova, Qimin Chao, and Joseph R. Ecker (1998): Nuclear Events in Ethylene Signaling: a Transcriptional Cascade Mediated by ETHYLENE-INSENSITIVE3 and ETHYLENE-RESPONSE-FACTOR1. Genes Dev , 12(23):3703-3714.

149

[65]. Joline J. Tilly, David W. Allen and Thomas Jack (1996): The CArG Boxes in the Promoter of the Arabidopsis Floral Organ Identity Gene APETALA3 Mediate Diverse Regulatory Effects. Development , 125:1647-1657.

[66]. Edwin Jacox and Laura Elnitski (2008): Finding Occurrences of Relevant Functional Elements in Genomic Signatures. International Journal of Computational Science,2(5):599-606.

[67]. Seung Y. Rhee, William Beavis, Tanya Z. Berardini, Guanghong Chen, David Dixon, Aisling Doyle, Margarita Garcia-Hemandez, Eva Hual, Gabriel Lander, Mary Montoya, Neil Miller, Lukas A. Mueller, Supama Mundodi, Lenore Reiser, Julie Tackling, Dan C. Weems, Yihe Wu, Iris Xu, Daniel Yoo, Jungwon Yoon, and Peifen Zhang (2002): The Arabidopsis Information Resource (TAIR): A Model Organism Database Providing a Centralized, Curated Gateway to Arabidopsis Biology, Research Materials, and Community. Nucl Acids Res, 31(1):224-228.

[68]. Shu Ouyang, Wei Zhu, John Hamilton, Haining Lin, Matthew Campbell, Kevin Childs, Francoise Thibaud-Nissen, Renae L. Malek, Yuandan Lee, Li Zheng, Joshua Orvis, Brian Haas, Jennifer Wortman, and C. Robin Buell (2006): The TIGR Rice Genome Annotation Resource: improvements and new features. Nucleic Acids Research , 35( Database Issue): D846-851.

[69]. Qiaoping Yuan, Shu Ouyang, Aihui Wang, Wei Zhu, Rama Maiti, Haining Lin, John Hamilton, Brian Haas, Razvan Sultana, Foo Cheung, Jennifer Wortman, and C. Robin Buell (2005): The Institute for Genomic Research Osa1 Rice Genome Annotation Database. Plant Physiology , 138: 18-26.

[70]. The Arabidopsis Genome Initiative (2000): Analysis of the Genome Sequence of the Flowering Plant Arabidopsis thaliana . Nature , 408: 796-815.

[71]. Lutz Nover, Kapil Bharti, Pascal Döring, Shravan Kumar Mishra, Arnab Ganguli, and Klaus-Dieter Scharf (2001): Arabidopsis and the Heat Stress Transcription Factor World: How Many Heat Stress Transcription Factors Do We Need? Cell Stress Chaperones , 6(3):177-189.

150

APPENDIX 1: TRANSCOMPEL BENCHMARK DATASET

The detailed information of modules embodied in ten sub-sets of TRANSCompel benchmark dataset has been summarized and reported in the following tables (from 1.a to 1.j). Highlighted sequence in blue is the binding site that consists of specific modules, and highlighted in red indicates overlap between two adjacent binding sites.

1.a AP1-Ets: Start End Span TFB Ss Seq ID Module position position (bp) contained AF039399 tgagtca ggtttggggtgggattattttagtt aagggaag 481 520 40 tgagtca aagggaag AF179904 agcggatgtggctaaggctgagtca 488 512 25 tgagtca agcggatgt D10051 aggaa gc tgagtca 494 507 14 tgagtca aggaa D13263 tgtgtcatttcctt 492 505 14 tttcctt tgtgtca J02288 aggaag tg actaac 813 826 14 tgactaac aggaagtg L05187 gtgagtcatgtgtgacaggtgaggaatgaaaacagagt 471 529 59 gtgagtca gcccgagagcttctat ttcct ttcct L10616 tgagtca ggcttc cccttcctgcc 559 582 24 tgagtca cccttcctgcc L19440 tgactca tgctgacaatcttct tcctt 767 793 27 tcttcctt tgactca L36024 tgactcatcttcctg 493 507 15 tcttcctg tgactca M16567 gaggatgt tataaagca tgagtca 439 462 24 tgagtca gaggatgt M21162 tgagtaatgcgtccaggaag 166 185 20 tgagtaa caggaag X00523 acaggata tg actct 148 162 15 tgactct acaggatatg X02910 ttcctccagatgagctcat 486 504 19 atgagctcat ttcct X03020 ttaatca tttcctc 513 526 14 tttcctc ttaatca X12641 aggaaatgaagtcatctgtcctctcagcaatcagcatga 373 471 99 tgaatcat cagcctccagccaagtaaccctggagtcatgagagctgc aggaaa taggggagcaacatgaatcat X12641 aggaaatgaagtca 373 386 14 aggaaa tgaagtca Y11874 aggaaatgaggtca 494 507 14 aggaaa tgaggtca

151 b. AP1-NFAT TFBSs Seq ID Module Start position End position Nucleotides contained ttaatca X03021 ttaatcatttcctc 521 534 14 tttcctc tgagctca X02910 tgagctcatgggtttctcc 496 514 19 ggg tttctcc tgatgtca L07488 tgatgtcatctttcca 416 431 16 tctttcca tgactct L07488 tgactcttgctttcct 546 561 16 ctttcct tgtttca X00695 aggaaaaactgtttca 430 445 16 aggaaaaa gttttcca U90652 tgagctacagttttcca 577 593 17 tgagcta ggaaa D14461 ggaaaccctgagtttca 842 858 17 agtttca ggaaaattt X14473 ggaaaatttgtttca 8 22 15 tgtttca agaaattcc X14473 agaaattccagagagtca 130 147 18 agagtca ttgaaaat X14473 ttgaaaatatgtgtaat 198 214 17 gtgtaat attaatca X03020 attaatcatttcctc 512 526 15 catttcctc

1.c. AP1-NFkappaB Start End TFBSs Seq ID Module Nucleotides position position contained M11847 tgaggtcaggggtggggaagcccagggctggggat 752 792 41 ctggggattcccca tcccca tgaggtca X00695 aaagaaattc caa agagtcat 550 570 21 agagtcat aaagaaattc V00534 tgacataggaaaactgaaagggagaagtgaaagt 185 230 46 gggaaattcctc ggg aaattcctc tgacatag M28130 tgactca ggtttgccctgaggggatgggccatcagt 473 528 56 ggaatttcct tgcaaatcgtggaatttcct tgactca AC002428 tgatgtcagggtttttcc 492 509 18 tgatgtca gggtttttcc J04238 tgactct gcaccctcctccccaactccatttcctttgc 243 377 135 tgactct ttcctccggcaggcggattacttgcccttacttgtcat gggttttcc ggcgactgtccagctttgtgccaggagcctcgcagg ggttgatgggattggggttttcc ctgacatcattgtaattttaagcatcgtggatattccc 465 535 71 ctgacatca gggaaagtttttggatgccattggggatttcct ggggatttcct M64485 ctgacatca ttgtaattttaagcat cg tggatattcc 465 501 37 tggatattcc ctgacatca

152

1.d. CEBP-NFkappaB Start End Seq ID Module Nucleotides TFBSs contained position position AF111163 ggaaaatcct ctgaacctgtaagaagagaacac 442 559 118 ggaaaatcct agccggcatggacacacccttacccttagtctca tttcacaa gttcccaccaagacacagagcatttcctgtgcctt ttccgc ta tttcacaa AY008847 gaaattcccc cagaaggttttgagagttgttttca 479 522 44 Gaaattcccc atgttgcaa atgttgcaa

D63333 aaaattcccccagaatgttttgacactagttttca 814 857 44 aaaattcccc gtgttgcaa gtgttgcaa L05921 acacaactggga taaatgacccgggatgaaga a 22 119 98 acacaactggga accaccggcatccaggaacttgtcttagaccagt gggactttcc ttgtaggggaaatgacctgca gggactttcc M17796 tggaaatgcctagatggcgcaatctggggaaag 35 132 98 tggaaatgcc aagatgtacatgaaggaaaagttatcttctgaaa attatgcaag gagaaattatgggtaagtgggattatgcaag M98536 tgcggat gaagaaaccatgca tgtccgggaagc 463 538 76 tgcggatgaagaaaccatgca ctcttctgtgctttcctaggggaaatgacctgagg ggggctttcc ggctttcc Y00081 acattgcacaatct taataaggtttccaatcagcc 453 547 95 acattgcacaatct ccacccgctctggccccaccctcaccctccaaca gggattttcc aagatttatcaaatgt gggattttcc Z11749 cattgagcaatcttaataaggtttccaatcagccc 806 900 95 cattgagcaatct cacccgctctggccccacccccaccctccaacaa ggattttccc agatttatcaaatgtgggattttccc

1.e. Ebox-Ets Start End TFBSs Seq ID Module Nucleotides position position contained M15653 cacatg gcccgagagctg cat ccg 288 311 24 cacatg catccg X15943 ggaagcaaaggggcagctg 491 509 19 ggaa cagctg U11854 gtctgctgacc agtgcggttaagcaaga 884 933 50 gtctgctgacc gagtccatttccttcctctttt ccttcctctttt V01523 tgtggc aaggcta tttggggaa 403 424 22 tgtggc tttggggaa V01523 caggaagca ggtca tgtggc 389 408 20 caggaagca tgtggc V01523 agcagctggc aggaag 380 395 16 agcagctggc ggaag

153

1.f. Ets-AML Start End TFBSs Seq ID Module Nucleotides position position contained X07177 caggatgtggttt 622 634 13 caggat gtggttt J02255 caggatat ctgtggtaa 622 63 8 17 caggatat tgtggtaa D14816 aaccacaaaaccagaggaggaa 490 511 22 aaccaca gaggaa X59486 cagga tgtggttt 135 147 13 caggat tgtggttt S68887 tgtggttgccttgcctagctaaaa 408 437 30 Tgtggt ggggaa ggggaa

1.g. IRG-NFkappaB Start End TFBSs Seq ID Module Nucleotides position position contained AB006745 ctctttctctttcacttttct gttagctgggg 600 642 43 ctctttctctttcacttttct ttgggactccc gggactccc V00534 gagaagtgaaagt gggaaattcc 206 228 23 gagaagtgaaagtg gggaaattcc L09126 gggattttcc ctctctctgtttgttccttttc 42 99 58 gggattttcc ccctaacactgtcaatatttcacttt tatttcacttt M12483 ggggattcccc atctcct cagtttcactt 379 407 29 ggggattcccc cagtttcactt X70675 agtttcttttccattttgtgttttcattttatg 70 140 71 agtttcttttcc acagcaacaagtgtttggtgtcttttgtgg ggaaactccc aaactccc D83956 tggggattcccca ctcccctg agt ttcactt 402 434 33 tggggattcccca ct agtttcacttct

1.h. NFkappaB-HMGIY Start End TFBSs Seq ID Module Nucleotides position position contained X03021 ggagattcca 466 475 10 att ggagattcca M65005 ggg aattt cc 229 238 10 gggaatttcc aattt X00695 aaagaaattc 550 559 10 aaagaaattc aaatt V00534 ggg aaatt cc 219 228 10 aaatt gggaaattcc L09126 gggactctcc ctttgggaacagtt atgca 928 959 32 gggactctcc aaa atgcaaaa M64485 gggg atttc ct 525 535 11 ggggatttcct atttc M64485 gg atattcc c 493 502 10 ggatattccc atattcc

154

1.i. PU1-IRF Seq ID Module Start position End position Nucleotides TFBSs contained aaggaa X54550 aaggaagtgaaa 423 434 12 gtgaaa gaggaa X15878 gaggaactgaaaac 449 462 14 tgaaaac gttttcatttc M66390 gttttcatttcctc 923 936 14 ttcctc ggt ttc U26540 ggtttcacttcc 495 506 12 ttcc ctttc AF172169 ctttcacttcctc 655 667 13 ttcctc

1.j. Sp1-Ets Start End Seq ID Module Nucleotides TFBSs contained position position D87541 acttcctc tttcggcggggcggcccg 651 696 46 acttcctc gcctggccggctcctcctcc ggctcctcctcc J02275 aaggaagtgggcgtggt 152 168 17 aaggaag tgggcgtggt M60058 ccggaagcaaccagcccacc 491 510 20 ccggaagc cccacc M84757 ttccttgaggcagggc 493 508 16 ttcctt gaggcagggc S71507 acaggaatgacctggtgcctcgccc 3 27 25 acaggaat ctcgccc U13399 gaagggcgggga cagttgaggggg 36 152 117 gaagggcgggga tggaatagggacggcagcagggaa aaagggaactga ccagatagcatgctgctgagaagaa aaaaagacattggtttaggtcagga accaaaa aaagggaactga U13399 aaagggaactga gtggctgtgaa a 141 172 32 aaagggaactga gggtgggg agggtgggg X14304 ggaagcaaccagcccacca 491 509 19 ggaa cccacca

155

APPENDIX 2: MUSCLE BENCHMARK DATASET

In the following table, the detailed information of modules embodied in Muscle benchmark dataset is displayed. Besides the module’s sequence, the contained TFBSs, the starting and ending positions and the span are all summarized and shown. The highlighted part is the contained TFBSs. Start End Contained Seq ID Module Span position position TFBSs caggtg J04699 caggtgcacattcc 804 817 14 cattcc gccccaccccctgcataccaaagtccccagcacaatcaccaggtttaac cccctttaaaaata J04971 740 808 69 tttgtc cccctttaaaaata gccccaccccctgca K01464 taaaaataact aaggtaagggccatgtgggtaggggaggtggtgtgag 280 573 294 cacgcg acggtcctgtctctcctctatctgcccatcggccctttggggaggaggaa taaaataact atgtgcccaaggactaaaaaaggcctggagccagaggggctagggcta ccaaatttagg agcagacctttcatgggcaaacctcagggctgctgtcctcctgtcacctc cctcctgtcacctccagag cagagccaagggatcaaaggaggaggagccagacaggagggatggg cctttcatgg agggagggtcccagcagatgactccaaatttaggcagcaggcacgcgg gaggaaat aatg ggaatg taaaaaa L21905 ctataatagccacaggattaacatagcaggcattgtctttctctgactata 436 539 104 caggtg gggtgggtattatgtgttcatcaaccatcctaaaaatacccggtaaacag ctataata gtg M13483 ccctatttgg ccatccccctgactgccccctcccctt ccttacatgg tctgg 249 435 187 caactg gggctccctggctgatcctctcccctgcccttggctccatgaatggcctcg gccccccacccctgcccc gcagtcctagcgggtgcgaaggggaccaaataaggcaaggtggcaga ccaaataagg ccgggccccccacccctgcccccggctgctccaactg ccatgaatgg ccctatttgg ccttacatgg M13631 ccttatacgg cccggcctcgctcacctgggccgcggccaggagcg ccttc 84 216 133 ccttctttgg tttgggcagcgcggggccggggccgcgccgggcccgacacccaaatat ccaaatatgg ggcgacggccggggccgcattcctgggggccgggc ggccgggc ccttatacgg M20543 ccatatacgg cccggcccgcgttacctgggaccgggccaacccgc tcctt 483 619 137 ccaaatatgg ctttggtcaacgcaggggacccgggcgggggcccaggccgcgaaccgg tccttctttg ccgagggagggggctctagtgcccaacac ccaaatatgg ccatatacgg M21390 ccatgtaaggaggcaaggcctggggacacccgagatgcctggttataat 122 291 170 gttataattaacccag taacccagacatgtggctgctccccccccccaacacctgctgcctgagcc catgtg tcacccccaccccggtgcctgggtcttaggctctgtacaccatggaggag cacctg aagctcgctctaaaaataacc ccatgtaagg tctaaaaataacc M22381 catctgtccttggccatttgctgagtctcctagctggaaaaagaggtgta 498 718 221 cagctg ggaccgaagcaactaagtttgagggtgtccagtctctgttgagacacttt cagctg tgagggtgtcctgtctctggtgggactattccatctttcgtcccctggctgg catttg cccatgtaatctgagcccagcattgtacatatcctgggaacagctgaca catctg atgcagtggtcagacagctg M57905 ccaaaatagcagctcacaagtgttgcattcctctctgggcgccgggcac 153 207 55 cattcct attcct cattcct ccaaaatagc M62404 taaaaataactg aggtaagggcctgggtaggggaggtggtgtgagacg 277 553 277 aaggccat ctcctgtctctcctctatctgcccatcggccctttggggaggaggaatgtg cctcctgtcacctccagagc cccaaggactaaaaaaaggccatggagccagaggggcgagggcaaca gaggaat gacctttcatgggcaaaccttggggccctgctgtcctcctgtcacctccag ccaaatttagg agccaagggatcaaaggaggaggagccaggacaggagggaagtggg cctttcatgg agggagggtcccagcagaggactccaaatttagg taaaaataactg M63391 ctataaata cccgctctggtatttg gggttgg cagctgt 282 320 39 ctataaata cagctgt M84685 ctatatataaagctgggtcgacttatgtcaccgcactaattaaatgccat 488 540 53 catctg ctg ctatatataaa 156

M95800 cacatg taatccactggaaacgtcttgatgtgcagcaacagcttagagg 430 561 132 cagttg ggggctcaggtttctgtggcgttggctatatttatctctgggttcatgccag cacatg cagggagggtt taaatggcacccag cagttg ctatatttat U02285 ccatatacgg cccggcccgtgttacctgggctcaggccaggcctc tcctt 559 695 137 ccatatacgg ctttggtcagcgcaggggacccgggcggggacccaggccttgaactggt ccaaatatgg cgggggagggggctctagtgcccaacacccaaatatgg tccttctttg U18131 ctatatataaa gctgggtcgacttatgtcaccgcactaattaaatgc cat 275 327 53 ctatatataaa ctg catctg V01218 ccatatacgg cccggtccggtcctagctacctgggccagggccagttctc 122 258 137 tccttctttg tccttctttggtcagtgcaggagacccgggcgggacccaggctgagaag ccatatacgg cagccgaagggactctagtgcccaacacccaaatatgg ccaaatatgg X05632 aaaaataactgaggtaagggccatggcagggtgggaggcggtgtgag 38 240 203 aaaaataactg aaggtccagtcttcccagctatctgctcatcagccctttgaaggggagga cctttcaag atgtgcccaaggactaaaaaaaggccgtggagccagagaggctgggg gaggaat cagcagacctttcaagggcaaatcaggggccctgctgtcctcctgtcacc cctcctgtcacctccagag tccagag X12971 cagctg cacccggctggtgtctctt ccttttatag tcagcag cagttg 690 737 48 ccttttatag cagctg cagttg X14726 gcagcaggtg caaaaatggagctgcgcaggcagaagagtgatcgtcat 436 545 110 cttttaaaaataa ttttaaaatccccaccagctggcgaagcaacaggtgcctaattcctcatc gcagcaggtg ttttaaaaataa caggtg cagctg X59034 cagctg tcatgcgggcaca caggtga tgtaagacaatagctgtggagt c 641 694 54 caggtga agctg cagctg cagctg X62155 ctatatttatctctggttccatgccagcggggagggtttaaatggcaccca 264 321 58 ctatatttat gcagttg cagttg X67686 ccatatacgg cctg gtccggtcctagctacctgggccagggccagtcctc 652 825 174 cattctt tccttctttggtcagtgcaggagacccgggcggggacccaggctgagaa tccttctttg ccagccgaaggaagggactctagtgcccgacacccaaatatggcttgg gggcgg gaagggcagcaacattcttcggggcgg ccaaatatgg ccatatacgg X73887 cagctg gtcccccgacaa caggtg ca cattcc 403 434 32 cagctg caggtg cattcc

157

APPENDIX 3: LIVER BENCHMARK DATASET

In this table, the detailed information of modules embodied in Liver benchmark dataset is displayed. Besides the module’s sequence, the contained TFBSs, the starting and ending positions and the span are all summarized and shown. The green highlighted part is the corresponding TFBSs, while the red one means overlaps. Start End Seq ID Module Span Contained TFBSs position position gtaaacaatgagttcatccctagtttgttcattctaa gtaaacaatg attaataagtaac AF236668 tcttgagcagattaataagtaacctgctgcctcagc 838 934 97 attggct aggaacagggagctgatattggct gttatggatta ac cactgtttgtc cactgtttgtctatggagagggaggcctcagtgctg U47685 869 939 71 ggttatggattaact aagcaaatatttgt agggccaagcaaatatttgtggttatggattaact ttgtggttat AF033857 ctcatggattatgattaactcaacctt ctgcacatga 779 868 90 ctcatggattatgatt tgcaatggttgggtaat agataaacaagaaagaactgataaacctgcaatg aactcaacctt cttcaactt gttgggtaatcttcaactt M15657 aggagtacggaaatcgttc ttt gtcattacccatcc 192 289 98 aggagtacggaaatc tgttttatgattaaca caggttgtcctcctgtctccttgtggtgaacattggc gttc ctgtgaccc tgttttatgattaaca M29301 attctgaaag ct aaattgcatt 497 518 22 aaattgcatt attctgaaag attgctcaat acaataacctttgactgtg tgttacaa gttacaatatttat attgctcaat M29301 tatttatttattcctatcagtagttagtttcacaacag 637 734 98 gttaatgattctt tgttacaat actagagaat gttaatgattctt gtttcacaac L09674 act tattgat tagattc cc atcaata 568 593 26 tattgat atcaata acttattgattagattc S85346 gttaatcagaaaa cagatccttattttctatggcagcata 436 611 176 gttaatcagaaaa cttactcaa agtattttaatgtc tgcgaaccctgtcactaacacacattc cttactcaataac ttttaagggaaaaaaatgcttctgtgctctagttttaaaat gcaaaggtatgatgttatttgtcaccatgcccaaaaaagt ccttactcaataac aagaagcat gcc aa agttaat cattggcc ctgctg gtttttgag acaaacg agtacatggccgatcaggctgtttttgtgtgcctgtt gccaaagttaat tgtgtgc AF051355 tttctattttacgtaaatcaccctgaacatgtttgcat 647 821 175 caacct ccaaaga caacctactggtgatgcacctttgatcaatacatttt agttaatcattggcc aagaagcatgcc agacaaacgtggtttttgagtccaaaga tgtttgc X16152 gtttatcagtgac tagtcattgattcgaagcatgtg 883 940 58 gtttatcagtgac gtgagggtgaggaaat agg gtg aggaaatactgacttt actgacttt aagccagtgtagaaa tggaccttttgcaatcc aagccagtgtagaaaagcaaacaggtcaggcccg agcaaa tg X16152 ggaggcgccctttggaccttttgcaatcctggcgct 83 168 86 gcgctcttgcagcctg cgggaggcgccctttg cttgcagcctgggctt ggctt ga ccttttg L13460 cctgtggacttagttcaaggccagttactaccacttt 540 639 100 caagttaat cctgtggacttagttca tttttttctaatagaatgaacaaatggctaattgttt tgtttgct aggcc gctttgtcaaccaagctcaagttaat gtaggttacttattctcc ttttg cacatttcgtagagcgagtgttccgatactctaatct caaggttcatat cacatttcgtagagcg M19524 ccctaggcaaggttcatatttgtgtaggttacttatt 137 237 101 tttgtgtaggtta agt ctccttttgttgactaagtcaataatc aagtcaataatc gtgtaggttacttattct ccttttgttga M60197 acaaagttaatgattaaaacctcccagac tgtcctc 469 612 144 gcagagattaataatt acaaagttaatgatta gacattgcccaggggtatctatttgccttgatgagg gatgaat aaacctcccagac atattactgtctttctttggggacaaatctgactgtc ccttgcttctgtgcagagattaataattgatgaat

158

APPENDIX 4: RESULT FROM FOR MUSCLE DATASET

KNOWN MODULES RANGE FOUND MODULES Se q ID Range Sequences CT CCATATACGG C_ ccatatacgg cccggcccgcgttacctgggaccgggccaacccgctccttctttggtcaacgcaggggacccg M20543 483,619 AC CCAAATATGG C ggcgggggcccaggccgcgaaccggccgagggagggggctctagtgcccaacacccaaatatgg ccatatacgg cccggcccgtgttacctgggctcag gccaggcctctccttctttggtcagcgcaggggacccgg U02285 559,695 gcggggacccaggccttgaactggtcgggggagggggctctagtgcccaacac ccaaatatgg ccatatacggcccggtccggtcctagctacctgggccagggccagttctctccttctttggtcagtgcaggaga V01218 122,258 cccgggcgggacccaggctgagaagcagccgaagggactctagtgcccaacacccaaatatgg ccatatacgg cctggtccggtcctagctacctgggccagggccagtcctctccttctttggtcagtgcaggaga X67686 652,825 cccgggcggggacccaggctgagaaccagccgaaggaagggactctagtgcccgacacccaaatatggctt gggaagggcagcaacattcttcggggcgg CT CCATATACGG C_ ccatatacgg cccggcccgcgttacctgggaccgggccaacccgctccttctttggtcaacgcaggggacccg M20543 483,619 GGCTATATAAAA_ ggcgggggcccaggccgcgaaccggccgagggagggggctctagtgcccaacacccaaatatgg GGA CAGGTG CAG_ ccatatacgg cccggcccgtgttacctgggctcaggccaggcctctccttctttggtcagcgca ggggacccgg U02285 559,695 GAAGAAGAGGTG gcggggacccaggccttgaactggtcgggggagggggctctagtgcccaacacccaaatatgg _ ccatatacggcccggtccggtcctagctacctgggccagggccagttctctccttctttggtcagtgcaggaga V01218 122,258 CA CATTCCT GCT_ cccgggcgggacccaggctgagaagcagccgaagggactctagtgcccaacacccaaatatgg GTAGCCACCCG ccatatacgg cctggtccggtcctagctacctgggccagggccagtcctctccttctttggtcagtgcaggaga X67686 652,825 cccgggcggggacccaggctgagaaccagccgaaggaagggactctagtgcccgacacccaaatatggctt gggaagggcagcaacattcttcggggcgg gcag caggtg caaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 aagcaacaggtgcctaattcctcatcttttaaaaataa X59034 641,694 cagctgtcatgcgggcaca caggtg atgtaagacaatagctgtggagtcagctg X73887 403,434 cagctggtcccccgacaa caggtg cacattcc J04699 804,817 caggtg cacattcc ctataatagccacaggattaacatagcaggcattgtctttctctgact atagggtgggtattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaacaggtg ccttatacggcccggcctcgctcacctgggccgcggccaggagcgccttctttgggcagcgcggggccggggc M13631 84,216 cgcgccgggcccgacacccaaatatggcgacggccggggccgcattcctgggggccgggc M57905 153,207 ccaaaatagcagctcacaagtgttg cattcct ctctgggcgccgggca cattcct CT CCATATACGG C_ ccatatacgg cccggcccgcgttacctgggaccgggccaacccgctccttctttggtcaacgcaggggacccg M20543 483,619 GGCTATATAAAA_ ggcgggggcccaggccgcgaaccggccgagggagggggctctagtgcccaacacccaaat atgg GGA CAGGTG CAG_ ccatatacggcccggcccgtgttacctgggctcaggccaggcctctccttctttggtcagcgcaggggacccgg U02285 559,695 GAAGAAGAGGTG gcggggacccaggccttgaactggtcgggggagggggctctagtgcccaacacccaaatatgg _ ccatatacgg cccggtccggtcctagctacctgggccagggccagttctctccttctttggtcagtgcaggaga V01218 122,258 AAACAGGTGCAG_ cccgg gcgggacccaggctgagaagcagccgaagggactctagtgcccaacacccaaatatgg GTAGCCACCCG ccatatacggcctggtccggtcctagctacctgggccagggccagtcctctccttctttggtcagtgcaggaga X67686 652,825 cccgggcggggacccaggctgagaaccagccgaaggaagggactctagtgcccgacacccaaatatggctt gggaagggcagcaacattcttcggggcgg gcag caggtg caaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 aagcaa caggtg cctaattcctcatcttttaaaaataa X59034 641,694 cagctgtcatgcgggcacacaggtgatgtaagacaatagctgtggagtcagctg X73887 403,434 cagctggtcccccgacaa caggtg cacattcc J04699 80 4,817 caggtg cacattcc ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaacaggtg CT CCATATACGG C_ ccatatacgg cc cggcccgcgttacctgggaccgggccaacccgctccttctttggtcaacgcaggggacccg M20543 483,619 AC CCAAATATGG C ggcgggggcccaggccgcgaaccggccgagggagggggctctagtgcccaacacccaaatatgg _ ccatatacggcccggcccgtgttacctgggctcaggccaggcctctccttctttggtcagcgcaggggacccgg U02285 559,695 GGCTATATAAAA_ gcggggacccaggccttgaactggtcgggggagggggctctagtgcccaacacccaaatatgg GGA CAGGTG CAG_ ccatatacgg cccggtccggtcctagctacctgggccagggccagttctctccttctttggtcagtgcaggaga V01218 122,258 GA CATTCCT GCG_ cccgggcgggacccaggctgagaagcagccgaagggactctagtgcccaacacccaaatatgg GAAGAAGAGGTG ccatatacgg cctggtccggtcctagctacctgggccagggccagtcctctccttcttt ggtcagtgcaggaga X67686 652,825 cccgggcggggacccaggctgagaaccagccgaaggaagggactctagtgcccgacacccaaatatggctt gggaagggcagcaacattcttcggggcgg gcag caggtg caaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 aagcaacaggtgcctaattcctcatcttttaaaaataa X59034 641 ,694 cagctgtcatgcgggcaca caggtg atgtaagacaatagctgtggagtcagctg 159

X73887 403,434 cagctggtcccccgacaa caggtg cacattcc J04699 804,817 caggtg cacattcc ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaacaggtg ccttatacggcccggcctcgctcacctgggccgcggccaggagcgccttctttgggcagcgcggggccggggc M13631 84,216 cgcgccgggcccgacacccaaatatggcgacggccggggccg cattcct gggggccgggc M57905 153,207 ccaaaatagcagctcacaagtgttg cattcct ctctgggcgccgggca cattcct CT CCATATACGG C_ ccatatacgg cccggcccgcgttacctgggaccgggccaacccgctccttctttggtcaacgcaggggacccg M20543 483,619 AC CCAAATATGG C ggcgggggcccaggccgcgaaccggccgagggagggggctctagtgcccaacacccaaatatgg _ ccatatacgg cccggcccgtgttac ctgggctcaggccaggcctctccttctttggtcagcgcaggggacccgg U02285 559,695 GGCTATATAAAA_ gcggggacccaggccttgaactggtcgggggagggggctctagtgcccaacac ccaaatatgg GGA CAGGTG CAG_ ccatatacgg cccggtccggtcctagctacctgggccagggccagttctctccttctttggtcagtgcaggaga V01218 122,258 GAAGAAGAGGTG cccgggcgggacccaggctgagaagcagccgaagggactctagtgcccaacacccaaatatgg _ ccatatacgg cctggtccggtcctagctacctgggccagggccagtcctctccttctttggtcagtgcaggaga CA CATTCCT GCT X67686 652,825 cccgggcggggacccaggctgagaaccagccgaaggaagggactctagtgcccgacacccaaatatggctt gggaagggcagcaacattcttcggggcgg gcag caggtg caaaaatggagctgcgcaggcagaa gagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 aagcaacaggtgcctaattcctcatcttttaaaaataa X59034 641,694 cagctgtcatgcgggcaca caggtg atgtaagacaatagctgtggagtcagctg X73887 403,434 cagctggtcccccgacaacaggtgcacattcc J04699 804,817 caggtg cacattcc ctataatag ccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaa caggtg ccttatacggcccggcctcgctcacctgggccgcggccaggagcgccttctttgggcagcgcggggccggggc M13631 84,216 cgcgccgggcccgacacccaaatatggcgacggccggggccgcattcctgggggccgggc M57905 153,207 ccaaaatagcagctcacaagtgttg cattcct ctctgggcgccgggca cattcct CT CCATATACGG C_ ccatatacgg cccggcccgcgttacctgggaccgggccaacccgctccttctttggtcaacgcaggggacccg M20543 483,619 AC CCAAATATGG C ggcgggggcccaggccgcg aaccggccgagggagggggctctagtgcccaacac ccaaatatgg _ ccatatacgg cccggcccgtgttacctgggctcaggccaggcctctccttctttggtcagcgcaggggacccgg U02285 559,695 GGCTATATAAAA_ gcggggacccaggccttgaactggtcgggggagggggctctagtgcccaacacccaaatatgg GGA CAGGTG CAG_ ccatatacgg cccggtccggtcctagctacctgggcca gggccagttctctccttctttggtcagtgcaggaga V01218 122,258 GAAGAAGAGGTG cccgggcgggacccaggctgagaagcagccgaagggactctagtgcccaacac ccaaatatgg _ ccatatacggcctggtccggtcctagctacctgggccagggccagtcctctccttctttggtcagtgcaggaga AAACAGGTGCAG X67686 652,825 cccgggcggggacccaggctgagaaccagccgaaggaagggactctagtgcccgacacccaaatatggctt gggaagggcagcaacattcttcggggcgg gcag caggtg caaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 aagcaacaggtgcctaattcctcatcttttaaaaataa X59034 641,694 cagctgtcatgcgggcaca caggtg atgtaagacaatagctgtggagtcagctg X73887 403,434 cag ctggtcccccgacaa caggtg cacattcc J04699 804,817 caggtg cacattcc ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaacaggtg AC CCAAATATGG C ccatatacgg cccggcccgcgttacctgggaccgggccaacccgctccttctttggtcaacgcaggggacccg M20543 483,619 _ ggcgggggcccaggccgcgaaccggccgagggagggggctctagtgcccaacacccaaatatgg GGCTATATAAAA_ ccatatacggcccggcccgtgttacctgggctcaggccaggcctctccttctttggtcagcgcaggggacccgg U02285 559,695 GGA CAGGTG CAG_ gcggggacccaggccttgaactggtcgggggagggggctctagtgcccaacacccaaatatgg GAAGAAGAGGTG ccatatacgg cccggtccggtcctagctacctgggccagggccagttctctccttctttggtcagtgcaggaga V01218 122,258 _ cccgggcgggacccaggctgagaagcagccgaagggactctagtgcccaacacccaaatatgg CA CATTCCT GCT_ ccatatacgg cctggtccgg tcctagctacctgggccagggccagtcctctccttctttggtcagtgcaggaga GTAGCCACCCG X67686 652,825 cccgggcggggacccaggctgagaaccagccgaaggaagggactctagtgcccgacacccaaatatggctt gggaagggcagcaacattcttcggggcgg gcag caggtg caaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 aagcaacaggtgcctaattcctcatcttttaaaaataa X59034 641,694 cagctgtcatgcgggcacacaggtgatgtaagacaatagctgtggagtcagctg X73887 403,434 cagctggtcccccgacaa caggtg cacattcc J04699 804,817 caggtg cacattcc ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtggg tattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaacaggtg ccttatacggcccggcctcgctcacctgggccgcggccaggagcgccttctttgggcagcgcggggccggggc M13631 84,216 cgcgccgggcccgacacccaaatatggcgacggccggggccgcattcctgggggccgggc M57905 153,207 ccaaaatagcagctcacaagtgttg cattcct ctctgggcgc cgggca cattcct 160

AC CCAAATATGG C ccatatacgg cccggcccgcgttacctgggaccgggccaacccgctccttctttggtcaacgcaggggacccg M20543 483,619 _ ggcgggggcccaggccgcgaaccggccgagggagggggctctagtgcccaacacccaaatatgg GGCTATATAAAA_ ccatatacggcccggcccgtgttacctgggctcaggccaggcctctccttctttggtcagcgcaggggacccgg U02285 559,695 GGA CAGGTG CAG_ gcggggacccaggccttgaactggtcgggggagggggctctagtgcccaacacccaaatatgg GAAGAAGAGGTG ccatatacgg cccggtccggtcctagctacctgggccagggccagttctctccttctttggtcagtgcaggaga V01218 122,258 _ cccgggcgggacccaggctgagaagcagccgaagggactctagtgcccaacacccaaatatgg AAACAGGTGCAG_ ccatatacggcctggtccggtcctagctacctgggccagggccagtcctctccttctttggtcagtgcaggaga GTAGCCACCCG X67686 652,825 cccgggcggggacccaggctgagaaccagccgaaggaagggactctagtgcccgacacccaaatatggctt gggaagggcagcaacattcttcggggcgg gcag caggtg caaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 aagcaa caggtg cctaattcctcatcttttaaaaataa X59034 641,694 cagctgtcatgcgggcacacaggtgatgtaagacaatagctgtggagtcagctg X73887 403,434 cagctggtcccccgacaa caggtg cacattcc J04699 804,817 cagg tg cacattcc ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaacaggtg CT CCATATACGG C_ ccatatacgg cccggcccgcgtta cctgggaccgggccaacccgctccttctttggtcaacgcaggggacccg M20543 483,619 GGCTATATAAAA_ ggcgggggcccaggccgcgaaccggccgagggagggggctctagtgcccaacacccaaatatgg GGA CAGGTG CAG_ ccatatacgg cccggcccgtgttacctgggctcaggccaggcctctccttctttggtcagcgcaggggacccgg U02285 559,695 GAAGAAGAGGTG gcggggacccaggccttgaactggtcgggggagggggctctagtgcccaacacccaaatatgg _ ccatatacgg cccggtccggtcctagctacctgggccagggccagttctctccttctttggtcagtgcaggaga V01218 122,258 GAA CAACTG CAG_ cccgggcgggacccaggctgagaagcagccgaagggactctagtgcccaacacccaaatatgg GTAGCCACCCG ccatatacgg cctggtccggtcctagctacctgggccagggccagtcctctccttctttggtcagtgcagg aga X67686 652,825 cccgggcggggacccaggctgagaaccagccgaaggaagggactctagtgcccgacacccaaatatggctt gggaagggcagcaacattcttcggggcgg gcag caggtg caaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 aagcaacaggtgcctaattcctcatcttttaaaaataa X59034 641,694 cagctgt catgcgggcaca caggtg atgtaagacaatagctgtggagtcagctg X73887 403,434 cagctggtcccccgacaacaggtgcacattcc J04699 804,817 caggtg cacattcc ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaa caggtg ccctatttggccatccccctgactgccccctccccttccttacatggtctgggggctccctggctgatcctctccc M13483 249,435 ctgcccttggctccatgaatggcctcggcagtcctagcgggtgcgaaggggaccaaataaggcaaggtggca gaccgggccccccacccctgcccccggctgctccaactg CT CCATATACGG C_ ccatatacgg cccggcccgcgttacctgggaccgggccaacccgctccttctttggtcaacgcaggggacccg M20543 483,619 GGCTATATAAAA_ ggcgggggcccaggccgcgaaccggccgagggagggggctctagtgcccaacacccaaatatgg GGA CAGGTG CAG_ ccatatacggcccggcccgtgttacctgggctcaggccaggcctctccttctttggtcagcgcaggggacccgg U02285 559,695 GAAGAAGAGGTG gcggggacccaggccttgaactggtcgggggagggggctctagtgcccaacacccaaatatgg _ ccatatacgg cccggtccggtcctagctacctgggccagggccagttctctccttctttggtcagtgcaggaga V01218 122,258 GTAGCCACCCG_ cccgggcgggacccaggctgagaagcagccgaagggactctagtgcccaacacccaaatatgg GCCCCCTCCCC ccatatacggcctggtccggtcctagctacctgggccagggccagtcctctccttctttggtcagtgcaggaga X67686 652,825 cccgggcggggacccaggctgagaaccagccgaaggaagggactctagtgcccgacacccaaatatggctt gggaagggcagcaacattcttcggggcgg gcag caggtg caaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagct ggcg X14726 436,545 aagcaa caggtg cctaattcctcatcttttaaaaataa X59034 641,694 cagctgtcatgcgggcaca caggtg atgtaagacaatagctgtggagtcagctg X73887 403,434 cagctggtcccccgacaa caggtg cacattcc J04699 804,817 caggtg cacattcc ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaa caggtg CT CCATATACGG C_ ccatatacgg cccggcccgcgttacctgggaccgggccaacccgctccttctttggtcaacgcaggggacccg M20543 483,619 GGCTATATAAAA_ ggcgggggcccaggccgcgaaccggccgagggagggggctctagtgcccaacacccaaatatgg GGA CAGGTG CAG_ ccatatacgg cccggcccgtgttacctgggctcaggccaggcctctccttctttggtcagcgcaggggacccgg U02285 559,695 GAAGAAGAGGTG gcggggacccaggccttgaactggtcgggggagggggctctagtgcccaacacccaaatatgg _ ccatatacgg cccggtccggtcctagctacctgggccagg gccagttctctccttctttggtcagtgcaggaga V01218 122,258 CG CATTCCT GGG_ cccgggcgggacccaggctgagaagcagccgaagggactctagtgcccaacacccaaatatgg GTAGCCACCCG ccatatacgg cctggtccggtcctagctacctgggccagggccagtcctctccttctttggtcagtgcaggaga X67686 652,825 cccgggcggggacccaggctgagaaccagccgaaggaagggactctagtgcccgacacccaaatatggctt gggaagggcagcaacattcttcggggcgg X14726 436,545 gcag caggtg caaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg 161

aagcaa caggtg cctaattcctcatcttttaaaaataa X59034 641,694 cagctgtcatgcgggcaca caggtg atgtaagacaatagctgtggagtcagctg X73887 403,434 cagctggtcccccgacaacaggtgcacattcc J04699 804,817 caggtg cacattcc ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaa caggtg ccttatacggcccggcctcgctcacctgggccgcggccaggagcgccttctttgggcagcgcggggccggggc M13631 84,216 cgcgccgggcccgacacccaaatatggcgacggccggggccgcattcctgggggccgggc M57905 153,207 ccaaaatagcagctcacaagtgttg cattcct ctctgggcgccgggca cattcct TACTATCAAAGGG _ GGGGAGGTGGTG ------_ AAGGAGGAGGAG TCCTTTGGATGGC_ GCCCTTTGAAGGG _ AAGGAGGAGGAG ------_ AAGGAACTGGAG_ GCCCCCTCCCC TCCTTTGGATGGC_ gcagcaggtgcaaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 GCCCTTTGAAGGG aagcaacaggtgcctaattcctcatcttttaaaaataa _ X59034 641,694 cagctgtcatgcgggc aca caggtg atgtaagacaatagctgtggagtcagctg AAGGAGGAGGAG X73887 403,434 cagctggtcccccgacaa caggtg cacattcc _ J04699 804,817 caggtg cacattcc AAGGAACTGGAG_ ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa L21905 436,539 AAA CAGGTG CAG ccatcctaaaaatacccggtaaacaggtg GA CCAAATAAGG C ccatatacggcccggcccgcgttacctgggaccgggccaacccgctccttctttggtcaacgcaggggacccg M20543 483,619 _ ggcgggggcccaggccgcgaaccggccgagggagggggctctagtgcccaacacccaaatatgg CTCCATGAATGGC ccatatacgg cccggcccgtgttacctggg ctcaggccaggcctctccttctttggtcagcgcaggggacccgg U02285 559,695 _ gcggggacccaggccttgaactggtcgggggagggggctctagtgcccaacacccaaatatgg TGCCCCTCCCC_ ccatatacggcccggtccggtcctagctacctgggccagggccagttctctccttctttggtcagtgcaggaga V01218 122,258 GCCCCCTCCCC_ cccgggcgggacccaggctgagaagcagccgaagggactctagtgcccaacacccaa atatgg CTGCCCTCCCC ccatatacgg cctggtccggtcctagctacctgggccagggccagtcctctccttctttggtcagtgcaggaga X67686 652,825 cccgggcggggacccaggctgagaaccagccgaaggaagggactctagtgcccgacacccaaatatggctt gggaagggcagcaacattcttcggggcgg TCCTTTGGATGGC_ GCCCTTTGAAGGG _ AAGGAGGAGGAG ------_ AAGGAACTGGAG_ TCAGCCACCCC CT CCATATACGG C_ ccatatacgg cccggcccgcgttacctgggaccgggccaacccgctccttctttggtcaacgcaggggacccg M20543 483,619 AC CCAAATATGG C ggcgggggcccaggccgcgaaccggccgagggagggggctctagtgcccaacacccaaatatgg _ ccatatacgg cccggcccg tgttacctgggctcaggccaggcctctccttctttggtcagcgcaggggacccgg U02285 559,695 GGCTATATAAAA gcggggacccaggccttgaactggtcgggggagggggctctagtgcccaacac ccaaatatgg ccatatacgg cccggtccggtcctagctacctgggccagggccagttctctccttctttggtcagtgcaggaga V01218 122,258 cccgggcgggacccaggctgagaagcagccgaagggactctagtgcccaacacccaaatatgg ccatatacgg cctggtccggtcctagctacctgggccagggccagtcctctccttctttggtcagtgcaggaga X67686 652,825 cccgggcggggacccaggctgagaaccagccgaaggaagggactctagtgcccgacacccaaatatggctt gggaagggcagcaacattcttcggggcgg ccttatacggcccggcctcgctcacctggg ccgcggccaggagcgccttctttgggcagcgcggggccggggc M13631 84,216 cgcgccgggcccgacacccaaatatggcgacggccggggccgcattcctgggggccgggc CT CCATATACGG C_ ccatatacgg cccggcccgcgttacctgggaccgggccaacccgctccttctttggtcaacgcagg ggacccg M20543 483,619 GGCTATATAAAA_ ggcgggggcccaggccgcgaaccggccgagggagggggctctagtgcccaacacccaaatatgg GGA CAGGTG CAG_ ccatatacgg cccggcccgtgttacctgggctcaggccaggcctctccttctttggtcagcgcaggggacccgg U02285 559,695 GAAGAAGAGGTG gcggggacccaggccttgaactggtcgggggagggggctctagtgcccaacacccaaatatgg _ ccatatacgg cc cggtccggtcctagctacctgggccagggccagttctctccttctttggtcagtgcaggaga V01218 122,258 CA CATTCCT GCT cccgggcgggacccaggctgagaagcagccgaagggactctagtgcccaacacccaaatatgg 162

ccatatacgg cctggtccggtcctagctacctgggccagggccagtcctctccttctttggtcagtgcaggaga X67686 652,825 cccgggcggggacccaggctgagaaccagccgaaggaagggactctagtgcccgacacccaaatatggctt gggaagggcagcaacattcttcggggcgg gcag caggtg caaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 aagcaacaggtgcctaattcctcatcttttaaaaataa X59034 641,694 cagctgtcatgcgggcaca caggtg atgtaagacaatagctgtggagtc agctg X73887 403,434 cagctggtcccccgacaa caggtg cacattcc J04699 804,817 caggtg cacattcc ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaacaggtg ccttatacggcccggcctcgctcacctgggccgcggccaggagcgccttctttgggcagcgcggggccggggc M13631 84,216 cgcgccgggcccgacacccaaatatggcgacggccggggccgcattcctgggggccgggc M57905 153,207 ccaaaatagcagctcacaagtgttg cattcct ctctgggcgccgggca cattcct CT CCATATACGG C_ ccatatacgg cccggcccgcgttacctgggaccgggccaacccgctccttctttggtcaacgcaggggacccg M20543 483,619 GGCTATATAAAA_ ggcgggggcccaggccgcgaaccggccgagggagggggctctagtgcccaacacccaaatatgg GGA CAGGTG CAG_ ccatatacggcccggcccgtgttacctgggctcaggccaggcctctccttctttggtcagcgcaggggacccgg U02285 559,695 GAAGAAGAGGTG gcggggacccaggccttgaactggtcgggggagggggctctagtgcccaacacccaaatatgg _ ccatatacgg cccggtccggtcctagctacctgggccagggccagttctctccttctttggtcagtgcaggaga V01218 122,258 AAACAGGTGCAG cccgggcgggacccaggctgagaagcagccgaagggactctagtgcccaacacccaaatatgg ccatatacggcctggtccggtcctagctacctgggccagggccagtcctctccttctttggtcagtgcaggaga X67686 652,825 cccgggcggggacccaggctgagaaccagccgaaggaagggactctagtgcccgacacccaaatatggctt gggaagggcagcaacattcttcggggcgg gcag caggtg caaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 aagcaa caggtg cctaattcctcatcttttaa aaataa X59034 641,694 cagctgtcatgcgggcacacaggtgatgtaagacaatagctgtggagtcagctg X73887 403,434 cagctggtcccccgacaa caggtg cacattcc J04699 804,817 caggtg cacattcc ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaacaggtg TACTATCAAAGGG taaaaataactaaggtaagggccatgtgggtaggggaggtggtgtgagacggtcctgtctctcctctatctgc _ ccatcggccctttggggaggaggaaatgtgcccaaggactaaaaaaggcctggagccagaggggctagggc GAC TAAAAAA GGC K01464 280,573 taagcagacctttcatgggcaaacctcagggctgctgtcctcctgtcacctccagagccaagggatcaaagga _ ggaggagccagacaggagggatgggagggagggtcccagcagatgactccaaatttaggcagcaggcacg GGGGAGGTGGTG cggaatg _ taaaaataactgaggtaagggcctgggtaggggaggtggtgtgagacgctcctgtctctcctctatctgcccat AAGGAGGAGGAG cggccctttggggaggaggaatgtgcccaaggactaaaaaaaggccatggagccagaggggcgagggcaa M62404 277,553 _ cagacctttcatgggcaaaccttggggccctgctgtcctcctgtcacctccagagccaagggatcaaaggagg AAA CAGGTG CAG aggagccaggacaggagggaagtgggagggagggtcccagcagaggactccaaatttagg aaaaataactgaggtaagggccatggcagggtgggagaggcggtgtgagaaggtccagtcttcccagctatc X05632 38,240 tgctcatcagccctttgaaggggaggaatgtgcccaaggactaaaaaaaggccgtggagccagagaggctg gggcagcagacctttcaagggcaaatcaggggccctgctgtcctcctgtcacctccagag gcag caggtg caaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 aagcaacaggtgcctaattcctcatcttttaaaaataa X59034 641,694 cagctgtcatgcgggcaca caggtg atgtaagacaatagctgtggagtcagctg X73887 403,434 cagctggtcccccgacaa caggtg cacattcc J04699 804,817 caggtg cacattcc ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaacaggtg GA CCAAATAAGG C _ CTCCATGAATGGC ccctatttggccatccccctgactgccccctccccttccttacatggtctgggggctccctggctgatcctctccc _ M13483 249,435 ctgcccttggctccatgaatggcctcggcagtcctagcgggtgcgaaggggaccaaataaggcaaggtggca TGCCCCTCCCC_ gaccgggccccccacccctgcccccggctgctccaactg GCCCCCTCCCC_ GCACCCACCCC TCCTTTGGATGGC_ gcag caggtg caaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 GCCCTTTGAAGGG aagcaa caggtg cctaattcctcatcttttaaa aataa _ X59034 641,694 cagctgtcatgcgggcaca caggtg atgtaagacaatagctgtggagtcagctg AAGGAGGAGGAG X73887 403,434 cagctggtcccccgacaa caggtg cacattcc _ J04699 804,817 caggtg cacattcc AAGGAACTGGAG_ AAGCAA CAGGTG L21905 436,539 ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa 163

ccat cctaaaaatacccggtaaa caggtg TCCTTTGGATGGC_ GCCCTTTGAAGGG _ AAGGAGGAGGAG ------_ AAGGAACTGGAG_ TGCCCCACCCC TCCTTTGGATGGC_ GCCCTTTGAAGGG _ ------AAGGAGGAGGAG _ AAGGAACTGGAG GA CCAAATAAGG C _ ccctatttggccatccccctgactgccccctccccttccttacatggtctgggggctccctggctgatcctctccc CTCCATGAATGGC M13483 249,435 ctgcccttggctccatgaatggcctcggcagtcctagcgggtgcgaaggggaccaaataaggcaaggtggca _ gaccgggccccccacccctgcccccggctgctccaactg TGCCCCTCCCC_ GCCCCCTCCCC CT CCATATACGG C_ ccatatac gg cccggcccgcgttacctgggaccgggccaacccgctccttctttggtcaacgcaggggacccg M20543 483,619 GGCTATATAAAA_ ggcgggggcccaggccgcgaaccggccgagggagggggctctagtgcccaacacccaaatatgg GGA CAGGTG CAG_ ccatatacggcccggcccgtgttacctgggctcaggccaggcctctccttctttggtcagcgcaggggacccgg U02285 559,695 GAAGAAGAGGTG gcggggacccaggccttgaactggtcgggggaggg ggctctagtgcccaacacccaaatatgg ccatatacgg cccggtccggtcctagctacctgggccagggccagttctctccttctttggtcagtgcaggaga V01218 122,258 cccgggcgggacccaggctgagaagcagccgaagggactctagtgcccaacacccaaatatgg ccatatacgg cctggtccggtcctagctacctgggccagggccagtcctctcctt ctttggtcagtgcaggaga X67686 652,825 cccgggcggggacccaggctgagaaccagccgaaggaagggactctagtgcccgacacccaaatatggctt gggaagggcagcaacattcttcggggcgg gcag caggtg caaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 aagcaacaggtgcctaattcctcatcttttaaaaataa X59034 641,694 cagctgtcatgcgggcaca caggtg atgtaagacaatagctgtggagtcagctg X73887 403,434 cagctggtcccccgacaa caggtg cacattcc J04699 804,817 caggtg cacattcc ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaacaggtg GGCTATATAAAA_ gcagcaggtgcaaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 GGA CAGGTG CAG_ aagcaacaggtgcctaattcctcatcttttaaaaataa GAAGAAGAGGTG X59034 641,694 cagctgtcatgcgggcaca caggtg atgtaagacaatagctgtgga gtcagctg _ X73887 403,434 cagctggtcccccgacaacaggtgcacattcc GTAGCCACCCG J04699 804,817 caggtg cacattcc ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaacaggtg TACTATCAAAGGG taaaaataactaaggtaagggccatgtgggtaggggaggtggtgtgagacggtcctgtctctcctctatctgc _ ccatcggccctttggggaggaggaaatgtgcccaaggactaaaaaaggcctggagccagaggggctagggc GAC TAAAAAA GGC K01464 280,573 taagcagacctttcatgggcaaacctcagggctgctgtcctcctgtcacctccagagccaagggatcaaagga _ ggaggagccagacaggagggatgggagggagggtcccagcagatgactccaaatttaggcagcaggcacg GGGGAGGTGGTG cggaatg _ taaaaataactgaggtaagggcctgggtaggggaggtggtgtgagacgctcctgtctctcctctatctgcccat AAGGAGGAGGAG cggccctttggggaggaggaatgtgcccaaggactaaaaaaaggccatggagccagaggggcgagggcaa M62404 277,553 cagacctttcatgggcaaaccttggggccctgctgtcctcctgtcacctccagagccaagggatcaaaggagg aggagccaggacaggagggaagtgggagggagggtcccagcagaggactccaaatttagg aaaaataactgaggtaagggccatggcagggtgggagaggcggtgtgagaaggtccagtcttcccagctatc X05632 38,240 tgctcatcagccctttgaaggggaggaatgtgcccaaggactaaaaaaaggccgtggagccagagaggctg gggcagcagacctttcaagggcaaatcaggggccctgctgtcctcctgtcacctccagag AGCCTAAGAAGGA _ ccctatttggccatccccctgactgccccctccccttccttacatggtctgggggctccctggctgatcctctccc GAA CAACTG CAG_ M13483 249,435 ctgcccttggctccatgaatggcctcggcagtcctagcgggtgcgaaggggaccaaataaggcaaggtggca TCAACCTCCCC_ gaccgggccccccacccctgcccccggctgctccaactg CTGCCCTCCCC GA CCAAATAAGG C ccctatttggccatccccctgactgccccctccccttccttacatggtctgggggctccctggctgatcctctccc M13483 249,435 _ ctgcccttggctccatgaatggcctcggcagtcctagcgggtgcgaaggggaccaaataaggcaaggtggca 164

CTCCATGAATGGC gaccgggccccccacccctgcccccggctgctccaactg _ TGCCCCTCCCC_ CGGCCCTCCCG GA CCAAATAAGG C _ ccctatttggccatccccctgactgccccctccccttccttacatggtctgggggctccctggctgatcctctccc CTCCATGAATGGC M13483 249,435 ctgcccttggctccatgaatggcctcggcagtcctagcgggtgcgaaggggaccaaataaggcaaggtggca _ gaccgggccccccacccctgcccccggctgctccaactg GCCCCCTCCCC_ CGACCCGCCCC AC CCAAATATGG C ccctatttggccatccccctgactgccccctccccttccttacatggtctgggggctccctggctgatcctctccc _ M13483 249,435 ctgcccttggctccatgaatggcctcggcagtcctagcgggtgcgaaggggaccaaataaggcaaggtggca GGCTATATAAAA_ gaccgggccccccacccctgcccccggctgctccaactg GGA CAGGTG CAG_ gcagcaggtgcaaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 GAAGAAGAGGTG aagcaacaggtgcctaattcctcatcttttaaaaataa X59034 641,694 cagctgtcatgcgggcaca caggtg at gtaagacaatagctgtggagtcagctg X73887 403,434 cagctggtcccccgacaa caggtg cacattcc J04699 804,817 caggtg cacattcc ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaacaggtg TACTATCAAAGGG _ ------GGGGAGGTGGTG CT CCATATACGG C_ ccctatttggccatccccctgactgccccctccccttccttacatggtctgggggctccctggctgatcctctccc GGCTATATAAAA M13483 249,435 ctgcccttggctccatgaatggcctcggcagtcctagcgggtgcgaaggggaccaaataaggcaaggtggca gaccgggccccccacccctgcccccggctgctccaactg TACTATCAAAGGG _ ------AAGGAGGAGGAG CGACAA CAGGTG _ gcag caggtg caaaaatggagctgcgcaggcagaagagtgatcgtcatttttaaaatccccaccagctggcg X14726 436,545 CA CATTCCT GGG aagcaacaggtgcctaattcctcatcttttaaaaataa X59034 641,694 cagctgtcatgcgggcaca caggtg atgtaagacaatagctgtggagtcagctg X73887 403,434 cagctggtcccccgacaa caggtg cacattcc J04699 804,817 caggtg cacattcc ctataatagccacaggattaacatagcaggcattgtctttctctgactatagggtgggtattatgtgttcatcaa L21905 436,539 ccatcctaaaaatacccggtaaacaggtg ccttatacggcccggcctcgctcacctgggccgcggcc aggagcgccttctttgggcagcgcggggccggggc M13631 84,216 cgcgccgggcccgacacccaaatatggcgacggccggggccgcattcctgggggccgggc M57905 153,207 ccaaaatagcagctcacaagtgttgcattcctctctgggcgccgggcacattcct GC CCATGTAAGG A _ ccctatttggccatccccctgactgccccctccccttccttacatggtctgggggctccctggctgatcctctccc ACCCTTGGAAGGC M13483 249,435 ctgcccttggctccatgaatggcctcggcagtcctagcgggtgcgaaggggaccaaataaggcaaggtggca _ gaccgggccccccacccctgcccccggctgctccaactg GGTTATAATTAA AC CCAAATATGG C ccctatttggccatccccctgactgccccctccccttccttacatggtctgggggctccctggctgatcctctccc _ M13483 249,435 ctgcccttggctccatgaatggcctcggcagtcctagcgggtgcgaaggggaccaaataaggcaaggtggca GGCTATATAAAA gaccgggccccccacccctgcccccggctgctccaactg GGGGAGGTGGTG _ AAGGAGGAGGAG GA CCAAATAAGG C _ ccctatttggccatccccctgactgccccctccccttccttacatggtctgggggctccctggctgatcctctccc CT CCATGAATGG C M13483 249,435 ctgcccttggctccatgaatggcctcggcagtcctagcgggtgcgaaggggaccaaataaggcaaggtggca _ gaccgggccccccacccctgcccccggctgctccaactg GCCCCCTCCCC TCCTTTGGATGGC_ GCCCTTTGAAGGG ------_ AAGGAACTGGAG GA CCAAATAAGG C ccctatttggccatccccctgactgccccctccccttccttacatggtctgggggctccctggctgatcctctccc _ M13483 249,435 ctgcccttggctccatgaatggcctcggcagtcctagcgggtgcgaaggggaccaaataaggcaaggtggca CT CCATGAATGG C gaccgggccccccacccctgcccccggctgctccaactg 165

_ TGCCCCTCCCC TACTATCAAAGGG taaaaataactaaggtaagggccatgtgggtaggggaggtggtgtgagacggtcctgtctctcctctatctgc _ ccatcggccctttggggaggaggaaatgtgcccaaggactaaaaaaggcctggagccagaggggctagggc GAC TAAAAAA GGC K01464 280,573 taagcagacctttcatgggcaaacctcagggctgctgtcctcctgtcacctccagagccaagggatcaaagga _ ggaggagccagacaggagggatgggagggagggtcccagcagatgactccaaatttaggcagcaggcacg GGGGAGGTGGTG cggaatg taaaaataactgaggtaagggcctgggtaggggaggtggtgtgagacgctcctgtctctcctctatctgcccat cggccctttggggaggaggaatgtgcccaaggactaaaaaaaggccatggagccagaggggcgagggcaa M62404 277,553 cagacctttcatgggcaaaccttggggccctgctgtcctcctgtcacctccagagccaagggatcaaaggagg aggagccaggacaggagggaagtgggagggagggtcccagcagaggactccaaatttagg aaaaataactgaggtaagggccatggcagggtgggagaggcggtgtgagaaggtccagtcttcccagctatc X05632 38,240 tgctcatcagccctttgaaggggaggaatgtgcccaaggactaaaaaaaggccgtggagccagagaggctg gggcagcagacctttcaagggcaaatcaggggccctgctgtcctcctgtcacctccagag GA CCAAATAAGG C ccctatttggccatccccctgactgccccctccccttccttacatggtctgggggctccctggctgatcctctccc _ M13483 249,435 ctgcccttggctccatgaatggcctcggcagtcctagcgggtgcgaaggggaccaaataaggcaaggtggca TGCCCCTCCCC_ gaccgggccccccacccctgcccccggctgctccaactg GCCCCCTCCCC CTCCATGAATGGC _ ------TGCCCCTCCCC_ GCCCCCTCCCC TCCTTTGGATGGC_ GCCCTTTGAAGGG ------_ AAGGAGGAGGAG TCCTTTGGATGGC_ ------GCCCTTTGAAGGG GA CCAAATAAGG C ccctattt ggccatccccctgactgccccctccccttccttacatggtctgggggctccctggctgatcctctccc _ M13483 249,435 ctgcccttggctccatgaatggcctcggcagtcctagcgggtgcgaaggggaccaaataaggcaaggtggca CT CCATGAATGG C gaccgggccccccacccctgcccccggctgctccaactg GC CCATGTAAGG A ccctatttggccatccccctgactgccccct ccccttccttacatggtctgggggctccctggctgatcctctccc _ M13483 249,435 ctgcccttggctccatgaatggcctcggcagtcctagcgggtgcgaaggggaccaaataaggcaaggtggca ACCCTTGGAAGGC gaccgggccccccacccctgcccccggctgctccaactg

166

APPENDIX 5: GO ANALYSIS RESULT FOR NINE CLUSTERS OF GENES

The following table (adapted from [49]) shows the GO analysis result for the nine clusters of genes. Cluster 1 Number Group Total of Group GO Genes Description Count Count Genes GO:0007165 at1g02340.1 2 996 signal transduction 14 1 GO:0007154 at3g07890.1 2 1098 cell communication Cluster 2 Number Group Total of Group GO Genes Description Count Count Genes GO:0005507 at5g37990.1 3 118 copper ion binding GO:0016020 at2g07695.1 10 7444 membrane GO:0012505 at4g39830.1 7 4603 endomembrane system GO:0031225 at3g60270.1 2 240 anchored to membrane at4g11190.1 18 1 at2g33050.1 at3g16530.1 at1g78460.1 at1g79680.1 at1g70990.1 Cluster 3 Number Group Total of Group GO Genes Description Count Count Genes GO:0004601 at5g15180.1 2 120 peroxidase activity oxidoreductase activity, acting on peroxide 16 1 GO:0016684 at3g03670.1 2 120 as acceptor GO:0006979 2 218 response to oxidative stress Cluster 4 Number Group Total of Group GO Genes Description Count Count Genes GO:0008289 at4g33550.1 2 163 lipid binding 1 at5g59320.1 28 GO:0022402 at2g32590.1 2 185 cell cycle process 2 GO:0007049 at3g11520.1 2 208 cell cycle Cluster 5 Number Group Total of Group GO Genes Description Count Count Genes GO:0006869 at5g23400.1 3 117 lipid transport GO:0008289 at5g25610.1 3 163 lipid binding GO:0012505 at5g53870.1 9 4603 endomembrane system GO:0016020 at4g12470.1 10 7444 membrane GO:0006952 at3g20820.1 3 683 defense r esponse 21 1 at1g62510.1 at3g22120.1 at2g30540.1 at5g39110.1 at3g24510.1 167

Cluster 6 Number Group Total of Group GO Genes Description Count Count Genes GO:0006118 at4g37370.1 3 681 transport generation of precursor metabolites and 1 GO:0006091 at4g31970.1 3 829 energy GO:0019825 at1g26380.1 2 248 oxygen binding GO:0005783 at1g09080.1 2 endoplasmic reticulum 2 at4g37370.1 GO:0006869 at4g36670.1 2 117 lipid transport 17 GO:0012505 at2g16005.1 7 4603 endo membrane system GO:0008289 at4g31970.1 2 163 lipid binding GO:0016020 at4g29020.1 8 7444 membrane 3 GO:0006810 at1g26380.1 4 1952 transport GO:0051234 at5g01870.1 4 1971 establishment of localization GO:0051179 at1g12090.1 4 1981 localization at3g15980.1 Cluster 7 Number Group Total of Group GO Genes Description Count Count Genes GO:0004601 at5g39580.1 3 120 peroxidase activity oxidoreductase activity, acting on peroxide GO:0016684 at1g20620.1 3 120 as acceptor 1 GO:0050832 at2g37130.1 2 85 defense response to fungus GO:0009620 at4g21850.1 2 134 response to fungus GO:0016491 4 1507 oxidoreductase activity GO:0006952 2 683 defense response GO:0008289 at5g39580.1 3 163 lipid binding GO:0006869 at4g12550.1 2 11 7 lipid transport GO:0016020 at5g57220.1 7 7444 membrane 2 at2g37130.1 at4g12510.1 at2g38530.1 at2g05540.1 7 GO:0051707 at5g39580.1 3 482 response to other organism 3 GO:0009607 at2g37130.1 3 525 response to biotic stimulus GO:0051704 at4g39950.1 3 547 multi -organism process generation of precursor metabolites and GO:0006091 at5g57220.1 3 829 energy 4 GO:0006118 at2g07698.1 2 681 at2g46750.1 GO:0019825 at5g57220.1 2 248 oxygen binding 5 at4g39950.1 GO:0012505 at5g39580.1 6 4603 endomembrane system at4g12550.1 at5g57220.1 6 at4g12510.1 at2g37130.1 at2g05540.1 Cluster 8 Number Group Total of Group GO Genes Description Count Count Genes GO:0016020 at1g70710.1 12 7444 membrane 16 1 GO:0031224 at5g23840.1 4 779 intrinsic to membrane 168

GO:0044425 at3g43720.1 4 1212 membrane part GO:0012505 at3g28550.1 7 4603 endomembrane system GO:0005623 at1g06120.1 13 15514 cell GO:0044464 at3g06460.1 13 15514 cell part GO:0031225 at3g20570.1 2 240 anchored to membrane at2g39510.1 at5g12940.1 at5g49770.1 at3g04320.1 at3g05020.1 at1g47600.1 Cluster 9 Number Group Total of Group GO Genes Description Count Count Genes GO :0031072 at3g30450.1 2 148 heat shock protein binding 5 1 GO:0005515 at2g17060.1 3 2275 protein binding at2g14140.1

169

APPENDIX 6: GO ANALYSIS RESULT FOR THE WHOLE GENE LIST

The following table (adapted from [49]) is the result of the whole gene list, and the group is generated by GOstat. Group Total Group GO Genes Description Count Count endomembrane GO:0012505 at1g74500.1 at1g75780.1 at2g33050.1 49 4603 system GO:0016020 at3g08970.1 at3g6027 0.1 at3g15980.1 65 7444 membrane GO:0006869 at4g33550.1 at3g07890.1 at1g26380.1 9 117 lipid transport GO:0044464 at2g07696.1 at5g06720.1 at5g25610.1 81 15514 cell part GO:0005623 at2g38530.1 at1g06120.1 at1g77210.1 81 15514 cell GO:0005507 at1g1883 0.1 at5g12940.1 at3g16530.1 5 118 copper ion binding anchored to GO:0031225 at5g15180.1 at3g04320.1 at5g39110.1 6 240 membrane macromolecule GO:0043170 at2g32590.1 at3g05020.1 at1g09080.1 10 6920 metabolic process GO:0005622 at5g47990.1 at1g47600.1 at4g36670.1 18 9003 intracellular defense response to GO:0050832 at5g01870.1 at3g47380.1 at3g61890.1 3 85 fungus GO:0044424 at4g12550.1 at2g16005.1 at3g03670.1 17 8514 intracellular part biopolymer metabolic GO:0043283 at5g65600.1 at5g26260.1 at5g23840.1 6 4744 process 1 membrane -bounded GO:0043227 at1g78460.1 at5g57220.1 at3g43720.1 13 7166 organelle intracellular GO:0043231 at3g06460.1 at5g11210.1 at2g39510.1 13 7164 membrane-bounded organelle at5g49770.1 at2g05540.1 at3g11520.1 at4g29020. 1 at1g20620.1 at5g04160.1 at2g04070.1 at5g23400.1 at4g31970.1 at3g24510.1 at5g39580.1 at3g22120.1 at3g29970.1 at4g39830.1 at1g70990.1 at3g20570.1 at2g07698.1 at1g79680.1 at4g12510.1 at2g17060.1 at4g11190.1 at5g37990.1 at 1g12090.1 at4g12470.1 at2g07695.1 at4g17785.1 at1g70710.1 at3g20820.1 at4g30270.1 at1g61500.1 at5g64510.1 at5g53870.1 at3g28550.1 at4g37370.1 at1g62510.1 at4g28100.1 at4g28710.1 at2g37130.1 at2g30540.1 GO:0008289 at4g335 50.1 at1g12090.1 at3g43720.1 11 163 lipid binding at3g22120.1 at5g01870.1 at4g12550.1 2 at4g12510.1 at4g12470.1 at1g62510.1 at2g38530.1 at5g59320.1 GO:0006118 at3g60270.1 at2g07695.1 at5g53870.1 13 681 generation of GO:0006091 at2g45550.1 at1g26410.1 at2g46750.1 14 829 precursor metabolites 3 and energy at4g37370.1 at5g47990.1 at2g07698.1 at4g31970.1 at5g57220.1 at3g20570.1 at2g30540.1 at1g26380.1 GO:0004601 at1g20620.1 at2g07695.1 at1g64590.1 6 120 peroxidase acti vity oxidoreductase GO:0016684 at5g39580.1 at1g66800.1 at5g06720.1 6 120 activity, acting on 4 peroxide as acceptor oxidoreductase GO:0016491 at1g26410.1 at5g15180.1 at1g06120.1 16 1507 activity at4g39830.1 at2g37130.1 at4g21850.1 170

at3g03670.1 at 2g30540.1 at1g26380.1 at2g37540.1 GO:0019825 at5g57220.1 at4g37370.1 at4g31970.1 6 248 oxygen binding 5 at2g45550.1 at5g47990.1 at4g39950.1 GO:0006952 at5g23400.1 at2g17060.1 at2g33050.1 9 683 defense response 6 at5g39580.1 at4g12470. 1 at2g37130.1 at4g11190.1 at3g20820.1 at4g23670.1 cellular metabolic GO:0044237 at1g66800.1 at1g79680.1 at4g39950.1 18 9054 process at1g06120.1 at1g09080.1 at1g01480.1 7 at5g38020.1 at4g11190.1 at4g17785.1 at5g65600.1 at2g07698.1 at1g68530.1 at5g49770.1 at3g61890.1 at1g61500.1 at1g74500.1 at3g08970.1 at2g07696.1 primary metabolic GO:0044238 at5g49770.1 at1g66800.1 at2g07698.1 19 9160 process at1g79680.1 at5g24210.1 at3g61890.1 at1g09080.1 at1g06120.1 at1g0 1480.1 8 at1g47600.1 at5g38020.1 at4g17785.1 at4g11190.1 at5g65600.1 at1g68530.1 at1g61500.1 at3g08970.1 at2g07696.1 at1g74500.1