Graphic Network based Methods in Discovering TFBS Motifs

THESIS

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Lizhi Li

Graduate Program in Biophysics

The Ohio State University

2012

Master's Examination Committee:

Kun Huang, Advisor

Victor Jin

Copyright by

Lizhi Li

2012

Abstract

To find motifs of transcriptional factors binding sites (TFBS) is essential to understand many biological processes in a cell. Currently the algorithms in discovering the motifs can be divided into three categories: word numeration methods, probabilistic based methods and newly developed graphic network based methods. Graphic network based methods show their advantages over the other two categories of algorithms on prediction accuracy, sensitivity and specificity.

This thesis gives a comprehensive overview the main motif discovery methods which are being used now and especially, focuses on the introduction of graphic network based methods. In addition, a study in discovering the TFBS motifs of E2F1, which is a well- known transcription factor, is performed by applying graphic network based algorithms.

ii

Dedication

This document is dedicated to my family.

iii

Vita

July 2002 ...... No. 1 Hengyang County High

2006 ...... B.S. Biological Science, Peking University

2009 ...... M.Eng. Biotechnology, Osaka University

2009 to present ...... Graduate Teaching Associate, Biophysics

Graduate Program, The Ohio State

University

Fields of Study

Major Field: Biophysics

iv

Table of Contents

Chapter 1: Introduction ...... 1

1.1 -DNA Interactions ...... 1

1.2 Transcription Factor ...... 2

1.3 TFBS and DNA Motif ...... 3

1.4 Position Weight Matrix (PWM) ...... 4

Chapter 2: TFBS Motif Discovery Methods ...... 6

2.2 Probabilistic Methods ...... 7

2.1.1 MEME (Multiple EM for Motif Elicitation) ...... 7

2.1.2 AlignACE ...... 9

2.2 Word Enumeration Based Methods ...... 10

2.2.1 Weeder ...... 10

2.2.2 Motif Discovery Scan (MDscan) ...... 11

2.3 Challenges of Current Motif Discovery Methods ...... 12

Chapter 3: Graphic Network Based Methods ...... 14

3.1 MotifCut: A Graphic Network Based Method ...... 14

3.1.1 Graph Construction ...... 15

v

3.1.2 Finding the MDS ...... 16

3.2 Discussion of MotifCut and Other Graphic Network Based Methods ...... 16

Chapter 4: TFBS Motifs Discovery Using Weighted Graph ...... 18

4.1 Motivation ...... 18

4.2 Work Flow ...... 18

4.2.1 Sequences Selection and Procession ...... 18

4.2.2 SSM (Similarity Score Matrix) Calculation ...... 19

4.2.3 Graph Construction ...... 21

4.2.4 Merging the Subgraphs ...... 22

4.2.5 Analysis of the MDS ...... 23

4.3 Result ...... 23

Chapter 5: Summary and Future Work ...... 25

References ...... 26

vi

List of Tables

Table 1. An Position Weight Matrix ...... 5

Table 2: The constructed MDS with density score...... 23

Table 3. The PWM of a MDS getting from the study of E2F1 TFBS using weighted graph algorithm...... 24

vii

List of Figures

Figure 1. Sample MEME output ...... 9

Figure 2. Breakdown of an input sequence into a set of 2(k-1)-mers ...... 19

Figure 3. Examples of the output of eQCM ...... 22

viii

Chapter 1: Introduction

A transcription factor is a protein that regulates the expression of a gene by binding on the transcription factor binding site (TFBS). TFBS usually between 5 and 20 bps long sequence that show conservation which possesses some specific patterns called DNA motifs. In biological research, to find TFBS is an important problem in regarding of the lack of experiment result to confirm transcription factor binding under most conditions

(Naughton et al., 2006). The development of computational methods for most accurately detecting these binding sites would benefit the comprehension into complex biological processes such as differentiation, development and oncogenesis.

1.1 Protein-DNA Interactions

A cell is driven by numerous interactions between macromolecules. Those interactions include but are not limited to protein-protein interactions, protein-nucleic acid interactions and nucleic acid-nucleic acid interactions, while protein-DNA interactions play a crucial role among those interactions.

In order to have a good understanding on the role that protein-DNA interactions play, an examination in how interact with DNA, what proteins exist in the protein-DNA

1 complexes and what nucleic acid sequences are required to assemble these complexes is necessary (“Overview of Protein-Nucleic Acid Interactions”).

Proteins interact with DNA according to four major forces: H-bonds, salt bridges, entropic effects and dispersion forces. These forces help protein to bind to a DNA in a sequence-specific or non-sequence specific manner (“Overview of Protein-Nucleic Acid

Interactions”).

To understand the mechanism that proteins interact with DNA and identify the pattern of

DNA sequences that interact with a given protein are crucial to comprehend how protein-

DNA interactions regulate cellular processes.

1.2 Transcription Factor

Transcription is the process that DNA is copied to form a new messenger RNA (mRNA) which is responsible for the synthesizing of proteins or other cell process such as RNA interfering. The level of gene transcription highly contributes to a cell’s morphological and functional attributes (“Basal Transcription Factor Information”). Hence, it is very important to comprehensively understand the mechanism of transcriptional regulation.

A transcription factor (TF) is a protein which binds to gene at specific sites and regulates the gene expression. TF can either be an activator by promoting the recruiting other transcription related proteins or the opposite, be a repressor to block the expression of gene (“Basal Transcription Factor Information”). Via this mechanism, transcription factors make the regulation of gene expression at molecule level possible.

2

1.3 TFBS and DNA Motif

Considering the truth that transcriptional regulation significantly relies on TF via the binding to DNA sites close to target gene in a sequence-specific manner, the goal to understand the evolution and the regulation mechanism of these interactions can be achieved by the precisely determination of where the proteins are bound in a genome.

Previously research shows that mutations in transcription factor binding sites (TFBS) result in cell dysfunction. Regarding this, a reveal of the mechanism of TFBS regulation is the key point to understand the transcriptional regulation and phenotypic variability

(Mahadevan et al., 2008).

There are usually three major steps to identify TFBS in a gnome scale. First, experimentally identify binding sites. Second, construct a sequence model, the pattern of

DNA sequences to which a TF specifically binds, to present the binding sites for a TF.

Third, discover new binding sites using this model (Narlikar et al., 2010)

A motif is defined as the concise representation, or a model, of sites for a TF given that a collection of binding sites are determined experimentally (Hannenhalli, 2008). For example, if a TF binds to DNA sites CAGATAAAC, CAGATGAAC and

CAGATCATC, This collection of sites can be described as the consensus motif

CAGATNANC, where ‘N’ indicates any base.

In nowadays, the major experimental methods to genomic binding locations of sequences-specific regulatory proteins are Chromatin immunoprecipitation (ChIP)

3

(Solomon and Varshavsky, 1985). In the ChIP assay, proteins are cross-linked to their

DNA-binding sites in vivo and then immunopurified from fragmented chromatin. Next, the bound DNA is genome-widely investigated by microarray hybridization (ChIP-chip) or deep sequencing (ChIPseq) (Albert et al., 2007).

1.4 Position Weight Matrix (PWM)

Position Weight Matrix (PWM) describes motif by capturing the inherent variability characteristic of sequence patterns (“RSA-tools - Tutorials - Position-specific scoring matrices”). PWM is derived from a set of aligned sequences; usually these sequences are tightly related. Giving the sequences below, we can derive a PWM by calculate the frequency of each nucleotide at each position, the PWM is shown in Table 1.

Sequence 1: CAGGTTGGA

Sequence 2: ACAGTCAGT

Sequence 3: GAGGTAAAC

Sequence 4: TCCGTAAGT

Sequence 5: TAGGTCATT

Sequence 6: TAGGTACTG

Sequence 7: ATGGTAACC

Sequence 8: CAGGTATAC

Sequence 9: TGTGTGAGT

Sequence 10: AAGGTAAGT

4

1 2 3 4 5 6 7 8 9 A 0.3 0.6 0.1 0.00 0.00 0.6 0.7 0.2 0.1 C 0.2 0.2 0.1 0.00 0.00 0.2 0.1 0.1 0.3 G 0.1 0.1 0.7 1.00 0.00 0.1 0.1 0.5 0.1 T 0.4 0.1 0.1 0.00 1.00 0.1 0.1 0.2 0.5

Table 1. An Position Weight Matrix

Each number in above matrix indicates the probability that a given nucleotide has been observed at a given position. For instance, the nucleotide "A" has the probability of 0.3 to show in position by observation.

The PWM is an appealing model due to its simplicity and wide application. However, the

PWM representation assumes the independence of each other position within a binding site, while this may be not true in reality.

5

Chapter 2: TFBS Motif Discovery Methods

Most efforts have been tried on the problem of de novo motif detection by most computational biologists. This problem can be simplified as follows: to find the statistically over-represented DNA short sequences within a set of DNA sequences that are believed to share a common bind site. A typical example of this problem is searching for transcription factor binding sites in the promoters of co-expressed genes. This problem is often called to as the motif-finding problem (Salmon-Divon et al., 2010).

Motif-finding problem is a problem with great biological applications, such as the discovery of regulatory elements that determine the timing, location, and level of gene transcription (Fratkin et al., 2006). In nowadays, the sharp increase in the number of sequenced genomes and the amount of genome-scale experimental data requires the use of computational techniques to more efficiently address this problem.

Over the past few years, numerous tools with different algorithms have become available for motif prediction. Most motif-finding algorithms current can be basically categorized into numeration and probabilistic (or pattern recognition based methods). In addition, methods with novel strategies are developed, including machine learning methods, such as Chip-Motifs (Jin et al., 2007), and graphic network based methods, which will be described in the next chapter.

6

2.2 Probabilistic Methods

Probabilistic methods apply a probabilistic model to represent binding sites in the form of

PWM, following with optimization procedure to find motifs. For example, MotifSampler

(Thijs, et al. 2001), AlignACE (Hughes et al., 2000), and BioProspector (Liu et al., 2001) search the PWM space using a Gibbs sampling algorithm, whereas MEME (Bailey et al.,1995) find overrepresented sequences utilizing an expectation-maximization (EM) strategy. However, these methods affected by local optima and are sensitive to disturbance in the input sequences (Shida, 2006). Recently, DEME (Redhead, 2007) and

Seeder (Fauteux, 2008) were developed to better classify between a positive set of sequences containing TFBS and the background sequences. find “discriminative” motifs, in which a motif is treated as a feature that leads to a good classification between a positive set of sequences containing binding sites and a negative

(background) set of sequences (Zhang et al., 2011).

2.1.1 MEME (Multiple EM for Motif Elicitation)

MEME is a widely used method which employs the expectation maximization algorithm

(Chuong and Serafim, 2008) to finding DNA or protein motifs. This method requires the users to input a set of candidate sequences that may share some unknown sequences motifs. For example, a set of promoters from co-expressed or orthologous genes are

7 likely to contain binding sites for the same transcription factor. Similarly, a set of proteins that interact with the same protein may contain similar domains (Bailey et al., 2006).

Issues with Motif Discovery in MEME

The discover algorithm of MEME is to search a set of similar short sequences in a set of much longer sequences. The result gets better if the input sequences are short. However, the existence of short and degenerate of binding sites makes the discovery of TFBS motifs in a set of DNA sequences difficult. In addition, to locate the promoter location is also difficult for some genes. Due to this reason, MEME is not suited to whole genome

TFBS motif discovery because the TFBS motifs become statistically invisible in the context of the whole genome (Bailey et al., 2006). Besides, MEME doesn’t consider gaps, deletions and insertions, which make MEME to be a very rigid and insensitive

TFBS motifs prediction method (“Current Protocols in Bioinformatics: Discovering

Novel Sequence Motifs with MEME”).

The sample output of MEME is shown in Figure 1.

8

Figure 1. Sample MEME output. This portion of an MEME HTML output form shows a protein motif that MEME has discovered in the input sequences (Bailey et al. 2006)

2.1.2 AlignACE

9

AlignACE is an algorithm implemented in the computer language C to find multiple motifs in a given set of DNA input sequences. AlignACE is based on a Gibbs sampling algorithm previously applied in finding protein sequence motifs (Neuwald et al., 1995).

But the improvements are in the following ways: Firstly, the motif model was changed to fit the source genome because of the no-site sequence frequencies. Secondly, both strands of the input sequences are considered, and in no circumstance overlapping is allowed.

Thirdly, simultaneous multiple motif searching was replaced by a step-by-step search approach.

2.2 Word Enumeration Based Methods

The second class of motif-finding methods widely applies the word numeration algorithms, examples include MDscan (Liu et al., 2002), MULTIPROFILER (Keich et al., 2002), and WEEDER (Pavesi et al., 2004).

This class of methods exhaustively enumerates k-mers (a DNA sequence with k nucleotides) in the input sequences using different strategies. For example, MDscan uses a deterministic greedy algorithm and Weeder employs a suffix tree for the enumeration.

While, a more recently developed MoSDi (Alahmad et al., 2010) is based on a compound

Poisson distribution to approximate of the distribution of the frequency of motif.

2.2.1 Weeder

10

Weeder is a software package for the finding of motifs in TFBS. It adopts parameters and statistical evaluation designed for the identification of conserved TFBS. Weeder does not rely on any prior knowledge in the discovery process which makes the discovery of novel motifs possible. It rather tries to statistically differentiate the similarities of the input sequences from the similarities a set of sequences built randomly (Pavesi et al., 2006).

The principle Weeder applies is that, a set of k-mers similar to one another should be detected in a group of nonhomologous sequences which contains a motif. These set of k- mers could be instances of binding sites of the same transcription factor. However, such set of k-mers should not be found when analyzing a random set of sequences which shares no motif (Pavesi et al., 2006).

Weeder represents motifs by using a consensus, that is, a sequence built by using the most frequent nucleotide in each position of the sites. Weeder considers all k-mers differing from the consensus under a threshold number of substitutions to be valid instances of the motif. Based on this idea, the k-mers of the input sequences are grouped, and each group is evaluated with a certain measure of significance. In this step, a species- specific background model built from the oligonucleotide distribution of all promoter regions from multiple species is applied. At last, raw results are analyzed and processed to get the motifs that are more possibly to represent the conserved TFBS (Pavesi et al.,

2006).

2.2.2 Motif Discovery Scan (MDscan)

11

MDscan utilizes a statistical scoring functions derived from Bayesian statistical formulation to thoroughly scan the highly ChIP-enriched fragments, generating multiple candidate motif patterns, and then further updates and refines the candidate motifs using other less likely sequences (Liu et al., 2002).

The intuition of MDscan is to first search for similar words showing in highly ChIP- enriched sequences. Since these sequences have higher signal to noise ratio, so it is more likely to contain motifs. By using these words, a position specific motif matrix can be initialized, and the motif will be updated and refined with the whole input sequences.

Beside the application in ChIP experiment, MDscan can also be used to find DNA motifs in other experiments as long as a subgroup of the sequences contains relatively more abundant motif sites (Liu, et al., 2004).

2.3 Challenges of Current Motif Discovery Methods

There are many problems of current motif discovery methods. The major problem of word enumeration based methods is the low computational efficiency. In addition, the effort to guarantee global optimality limits the application of word enumeration based methods to short motifs which prevent them from discovering longer and more complex motifs (Shashidhara et al., 2010).

The main drawback of probabilistic based algorithms is that they may unable to find the

TFBS when the distribution of nucleotides in a motif is not statistically significantly different from the background sequences (Zhang et al., 2011).

12

Many probabilistic methods are based on PWM, which assumes the independence of each nucleotide position in the motif. However, previous research has suggested that a certain position might significant depend on another position in a motif (Naughton et al.,

2006). There are two explanations for this dependency. Firstly, a change in the sequence of a binding site may result in a compensatory change elsewhere in the TFBS. Secondly, if the motifs in a set are evolutional related, then a phylogeny of k-mers within the motif is expected to see (Naughton et al., 2006).

Additionally, probabilistic based methods don’t consider the probability density caused by small subclusters, though they are naturally difficult to be parametrically modeled

(Naughton et al., 2006).

13

Chapter 3: Graphic Network Based Methods

A recent evaluation of some of the methods (probabilistic or word enumeration based) shows that computational prediction of TFBS motifs is with less than 50% accuracy which points out that the huge space of improvement is still needed (Reddy et al., 2007).

In regarding of this, numerous novel algorithms have been developed to improve the precision and power of motif discovery. Among those algorithms, graphic network based algorithm is an effective way to address the motif-discovery problem. Well established graphic network based methods include BSG (Reddy et al., 2007), MotifCut (Fratkin et al., 2006) and MotifClick (Zhang et al., 2011).

Graphic network based methods formulate motif finding as a search for the maximum density subgraph (MDS) of a graph whose nodes are the words in the input sequences, and whose edges connect similar words. The resulting optimization can be performed in polynomial time.

A representative graphic network based method named MotifCut is going to be discussed as an example to illustrate the core principal that graphic network based methods share in the following paragraph.

3.1 MotifCut: A Graphic Network Based Method

14

MotifCut is method which not replies on parameters by increasing the model size along with the size of input size. MotifCut firstly converts the input sequences into a collection of k-mers which contains all nucleotides in the sequences. It considers overlaps or duplicates as a distinct k-mer. These k-mers form the set of vertices, V, in a graph G = (V,

E) representing the input. For every pair of vertices vi, vj, an edge with a weight wij is created, and they are stored in E. The weights are the nucleotide similarities between those k-mers. Considering the truth that of motif-containing sequences are different from background sequences, the weight of the edges connecting a pair of k-mers that are unlikely to appear in the background is up-weighted (Fratkin et al., 2006).

Since the conservation of TFBS motifs, with the truth that an edge in this graph represents the similarity between two k-mers, then those k-mers that represent a motif can be recognized as the portions of the graph with the greatest edge density. This understanding formed the basis of MotifCut algorithm. Based on this understanding, the motif-discovery problem becomes the problem becomes a MDS searching problem.

3.1.1 Graph Construction

The main idea of MotifCut algorithm is to find the MDS in a specially constructed weighted graph G = (V, E); where V refers to the set of all of the k-mers in the input, and

E denotes to the set of edges assigned with weights.

15

Let ai1, ai2, . . ., aim denote the nucleotides, and vertex vl = [aij, aij+1, …, aij+k-1] represent a vertex in V, where k is the size of k-mers. The constructed weighted graph G can be regarded as a fully connected graph with assigned weight to each edge.

3.1.2 Finding the MDS

To find the MDS, MotifCut employs a modified parametric flow algorithm (Gallo et al.,

1988). It applies an iterative push/relabel algorithm to find max-flow and min-cut

(Papadimitriou and Steiglitz, 1998).

MotifCut adds two additional vertices: s—the source, and t—the sink to the original graph G. The graph density is computed as ! = ! / ! . where ! is the number of vertices and ! is the total weight of all the edges. The sum of the weights of edges that connect to vi is marked as d(vi). If d(vi) ≥ λ, s will be connected to vi, otherwise vi is connected to t. Elements connected to the sink are discarded, then a new λ with the remaining elements is calculated, and the algorithm is rerun. Finally, this algorithm will converge into the set of vertices that form the MDS (Gallo et al., 1988).

3.2 Discussion of MotifCut and Other Graphic Network Based Methods

A crucial property of MotifCut is that the principal it applies is different from PWM models. Therefore, the resulting motifs it discovers are significantly different. PWM based methods impose too many assumptions on the motif structure that result in the

16 missing of motifs. In MotifCut’s graph network based formulation, it is possible to find two significantly different k-mers being in a subgraph, as long as the k-mers that connect them are sufficiently edge-dense. Therefore, two related but substantially different k-mers can be part of the same motif in the existence a set of intermediary k-mers.

In conclusion, graphic network based methods have two main advantages over PWM- based methods and most existing methods in general:

First, the motif model of graphic network based algorithms is nonparametric and hence it is capable to accommodate more complicated structures in reality, such as the case of nucleotide dependencies.

Second, motifs can be discovered in long inputs after optimization of the models (Fratkin et al., 2006).

Since the two categories of methods, the word enumeration based and probabilistic methods have so many problems in efficient and accurate predicting of TFBS motifs, the applying of novel methods, such as graphic network based methods, becomes increasingly more important issue.

17

Chapter 4: TFBS Motifs Discovery Using Weighted Graph

4.1 Motivation

Numerous algorithms and tools have been developed to discovery TFBS using various approaches. Previous research using graphic based research, such as MotifCut, or

MotifClick has showed the great improvement in both sensitivity and specificity over the traditional word numeration and probabilistic based methods. Both methods investigate

TFBS by building weighted graph, which composed of vertices (represent k-mers) and weighted edges (represent the similarity scores between each other k-mers). Then the problem of motif discovery becomes a clustering problem which aims to find the maximum density subgraphs (MDS).

In this research, the well-known transcription factor E2F1 was studied using a weighted graph method.

4.2 Work Flow

4.2.1 Sequences Selection and Procession

18

By using the Transcriptional Regulatory Element Database (http://rulai.cshl.edu/cgi- bin/TRED/tred.cgi?process=analysisMotifDWEForm), 50 E2F1 target genes were selected. A 1000 bps which marked promoter sequence from -700 to 299 relative to transcription start site were retrieved from each of the 50 target genes.

Next, each 1000 bps sequence was converted into a collection of 2(k - 1) -mers (Figure

2). In our experiments, k is chosen to be equal to 5.

Figure 2. Breakdown of an input sequence into a set of 2(k-1)-mers. Each pair of adjacent 2(k-1)-mers overlap with each other by k-1 letters. Each 2(k-1)-mer will be used as a vertex to construct a graph (Adopted from Zhang et al., 2011)

After the conversion, 249 2(k-1)-mers (k=5) were generated for each 1000 bps sequence.

In total, there are 12450 2(k-1)-mers.

4.2.2 SSM (Similarity Score Matrix) Calculation

19

Now consider each of the 12450 2(k-1)-mers as a vertex, in order to establish a weighted graph, we calculate the weights for all edges. The weight of each edge is defined as the similarity score between each pair of vertices. In this research the similarity score is simply calculated by checking whether the two 2(k-1)-mers share the same nucleotide at the same location, which will result in the addition of 1 point to the score, if not, the score will be 0. For example, there are two sequences ACCATCGGG and TCCATGGGC, they share the same nucleotide on six positions, which are position 2, 3, 4, 5, 7 and 8 (count from left to right), so the similarity score here is 6. Here, the direction of sequence is only considered from left (upper stream) to right (downstream), and it does not consider the complimentary situation. For instance, if two sequences are completely complimentary to each other, the similarity score is not 9, but 0.

For formula of calculating the similarity score is shown below:

!

!!,! = !! !!!

0 !" !! ≠ !! ; Where a and b represent two vertices (k-mers), !! = al refers to the 1 !" !! = !! ; nucleotide on position l in sequence, !! ∈ !, !, !, !}.

By doing this, a 12450*12450 matrix which contains the similarity score between each pair of the 12450 2(k-1)-mers was build.

20

4.2.3 Graph Construction

After getting the SSM, the result was input into a graph clustering software called eQCM

(Xiang et al., 2012). The basic idea of the eQCM is to get cluster the vertices in regard of the edge weight between each pair of the vertices. The output is a set of density subgraphs containing the vertices and a density score of each subgraph. Examples of the output format can be seen in Figure 3.

21

Figure 3. Examples of the output of eQCM.

4.2.4 Merging the Subgraphs

22

In the last step, a set of subgraphs with at least 2 vertices were obtained. However, in order to get the MDS, the merging of the subgraphs is performed in this study. The basic idea is that, if the shared vertices number is equal or bigger than half of the number of the subgraph which contains fewer vertices. The interpretation of this idea is that if at least half of the vertices of a subgraph are shared by another subgraph, then this subgraph can be merged into the other subgraph.

4.2.5 Analysis of the MDS

After the merging step, a set of MDS is obtained. For each MDS, a PWM can be obtained by simple statistical summarization.

4.3 Result

Here we examine the largest MDS identified from the weighted graph algorithm, denoted by the vertex number (from 1 to 12450), the result is shown in table 2.

4443 7197 1007 146 139 1361 7019 10765 3341 10883 3224 12010 9515 6627 4887 6986 1393 1408 1451 1478 1489 5915 11390 8005 1379 1401 1412 1426 1433 1444 1471 1482 1486 727 1493 6989 1386 1397 1437 1448 1458 1475 2314 1365 1369 1376 1405 1416 1455 6209 Density: 6.38449 Table 2: The constructed MDS with density score.

23

The numbers in table 2 correspond to k-mers. For example, 146 means the 146th k-mer out of the total 12450 k-mers.

The density of a subgraph is calculated by this way:

!! ! = ( !!! !!!)/!! ,

ewere D represent density, Wi is the weight of edge Ei, and ne, nv refer to the number of edges or vertices in a subgraph G= (V, E), respectively.

Based on this MDS, a set of DNA sequences corresponding to vertex numbers is obtained. Using the way to calculate a PWM defined in the Chapter 1.4, a PWM is derived from this set of DNA sequences, and the result is present in table 3.

1 2 3 4 5 6 7 8 9 A 1 0.56 0.02 0.32 0 0.6 1 0.34 0.08 C 0 0.04 0 0 0.04 0.08 0 0.12 0.06 G 0 0.36 0.98 0.68 0.96 0.3 0 0.52 0.82 T 0 0.04 0 0 0 0.02 0 0.02 0.04 Table 3. The PWM of a MDS getting from the study of E2F1 TFBS using weighted graph algorithm.

From this table, we can see that position 1 and position 7 are extremely conserved with the occupation of A, while position 3, 5 and 9 are also highly conserved. From this PWM, a can be derived, which is listed below:

AAGGGAAGG

24

Chapter 5: Summary and Future Work

Graphic network based methods shows great advantages among current motif-discovery methods by applying more biological meanings into the motif discovery process.

Due to the limitation of time, there is still much work left for future study to fully investigate the approach.

First, since MDS has been obtained, the PWM is needed to be calculate from the MDS to compare to all the result, from other algorithms such as MEME, MDscan and AlignACE.

In addition, the results need to be compared to all experimentally confirmed E2F1 motifs.

Second, the algorithm to calculate the similarity score still need improvement considering the existence of insertion or deletion in sequences.

Last but not the least, in order to achieve the optimal prediction result, the threshold of merging subgraphs into MDS still needs to be optimized.

25

References

Albert, I., Mavrich, T.N., Tomsho, L.P., Qi, J., Zanton, S.J., Schuster, S.C., Pugh, B.F.(2007) Translational and rotational settings of H2A.Z nucleosomes across the Saccharomyces cerevisiae genome. Nature 446:572–576.

An Introduction to Position Specific Scoring Matrices. Retrieved from http://bioinformatica.upf.edu/T13/MakeProfile.html

Alahmad Y, Taverna M, Mobdi H, Duboeuf J, Grégoire A, Rancé I, Tran NT. A validated capillary electrophoresis method to check for batch-to-batch consistency during recombinant human glycosylated interleukin-7 production campaigns. J Pharm Biomed Anal. 2010 Mar 11; 51(4):882-8. Epub 2009 Sep 15.

Ben-Gal I, Shani A, Gohr A, Grau J, Arviv S, Shmilovici A, Posch S, Grosse I. Identification of transcription factor binding sites with variable-order Bayesian networks. Bioinformatics. 2005 Jun 1; 21(11):2657-66. Epub 2005 Mar 29.

Brian T. Naughton, Eugene Fratkin, Serafim Batzoglou1 and Douglas L. Brutlag. A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites. Nucleic Acids Research, 2006, Vol. 34, No. 20

Christos H. Papadimitriou, Kenneth Steiglitz (1998). "6.1 The Max-Flow, Min-Cut Theorem". Combinatorial Optimization: Algorithms and Complexity. Dover. pp. 120–128. ISBN 0486402584.

Chuong B Do & Serafim Batzoglou. What is the expectation maximization algorithm. Nature Biotechnology 26, 897 - 899 (2008)

Current Protocols in Bioinformatics: Discovering Novel Sequence Motifs with MEME. Retrieved from http://www.sdsc.edu/~tbailey/MEME-protocol- draft2/protocols.html

Fauteux F, Blanchette M, Strömvik MV. Seeder: discriminative seeding DNA motif discovery. Bioinformatics. 2008 Oct 15; 24(20):2303-7. Epub 2008 Aug 21.

26

Fratkin E, Naughton BT, Brutlag DL, Batzoglou S. MotifCut: regulatory motifs finding with maximum density subgraphs. Bioinformatics. 2006 Jul 15;22(14):e150- 7.

Gallo E, Stanzani M, Landi G, Craveri A. Vascular disease of the supra-aortic trunk (SAT). Clinical, velocimetric and risk factors findings in a group of elderly subjects. Minerva Cardioangiol. 1988 Dec;36(12):601-4.

Hannenhalli S. Eukaryotic transcription factor binding sites--modeling and integrative search methods. Bioinformatics. 2008 Jun 1; 24(11):1325-31. Epub 2008 Apr 21.

Hughes JD, Estep PW, Tavazoie S, Church GM. Computational identification of cis- regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000 Mar 10;296(5):1205-14.

Jin VX, O'Geen H, Iyengar S, Green R, Farnham PJ. Identification of an OCT4 and SRY regulatory module using integrated computational and experimental genomics approaches. Genome Res. 2007 Jun; 17(6):807-17.

Kadonaga,J.T. (2004). Regulation of RNA polymerase II transcription by sequence- specific DNA binding factors. Cell, 116, 247–257.

Kazuhito Shida. GibbsST: a Gibbs sampling method for motif discovery with enhanced resistance to local optima. BMC Bioinformatics 2006, 7:486 doi:10.1186/1471-2105-7-486

Keich U, Pevzner PA. Finding motifs in the twilight zone. Bioinformatics. 2002 Oct; 18(10):1374-81.

Keich U, Pevzner PA. Subtle motifs: defining the limits of motif finding algorithms. Bioinformatics. 2002 Oct; 18(10):1382-90.

Liu XS, Brutlag DL, Liu JS. An algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation microarray experiments. Nat Biotechnol. 2002 20(8):835-9.

Luscombe NM, Austin SE, Berman HM, Thornton JM. An overview of the structures of protein-DNA complexes. Genome Biol. 2000; 1(1):REVIEWS001. Epub 2000 Jun 9.

Mahadevan R, Lovley DR. The degree of redundancy in metabolic genes is linked to mode of metabolism. Biophys J. 2008 Feb 15; 94(4):1216-20. Epub 2007 Nov 2.

27

Narlikar L, Gordân R, Hartemink AJ. A nucleosome-guided map of transcription factor binding sites in yeast. PLoS Comput Biol. 2007 Nov; 3(11):e215. Epub 2007 Sep 24.

Naughton BT, Fratkin E, Batzoglou S, Brutlag DL. A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites. Nucleic Acids Res. 2006;34(20):5730-9. Epub 2006 Oct 13.

Neuwald AF, Liu JS, Lawrence CE. Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Sci. 1995 Aug; 4(8):1618-32.

Nicholas M Luscombe, Susan E Austin, Helen M Berman and Janet M Thornton, An overview of the structures of protein-DNA complexes. Genome Biology 2000, 1(1):reviews001.1–001.37

Overview of Protein-Nucleic Acid Interactions. Retrieved from http://www.piercenet.com/browse.cfm?fldID=A6C04192-4535-4618-B372- 98D97A7A21F8#intro

Reddy TE, DeLisi C, Shakhnovich BE (2007) Binding Site Graphs: A New Graph Theoretical Framework for Prediction of Transcription Factor Binding Sites. PLoS Comput Biol 3(5): e90. doi:10.1371/journal.pcbi.0030090

Redhead E, Bailey TL. Discriminative motif discovery in DNA and protein sequences using the DEME algorithm. BMC Bioinformatics. 2007 Oct 15; 8:385.

RSA-tools - Tutorials - Position-specific scoring matrices. Retrieved from http://rsat.ulb.ac.be/tutorials/tut_PSSM.html

Salmon-Divon M, Dvinge H, Tammoja K, Bertone P. PeakAnalyzer: genome-wide annotation of chromatin binding and modification loci. BMC Bioinformatics. 2010 Aug 6; 11:415.

Sathyapriya R, Vijayabaskar MS, Vishveshwara S, 2008 Insights into Protein–DNA Interactions through Structure Network Analysis. PLoS Comput Biol 4(9): e1000170. doi:10.1371/journal.pcbi.1000170

Shashidhara LS, Smith AG. Expression and subcellular location of the tetrapyrrole synthesis enzyme porphobilinogen deaminase in light-grown Euglena gracilis and three nonchlorophyllous cell lines. Proc Natl Acad Sci U S A. 1991 Jan 1;88(1):63-7.

28

Solomon MJ, Varshavsky A. Formaldehyde-mediated DNA-protein crosslinking: a probe for in vivo chromatin structures. Proc Natl Acad Sci U S A. 1985 Oct; 82(19):6470-4.

Steve Buratowski. Basal Transcription Factor Information. Retrieved from http://tfiib.med.harvard.edu/transcription/basaltx.html

Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouzé P, Moreau Y. A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics. 2001 Dec; 17(12):1113-22.

Timothy L. Bailey, Nadya Williams, Chris Misleh and Wilfred W. Li. MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Research, 2006, Vol. 34, Web Server issue W369–W373 doi:10.1093/nar/gkl198

Transcriptional Regulatory Element Database. Retrieved from http://rulai.cshl.edu/cgi-bin/TRED/tred.cgi?process=analysisMotifDWEForm

Zhang S, Li S, Niu M, Pham PT, Su Z. MotifClick: prediction of cis-regulatory binding sites via merging cliques. BMC Bioinformatics. 2011 Jun 16; 12:238.

Xiang Y., Zhang C-Q, Huang K. Predicting glioblastoma prognosis networks using weighted gene co-expression network analysis on TCGA data. To appear, BMC Bioinformatics, 2012.

29