University of Connecticut OpenCommons@UConn
Doctoral Dissertations University of Connecticut Graduate School
1-17-2014 Machine Learning Approaches to Transcription Factor Binding Site Search and Visualization Chih Lee University of Connecticut - Storrs, [email protected]
Follow this and additional works at: https://opencommons.uconn.edu/dissertations
Recommended Citation Lee, Chih, "Machine Learning Approaches to Transcription Factor Binding Site Search and Visualization" (2014). Doctoral Dissertations. 304. https://opencommons.uconn.edu/dissertations/304 Machine Learning Approaches to Transcription Factor Binding Site Search and Visualization
Chih Lee, Ph.D. University of Connecticut, 2014
ABSTRACT
A transcription factor (TF) is a protein or protein complex. It regulates the expression of its target genes by physically binding to the regulatory regions of these genes. The binding sites of a TF naturally share a common pattern or motif with one another. Given known binding sites of a TF, a TF model can be built to scan sequences for putative binding sites. This is known as a transcription factor binding site (TFBS) search problem. In this dissertation, we investigate the TFBS search problem using machine learning approaches.
In general, the known binding sites of a TF are of variable lengths and have to be aligned before a model can be built. Transcription factor binding site alignment is considered an unsupervised learning problem since no other information about the unaligned binding sites is given. We propose an al- Chih Lee - University of Connecticut - 2014 gorithm that considers the lengths of TFBSs and dependencies of nucleotide positions in a binding site. The novel method is named LASAGNA (Length-
Aware Site Alignment Guided by Nucleotide Association).
Studies often utilize TFBS search tools to predict the binding sites of a TF in a DNA sequence when binding sites found by assays are not available. The analysis often involves TF model collection, promoter sequence retrieval and visualization, requiring several tools to accomplish. To accelerate TFBS anal- yses, we developed a novel integrated webtool named LASAGNA-Search.
This user-friendly tool allows users to perform the analysis without leaving the site.
TFBS search methods are considered supervised learning algorithms since they learn from example binding sites of a TF. Most of the TFBS search meth- ods consider only known binding sites of a TF and hence deal with one-class classification problems. However, non-binding sites contain information about the TF as well. When non-binding sites are available, searching for
TFBSs becomes a two-class classification problem. We propose two novel methods named the negative-to-positive vector and the optimal discrimi- nating vector methods, utilizing both binding sites and non-binding sites. Machine Learning Approaches to Transcription Factor Binding Site Search and Visualization
Chih Lee
M.Sc. Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, 2005
A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy at the University of Connecticut
2014 Copyright by
Chih Lee
2014
i APPROVAL PAGE
Doctor of Philosophy Dissertation
Machine Learning Approaches to Transcription Factor Binding Site Search and Visualization
Presented by Chih Lee, M.Sc. CSIE
Major Advisor Chun-Hsi Huang
Associate Advisor Jinbo Bi
Associate Advisor Sanguthevar Rajasekaran
Associate Advisor Daniel Schwartz
Associate Advisor Dong-Guk Shin
University of Connecticut 2014
ii
TABLE OF CONTENTS
Introduction : Machine Learning and Computational TFBS Analysis . 1 0.1. Supervised and Unsupervised Learning ...... 1 0.2. Computational TFBS Analysis ...... 3 0.2.1. Transcription Factor Binding Site Alignment ...... 4 0.2.2. Transcription Factor Binding Site Search ...... 7 0.2.3. Chapter Organization ...... 9
Ch. 1 : LASAGNA ...... 11 1.1. Background ...... 11 1.2. Methods ...... 14 1.2.1. The Search Module ...... 14 1.2.2. The LASAGNA Algorithm ...... 16 1.2.3. LASAGNA for ChIP-seq Data ...... 19 1.2.4. Scoring a Putative Binding Site ...... 21 1.3. Results and Discussion ...... 22 1.3.1. Comparison of Alignment Algorithms ...... 22 1.3.2. Comparison of TFBS Search Methods ...... 29 1.3.3. Application of LASAGNA-ChIP to ChIP-seq Data ...... 33 1.3.4. LASAGNA is Simple and Effective ...... 60 1.4. Conclusions ...... 62
Ch. 2 : LASAGNA-Search ...... 65 2.1. Background ...... 65 2.2. Materials and Methods...... 66 2.2.1. Modules ...... 67 2.2.2. TF Model Collections ...... 72 2.3. Results and Discussion ...... 78 2.3.1. User Interface ...... 79 2.3.2. Comparison of Features to Existing Webtools ...... 83 2.3.3. Evaluation of Precomputed TF Models ...... 86 2.4. Future Directions ...... 203
Ch. 3 : Searching for TFBSs in Vector Spaces ...... 207 3.1. Background ...... 207 3.2. Methods ...... 208 3.2.1. Data Sets ...... 208 3.2.2. Notation ...... 210 3.2.3. Embedding Short Sequences in Vector Spaces ...... 211
iv 3.2.4. Searching for TFBSs in Vector Spaces ...... 213 3.2.5. The NPV Method ...... 214 3.2.6. The ODV Method ...... 216 3.2.7. The PSSM and ULPB Methods ...... 219 3.3. Results and Discussion ...... 220 3.3.1. Performance Assessment and Evaluation Metrics ...... 220 3.3.2. Prokaryotic Transcription Factor Binding Sites ...... 222 3.3.3. Eukaryotic Transcription Factor Binding Sites ...... 223 3.3.4. Motif Subtype Identification in Vector Spaces ...... 224 3.3.5. Independent validation on ChIP-seq Data ...... 226 3.4. Conclusions ...... 234
Bibliography ...... 237
v LIST OF FIGURES
1.2.1 An Illustration of LASAGNA with Ka = 0...... 18 1.2.2 LASAGNA-ChIP Flowchart ...... 20 1.3.1 Overall ROC Curves for the Three Alignment Algorithms ...... 26 1.3.2 Distribution of Ks by Species and Conserved Domain (LASAGNA) 35 1.3.3 Distribution of Ks by Species and Conserved Domain (ClustalW2) 36 1.3.4 Distribution of Ks by Species and Conserved Domain (MEME) . . . 37 1.3.5 Comparison of the PSSM Method Dependent on LASAGNA to SiTaR ...... 38
2.2.1 Architecture of LASAGNA-Search ...... 67 2.2.2 Comparison of Scoring Strategies Using TF Models Collected by LASAGNA-Search ...... 69 2.2.3 An Inferred Gene Regulatory Network of Human Genes TP53 andMYB...... 72 2.3.1 Input Page of LASAGNA-Search ...... 88 2.3.2 Result Page of LASAGNA-Search ...... 89 2.3.3 Visualization of Hits in the UCSC Genome Browser ...... 90 2.3.4 Comparison of Precomputed TF Models by Average Precision . . . 205 2.3.5 Comparison of Precomputed TF Models by Accuracy ...... 206
3.2.1 Illustration of Embedding a Short Sequence in Vector Space . . . . . 212 3.2.2 Illustration of the NPV Method ...... 215 3.2.3 The Orthogonal Projection of s onto t ...... 217 3.3.1 Comparison of the PSSM, ULPB, NPV and ODV Methods on the RegulonDB Data Set ...... 228 3.3.2 Comparison of the PSSM, ULPB, NPV and ODV Methods on the JASPAR Data Set ...... 229 3.3.3 Illustration of the kNPV Method ...... 230 3.3.4 The kNPV (kODV) Method Versus the NPV (ODV) Method ...... 231
vi LIST OF TABLES
1.3.1 TFBSs in TRANSFAC Public Database by Species ...... 23 1.3.2 Species-wise and Overall Comparisons between LASAGNA and ClustalW2 ...... 27 1.3.3 Comparison of Two Groups of TFs Divided According to Results on LASAGNA and ClustalW2 ...... 28 1.3.4 Species-wise and Overall Comparisons between LASAGNA and MEME ...... 29 1.3.5 Comparison of Two Groups of TFs Divided According to Results on LASAGNA and MEME ...... 30 1.3.6 Distribution of the 1751 Binding Sites of 90 TFs in TRANSFAC Public Database ...... 31 1.3.7 List of 38 ChIP-seq Experiments ...... 39 1.3.8 Sequence Logos of Motifs Found by LASAGNA-ChIP and MEME 45
2.2.9 Summary of TF Model Collections ...... 73 2.3.10 ...... 91 2.3.11 Human ENCODE Tracks Used for Validating TF Models ...... 92 2.3.12 Mouse ENCODE Tracks Used for Validating TF Models ...... 136 2.3.13 Performance of LASAGNA-Search TF Models Validatedby ENCODE Human ChIP-seq Data ...... 140 2.3.14 Performance of LASAGNA-Search TF Models Validatedby ENCODE Mouse ChIP-seq Data ...... 176 2.3.15 Performance of MAPPER2 TF Models Validated by ENCODE Human ChIP-seq Data ...... 183 2.3.16 Performance of MAPPER2 TF Models Validated by ENCODE Mouse ChIP-seq Data ...... 200
3.2.17 Statistics of the E. coli TFs in RegulonDB ...... 209 3.2.18 Statistics of TFs in the JASPAR Database ...... 227 3.3.19 Results of Independent Validation on ChIP-seq Data ...... 232
vii Introduction Machine Learning and Computational Transcription Factor Binding Site Analysis
0.1 Supervised and Unsupervised Learning
Machine learning problems in general fall into two categories, supervised learning and unsupervised learning. The classification problem is a super- vised learning problem. In a classification problem, each data instance is tagged with a class label. The labeled instances “supervise” the learning process, which produces a classifier. The classifier is able to distinguish one class from another and hence can be used to classify a data instance whose class is unknown. A data instance in this case can be variables describing a person, while the class labels are normal and cancer. The trained classifier can predict if a person has cancer based on these variables. The clustering problem, on the other hand, is a unsupervised learning problem. Each data instance is not tagged with a class label in this case. A clustering algorithm divides data instances into groups based on the similarity between two data instances. One example is to group cancer patients so as to identify cancer subtypes.
1 Embedding data instances in Rp is usually a first step in machine learning, where p is the number of variables describing an instance. For some data, an instance is readily described by p real variables. Microarray data [75] is one example, where each array is described by the expression of p genes. Often- times, biological data instances are not readily described by p real variables and hence conversion is needed to embed each instance in Rp. For instance, to manipulate sequence data in the Euclidean space, each sequence needs to be embedded in Rp. Once embedded in Rp, it is often desired to manipulate the instances in a lower dimensional subspace Rk, where k < p. A sub- space can be found such that certain requirements are satisfied. Principal component analysis (PCA) [39], a unsupervised learning approach, finds a subspace such that a desired portion of variance in the data is retained in the subspace while pairwise similarity between data instances is approximately preserved. Previous studies showed success of PCA in population structure analysis [46], ethnicity inference [52] and haplogroup inference [88]. With additional information, data instances can be categorized into 2 or more classes. In this case, data instances can be placed in a subspace found by
Fisher’s discriminant analysis (FDA), which can be viewed as a supervised learning method. In this subspace, class centers are separated as far from one another as possible while class members are pulled as close toward
2 their respective class centers as possible. It was demonstrated that FDA can be used to visually identify the best similarity measure between two short
DNA sequences [48] in searching for transcription factor binding sites.
PCA and FDA systematically find a subspace such that certain require- ments are satisfied. Domain knowledge however can also be utilized to find subspaces. Lee and Huang [49] approached the transcription factor (TF) binding site (TFBS) search problem by first embedding short sequences of length l or l-mers in space, where l is the length of TFBSs. TFBSs can then be searched in subspaces identified by domain knowledge. Specifically, these are promising subspaces implied by previous studies. Similar to informa- tion retrieval [1], a query vector is used to search for TFBSs. Two supervised approaches to constructing a query vector were proposed. One is named the negative-to-positive vector (NPV) method. The other is named the optimal discriminating vector (ODV) method. NPV and ODV are supervised meth- ods because they learn from known TFBSs and non-TFBSs. It was shown that, in this framework, the best subspace can be identified for each individ- ual TF. This important advantage contributes to the superior performance of the NPV and ODV methods to a state-of-the-art method named ULPB [69].
0.2 Computational TFBS Analysis
3 A transcription factor is a protein that regulates the expression of its target genes by physically binding to the regulatory regions of these genes. The binding sites of a transcription factor (TF) naturally share similarity with one another. The common pattern shared among the binding sites of a TF is called a motif. In general, there are two approaches to computational motif analysis, de novo motif discovery [3, 85, 5, 11, 76, 79, 66, 4, 55, 91, 89, 29] and transcription factor binding site (TFBS) search [63, 13, 34, 69, 24]. As the name suggests, de novo motif discovery algorithms finds over-represented patterns in sequences without prior knowledge of the binding TFs. The input to these algorithms is usually the upstream region sequences of genes putatively co-regulated by one or more common TFs. The output is one or more motifs or patterns whose instances are over-represented in the input sequences. On the other hand, a TFBS search algorithm takes binding site sequences of a TF as input. It learns from these known binding sites and builds a TF model out of them. The TF model can then be used to scan sequences for putative binding sites. While the two approaches are tightly connected, we focus on the TFBS search problem and assume that a TF has known binding sites available.
0.2.1 Transcription Factor Binding Site Alignment
A typical TFBS search algorithm requires aligned TFBSs. This requirement allows simple representations of TF models. Types of TF models include con-
4 sensus sequences, position-specific scoring matrices (PSSMs) [77], etc. The
PSSM method is a widely used method among the available TFBS search algorithms. Given aligned binding sites of a TF, the TF model is essentially a 4 l matrix, where l is the length of the binding sites. Column i of the × matrix stores the scores of matching the ith letter in a sequence of length l
(an l-mer) to nucleotides A, C, G and T, respectively. The score of an l-mer is then calculated by summing up the scores of letter 1 through letter l. Once constructed, the matrix of a TF can be stored in a database to scan sequences for binding sites of the TF in the future without resorting to the actual bind- ing sites. In fact, many tools [74, 40, 71, 13, 83, 90, 41] depend on matrices stored in at least one of the databases, JASPAR [10], RegulonDB [27] and
TRANSFAC [59]. Since a matrix is constructed from aligned binding sites, we cannot overemphasize the quality of TFBS alignments.
Databases such as JASPAR, TRANSFAC and ORegAnno [32] contain DNA segments bound by TFs. These DNA segments can be seen as TFBSs with some irrelevant bases on one or both sides because of the resolutions of techniques used to obtain TFBSs. The DNA segments belonging to a TF are therefore unaligned variable-length sequences. While the DNA segments for most TFs in the JASPAR database are aligned, this is not the case for the
TRANSFAC public and ORegAnno databases. About 53% (983 out of 1867)
5 of the TFs in the TRANSFAC Public database (release 7.0) have unaligned variable-length DNA segments. Moreover, nearly 78% (1447 out of 1867) of
TFs having curated DNA segments do not have a matrix. Focusing on TFs with variable-length DNA segments, about 71% (669 out of 983) of them do not have a matrix. On the other hand, the ORegAnno database stores ex- perimentally validated DNA segments bound by TFs but does not provide matrices. About 31% (175 out of 572) of the TFs therein have variable-length
DNA segments. In the absence of a matrix, to search for binding sites of these TFs using a matrix dependent tool, one needs to first align the cu- rated DNA segments for each TF. In the rest of this proposal, we refer to
(variable-length) DNA segments containing TFBSs as (variable-length) TF-
BSs for simplicity reasons.
ChIP-seq data represents another important source of TFBSs. ChIP-seq [38] is a high-throughput technique used to determine the in vivo binding affini- ties of a transcription factor to DNA on the whole-genome scale. The raw data from an ChIP-seq experiment is typically processed by a peak-finding algorithm to identify DNA segments containing binding signal peaks. These variable-length DNA segments can be seen as TFBSs with excessive irrel- evant bases on both sides. To locate the actual TFBS within each DNA segment, a de novo motif discovery tool can be used to identify the motifs
6 present in those DNA segments.
0.2.2 Transcription Factor Binding Site Search
One assumption the PSSM representation makes is that positions in a bind- ing site are independent, which is often not the case. Osada et al. [63] exploited dependence between positions by considering nucleotide pairs in scoring methods. It was shown that incorporating nucleotide pairs signif- icantly improved the performance of a method, meaning that most tran- scription factors studied demonstrated correlation between positions in a binding site. This result was reinforced in a recent study [69], in which the authors showed correlations between two nucleotides within a binding site by plotting the mutual information matrix. A novel scoring method called the ungapped likelihood under positional background (ULPB) method was proposed in this study. The ULPB method models a TFBS by two first-order
Markov chains and scores a candidate binding site by likelihood ratio pro- duced by the two Markov chains. leave-one-out cross-validation results on
22 TFs with 20 or more binding sites showed that ULPB is superior to the methods compared in their work.
The PSSM method and methods alike consider only known binding sites of a TF, while information contained in non-binding sites is not exploited.
This is because explicit use of negative examples in the TFBS search prob-
7 lem is hindered by the vast amount of non-binding sites of a transcription factor. This is further aggravated by the low specificity of some transcrip- tion factors, where a binding site may be more similar to a non-binding site than some other binding sites. Owing to these issues, previous studies involving negative examples are limited and the roles of negative examples remain unclear. In a review article, Hannenhalli [34] surveyed work on improved motif models and integrative methods. None of these reviewed studies [34], however, investigated the use of negative examples on top of true TFBSs. While introducing improved benchmarks for computational motif discovery, Sandve et al. [72] described algorithms for finding optimal motif models using both positive and negative TFBSs. Three models were compared using the proposed benchmarks. However, no methods relying on only positive examples were compared. In a previous study, Do and
Wang [18] formulated the TFBS search problem as a classification problem, proposed a novel similarity measure, and investigated three classification techniques. Five-fold CV results showed that learning vector quantization performed better than P-Match [13], which requires only positive examples.
The evaluation, however, was done on only 8 human transcription factors and 8 artificial ones. It is not clear how the results on the small set of 8 real TFs can be generalized to other TFs. Lee and Huang [48] proposed to visualize TFBS’s in the context of negative examples. It was shown that,
8 using negative examples explicitly, the visualization technique affords iden- tification of a better similarity measure between short sequences in a TFBS search problem.
0.2.3 Chapter Organization
In this dissertation, we investigate transcription factor binding site search and visualization using machine learning techniques. Typically, a TFBS search algorithm cannot process and learn from unaligned binding sites.
Therefore, the TFBS search problem consists of two sub-problems. One is aligning the known variable-length binding sites of a TF, while the other is searching for novel binding sites given the aligned TFBSs. TFBS alignment can be seen as an unsupervised learning problem since no other information about the unaligned TFBSs is given. In Chapter 1, we describe an algorithm that considers the lengths of TFBSs and iteratively applies a TFBS search method to scanning unaligned TFBSs [50, 51, 47]. That is, a supervised learning algorithm is used at each iteration to deal with this unsupervised learning problem. The method is named LASAGNA (Length-Aware Site
Alignment Guided by Nucleotide Association). We show that LASAGNA significantly outperforms two widely used algorithms for TFBS alignment.
Coupling LASAGNA with a TFBS search method, we further show that it performs better than an alignment-free method accepting variable-length input TFBSs.
9 Studies often utilize TFBS search tools to predict the binding sites of a TF in a DNA sequence when binding sites found by assays are not available. The analysis often involves TF model collection, promoter sequence retrieval and visualization, requiring several tools to accomplish. To accelerate TFBS anal- yses, we developed a novel integrated webtool, allowing users to perform the analysis without leaving the site. Important features include accepting unaligned variable-length TFBSs, collections of 1792 TF models, automatic promoter sequence retrieval, visualization in the UCSC Genome Browser
[19] and gene regulatory network inference/visualization based on binding specificities. We describe this user-friendly webtool in detail in Chapter 2.
Most of the TFBS search methods consider only known binding sites of a
TF and hence deal with one-class classification problems. However, non- binding sites contain information about the TF as well. When non-binding sites are available, searching for TFBSs becomes a two-class classification problem. In Chapter 3, we describe the NPV and ODV methods, which utilize both binding sites and non-binding sites of a TF. We show the perfor- mance gain of the NPV and ODV methods over the ULPB method through cross-validation experiments and independent validation experiments on the whole genome scale.
10 Chapter 1 LASAGNA: An Unsupervised Learning Algorithm for TFBS Alignment
1.1 Background
In this chapter, we describe a novel TFBS alignment algorithm named
LASAGNA (Length-Aware Site Alignment Guided by Nucleotide Asso- ciation). The algorithm is based on the hypothesis that binding sites of a
TF share a core [12], a short and highly conserved stretch of DNA. Hence, a binding site can be seen as a core with some irrelevant bases on one or both sides. In general, shorter sites tend to contain fewer irrelevant bases and are easier to align. For this reason, we progressively align the binding sites from the shortest to the longest ones. The algorithm further exploits dependence between two positions in a binding site. Dependence between positions has been shown to boost performance of TFBS search algorithms [63, 69] as well as protein structural motif recognition [44]. To our best knowledge, this idea has never been applied to multiple sequence alignment. We further describe a more sophisticated version, named LASAGNA-ChIP,for aligning peak sequences produced by ChIP-seq experiments.
11 To compare algorithms for TFBS alignment, we conduct cross-validation
(CV) experiments on 4771 binding sites of 189 TFs across 5 species extracted from the TRANSFAC Public database (release 7.0). We compare LASAGNA to ClustalW2 [81, 45] and MEME [3]. ClustalW2 [81, 45] is a widely used multiple sequence alignment tool. It aligns sequences by first constructing a guide tree based on pairwise sequence alignments. The guide tree then de- termines the order in which the sequences are aligned. Although ClustalW2 was not specifically designed for aligning TFBSs, it was used to produce gapped TFBS alignments in creating the MAPPER database [58] as well as to produce both gapped and gapless TFBS alignments in [69]. ClustalW2 and other similar algorithms focus on producing structurally correct align- ments, while other improved algorithms rely on structural or homology information [61]. ClustalW2 can be viewed as a representative of these al- gorithms when no information other than sequences is available.
MEME [3], on the other hand, is a widely adopted de novo motif discovery tool. It is used to find over-represented patterns or motifs in input se- quences. It is capable of locating the motif instances in each input sequence, while it can also handle the case when an input sequence does not contain an instance of the motif. Since the binding sites of a TF share a common
12 pattern or motif, a de novo motif discovery tool can be used to locate the motif instance in each binding site. This shared motif can then be used to align the binding sites. Therefore, in this dissertation, we view MEME as a TFBS alignment tool rather than a de novo motif discovery tool. In fact,
MEME is employed by the PAZAR database [64] to dynamically align TF-
BSs and generate PSSMs. It is also used by 5 out of 6 tools compared in
[80] for ChIP-seq data analysis. We show that LASAGNA significantly out-
15 15 performs ClustalW2 (p-value: 1.22 10− ) and MEME (p-value: 3.55 10− ). × ×
To scan promoters for new TFBSs based on variable-length known TFBSs, we couple a PSSM method with LASAGNA, denoted by LASAGNA-PSSM.
That is, the input variable-length TFBSs are aligned by LASAGNA and a PSSM model is built from the alignment. It is useful to compare an alignment-based TFBS search method to an alignment-free method. There- fore, we further compare LASAGNA-PSSM to SiTaR [24], which accepts variable-length input TFBSs. To our best knowledge, SiTaR is the only alignment-free method capable of handling variable-length input TFBSs at the time of writing. Cross-validation results on 90 TFs whose binding sites can be located in respective genomes indicate that LASAGNA-PSSM
8 is significantly more precise at fixed recall rates (p-value: 2.66 10− ). The × recall-precision curve also shows that our method is constantly more precise
13 at any recall rate and more sensitive at any precision.
Finally, we demonstrate the application of LASAGNA-ChIP to ChIP-seq data using 38 mouse ChIP-seq experiments. We show that, assuming the one-per-sequence model, LASAGNA-ChIP is comparable to MEME in re- vealing the motif of the ChIPed TF or its cofactor. For both LASAGNA-ChIP and MEME, the ChIPed TF motif was found in 31 experiments, while a co- factor motif was found in 3 experiments. While the two methods differ in the rest 4 experiments, the found motifs have similar information content and may belong to unknown cofactors.
1.2 Methods
We describe our novel alignment algorithm in this section. LASAGNA utilizes a search module to align a new binding site with a partial align- ment. Thus, we introduce the search module followed by the LASAGNA algorithm.
1.2.1 The Search Module
The search module of LASAGNA is a function learned from a (partial) TFBS alignment to score l-mers. It considers nucleotide pairs in addition to in- dividual nucleotides so as to exploit dependence between positions. We introduce our choice of the search module, the PSSM model described in
14 [63]. We denote it by PSSMa ( ) in this chapter. K ·
Suppose that a PSSM is constructed from aligned sequences of length l. The score of letter u at position i is given by
fi(u) M (u) = log , i f (u)
where fi(u) is the probability of observing letter u at position i and f (u) is the background probability of seeing letter u. Similarly, the score of a pair of letters (u, v) at position (i, j) is given by
fi,j(u, v) M (u, v) = log , i,j f (u, v)
where fi,j(u, v) is the probability of observing nucleotide pair (u, v) at position
(i, j) and f (u, v) is the background probability of seeing the pair. The score of s, a sequence of length l, is then
l K l k X X X− PSSMK(s) = Mi(si) + Mi,j(si, sj), (1.2.1) i=1 k=1 i=1
th where si denotes the i letter of s, j = i + k and K is the scope parameter defined in [63]. The parameter K controls how far apart a pair of nucleotides can be. When K = 1, only adjacent nucleotide pairs are scored. We define
Pl PSSM0(s) = i=1 Mi(si), that is, we do not score nucleotide pairs when K = 0.
15 Our search module is a variant of (1.2.1). Let minx Mi(x) if u is the gap letter M0(u) = i Mi(u) otherwise and minx y Mi j(x, y) if u or v is the gap letter , , M0 (u, v) = . i,j Mi,j(u, v) otherwise
The search module is defined as follows:
l K l k X X X− a PSSMK(s) = Mi0(si) + Mi0,j(si, sj), (1.2.2) i=1 k=1 i=1 where superscript a denotes alignment as this module is used in our align- ment algorithm.
1.2.2 The LASAGNA Algorithm
The algorithm is based on the idea that the binding sites of a TF share a common core, a conserved short DNA sequence. A binding site can then be seen as a core with a few irrelevant bases on one or both sides. Assuming that each binding site fully contains the core, the shorter a binding site, the fewer irrelevant bases it contains. Therefore, we progressively align the binding sites by aligning the shortest binding site with the already aligned binding sites until all the binding sites are aligned.
16 The algorithm takes a set of unaligned binding sites, U, and parameter Ka as inputs. Let A denote the set of aligned binding sites. A binding site in A may have gap letters added to one or both ends as a result of the alignment.
The algorithm works as follows:
(i) Initialize A to s , where s, the seed site, a shortest binding site arbitrarily { } chosen from U. Remove s from U.
(ii) (a) Build PSSMa ( ) from A. Let the length of this PSSM be l. Ka ·
(b) Remove the shortest binding site s from U.
(c) Create S, the augmented sequence of s, by adding l 1 gap letters − to both ends of s.
(d) Score each l-mer of S by PSSMa ( ) to find the highest scoring one. Ka ·
(e) Let s be its reverse-complement and repeat c–d. That is, the opposite
strand is considered.
(iii) Add s to A if the highest scoring l-mer resides in s. Otherwise, add its
reverse-complement to A. Gap letters are added to one or both ends
of sequences in A. This ensures that they are all of the same length
and each column of the alignment has at least one non-gap letter.
(iv) Repeat ii–iii until U is empty.
In step iib, there may be more than one shortest binding sites in U. To break the tie, we use PSSMa ( ) to scan each of the shortest ones. The “s” contain- Ka · 17 (a) -GCGCTAA-- (c) ----GCGCTAA-- --CGCCAAA------CGCCAAA- -GCGCCAAA- ----GCGCCAAA- -GCGCCAAA- ----GCGCCAAA- -GCGCCAAA- (b) PSSM ----GCGCCAAA- -GCGCGAAA- ----GCGCGAAA- -GCGCCAAT- ----GCGCCAAT- -CCGCCAAA- ----CCGCCAAA- -CCGCGAAA- ----CCGCGAAA- A --CGCGGAAA A -----CGCGGAAA -GCGCGAAG- ----GCGCGAAG- -GCGGGAAA- ----GCGGGAAA- -GCGCGATC- ----GCGCGATC- -CCCGGAAA- ----CCCGGAAA- CGCGCCAAA- ---CGCGCCAAA- -GCGCGAAAA ----GCGCGAAAA CCCGCCAGG- ---CCCGCCAGG- TTTCCCGCCAA--
------TTT CCCGCCAA------TTTCCCGCCAA TTTCGCGCCAAA U TTTCGCGCCAAA U TTTGGCGGGCGGCC TTTGGCGGGCGGCC CAATTTTCGCGCGG CAATTTTCGCGCGG CCATTTTCGCGGGAA CCATTTTCGCGGGAA
Figure 1.2.1: An Illustration of LASAGNA with Ka = 0
ing the highest scoring l-mer is removed from U to align with sequences in
A. In the unlikely case of two or more shortest binding sites in U sharing the same highest score, one is arbitrarily chosen. Figure 1.2.1 illustrates an iteration of the algorithm.
An alignment may be trimmed before building a PSSM. We describe one way of trimming aligned TF binding sites using two simple measures. Let l be the length of the aligned binding sites. We first compute and denote the percentage of non-gap letters at position i of the alignment by Ci, for i = 1, 2,..., l. The information content (IC) at each position is then computed with small sample correction described in [73]. That is,
X ICi = max 0, 2 + fi(u) log fi(u) eˆ(ni) , 2 − u A, C, G, T ∈{ } where i 1, 2,..., l , ni is the number of non-gap letters at position i and ∈ { } 18 eˆ( ) gives the approximated sampling error. Let C and IC be the cutoff · min min thresholds. The alignment is examined from the left end to the right until the
first position j satisfying both Cj > Cmin and ICj > ICmin is encountered. The positions preceding j are trimmed off. The trimming is similarly applied to the right end.
1.2.3 LASAGNA for ChIP-seq Data
Although LASAGNA is not specifically designed as a de novo motif discovery algorithm, a more sophisticated version, named LASAGNA-ChIP, is capa- ble of handling ChIP-seq data. Here, we refer to the previous section and describe the additional steps that are necessary for aligning ChIP-seq peak sequences. The flowchart in Figure 1.2.2 gives an overview of LASAGNA-
ChIP.
Before aligning ChIP-seq peak sequences, each sequence is clipped to 100 bp surrounding the signal peak. This is a common practice since, for most peak sequences (> 90%), the actual TFBS is usually found within 50 bp of the called peak [38]. In step iia, we trim the partial alignment A if it contains more than two sequences. Unlike TFBSs found in databases such as TRANSFAC, even after clipping, a peak sequence contains much more irrelevant bases flanking the core. The trimming procedure described in the
19 Overview of the LASAGNA-ChIP algorithm
ChIP-seq peak sequences LASAGNA with trimming Initial Refined Clip to 100 bp Alignments Alignments around signal peak Alignment 0 Alignment 0
Clipped peak Alignment 1 Alignment 1 sequences Alignment 2 Alignment 2 Refinement Seed sites Alignment 3 Process Alignment 3 Shortest site Alignment 4 Alignment 4 Arbitrary site 1 Alignment 5 Alignment 5 Arbitrary site 2
Arbitrary site 3
Arbitrary site 4
Arbitrary site 5 Alignment with the highest IC
Figure 1.2.2: LASAGNA-ChIP Flowchart
previous section is used, where Cmin (ICmin) is set to the mean Ci (ICi) over all the columns of A. The resulting alignment is further trimmed by IC such that it has at most 15 columns and the columns on both ends have positive
IC. In step iib, if there are more than 5 shortest binding sites in U. Only 5 are arbitrarily chosen to break the tie by similarity to PSSMa ( ). Ka ·
The alignment A obtained by the modified procedure is further refined as follows:
(i) Set T to A trimmed to l columns as described above.
(ii) Build PSSMa ( ) out of T. Ka ·
(iii) Initialize R to , the refined partial alignment. {}
(iv) For each peak sequence s,
20 (a) Create S, the augmented sequence of s, by adding l 1 gap letters − to both ends of s.
(b) Score each l-mer of S by PSSMa ( ) to find the highest scoring one. Ka ·
(c) Let s be its reverse-complement and repeat a–b.
(d) Add s to R if the highest scoring l-mer resides in s. Otherwise, add
its reverse-complement to R. Gap letters are added to one or both
ends of sequences in R.
(v) Set A to R and repeat i–v until the sum of IC across columns of T does
not change in 3 iterations.
For ChIP-seq peak sequences, the shortest sequence may miss or contain only a fraction of the core. Hence, using the shortest sequence as the seed site sometimes results in an alignment with less IC. For this reason, five additional sequences are arbitrarily chosen as the seed site to produce 5 additional alignments. Among the 6 alignments, the one with the most IC after trimming is chosen as the final alignment.
1.2.4 Scoring a Putative Binding Site
Although a PSSM suggests the length of a putative binding site, we do not restrict the length of a candidate binding site to the length of the PSSM. A putative binding site could be of any reasonable length. If a true binding site is flanked by a few irrelevant bases, this sequence should be given a
21 relatively high score compared to those of non-binding sites. Therefore, to score a putative binding site s, we slide s through the PSSM as described in the previous section. The score of sequence s is given by
ScoreKs (s) = max PSSMKs (Si:(i+l 1)), (1.2.1) i 1,2,...,l+ls 1 − ∈{ − }
where l is the length of the PSSM, ls is the length of s, S denotes the augmented sequence of s with l 1 gap letters on both ends and PSSMK ( ) is defined in − s · (1.2.1).
1.3 Results and Discussion
1.3.1 Comparison of Alignment Algorithms
1.3.1.1 Data sets
Wedownloaded all the TF binding sites from the TRANSFAC Public database
(release 7.0). The binding sites were grouped by species and TF. Binding sites having less than 4 nucleotides were discarded. TFs of each species were filtered such that each TF has at least 10 binding sites. This ensures that each TF has enough binding sites to construct a PSSM. The numbers of
TFs and TFBSs are listed in Table 1.3.1.
To facilitate experiments, we planted each TFBS in a 2000 base random sequence simulated by a first-order Markov chain of the species in question.
22 Table 1.3.1: TFBSs in TRANSFAC Public Database by Species
Species # TFs1# TFBSs2 Homo sapiens 68 1984 Mus musculus 53 966 Rattus norvegicus 26 633 Drosophila melanogaster 29 935 Saccharomyces cerevisiae 13 253 Overall 189 4771 1The total number of TFs 2The total number of TFBSs
Except for Saccharomyces cerevisiae, the Markov chain of a species was learned from promoter sequences in the UCSC Genome Browser database
[19]. For Saccharomyces cerevisiae, the promoter sequences were retrieved from the SCPD [93] using the yeast gene list in euGenes [30].
1.3.1.2 Performance assessment and evaluation metrics
Since the purpose of aligning TFBSs is to construct a PSSM, the quality of an alignment is best measured by the search performance of the PSSM. The performance of a TFBS search method is evaluated by ν-fold CV. Consider a TF with n binding sites. The n TFBSs are first divided into ν sets, each of which contains n or n + 1 TFBSs. At each iteration of the ν-fold CV, one b ν c b ν c of the ν TFBS sets called the test TFBS set, Ptest, is left out. The rest of the
TFBSs are aligned to build a PSSM. Each test TFBS in Ptest is then planted in a 2000 base random sequence and scanned by the PSSM, scoring each l-mer, where l is the length of the test TFBS. We score both the forward and
23 reverse strands of an l-mer and assign the higher score to it. An l-mer is considered a hit if it shares more than l/2 bases with the test TFBS. The b c l-mers can then be divided into two sets, H and N, where H is the set of hits and N is considered the set of non-binding sites. The score of the test TFBS is the highest score of hits in H. For each test TFBS t P , we find its rank ∈ test relative to all the non-binding sites in N. Formally, the rank of binding site t equals 1 + s N ScoreK (s) ScoreK (t) . |{ ∈ | s ≥ s }|
After the ν-fold CV, we end up with n ranks, each of which corresponds to a
TFBS. We use the area under the ROC curve (AUC) to gauge the quality of alignment. The ROC curve is a plot of true positive rate (TPR) against false positive rate (FPR), displaying the trade-off between TPR and FPR. We refer readers to [23] for an introduction to this metric. In this study, ν = 10 for all the CV experiments.
1.3.1.3 Comparison with ClustalW2
In general, gapless alignment is preferred over gapped alignment for align- ing TFBSs. Because of the nature of ClustalW2, the alignment of TFBSs may contain gaps in the middle of some binding sites. This is disadvantageous to
ClustalW2 as the PSSM method does not allow insertion of gaps into the se- quence being scanned. Hence, we turned off gaps by setting the gap opening penalty parameters to a large value, i.e., we set both GAPOPEN and PWGAPOPEN
24 to 100000. Indeed, results indicated that overall the “gapped” ClustalW2 performs slightly worse than the “gapless” variant (p-value: 0.277). For both
LASAGNA and ClustalW2, parameter Ks in Eq. 1.2.1 was searched from 0 to min 10, l 1 for each TF and the one producing the highest AUC is used, { min − } where lmin is the minimal length of the TFBSs. For LASAGNA, parameter Ka of the LASAGNA algorithm was set to Ks as the two parameters are closely related.
We conducted 10-fold CV on each TF. The overall ROC curves are shown in
Figure 1.3.1. The ROC curves are based on the ranks of 4771 TFBSs of 189
TFs. It shows that LASAGNA has invariably higher true positive rate than
ClustalW2. The AUC score was calculated for each TF and for each method.
To gauge the significance of difference, the Wilcoxon signed-rank test [87] was performed for each species. The tests showed that LASAGNA is consis- tently better than ClustalW2 across the 5 species. Table 1.3.2 shows the test results. Overall, LASAGNA performed significantly better than ClustalW2 in terms of AUC scores. The species-wise p-values shows that LASAGNA is significantly better (< 0.05) than ClustalW2 for aligning TFBSs of all the 5 individual species.
To better understand the results, we split the 189 TFs into two groups. One
25 0.7 1.00 0.6 0.95 0.5 0.90 0.4 0.85 0.3 True Positive Rate Positive True 0.80 0.2 0.75 0.1 LASAGNA ClustalW2 MEME 0.0 0.70
0.000 0.010 0.020 0.0 0.1 0.2 0.3 0.4 0.5 0.6
False Positive Rate False Positive Rate
Figure 1.3.1: Overall ROC Curves for the Three Alignment Algorithms
contains TFs on which LASAGNA performed better than ClustalW2 and the other contains the rest of the TFs. Three factors are examined for each TF.
They are the number of TFBSs, the mean and standard deviation of TFBS length. For each factor, we looked for difference between the two groups.
Table 1.3.3 shows the comparisons. It can be seen that LASAGNA produces better alignments when a TF has fewer binding sites but the difference is not significant. The mean and standard deviation of TFBS length are the two more important factors. We believe that LASAGNA is well-suited for aligning TFBSs that are longer and more variable in length.
1.3.1.4 Comparison with MEME
The MEME tool in the MEME Suite 4.8.1 was used. The parameter minw,
26 Table 1.3.2: Species-wise and Overall Comparisons between LASAGNA and ClustalW2
Species # better1# ties2# TFs3 p-value4 7 H. sapiens 54 (79.4%) 0 68 4.42 10− × 5 M. musculus 42 (79.2%) 0 53 1.41 10− × 4 D. melanogaster 22 (75.9%) 0 29 9.89 10− × 2 S. cerevisiae 9 (69.2%) 1 13 3.88 10− × 3 R. norvegicus 20 (76.9%) 1 26 1.54 10− × 15 Overall 147 (77.8%) 2 1891.22 10− × 1Number of TFs on which LASAGNA performs better than ClustalW2. 2Number of TFs on which LASAGNA and ClustalW2 have the same performance. 3Total number of TFs for a species. 4Wilcoxon signed-rank test p-value.
minimal width of motifs, was set to the smaller of 6 and the minimal length of input TFBSs. The option revcomp to search the reverse strand was turned on. Finally, the parameter minsites was set to the number of input TFBSs since a common motif is supposed to appear at least once in each TFBS. To ensure that MEME functions properly, binding sites shorter than 8 bases are padded with gap letters since genomic locations are not available for most
TFBSs.
The experiments were carried out in the same manner as the ClustalW2 ex- periments. The overall ROC curve in Figure 1.3.1 indicates that LASAGNA has consistently higher true positive rates than MEME across different false positive rates. The overall and species-wise comparisons between
LASAGNA and MEME in Table 1.3.4 show that LASAGNA performed sig-
27 Table 1.3.3: Comparison of Two Groups of TFs Divided According to Results on LASAGNA and ClustalW2
Factor Group 11 meanGroup 22 meanp-value3 # TFBSs4 25.07483 25.83333 0.1409 Mean of TFBS length 18.78626 17.56167 0.08451 SD of TFBS length5 8.180204 6.921905 0.06295 1LASAGNA performed better than ClustalW2 on TFs in this group. 2ClustalW2 performed better than or equal to LASAGNA on TFs in this group. 3Wilcoxon signed-rank test p-value. 4Number of binding sites for each TF. 5Standard deviation of binding site length for each TF.
nificantly better than MEME. To gain some insights into the difference be- tween LASAGNA and MEME, we similarly examined the three factors used to compare LASAGNA and ClustalW2. As seen in Table 1.3.5, the number of input TFBSs is the only significant (p-value < 0.05) factor out of the three.
The reasons are not clear but may be investigated in the future. Moreover, it will be helpful to identify other (biologically meaningful) factors that can better explain the performance difference.
1.3.1.5 Distribution of Ks
In Figures 1.3.2, 1.3.3 and 1.3.4, for LASAGNA, ClustalW2 and MEME, we show the distribution of Ks for a TF by species and conserved domain. Over- all, we observe that small values are preferred for all three methods. By vi- sual inspection, LASAGNA appears more similar to MEME than ClustalW2 in the usage of Ks. It can be seen that the usage of Ks differs among different conserved domains. Related conserved domains such as ZF-H2C2 2 and
28 Table 1.3.4: Species-wise and Overall Comparisons between LASAGNA and MEME
Species # better1# ties2# TFs3 p-value4 3 H. sapiens 41 (60.3%) 0 68 7.83 10− × 6 M. musculus 41 (77.4%) 0 53 8.79 10− × 7 D. melanogaster 26 (89.7%) 0 29 1.02 10− × 3 S. cerevisiae 10 (76.9%) 3 13 2.96 10− × 4 R. norvegicus 23 (88.5%) 1 26 1.73 10− × 15 Overall 141 (74.6%) 4 1893.55 10− × 1Number of TFs on which LASAGNA performs better than ClustalW2. 2Number of TFs on which LASAGNA and ClustalW2 have the same performance. 3Total number of TFs for a species. 4Wilcoxon signed-rank test p-value.
ZF-C2H2 display similar patterns. This is not surprising as conserved do- mains in a protein are often computationally predicted. Hence, a protein is likely to possess related conserved domains. While overall the distributions seem method-dependent, we observe that, for ZF-H2C2 2 and ZF-C2H2, the distributions center around 4 across all three methods. Finally, we note that these observations are preliminary and more TFs are needed to draw statistically sound conclusions.
1.3.2 Comparison of TFBS Search Methods
1.3.2.1 Data sets
To compare with an alignment-free TFBS search method, SiTaR, [24], we retrieved real promoter sequences embedding TFBSs. Specifically, we fol- lowed the curated location of each binding site in the TRANSFAC Public
29 Table 1.3.5: Comparison of Two Groups of TFs Divided According to Results on LASAGNA and MEME
Factor Group 11 meanGroup 22 meanp-value3 # TFBSs4 23.33333 30.85417 0.03196 Mean of TFBS length 18.33468 19.04125 0.3007 SD of TFBS length5 7.95844 7.730625 0.1846 1LASAGNA performed better than ClustalW2 on TFs in this group. 2ClustalW2 performed better than or equal to LASAGNA on TFs in this group. 3Wilcoxon signed-rank test p-value. 4Number of binding sites for each TF. 5Standard deviation of binding site length for each TF.
database (release 7.0) to retrieve the 1000-base sequences flanking the bind- ing site. We discarded binding sites that cannot be found in the proximity of the curated locations. The retrieved binding sites were grouped by TF and TFs having less than 10 binding sites were removed. After filtering, we ended up with 90 TFs and 1751 binding sites. A TF may be present in more than one species as homologs and hence the binding sites of a TF may be located in genomes of multiple species. The species and respective numbers of binding sites are shown in Table 1.3.6.
1.3.2.2 Performance assessment and evaluation metrics
To compare with SiTaR [24], we adopt the same ν-fold CV process used to compare LASAGNA with ClustalW2 and MEME. However, we do not assume that a TFBS search method scores all the l-mers in a promoter se- quence, where l is the length of binding sites. Instead, a TFBS search method scans a promoter sequence and predicts a list of binding sites with respective
30 Table 1.3.6: Distribution of the 1751 Binding Sites of 90 TFs in TRANSFAC Public Database
Species # TFBSs1 Homo sapiens 735 Mus musculus 346 Rattus norvegicus 278 Saccharomyces cerevisiae 158 Drosophila melanogaster 155 Gallus gallus 73 Bos taurus 5 Sus scrofa 1 1Total number of TFBSs.
scores. The predicted binding sites may be of different lengths, which is the case for SiTaR.
We describe how a hit is determined. Let the length of a predicted binding site be l and the length of the test TFBS in question be ls. The predicted binding site is considered a hit to the test TFBS if the overlap between the two sequences is more than l /2 bases as in [24]. In case this is not possible, b s c i.e., l l /2 , the predicted binding site must be embedded in the true one ≤ b s c to be deemed a hit.
Using the n ranks of TFBSs from ν-fold CV, we compute recall (true positive rate), precision and the Fβ-measure, where β = 0.5 as in [24]. Let the recall rate be r. The number of TFBSs recalled by the method is p = n r. Let T × 31 the number of non-binding sites or false positives introduced be pF. The
p precision recall T 2 × precision is given by p +p , while Fβ = (1 + β ) β2 precision+recall . T F ×
1.3.2.3 Comparison with an alignment-free method
We conducted 10-fold CV on the aforementioned 90 TFs. The PSSM method dependent on LASAGNA (LASAGNA-PSSM) was compared to SiTaR [24].
LASAGNA considered both strands of a sequence when aligning binding sites. The parameters Ka = Ks were determined in the same way as in comparing LASAGNA to ClustalW2. An alignment was trimmed with
Cmin = 0.4 and ICmin = 0 before constructing a PSSM as described in the method section on the LASAGNA algorithm. The PSSM method uses a cut- off score to predict TFBSs. The cutoff score is set to the minimal score of the constituting binding sites of the PSSM. The SiTaR method has a mismatch parameter and the maximal value allowed by its webtool is 5. We searched in the range from 0 to 5 to find the mismatch value giving the highest Fβ- measure for each TF.
In terms of the Fβ, no significant difference was found between the two meth- ods (p-value: 0.392 [87]). To ensure a fair comparison, we fixed the recall rate for each TF and compare the precision achieved by LASAGNA-PSSM and SiTaR. The recall rate was set to the lower of the recall rates attained by
LASAGNA-PSSM and SiTaR. The TF c-Jun (AC: T00132) was excluded from
32 comparison because SiTaR did not recover any TFBS. Figure 1.3.5a shows the plot of precision by LASAGNA-PSSM against that by SiTaR. At fixed recall rates, LASAGNA-PSSM is more precise than SiTaR on 65 out of 89
8 TFs (p-value: 2.66 10− ). Figure 1.3.5b shows the plots of precision against × recall based on all the recalled TFBSs by each method. It can be seen that
LASAGNA-PSSM is constantly more precise than SiTaR at the same recall rate. Moreover, LASAGNA-PSSM recovered substantially more TFBSs than
SiTaR at the same precision.
Results reported in [24] showed that SiTaR is highly precise and sensitive.
Although SiTaR accepts variable-length binding sites, all the experiments presented in [24] used fixed-length binding sites as inputs. It is therefore not clear how SiTaR performs on TFs having variable-length binding sites. It is also not clear whether SiTaR preprocesses highly variable-length binding sites as this was not stated in [24]. These issues however are not the focus of this dissertation.
1.3.3 Application of LASAGNA-ChIP to ChIP-seq Data
To demonstrate the use of LASAGNA-ChIP on ChIP-seq data, we retrieved mouse ChIP-seq data produced by the Encyclopedia of DNA Elements
(ENCODE) project [14] from the UCSC Genome Browser [19]. All the 38 peak
files in the Narrow Peaks format that matches pattern ftp://hgdownload.
33 cse.ucsc.edu/goldenPath/mm9/database/wgEncode*Tfbs*Pk* were down- loaded on Oct. 12, 2012, where “*” is the wildcard character matching zero or more characters. These files give signal peak location besides start and end for each peak and hence the corresponding signal files do not need to be processed by a peak-finding algorithm. Four distinct cell types and 17 dis- tinct target TFs are present in the 38 ChIP-seq experiments. Table 1.3.7 lists, for each ChIP-seq experiment, the cell, target TF, number of peaks as well as the minimum, maximum, mean and standard deviation of peak length. We observe that the peak length varies greatly. The mean peak length can be as long as 1124, while the highest standard deviation is nearly 876.
34 LASAGNA
Overall Homo sapiens Saccharomyces cerevisiae 14 3.0 30 12 2.5 25 10 2.0 20 8 1.5 15 6 Count Count Count 1.0 10 4 5 2 0.5 0 0 0.0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K K K
Mus musculus Rattus norvegicus Drosophila melanogaster 7 6 10 6 5 8 5 4 6 4 3 3 Count Count Count 4 2 2 2 1 1 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K K K
HOX BZIP_1 BRLZ 5 6 10 5 8 4 4 6 3 3 Count Count Count 4 2 2 2 1 1 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K K K
ZF−C4 BZIP_2 HOLI 5 5 5 4 4 4 3 3 3 Count Count Count 2 2 2 1 1 1 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K K K
ZF−H2C2_2 POU ZF−C2H2 5 6 5 5 4 4 4 3 3 3 Count Count Count 2 2 2 1 1 1 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K K K
Figure 1.3.2: Distribution of Ks by Species and Conserved Domain (LASAGNA)
35 ClustalW2
Overall Homo sapiens Saccharomyces cerevisiae 10 3.0 30 2.5 8 25 2.0 6 20 1.5 15 Count Count Count 4 1.0 10 2 5 0.5 0 0 0.0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K K K
Mus musculus Rattus norvegicus Drosophila melanogaster 6 4 10 5 3 8 4 6 3 2 Count Count Count 4 2 1 2 1 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K K K
HOX BZIP_1 BRLZ 8 7 8 6 6 5 6 4 4 4 3 Count Count Count 2 2 2 1 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K K K
ZF−C4 BZIP_2 HOLI 6 6 6 5 5 5 4 4 4 3 3 3 Count Count Count 2 2 2 1 1 1 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K K K
ZF−H2C2_2 POU ZF−C2H2 6 4 5 5 4 3 4 3 3 2 Count Count Count 2 2 1 1 1 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K K K
Figure 1.3.3: Distribution of Ks by Species and Conserved Domain (ClustalW2)
36 MEME
Overall Homo sapiens Saccharomyces cerevisiae 30 3.0 14 25 2.5 12 20 10 2.0 8 15 1.5 Count Count Count 6 10 1.0 4 5 0.5 2 0 0 0.0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K K K
Mus musculus Rattus norvegicus Drosophila melanogaster 7 6 12 6 5 10 5 8 4 4 6 3 3 Count Count Count 4 2 2 2 1 1 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K K K
HOX BZIP_1 BRLZ 6 5 10 5 4 8 4 3 6 3 Count Count Count 2 4 2 1 2 1 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K K K
ZF−C4 BZIP_2 HOLI 4 5 4 4 3 3 3 2 2 Count Count Count 2 1 1 1 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K K K
ZF−H2C2_2 POU ZF−C2H2 5 5 4 4 4 3 3 3 2 Count Count Count 2 2 1 1 1 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K K K
Figure 1.3.4: Distribution of Ks by Species and Conserved Domain (MEME)
37 (a) (b) ●
1.0 1.0 LASAGNA−PSSM ● ● ● ●● ● ● ● ● ● ● SiTaR ● ● ● ● ● ● ● 0.8 ●● ● 0.8 ● ● ● ● ● ● ● ● ● ● ●
0.6 ● 0.6 ● ● ● ● ● ● ● ● Precision
0.4 ● 0.4 ● ● ● ●
● LASAGNA−PSSM Precision LASAGNA−PSSM ● ● ● ● ●
0.2 ● 0.2 ● ●
● ●● ●● ● ● ●● ● ● ●● ●● ●●● ●● 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000 1200
SiTaR Precision Number of Recalled TFBSs
Figure 1.3.5: Comparison of the PSSM Method Dependent on LASAGNA to SiTaR
38 Table 1.3.7: List of 38 ChIP-seq Experiments
Track Cell TF # PeaksMin Max Mean Std
wgEncodeSydhTfbsCh12Bhlhe40nb100IggrabPk CH12 BHLHE40 50576 110148 628.9630.3
wgEncodeSydhTfbsMelBhlhe40cIggrabPk MEL BHLHE40 11919 1 5609 581.5584.7
wgEncodeCaltechTfbsC2c12CebpbFCntrl50bE2p60hPcr1xPkRep1 C2C12 CEBPB 7936 100 1397 304.5101.3
wgEncodeCaltechTfbsC2c12CebpbFCntrl50bPcr1xPkRep1 C2C12 CEBPB 11434 104 1455 349.7125.3 39 wgEncodeCaltechTfbsC2c12CebpbFCntrl50bE2p60hPcr1xPkRep2 C2C12 CEBPB 7339 119 1390 351.1101.6
wgEncodeSydhTfbsMelEts1IggrabPk MEL ETS1 14368 1 9720 818.5643.4
wgEncodeSydhTfbsCh12Ets1IggrabPk CH12 ETS1 41149 110484 676.1652.2
wgEncodeCaltechTfbsC2c12Fosl1sc605FCntrl36bPcr1xPkRep1 C2C12 FOSL1 6507 125 1035 346.5105.5
wgEncodeSydhTfbsMelGata1Dm2p5dStdPk MEL GATA1 23793 1 4461 461.4251.0
wgEncodeSydhTfbsMelGata1IggratPk MEL GATA1 60511 1 5608 473.6362.4
Continued on next page Table 1.3.7 – continued from previous page
Track Cell TF # PeaksMin Max Mean Std
wgEncodeSydhTfbsCh12CjunIggrabPk CH12 JUN 16619 1 6938 563.9396.6
wgEncodeSydhTfbsMelJundIggrabPk MEL JUND 1638 2 4946 563.1409.0
wgEncodeSydhTfbsCh12JundIggrabPk CH12 JUND 14821 1 5473 585.3463.2
wgEncodeSydhTfbsCh12Mafkab50322IggrabPk CH12 MAFK 38914 1 8918 423.0434.7
40 wgEncodeSydhTfbsMelMafkab50322IggrabPk MEL MAFK 11369 1 6308 386.3415.9
wgEncodeSydhTfbsMelMafkDm2p5dStdPk MEL MAFK 1191 28 9421 466.1370.1
wgEncodeSydhTfbsEse14MafkStdPk ES-E14 MAFK 14597 16 6152 390.3217.0
wgEncodeSydhTfbsMelMaxIggrabPk MEL MAX 10079 1 7565 899.0816.9
wgEncodeSydhTfbsCh12MaxIggrabPk CH12 MAX 31133 1 9066 741.0693.3
wgEncodeCaltechTfbsC2c12MaxFCntrl50bPcr1xPkRep1 C2C12 MAX 8930 103 2242 674.4262.9
Continued on next page Table 1.3.7 – continued from previous page
Track Cell TF # PeaksMin Max Mean Std
wgEncodeCaltechTfbsC2c12MaxFCntrl50bE2p60hPcr1xPkRep1 C2C12 MAX 2085 109 1390 464.3149.4
wgEncodeSydhTfbsMelCmybh141IggrabPk MEL MYB 4019 1 6956 751.8618.8
wgEncodeSydhTfbsMelCmybsc7874IggrabPk MEL MYB 4019 1 6956 751.8618.8
wgEncodeSydhTfbsCh12CmycIggrabPk CH12 MYC 19356 1 9735 896.7734.7
41 wgEncodeSydhTfbsMelCmycIggrabPk MEL MYC 6297 1102421124.5876.5
wgEncodeCaltechTfbsC2c12Sc32758FCntrl50bE2p7dPcr1xPkRep1 C2C12 MYOD1 3247 103 1019 380.6115.4
wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bE2p24hPcr2xPkRep1C2C12 MYOD1 20160 87 2338 386.3159.5
wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bPcr2xPkRep1 C2C12 MYOD1 13040 92 1997 474.0199.9
wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bE2p60hPcr2xPkRep1C2C12 MYOD1 29226 67 5176 504.8265.0
wgEncodeCaltechTfbsC2c12Sc12732FCntrl32bE2p60hPcr2xPkRep1C2C12 MYOG 32173 66 3463 504.0257.4
Continued on next page Table 1.3.7 – continued from previous page
Track Cell TF # PeaksMin Max Mean Std
wgEncodeCaltechTfbsC2c12Sc12732FCntrl32bE2p24hPcr2xPkRep1C2C12 MYOG 24360 65 1968 339.8156.1
wgEncodeCaltechTfbsC2c12Sc12732FCntrl50bE2p7dPcr1xPkRep1 C2C12 MYOG 6885 118 1515 423.8124.9
wgEncodeCaltechTfbsC2c12SrfFCntrl32bE2p24hPcr2xPkRep1 C2C12 SRF 2610 177 2672 549.4237.5
wgEncodeSydhTfbsMelTbpIggmusPk MEL TBP 25442 110597 795.6705.6
42 wgEncodeSydhTfbsCh12TbpIggmusPk CH12 TBP 17492 1 9544 779.4567.9
wgEncodeCaltechTfbsC2c12Tcf3FCntrl32bE2p5dPcr2xPkRep1 C2C12 TCF3 10172 65 2026 324.4136.7
wgEncodeCaltechTfbsC2c12Usf1FCntrl50bE2p60hPcr1xPkRep1 C2C12 USF1 6742 100 1165 359.5120.0
wgEncodeCaltechTfbsC2c12Usf1FCntrl50bPcr1xPkRep1 C2C12 USF1 8579 108 1893 415.2128.2 It is useful to know if LASAGNA-ChIP is able to align peak sequences and reveal the motif of the ChIPed TF. To align peak sequences, parameter
Ka was searched from 0 to 8 to obtain the alignment with the highest IC.
MEME was also used to align peak sequences because it is often the choice of method. In fact, MEME is used by 5 out of 6 tools compared in [80] for ChIP-seq data analysis. The MEME parameters are described in section
Comparison of alignment algorithms, where the one-per-sequence model is assumed. To ensure that both methods finish within reasonable time, for each experiment, we randomly sampled 300 peaks for alignment. We did not distinguish large peaks from small ones because ChIP-seq experiments require large numbers of cells and hence “a small peak could represent very strong binding in only a subset of the cells” [22].
For each alignment, we searched for the resulting motif in 386 UniPROBE mouse motifs and 398 motifs derived from all the matrices in the TRANS-
FAC Public database. The search was accomplished by software TOMTOM
[33]. We used Pearson correlation as the distance measure, required a min- imal overlap of 5 nucleotides, and set the E-value cut-off to 5. Table 1.3.8 shows, for each ChIP-seq experiment, the sequence logos of motifs found by
LASAGNA-ChIP and MEME. The matching motifs found by TOMTOM are listed under each sequence logo [15] by E-value. In case more than 10 signif-
43 icant motifs were found, only the 10 most significant ones were shown. For each ChIP-seq experiment, the one matching the ChIPed TF is highlighted in yellow, while the one matching a cofactor of the ChIPed TF is highlighted in blue if the ChIPed TF is not found.
We first notice that overall the motifs found by LASAGNA-ChIP and MEME are very similar by visual inspection. No significant difference is observed in terms of motif IC (p-value: 0.1252). For both LASAGNA-ChIP and MEME, the ChIPed TF motifs were found for 31 experiments. Among the other 7 experiments are one MYB in MEL cells, all the ETS1 in CH12 and MEL cells, one JUND in MEL cells, one MAX in C2C12 cells, all the TBP in CH12 and
MEL cells. Interestingly, LASAGNA-ChIP and MEME differ only for 4 out of these 7 experiments. They are one ETS1 in CH12 cells, one MAX in C2C12 cells and two TBP in CH12 and MEL cells. Although LASAGNA-ChIP and
MEME differ in these cases, the found motifs still warrant further analyses.
For instance, the motif for ETS1 in CH12 cells found by LASAGNA-ChIP resembles the secondary motif of Gabpa (highlighted in green), which is a known paralog.
For the other 3 out of the 7 experiments, LASAGNA-ChIP and MEME pro- duced similar motifs. The one found for MYB in MEL resembles those
44 of GATA proteins. This agrees with a recent study reporting that MYB and GATA-3 cooperatively regulate IL-13 by direct binding to a conserved
GATA-3 response element [42]. Since this motif is based on 300 peak se- quences, it is likely that the two proteins similarly regulate other genes in
MEL cells. The motif for ETS1 in MEL cells also matches those of GATA pro- teins. Cooperation between ETS1 and GATA-3 in regulating IL-5 was also suggested [9, 86]. Finally, while the motif for JUND in MEL cells matches two motifs in the TRANSFAC and UniPROBE databases, the matches are likely false positives since no literature support was found.
While it is not specifically designed to be a de novo motif discovery method,
LASAGNA-ChIP aligns all the peak sequences and finds the most informa- tive motif. The assumption that a motif instance is present in each peak sequence may not hold for some experiments. Because of several possi- ble binding models [22], two or more motifs may be present in subsets of the peak sequences. Discovery of more than one motif will be enabled for
LASAGNA-ChIP in the near future. Table 1.3.8: Sequence Logos of Motifs Found by LASAGNA-ChIP and MEME
TF Cell LASAGNA-ChIP MEME wgEncodeSydhTfbsMelBhlhe40cIggrabPk Continued on next page
45 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 7.14 bits IC: 7.21 bits 2.0 2.0
1.0 1.0 bits C bits CA G C CGT G AC A G AT G A CT G GAAT C T T C A C TG A T T AC G CG G GC T T T A 0.0 0.0 5 5 BHLHE40 MEL WebLogo 3.2 WebLogo 3.2 USF; SREBP-1; Arnt;USF; Arnt; N-Myc; SREBP- Bhlhb2 primary; N-Myc;1; c-Myc:Max; MyoD; c-Myc:Max; Max; PHO4;Max; Bhlhb2 primary; Max primary; MyoD; GBP;Max primary; PHO4; PIF3; RAV1; PIF3 Sn; E47; Lmo2complex; GBP; Myf6 primary wgEncodeSydhTfbsCh12Bhlhe40nb100IggrabPk IC: 8.32 bits IC: 8.19 bits 2.0 2.0
1.0 1.0 bits GT bits C G C G A G C T G TCC C TAC G A A CG GG C A T G C G TA T C AT A G C GG A T A T T TT 0.0 0.0 5 5 BHLHE40 CH12 WebLogo 3.2 WebLogo 3.2 c-Myc:Max; Arnt; USF;USF; Arnt; c-Myc:Max; Bhlhb2 primary; N-Myc;Bhlhb2 primary; SREBP- SREBP-1; Max; PIF3; GBP;1; Max; Max primary; Max primary; Hairy; PHO4;N-Myc; PIF3; RAV1; RAV1; Max secondary;PHO4; Max secondary; MyoD; Zscan4 primary;GBP; Hairy; MyoD; Zs- Lmo2complex; can4 primary Tcfe2a primary; E47 wgEncodeCaltechTfbsC2c12CebpbFCntrl50bE2p60hPcr1xPkRep1 IC: 10.41 bits IC: 10.4 bits 2.0 2.0
1.0 1.0 bits G A bits G A A AT T A AT T TT TC C T TC C G G T G G C C C C AT G ACTG G A G C C CG C TT A A TT A 0.0 0.0 5 10 5 10 CEBPB C2C12 WebLogo 3.2 WebLogo 3.2 C/EBPbeta; C/EBP;C/EBPbeta; C/EBP; C/EBPalpha; HLF; VBP;C/EBPalpha; HLF; VBP; E4BP4; Mafb secondary;E4BP4; Mafb secondary; CHOP:C/EBPalpha; CHOP:C/EBPalpha; ces-2; Dlx2 2273.2;ces-2; Dlx2 2273.2; Mafk secondary; Mafk secondary; Cphx 3484.1; Hdx 3845.3 Cphx 3484.1 wgEncodeCaltechTfbsC2c12CebpbFCntrl50bE2p60hPcr1xPkRep2 Continued on next page
46 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 9.56 bits IC: 9.79 bits 2.0 2.0
1.0 1.0 bits G bits G TGC A C A A T ATG A T A C A G C T A GC G C G GC C G A T AT C T A G T G A GAG CT T G T G C C CC ATC TA CA GTCA T A 0.0 0.0 5 10 15 5 10 CEBPB C2C12 WebLogo 3.2 WebLogo 3.2 C/EBPbeta; C/EBP;C/EBPbeta; C/EBP; C/EBPalpha; VBP; HLF;C/EBPalpha; VBP; E4BP4; CHOP:C/EBPalpha; CHOP:C/EBPalpha; E4BP4; Mafb secondary;HLF; CREB; MATa1; Zfp105 secondary; MATa1;Mafb secondary; Dlx2 2273.2; ces-2;Cphx 3484.1; ces-2 Bsx 3483.2; CREB; Oct- 1; Cphx 3484.1; Hdx 3845.3; Gmeb1 primary wgEncodeCaltechTfbsC2c12CebpbFCntrl50bPcr1xPkRep1 IC: 10.39 bits IC: 10.49 bits 2.0 2.0
1.0 1.0 bits bits
G A G CA A C A T G T A AT G T A CT C CC T CCT G G T G T C C C G A GA T A C A A C T G GG C T A T T A T 0.0 0.0 5 10 5 10 CEBPB C2C12 WebLogo 3.2 WebLogo 3.2 C/EBPbeta; C/EBP;C/EBPbeta; C/EBPalpha; C/EBPalpha; HLF;C/EBP; HLF; Mafb secondary; Mafb secondary; E4BP4; CHOP:C/EBPalpha; E4BP4;ces-2; CHOP:C/EBPalpha; ces-2; VBP; Dlx2 2273.2;VBP; Dlx2 2273.2; Hdx 3845.3; Hoxa6 1040.1;Hoxa6 1040.1 Zfp105 secondary; Mafk secondary wgEncodeSydhTfbsCh12CjunIggrabPk IC: 8.0 bits IC: 7.91 bits 2.0 2.0
1.0 1.0 bits bits
G AT CTCA T GA T A T C AC G G GA GGTC CCT G A C T TT A A A C A T A A C T T ACC GG C A G A C G G T C G C G GC CG CG T T A AG T A AT 0.0 0.0 5 10 5 10 15 JUN CH12 WebLogo 3.2 WebLogo 3.2 AP-1; GCN4; TCF11:MafG;AP-1; Jundm2 secondary; cap; Jundm2 secondary;GCN4; NF-E2; TCF11; Bach2; NF-E2; Dfd;TCF11:MafG; v- CRP; v-Maf; Six4 2860.1 Maf; Six4 2860.1; Atf1 secondary; Bach2 wgEncodeSydhTfbsMelCmybh141IggrabPk Continued on next page
47 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 8.2 bits IC: 8.21 bits 2.0 2.0
1.0 1.0 bits T T bits A A C T A G CTT C G GA T C T G GA C T C T C C C G G CA G C C C G A T T G A A C A G G A T A C T T G T C GA G C A AT T AA TTA A 0.0 0.0 5 10 5 10 15 MYB MEL WebLogo 3.2 WebLogo 3.2 GATA-1; GATA-GATA-1; GATA- 2; Gata5 primary;2; Gata3 primary Gata6 primary; GATA-6;; Gata6 primary; Gata3 primary ; GATA-3;Gata5 primary; GATA- mtTFA; Lmo2complex;3; GATA-6; Lmo2complex; Gata3 secondary; GATA-Gata3 secondary; mt- X; Evi-1; Bbx secondary;TFA; GATA-X; Evi- Gabpa secondary; NIT2 1; Gabpa secondary; Sox7 secondary; NIT2 wgEncodeSydhTfbsMelCmybsc7874IggrabPk IC: 8.17 bits IC: 8.23 bits 2.0 2.0
1.0 1.0 bits bits
T C C A C T C C C T A G T A AG T TT T A AGC T C T T GC C G GAC C AA A A G G CT A GC C AGA A A T 0.0 0.0 5 10 5 MYB MEL WebLogo 3.2 WebLogo 3.2 c-Myb; Myb secondary;Mybl1 secondary; Ovo; Mybl1 secondary;Myb secondary; c-Myb; v-Myb; GAmyb Ovo; GAmyb; v-Myb; MyoD; MIF-1 wgEncodeSydhTfbsCh12CmycIggrabPk IC: 7.96 bits IC: 7.98 bits 2.0 2.0
1.0 1.0 bits bits C ACATG CCACAT A C A A GG GG TGG A G G G T G G C G C T C T A C C T A A G T C G GT C G GG C GG C 0.0 T A A AAA 0.0 A A 5 10 15 5 10 MYC CH12 WebLogo 3.2 WebLogo 3.2 Max primary; c-Myc:Max;Max primary; c- USF; Arnt; Max;Myc:Max; USF; N-Myc; Tcfe2a secondary; N-Lmo2complex; PHO4; Max; Myc; Tal-1beta:E47;Arnt; Tcfe2a secondary; Lmo2complex; Tal-MyoD; Sn; RAV1; PIF3; 1alpha:E47; Tal-Tcfe2a primary; Tal- 1beta:ITF-2; PHO4;1alpha:E47; Tal-1beta:E47; PIF3; Tcfe2a primary;E47; GBP; Ascl2 primary; Sn; Myf6 primary;Tal-1beta:ITF-2 Bhlhb2 secondary; E47; Ascl2 primary; MyoD Continued on next page
48 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME wgEncodeSydhTfbsMelCmycIggrabPk IC: 8.44 bits IC: 8.45 bits 2.0 2.0
1.0 1.0 bits CA bits A G G C C T A TG A C G A A C GG G A G G T C G A C T G A T C CTAC T C T A G G GG C G TCT AA TA TA A AA 0.0 0.0 5 10 5 MYC MEL WebLogo 3.2 WebLogo 3.2 c-Myc:Max; Max primary;c-Myc:Max; Max primary; N-Myc; Max; USF;PHO4; Max; N-Myc; PHO4; Arnt; PIF3;USF; Arnt; GBP; PIF3; GBP; Lmo2complex;Lmo2complex; MyoD; Tcfe2a secondary; MyoD;Bhlhb2 secondary; Sn; Bhlhb2 secondary; Tcfe2a secondary Sn; Tcfe2a primary; Max secondary; Hairy wgEncodeSydhTfbsCh12Ets1IggrabPk IC: 7.58 bits IC: 7.99 bits 2.0 2.0
1.0 1.0 bits TG G bits T T CC TCCCT T GTT CT C C T G T T C TT A T T CCC C T A TA A C T T T C G C C T T C C A C C A C A T A G G GA A G G A G A G C G C G AA G G C G GGG G G T A TA T A AA AT A A 0.0 0.0 5 10 15 5 10 15 ETS1 CH12 WebLogo 3.2 WebLogo 3.2 Irf6 secondary; RAV1; cap; Ovo; ISRE; Sox12 secondary; Tcf3 secondary; p300; Gabpa secondary ;Sox4 primary Ascl2 secondary; p300; MEIS1; cap wgEncodeSydhTfbsMelEts1IggrabPk IC: 8.63 bits IC: 8.71 bits 2.0 2.0
1.0 1.0 bits bits TT T T A ATAAG C GC G GC G A A TC A A G A A G T CT G G G TG CC A A T G C T C T C C G C C C AA AT A T TT T 0.0 0.0 5 5 10 ETS1 MEL WebLogo 3.2 WebLogo 3.2 GATA-1; GATA-2;GATA-1; GATA- Gata6 primary; GATA-6;2; Gata6 primary; GATA-3 ; Gata5 primary;Gata5 primary; GATA-6; Lmo2complex; mtTFA;GATA-3 ; Gata3 primary; Gata3 primary; Evi-1;GATA-X; mtTFA; GATA-X; Gata3 secondary;Lmo2complex; Evi-1; NIT2 Gata3 secondary wgEncodeCaltechTfbsC2c12Fosl1sc605FCntrl36bPcr1xPkRep1 Continued on next page
49 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 11.59 bits IC: 11.19 bits 2.0 2.0
1.0 1.0 bits bits A C C A G G C C T G T G A G G T TC T GC A A C A T A C C T C C C A C GT T TA TC CC G T T TA T 0.0 0.0 5 5 FOSL1 C2C12 WebLogo 3.2 WebLogo 3.2 GCN4; Bach2; AP-1;AP-1; NF-E2; Bach2; Bach1; Jundm2 secondary;GCN4; Jundm2 secondary; NF-E2; TCF11:MafG;Bach1; TCF11:MafG; v-Maf; v-Maf; Tax/CREB;Tax/CREB Zfp691 secondary wgEncodeSydhTfbsMelGata1Dm2p5dStdPk IC: 9.22 bits IC: 9.2 bits 2.0 2.0
1.0 1.0 bits A A bits A A A G TA G C T AC G A A G AA G G GG G A G G G G G G G T T T T C T C C C A C T T T C C T T C C CC C C C A A TT A T TT 0.0 0.0 5 10 5 10 GATA1 MEL WebLogo 3.2 WebLogo 3.2 GATA-2; GATA-1;GATA-2; GATA-1; Gata6 primary; GATA-Gata6 primary; GATA- 3; GATA-6; Gata5 primary;3; GATA-6; Gata3 primary; Gata3 primary; mtTFA;Gata5 primary; mt- GATA-X; Lmo2complex;TFA; GATA-X; Evi-1; Gata3 secondary;Lmo2complex; Evi-1; NIT2 NIT2; Gata3 secondary wgEncodeSydhTfbsMelGata1IggratPk IC: 9.52 bits IC: 9.5 bits 2.0 2.0
1.0 1.0 bits T bits A CT A C TAAG T T T C G A G T A T G G C C G AG CG T GC A A A A G T G C G C C T CC GC C G A A AT TT A T 0.0 0.0 5 10 5 10 GATA1 MEL WebLogo 3.2 WebLogo 3.2 GATA-1; Gata6 primary;GATA-1; Gata6 primary; GATA-2; GATA-GATA-2; GATA- X; Gata5 primary;X; Gata3 primary; Gata3 primary; GATA-Gata5 primary; GATA- 3; GATA-6; mtTFA;3; GATA-6; mtTFA; Lmo2complex; Evi-1;Lmo2complex; Evi-1; NIT2 NIT2; qa-1F wgEncodeSydhTfbsCh12JundIggrabPk Continued on next page
50 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 7.97 bits IC: 8.37 bits 2.0 2.0
1.0 1.0 bits bits CTCA AT CTCA T G A A T A A AA T C CG G A AGG G C C G A A T A G GG CTTGAC T CGG C C C G T T GA A A A T T 0.0 0.0 5 5 10 15 JUND CH12 WebLogo 3.2 WebLogo 3.2 AP-1; Bach2; TCF11:MafG;AP-1; GCN4; TCF11:MafG; GCN4; NF-E2; v-Maf;Jundm2 secondary; Bach1; Jundm2 secondary;Bach2; NF-E2; Bach1; Mafb primary; TCF11; v-Maf; cap; Dfd; Mafk primary Zfp691 secondary wgEncodeSydhTfbsMelJundIggrabPk IC: 16.59 bits IC: 14.47 bits 2.0 2.0
1.0 1.0 bits AGAGAGAGA GAG bits GAG GAG G A A A A A G G C T A CT G G C G T G TCA G CTC C C A G C GA CA C C GT A A C G A AGA C G T T G TA G T T C A T AC C G G C G C C C C C TT TT T T A A A TT T TT T 0.0 0.0 5 10 15 5 10 15 JUND MEL WebLogo 3.2 WebLogo 3.2 p300; Irf6 secondary Irf6 secondary; p300 wgEncodeSydhTfbsEse14MafkStdPk IC: 13.47 bits IC: 13.47 bits 2.0 2.0
1.0 1.0 bits bits
GT AGC GCT AC AA G C G TT G C T C C G A A TT G C AA T TT C T A CC A AA G C G A AT T AG T G A C AT GC T A G C C C C A G GG T CG T A A G T C A C G G GG G C G CC G G C CC 0.0 TT AA A 0.0 TT TTT AA 5 10 15 5 10 15 MAFK ES-E14 WebLogo 3.2 WebLogo 3.2 v-Maf; TCF11:MafG;v-Maf; TCF11:MafG; NF-E2; Mafb primary;NF-E2; Mafb primary; AP-1; Mafk primary;AP-1; Mafk primary; Jundm2 secondary; Bach2;Jundm2 secondary; Bach2; GCN4; Bach1; AP-4 GCN4; Bach1; AP-4; XFD-3 wgEncodeSydhTfbsMelMafkDm2p5dStdPk IC: 13.14 bits IC: 13.01 bits 2.0 2.0
1.0 1.0 bits bits G C C A G C G T A T T C A C A C T A C G T C T T A T T T C G A G C G G C T A A AC G G AA G TG T GG G A A A G AC A A G T T A C AG TG C A G GA G G T C A C A G T C G C G C G C C C G C G C 0.0 TT A T A T 0.0 TT T T 5 10 15 5 10 15 MAFK MEL WebLogo 3.2 WebLogo 3.2 v-Maf; TCF11:MafG;v-Maf; TCF11:MafG; NF-E2; NF-E2; Mafb primary;AP-1; Mafb primary; AP-1; Jundm2 secondary;Jundm2 secondary; Mafk primary; Bach2;Mafk primary; Bach2; GCN4; Bach1; AP-4 GCN4; Bach1; AP-4 wgEncodeSydhTfbsCh12Mafkab50322IggrabPk Continued on next page
51 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 8.76 bits IC: 8.7 bits 2.0 2.0
1.0 1.0 bits bits TC CA TC A AG AC GC GCTGT A A T T C A G T T C T G A C G A C A CA A G A A AGT G A C A CA C A C T T T A T G GG C A T A G AT T G C GGCC T CC G CG G GG G T T TT T T ATAT 0.0 0.0 5 10 15 5 10 15 MAFK CH12 WebLogo 3.2 WebLogo 3.2 v-Maf; Mafb primary;v-Maf; Mafb primary; Mafk primary; TCF11:MafG; AP-1; TCF11:MafG; AP-1; NF-NF-E2; Mafk primary; E2; Jundm2 secondary;Zic2 secondary; AP-4; XFD-3; RAV1;Zic1 secondary; Zic2 secondary Zic3 secondary; Jundm2 secondary; AP- 4 wgEncodeSydhTfbsMelMafkab50322IggrabPk IC: 8.65 bits IC: 8.43 bits 2.0 2.0
1.0 1.0 bits bits GCTGA A A TC GC AT T G C A T C G G CT T A A C GTG A T A T A A T TA A A T TCTG A T A A C T A A C A C T G G T T T A T T C C C C CT T A T A A C G G A C G G G G C CC G C C G G G G G C G 0.0 A A A 0.0 T TTT 5 10 15 5 10 15 MAFK MEL WebLogo 3.2 WebLogo 3.2 v-Maf; TCF11:MafG;TCF11:MafG; Mafb primary; NF-E2;Mafb primary; AP-1; Mafk primary;Mafk primary; v-Maf; NF- Jundm2 secondary; Bach2;E2; AP-1; Zic3 secondary; C/EBP Zic2 secondary; Jundm2 secondary; Zic1 secondary; C/EBP; Pbx1 3203.1; AP-4 wgEncodeCaltechTfbsC2c12MaxFCntrl50bE2p60hPcr1xPkRep1 Continued on next page
52 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 7.76 bits IC: 7.96 bits 2.0 2.0
1.0 1.0 bits bits
C AC TG ACATG GT A G ATG G C T AA G A C T C G G GA GT T G G C A C C T AC T T G AC C G T C G G C C C C C T T T T ATA AA 0.0 0.0 5 10 5 10 MAX C2C12 WebLogo 3.2 WebLogo 3.2 USF; c-Myc:Max; Max;USF; c-Myc:Max; N-Myc; Max primary; GBP;Max; Max primary; Arnt; N-Myc; Arnt; PIF3;GBP; Tcfe2a secondary; PHO4; Tcfe2a secondary;PIF3; PHO4; Lmo2complex; Max secondary; MyoD; Max secondary; Lmo2complex; MyoD;Myf6 primary; RAV1; Tal- RAV1; Bhlhb2 primary;1alpha:E47; Tal-1beta:E47; SREBP-1; Sn Tcfe2a primary; Sn; Tal- 1beta:ITF-2 wgEncodeCaltechTfbsC2c12MaxFCntrl50bPcr1xPkRep1 IC: 7.67 bits IC: 7.91 bits 2.0 2.0
1.0 1.0 bits bits A A C C GC T TT GC C G C G CC C T G C C T T C C G GCC C T T T G G T A C T A C G G T C T C C G T G A C A T T T T G AG G G G T C G C T G G CG A AA AA A A A A A T 0.0 0.0 5 10 5 10 15 MAX C2C12 WebLogo 3.2 WebLogo 3.2 Gabpa secondary; Ascl2 secondary; Hic1 secondary; NF-1;Gabpa secondary; cap; AP-1; Smad3 primary;Pax-4 MyoD; Tcfap2a secondary; Tcfe2a secondary; AP- 2alpha wgEncodeSydhTfbsCh12MaxIggrabPk Continued on next page
53 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 7.93 bits IC: 7.79 bits 2.0 2.0
1.0 1.0 bits bits ATG CAT G T T G CC C G G G A C A C C G C T A T TG T AC A TAC G G CC A G G GG CC 0.0 TA T A 0.0 AAA 5 5 MAX CH12 WebLogo 3.2 WebLogo 3.2 USF; N-Myc; c-Myc:Max;USF; Max primary; c- PHO4; Max primary;Myc:Max; PIF3; N-Myc; Arnt; GBP; Max; PIF3;Arnt; PHO4; Max; GBP; MyoD; Tcfe2a secondary;MyoD; Sn; Lmo2complex; Max secondary; SREBP-1; Tcfe2a secondary; Lmo2complex; SREBP-Zscan4 primary; 1; Bhlhb2 secondary; RAV1 Max secondary; Tal- 1beta:ITF-2; Tal- 1beta:E47; Tal-1alpha:E47; Tcfe2a primary wgEncodeSydhTfbsMelMaxIggrabPk IC: 8.44 bits IC: 8.4 bits 2.0 2.0
1.0 1.0 bits bits
CGTG G A C CG C T A G G T A AA C G T G A T T G C C G C C A T A T A G C T T T G A G T T T G A GTG T C G C C G C C G G TA A C A 0.0 T 0.0 A 5 10 15 5 10 15 MAX MEL WebLogo 3.2 WebLogo 3.2 c-Myc:Max; Max primary;c-Myc:Max; Max primary; Arnt; USF; PIF3; N-Myc;Arnt; Max; Bhlhb2 primary; Max; Bhlhb2 secondary;USF; N-Myc; PHO4; Bhlhb2 primary; PIF3; Bhlhb2 secondary; PHO4; GBP; E47;GBP; SREBP-1; MyoD; MyoD; Tcfe2a primary;Sn; Max secondary; SREBP-1; HTF; RAV1;Tcfe2a primary; Max secondary; Sn; GR Myf6 primary; E47; Nkx2- 2 2823.1; Hairy wgEncodeCaltechTfbsC2c12Sc12732FCntrl32bE2p24hPcr2xPkRep1 Continued on next page
54 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 9.93 bits IC: 9.96 bits 2.0 2.0
1.0 1.0 bits bits CA CTG C GCT C C G G A G TG G C G A C A GA G A C G G C C G C C T C C CC C T C GA GC T A GGG TTC G A T G AT A A T C AA G G T C C C C C G C 0.0 T A T AA A 0.0 AT T T A 5 10 15 5 10 15 MYOG C2C12 WebLogo 3.2 WebLogo 3.2 Ascl2 primary; E47;Ascl2 primary; AP-4; Lmo2complex; AP-Lmo2complex; E47; HEN1; 4; Myf6 primary;MyoD; Myf6 primary; MyoD; Zic1 secondary;Tcfe2a primary; AREB6; Zic2 secondary; Tcfe2a secondary; Sn; Tcfe2a secondary; Tgif1 2342.2; Tal- Tcfe2a primary; 1alpha:E47; Zic1 secondary; Zic3 secondary; Sn; HEN1;Pknox2 3077.2; AREB6; Myf6 secondary;Zic3 secondary; Gfi-1; Tal- Tgif1 2342.2; Tal-1beta:E47; Zic2 secondary; 1alpha:E47; RAV1; Tal-Tgif2 3451.1 1beta:E47; Pknox2 3077.2 wgEncodeCaltechTfbsC2c12Sc12732FCntrl32bE2p60hPcr2xPkRep1 IC: 10.49 bits IC: 10.44 bits 2.0 2.0
1.0 1.0 bits bits G C G C T T C G A A T G A A G G TC G G G C C G C A GA C C C C GA C T G G C C T T A G C TC AT G G T G TA C CC T GG C A A C G A T A C T G C G C A T CT T T T C G A TTTT TA T T A T A A AAAA 0.0 0.0 5 10 15 5 10 15 MYOG C2C12 WebLogo 3.2 WebLogo 3.2 Myf6 primary; Myf6 primary; Ascl2 primary; E47; AP-4;Ascl2 primary; E47; AP-4; MyoD; Tcfe2a secondary;Lmo2complex; MyoD; Tcfe2a primary; Tcfe2a primary; HEN1; Lmo2complex; HEN1;Tcfe2a secondary; Sn; Tal- Sn; Tal-1beta:E47; Tal-1beta:E47; Tal-1beta:ITF-2; 1beta:ITF-2; AREB6;Tal-1alpha:E47; AREB6; RAV1; Tal-1alpha:E47;RAV1; USF; Tgif1 2342.2; c-Myc:Max; USF;c-Myc:Max; Arnt; myo- Tgif1 2342.2; myogenin/NF-genin/NF-1 1; Mybl1 secondary wgEncodeCaltechTfbsC2c12Sc12732FCntrl50bE2p7dPcr1xPkRep1 Continued on next page
55 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 12.31 bits IC: 12.26 bits 2.0 2.0
1.0 1.0 bits AA C bits AA C G G G G C G G A G G C A C GG GC G G A C C T CA GT G A C C C T C G T A AAT A T A AT A 0.0 0.0 5 10 5 10 MYOG C2C12 WebLogo 3.2 WebLogo 3.2 Ascl2 primary; Ascl2 primary; Myf6 primary; Sn;Myf6 primary; Sn; MyoD; AP-4; HEN1;MyoD; AP-4; HEN1; E47; Lmo2complex;E47; Lmo2complex; Tcfe2a secondary; Tal-Tcfe2a secondary; myo- 1beta:ITF-2; myo-genin/NF-1; Tal-1beta:ITF-2; genin/NF-1; Tal-1alpha:E47;Tal-1alpha:E47; RP58; RP58; Tcfe2a primary;Tcfe2a primary; Tal- Myf6 secondary; 1beta:E47; Myf6 secondary; Tgif1 2342.2; Tal-1beta:E47;Tgif1 2342.2; AREB6; AREB6; Tgif2 3451.1;Pknox2 3077.2; Tgif2 3451.1 Pknox2 3077.2 wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bE2p24hPcr2xPkRep1 IC: 12.35 bits IC: 12.39 bits 2.0 2.0
1.0 1.0 bits bits G TT AA C G G CGC G C CC G AC C G G C G G T C AG AA GT T TC C C A T G G CC T AA T AAA T ATT AT A 0.0 0.0 5 10 5 10 MYOD1 C2C12 WebLogo 3.2 WebLogo 3.2 Ascl2 primary; Ascl2 primary; Myf6 primary; MyoD;Myf6 primary; MyoD; HEN1; Sn; AP-4;Sn; HEN1; AP-4; Lmo2complex; E47;E47; Lmo2complex; Tcfe2a primary; Tal-Tcfe2a primary; Tal- 1beta:ITF-2; Tal-1alpha:E47;1beta:ITF-2; Tal- myogenin/NF-1; Tal-1alpha:E47; myogenin/NF- 1beta:E47; RP58;1; Tcfe2a secondary; Tcfe2a secondary; RP58; Tal-1beta:E47; Myf6 secondary; RAV1;Myf6 secondary; AREB6; AREB6; Tgif1 2342.2; Adf-1 RAV1; Tgif1 2342.2; Pknox2 3077.2 wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bE2p60hPcr2xPkRep1 Continued on next page
56 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 11.58 bits IC: 11.29 bits 2.0 2.0
1.0 1.0 bits bits G CC G CC
G C C A A C TC T C C A TTCT C G G G C C TC C G C C G A T G G T A T T G C TAG T AC ACG T C GG T G C G A A T A TT GA 0.0 0.0 5 10 15 5 10 MYOD1 C2C12 WebLogo 3.2 WebLogo 3.2 Ascl2 primary; Ascl2 primary; MyoD; Myf6 primary; MyoD;Myf6 primary; Sn; HEN1; E47; Lmo2complex; Sn;E47; AP-4; Lmo2complex; AP-4; Tcfe2a primary;Tcfe2a secondary; HEN1; Tcfe2a secondary;Tcfe2a primary; Mybl1 secondary; RAV1; Tal-1alpha:E47; Myb secondary; Tal-Tgif1 2342.2; Tal-1beta:ITF- 1beta:ITF-2; Tal-2; myogenin/NF-1; AREB6; 1alpha:E47; Tal-1beta:E47;Myb secondary; Tal- Myf6 secondary; RAV1;1beta:E47; USF; c-Myb c-Myb; c-Myc:Max; AREB6 wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bPcr2xPkRep1 IC: 10.82 bits IC: 10.75 bits 2.0 2.0
1.0 1.0 bits bits AA GCT AA GC G G G G G A G T T GC GC A C G G C A C CC T T G G G C G A G C T TA AA T A A 0.0 0.0 5 10 5 MYOD1 C2C12 WebLogo 3.2 WebLogo 3.2 Myf6 primary; Ascl2 primary; Ascl2 primary; MyoD;Myf6 primary; MyoD; Sn; E47; AP-4; HEN1;AP-4; E47; Sn; Tal-1beta:ITF-2; Tal-myogenin/NF-1; Tal- 1alpha:E47; Lmo2complex;1alpha:E47; Tcfe2a primary; myogenin/NF-1; Lmo2complex; HEN1; Tcfe2a primary; Tgif2 3451.1; Tgif1 2342.2; Tal-1beta:E47; Tal-1beta:ITF-2; Tcfe2a secondary; RP58;Pknox2 3077.2; Tal- AREB6; Myf6 secondary;1beta:E47; Mrg1 2246.2; RAV1; Tgif1 2342.2; USF Meis1 2335.1; AREB6; Mrg2 2302.1 wgEncodeCaltechTfbsC2c12Sc32758FCntrl50bE2p7dPcr1xPkRep1 Continued on next page
57 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 13.04 bits IC: 12.89 bits 2.0 2.0
1.0 1.0 bits bits AA C AA C G G AA C AA G C G G G G G G G GC C G G G G G G A G C G C G C C G A C C A C T A C A T G A A A G C C A A C G CC C TT A AT A TT A AAT A 0.0 0.0 5 10 15 5 10 15 MYOD1 C2C12 WebLogo 3.2 WebLogo 3.2 Ascl2 primary; Ascl2 primary; Myf6 primary; HEN1;Myf6 primary; HEN1; MyoD; Sn; AP-4;MyoD; Sn; AP-4; E47; Lmo2complex;E47; Tcfe2a secondary; Tcfe2a secondary; Lmo2complex; Tgif1 2342.2; Tgif1 2342.2; Tcfe2a primary; Tcfe2a primary; c- Myf6 secondary; c-Myb; Myf6 secondary; Myb; Tal-1beta:ITF-2;Pknox2 3077.2; RP58; Tal- Pknox2 3077.2; RP58; Tal-1beta:ITF-2; Tal-1alpha:E47; 1alpha:E47; Tal-1beta:E47;Tal-1beta:E47; AREB6; Eomes secondary; AREB6 myogenin/NF-1 wgEncodeCaltechTfbsC2c12SrfFCntrl32bE2p24hPcr2xPkRep1 IC: 8.84 bits IC: 8.84 bits 2.0 2.0
1.0 1.0 bits bits TT T GG CT T GG C G T T A AT T C C A C T AC A C A T CA A T T C CT C A C T G A C G A T A C A G T G A T T G C C A T C A A C C C GG C T C C A A C A T C G C G G G G G G G G G C G G G C G AT A AT A A T 0.0 0.0 5 10 15 5 10 15 SRF C2C12 WebLogo 3.2 WebLogo 3.2 SRF; Srf primary; AG;SRF; Srf primary; AG; YY1; AGL3; TATA;AGL3; YY1; MCM1; Tbp secondary; MCM1;Tbp secondary; TATA; GATA-1; Msx-1 GATA-1; Abd-B; Tcf3 secondary wgEncodeSydhTfbsCh12TbpIggmusPk IC: 7.74 bits IC: 6.81 bits 2.0 2.0
1.0 1.0 bits bits
CTTCTCTC GAA G CA G GA G C G G TAC C C GGAG AGA TAAA CA AG T C T AA G T C T G T T T C G TT T A CCT T A AA GG C C C GG C A C C G C G C G G C C G 0.0 A A T T 0.0 CT T 5 10 15 5 10 15 TBP CH12 WebLogo 3.2 WebLogo 3.2 Gabpa secondary; Pax-4; cap; Sox12 secondary Gabpa secondary; Irf3 secondary wgEncodeSydhTfbsMelTbpIggmusPk Continued on next page
58 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 8.38 bits IC: 8.06 bits 2.0 2.0
1.0 1.0 bits bits CC T TG CCA C A C C C G G G C CAT TC G G CTTCCTAT C TA C A A C AA G G C A G A C T GG C C G G T G G G G A A CA G TA C G G G G A T T A T TA ATAA A T AA A 0.0 0.0 5 10 15 5 10 15 TBP MEL WebLogo 3.2 WebLogo 3.2 AP-2rep; CDC5; AP-1; Pax-CDC5; Gfi-1; AP-1; GCN4; 4; NF-E2; RAV1 v-Maf; TCF11:MafG; myogenin/NF-1 wgEncodeCaltechTfbsC2c12Tcf3FCntrl32bE2p5dPcr2xPkRep1 IC: 11.32 bits IC: 11.02 bits 2.0 2.0
1.0 1.0 bits bits G TC GA GC C G A C C G C AC T G G GC C G T T C A T G G C A C C T G C T A A C T T C A TC C G C T C G G C T A G C GG G 0.0 T T TACTAA 0.0 A T 5 10 15 5 TCF3 C2C12 WebLogo 3.2 WebLogo 3.2 Ascl2 primary; Ascl2 primary; MyoD; Myf6 primary; MyoD;AP-4; Tgif1 2342.2; E47; Lmo2complex; AP-HEN1; Lmo2complex; 4; Sn; Tcfe2a secondary;Myf6 primary; Sn; Tcfe2a primary; Tal-Tgif2 3451.1; Meis1 2335.1; 1alpha:E47; HEN1;Pknox2 3077.2; Tal-1beta:ITF-2; Tal-Tcfe2a primary; 1beta:E47; Myf6 secondary;Mrg1 2246.2; E47; Myb secondary; Mrg2 2302.1; Tgif1 2342.2; AREB6;myogenin/NF-1; Mybl1 secondary; RAV1;Tcfe2a secondary; Arnt Pknox1 2364.2; TGIF; AREB6 wgEncodeCaltechTfbsC2c12Usf1FCntrl50bE2p60hPcr1xPkRep1 Continued on next page
59 Table 1.3.8 – continued from previous page TF Cell LASAGNA-ChIP MEME IC: 12.16 bits IC: 12.0 bits 2.0 2.0
1.0 1.0 bits GT CG bits CG AC A C GA A T C G C GCG CG GT T A A C T A A AATG T A T A C T TG AC C T C C C G C G G G G C G C G T AT AA T T T 0.0 0.0 5 10 5 10 USF1 C2C12 WebLogo 3.2 WebLogo 3.2 USF; Arnt; SREBP-USF; Arnt; SREBP-1; 1; Bhlhb2 primary; c-Bhlhb2 primary; GBP; Myc:Max; Max; GBP; N-c-Myc:Max; PIF3; Max; Myc; PIF3; Max secondary;N-Myc; Max secondary; Max primary; RAV1; XBP-Max primary; RAV1; 1; Hairy; PHO4; MyoD;bZIP911; XBP-1; bZIP911; Rara primary;PHO4; MyoD; Hairy; ATF6; AREB6 Rara primary; ATF6; CF1/USP wgEncodeCaltechTfbsC2c12Usf1FCntrl50bPcr1xPkRep1 IC: 12.4 bits IC: 12.34 bits 2.0 2.0
1.0 1.0 bits GT CG bits CG AC C G C TC A G T CG GGC A T A G T A GT AT A C A T T AG T CC C A TC G G C C G G A A C T G C G G C CCC GGG TA AA AA TT TT TA 0.0 0.0 5 10 5 10 USF1 C2C12 WebLogo 3.2 WebLogo 3.2 USF; Arnt; SREBP-USF; Arnt; Bhlhb2 primary; 1; Bhlhb2 primary; c-SREBP-1; c-Myc:Max; Myc:Max; Max; GBP; N-Max; GBP; N-Myc; Myc; PIF3; Max secondary;PIF3; Max secondary; Max primary; XBP-1;Max primary; XBP- RAV1; Hairy; bZIP911;1; Rara primary; Rara primary; MyoD;RAV1; bZIP911; Hairy; Rxra primary; PHO4;Rxra primary; CF1/USP; CF1/USP Nr2f2 primary; MyoD
1.3.4 LASAGNA is Simple and Effective
Unlike MEME and similar methods, the order in which the input sequences are aligned is crucial to LASAGNA and ClustalW. ClustalW relies on a guide tree based on pairwise alignments to decide the order. LASAGNA, on the other hand, depends on the length of a sequence and its similarity to the partial alignment. LASAGNA-ChIP is well-suited for a TF whose shortest
60 site misses the core or contains only a fraction of it. We, however, observed no significant difference between LASAGNA and LASAGNA-ChIP on TF-
BSs in the TRANSFAC Public database. This is because, for these TFBSs, a shortest site often fully contains the core. Hence, our assumption holds true in general.
For ChIP-seq data, the assumption that short sequences contain less irrele- vant bases flanking the core may not hold. However, we observe that, under the one-per-sequence model, LASAGNA-ChIP performed comparably well to MEME in aligning ChIP-seq peak sequences. We attempted other orders such as from the longest sequence to the shortest one and found that aligning the shortest sequence first does have its advantage (data not shown). Also, we note that, for 11 out of 38 experiments, the peak sequences are all at least
100 bp (see Table 1.3.7) and hence all the peak sequences are 100 bp long after clipping. This implies that LASAGNA-ChIP is capable of handling sequences of the same length.
LASAGNA-ChIP, MEME and methods alike produce gapless alignments and do have their limits. When a TF binds to two cores separated by a variable-length spacer, these methods are expected to align the canonical
TFBSs containing spacers of the most prevalent length. These binding pat-
61 terns are also known as two-block motifs. Gapped alignment or explicit modeling [8] is needed to correctly align TFBSs of this nature.
1.4 Conclusions
We proposed LASAGNA, a novel alignment algorithm specifically designed for aligning variable-length transcription factor binding sites. Cross-validation results on 189 TFs and 4771 TFBSs indicated that LASAGNA significantly
15 outperformed ClustalW2 (p-value: 1.22 10− ) and MEME (p-value: 3.55 × × 15 10− ). This is because LASAGNA was specifically designed for aligning variable-length TFBSs. Based on the success of LASAGNA, we devel- oped LASAGNA-ChIP,which is capable of handling sequences produced by
ChIP-chip and ChIP-seq experiments. While ClustalW2 is better suited for producing structurally correct alignments, LASAGNA-ChIP, MEME and methods alike can be used to align sequences produced by ChIP-chip or
ChIP-seq experiments.
Wecompared LASAGNA-PSSM, the PSSM method dependent on LASAGNA, to SiTaR, an alignment free TFBS search method. Cross-validation exper- iments were conducted on 1751 TFBSs of 90 TFs for both methods. The results showed that, at fixed recall rates, LASAGNA-PSSM is significantly
8 more precise than SiTaR (p-value: 2.66 10− ). The recall-precision curve × 62 showed that our method is constantly more precise at any recall rate or more sensitive at any precision.
We conclude that the LASAGNA algorithm is simple and effective in align- ing variable-length binding sites. It has been integrated into a user-friendly webtool for TFBS search called LASAGNA-Search. The tool currently stores precomputed PSSM models for 189 TFs and 133 TFs built from TFBSs in the TRANSFAC Public database (release 7.0) and the ORegAnno database
(08Nov10 dump), respectively. In the future, more sources of experimen- tally validated TFBSs such as the PAZAR database will be incorporated into the webtool, making variable-length TFBSs more accessible to scientists in the field.
63
Chapter 2 LASAGNA-Search: A User-friendly Webtool for Transcription Factor Binding Site Search and Visualization
2.1 Background
In this chapter, we describe a user-friendly webtool named LASAGNA-
Search for transcription factor binding site search. We use the term position- specific weight matrix (PWM) to refer to an 4 l matrix described in the × introduction chapter, where l is the length of binding sites of the TF of inter- ests. The term position-specific scoring matrix (PSSM) is hence used to refer to the method that scores binding sites based on a PWM or an alignment.
A typical TFBS search webtool takes a PWM and a promoter sequence as inputs and outputs putative binding sites. Many webtools implement use- ful features in addition to this basic function. Some tools accept variable- length binding sites [69, 24, 3] instead of a PWM. Some tools offer precom- puted models built from PWMs or TFBSs so users do not need to collect
PWMs or TFBSs to use a tool [67, 10, 13, 82, 78, 31]. Some tools adopt
65 a TFBS search method that exploits position dependence [69, 67]. Some tools offer promoter sequence retrieval or integrate a sequence retrieval tool
[67, 83, 2, 82, 78]. For result visualization, the MAPPER2 database [67] sup- ports visualizing hits in the UCSC Genome Browser [19] for three species.
It is also desirable to visualize the predicted binding specificities as a gene regulatory network (GRN) [31]. Not all of these useful features however are available at one single webtool.
To incorporate all the aforementioned useful features, we implemented
LASAGNA-Search: a webtool for TFBS search and visualization. LASAGNA-
Search accepts variable-length TFBSs in addition to PWMs. It offers 1792 precomputed models based on TFBSs and PWMs collected from the TRANS-
FAC Public, JASPAR, ORegAnno and UniPROBE databases. Its search module exploits position dependence for a TFBS-based model whenever performance gain is indicated by cross-validation. Automatic promoter se- quence retrieval is supported for 15 species at LASAGNA-Search, which enables visualization of search results in the UCSC Genome Browser for the 15 species. Search results can also be visualized along promoter se- quences locally at LASAGNA-Search for any species. Finally, a GRN can be constructed from search results and visualized locally with various options.
2.2 Materials and Methods
66 Model Input User TFBSs User PWM Promoter Input Gene ID/Symbol mRNA accession PWM-based collections: TFBS-based collections: Alignment Module TRANSFAC Promoter Retrieval TRANSFAC JASPAR CORE User promoter ORegAnno UniPROBE Module PAZAR Trim Alignment Build Model
Search Module
Previous Search Results Search Results Infer Gene Regulatory Network
Visualization UCSC Genome Gene Regulatory HTML Table Plain-Text Table Local Image Browser Network
Figure 2.2.1: Architecture of LASAGNA-Search
The search module of LASAGNA-Search takes TF models and promoter se- quences as inputs. TFBS-based collections contain precomputed TF models from TFBSs, while PWM-based collections include precomputed TF models from PWMs. The alignment module aligns user-provided (variable-length) TFBSs and the alignment may be manually trimmed before model building. Users may also input a PWM for model building. Promoter sequences may be provided by users or automatically retrieved by the promoter retrieval module using the NCBI Gene ID, the official symbol or an mRNA accession number of a gene. Results produced by the search module may be displayed in a HTML or tab-delimited table. LASAGNA-Search offers visualization of the results as local images, while the results can also be displayed in the UCSC Genome Browser as custom tracks. A gene regulatory network can be inferred from search results and visualized locally.
Figure 2.2.1 shows the architecture of LASAGNA-Search. We introduce the major components in the following sections.
2.2.1 Modules
2.2.1.1 Alignment Module
The alignment module aligns variable-length TFBSs so a model can be built from the alignment. It implements the LASAGNA algorithm detailed in
Chapter 1 and has been extensively compared to ClustalW2 [45] and MEME
67 [3] with favorable outcomes (see Section 1.3.1).
2.2.1.2 Search Module
The search module takes a TF model and a promoter sequence as inputs. The
TF model specifies l, the length of a putative binding site, parameter K, and gives Mi(u) and Mi j(u, v) for u A, C, G, T , i [1, l k] and j [i + 1, i + K] , ∈ { } ∈ − ∈ as seen in (1.2.1). Depending on the TF, scores of nucleotide pairs may contribute to the score of a sequence. This is controlled by a parameter
K 0, the maximal distance between a nucleotide pair. The value of K is ≥ TF-dependent and is determined by cross-validation. Hence, K is greater than 0 only if nucleotide pairs improve the search performance for a TF.
Conventionally, it is assume that the first letter of an l-mer is aligned with the first position of a TF model and the l-mer is scored accordingly. Unlike the conventional approach, we align an l-mer with a TF model by sliding an l-mer and its reverse-complement through the model such that the overlap between the two is at least one nucleotide as described in Section 1.2.4. Using the framework described in Section 2.3.3, we found that this is significantly better than the conventional approach for locating TFBSs (see Figure 2.2.2).
Moreover, this approach allows easy scoring of an l-mer by a cluster of TF models of different widths. Scoring by a cluster of TF models has been shown to outperform using only the best model in the cluster [62] and hence
68 is a feature to be included in the near future.
A Average Precision B Accuracy
70 TFs ● 70 TFs ● p−value: 1.5e−05 p−value: 4.1e−06
● 8e−05 0.8
● ●
● 6e−05 0.6
● ●
●●
● Not slide ● ● ● ● 0.4 4e−05 ● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ●● ● 0.2 ● ● ●● ● ● ●●
2e−05 ● ●● ●● ● ● ● ●● ● ●● ●●●● ●●●●● ●●● ● ●● ●● ● ●●● ●● ●●● ●● ● ●●● ● ●●●● ●●● 0.0 0e+00 0e+00 2e−05 4e−05 6e−05 8e−05 0.0 0.2 0.4 0.6 0.8
Slide Slide
Figure 2.2.2: Comparison of Scoring Strategies Using TF Models Collected by LASAGNA-Search
The search module of LASAGNA-Search (x axis) scores a binding site by sliding a putative site through a TF model, while the conventional approach (y axis) does not. The evaluation framework is described in section “Evalu- ation of precomputed TF models” of the main article. Each point in a plot corresponds to a TF, whose binding sites can be predicted by more than one model. The average performance across the models is used to plot the point. Average precision is used as the performance measure to generate (A), whereas accuracy is the performance measure for (B).
For each putative binding site or hit, the search module computes the score and the p-value, the probability of observing a score higher or equal to the score under a background distribution. We adopt the 0th-order Markov model, also known as the independent multinomial model, as the back- ground model. To estimate the background distribution for a PSSMK model, we consider only the binding sites or PWM used to build the model. We
69 adopt this conservative way of estimating the score distribution because it is harder for a non-binding site to get a low p-value in this distribution.
For K = 0, the exact score distribution can be efficiently computed by con- volution [36]. However, this is not the case for K > 0 since the score can no longer be seen as a sum of independent variables. Consequently, we compute p-values using empirical score distributions. The empirical score distribution of a model is obtained by scanning random promoter sequences simulated by the background model. Specifically, we focus on only those
PSSMK scores in the upper 5% and hence scores lower than the 5 percentile
5 are assigned a p-value of 0.05+. The smallest non-zero p-value is 2.5 10− × 5 and a p-value of 0 implies any number lower than 2.5 10− . ×
While the p-values are not corrected for multiple testing, they are useful for ordering hits found by different TF models. To take into account the length of the promoter sequence in which a hit is found, an E-value is computed for the hit. An E-value gives the expected number of times a hit of the same or higher score is found in the promoter sequence by chance. Let L be the length of the promoter sequence and l be the length of the putative binding site. E-value = p-value (L l + 1), which is approximately p-value L when × − × L l.
2.2.1.3 Promoter Retrieval Module
Currently, LASAGNA-Search supports retrieving promoter sequences for
70 15 species: Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Saccharomyces cerevisiae, Caenorhabditis elegans, Caenorhab- ditis briggsae, Bos taurus, Sus scrofa, Ovis aries, Gallus gallus, Canis lupus familiaris, Felis catus, Xenopus (Silurana) tropicalis and Danio rerio. Users may enter the NCBI Gene identifier (ID), the official gene symbol or an mRNA accession number of a gene to retrieve its upstream promoter re- gion. The upstream region of a gene is specified by positions relative to the transcription start site (TSS) obtained from the UCSC Genome Browser
[19]. Information in the NCBI Gene database is used for conversion between
Gene IDs and symbols.
2.2.1.4 Gene Regulator Network Inference
LASAGNA-Search automatically constructs a gene regulatory network based on search results. A directed edge from a TF model to a gene is established if at least a significant hit is found in the promoter region of the gene by the
TF model. The lowest p-value of these hits is used to compute the weight on this edge. That is, the thickness of the edge is proportional to log p-value. − In case the coding genes of a TF model are known, these genes may be added to the network with dotted arrows from the genes to the TF model. To sim- plify the network, the node for a TF model may be removed, leaving only its coding genes in the network. Figure 2.2.3 shows an example network of human genes TP53 and MYB. Visualization of gene regulatory networks at
71 A B C
Figure 2.2.3: An Inferred Gene Regulatory Network of Human Genes TP53 and MYB
A small gene regulatory network of human genes TP53 and MYB inferred from scanning the promoters (950bp upstream to 50bp downstream) using two TP53 TF models and two MYB TF models. Genes are denoted by green ellipses and TF models are represented by red octagons. (A) The inferred network containing the 2 genes and the 4 TF models. (B) The dotted edges from the 2 coding genes to the 4 TF models are established in this network. (C) The simplified network after removing 4 nodes. These nodes are removed because the two TP53 TF models are coded by the TP53 gene and likewise for MYB.
LASAGNA-Search is enabled by Cytoscape Web [57]. We describe how the networks in Figure 2.2.3 were generated in Section 2.3.1.
2.2.2 TF Model Collections
LASAGNA-Search currently offers 6 precomputed TF model collections.
The collections are categorized by the type of data used to build a model.
Table 2.2.9 lists the type and number of models for each collection. To facilitate gene regulatory network visualization, we mapped TF models to genes coding for the TFs. The number of models that can be mapped is also listed in Table 2.2.9 for each collection. Models in the TFBS-based
72 Table 2.2.9: Summary of TF Model Collections
Database TypeModelsMapped Models1 TRANSFAC TFBS 189 188 ORegAnno TFBS 133 132 PAZAR TFBS 66 66 TRANSFAC PWM 398 366 JASPAR COREPWM 476 457 UniPROBE PWM 530 524 Total 1792 1733 1Models of TFs whose coding genes were found.
collections were built from unaligned TFBSs, while models in the PWM- based collections were built from PWMs. We describe the two categories in the following sections.
2.2.2.1 TFBS-based Collections
We collected experimentally validated transcription factor binding sites from the TRANSFAC Public database and the ORegAnno database. In these two collections, binding sites of a TF were not collected across species.
TF models are non-redundant in the sense that a TF of a species has only one model based on all the available binding sites in a database. The binding sites of a TF were aligned by the alignment module to build a model. We built one model for each TF because, for most TFs, the binding affinity can be explained by only one model [92]. In case a TF recognizes more than one motif [25]), we rely on database curators to distinguish binding sites belonging to distinct motifs. Moreover, the TFBS-based collections are com-
73 pensated by our PWM-based collections, which offer more than one model for some TFs.
Binding sites of 5 species were collected from the TRANSFAC Public database
(release 7.0) [59], including Homo sapiens, Mus musculus, Rattus norvegi- cus, Drosophila melanogaster and Saccharomyces cerevisiae. For each species, a TF was included in our collection if it contains at least 10 bind- ing sites. Totally, binding sites for 189 TFs across 5 species were collected.
Although TRANSFAC does build PWMs for TFs, 72 (38.1%) of them do not have any PWMs in TRANSFAC.
Besides the 5 species present in the TRANSFAC collection, binding sites of
Caenorhabditis elegans and Caenorhabditis briggsae were collected from the ORegAnno database (08Nov10 dump) [32]. Being an open-annotation database, ORegAnno allows users to adopt the role of curators and con- tribute binding sites and other types of annotations to the database. A nice feature is allowing users to enter a NCBI or Ensembl ID for each gene or transcription factor mention. This feature allows easy mapping of distinct mentions of the same TF to a unique database ID so binding sites of a TF contributed by different users can be easily merged. Nevertheless, many TF mentions in ORegAnno are not accompanied by a database ID. In this case,
74 we automatically assign the NCBI Gene ID to a TF mention by consulting the NCBI Gene database. We note that this is not always possible since a TF mention may be the symbol of one gene and a synonym of another and hence cannot be uniquely mapped. Ambiguity was manually resolved when a TF mention cannot be uniquely mapped to a NCBI Gene ID. Still, some TF mentions are protein complexes and hence cannot be identified by a single gene ID. These mentions were semi-automatically collapsed. Finally, binding sites of 133 TFs across 7 species were collected, where each TF has at least 10 binding sites.
The PAZAR database [64] offers an platform for users to start curation projects. A record stores one annotation for one sequence from either an TF- gene interaction or gene expression experiment. Hence, binding sites of a
TF can be extracted from TF-gene interaction records in the PAZAR projects.
Since more than one project may curate binding sites of a particular TF, we aggregated records containing TF-gene interaction information from all the public projects. All the files in the general feature format dated 20120117 were downloaded. We group TFBSs by TF and species, that is, human TF A and mouse TF A are considered two TFs.
A binding site was filtered out if it is less than 4 or greater than 1000 bases long. To verify a binding site, we searched for it in the vicinity of the curated
75 genomic location in the reference genome. The binding site was discarded if it couldn’t be located within 5 bases of the curated location. As we collected binding sites for a TF across all the public projects in PAZAR, a TFBS may be curated by more than one project, resulting in multiple copies of the TFBS in our collection. Therefore, for each pair of overlapping binding sites, we kept only the shorter one if the overlap is more than 80% of the shorter one in length. A model was built for a TF if it has at least 10 binding sites.
The LASAGNA-ChIP algorithm [50] was used to align the binding sites of a
TF since some of the projects contain TFBSs identified by ChIP-seq and ChIP- chip experiments. As reported in [38], about 94% of the actual binding sites can be located within 50 bases of signal peaks. However, no clipping was done for sequences produced by ChIP-seq experiments since information about the signal peak is not available in PAZAR. The new collection contains
66 TF models, 39, 20 and 7 of which are human, mouse and rat, respectively.
As seen in Table 2.2.9, nearly all the TF models in the two collections were mapped to TF coding genes. Only one model in each collection remains unmapped due to lack of information in the source databases. They are ETF
(T00270) and MYF in TRANSFAC and ORegAnno, respectively.
2.2.2.2 PWM-based Collections
In addition to binding sites, we also collected position-specific weight ma- trices (PWMs) from the TRANSFAC Public database, the JASPAR CORE
76 database [10] and the UniPROBE database [60]. A PWM is a 4 l matrix, × where l is the length of binding sites. Each element in column i of a PWM is usually the count or probability of a nucleotide at position i. PWMs are valuable resources for various reasons. One reason is that most PWMs in
TRANSFAC and JASPAR were built by domain experts. For instance, some
PWMs in TRANSFAC and JASPAR CORE were based on binding sites of more than one species because of cross-species conservation (e.g. TRANS-
FAC matrix M00152). Moreover, a PWM in TRANSFAC may be based on binding sites of more than one TF because of similar binding specificities
(e.g. TRANSFAC matrix M00158). Another reason is that for some tech- niques no binding sites but only matrices are produced. The UniPROBE database, for example, stores data from protein binding microarray (PBM) experiments [6]. The PBM technique assigns a binding specificity score to each 10-mer sequence variant. Berger and Bulyk [6], however, do not sug- gest setting a specificity cut-off threshold to report binding sites. Instead,
PWMs are produced by the Seed-and-Wobble algorithm.
From the UniPROBE database, we collected 530 PWMs of 6 species: Homo sapiens, Mus musculus, Saccharomyces cerevisiae, Caenorhabditis elegans,
Plasmodium falciparum and Cryptosporidium parvum. These 530 PWMs correspond to 414 non-redundant TFs (proteins or protein complexes). We
77 collected 476 PWMs from the JASPAR CORE database, where the PWMs were categorized into 6 species groups: Vertebrates, Insects, Plants, Fungi,
Nematodes and Urochordates. Finally, 398 PWMs were collected from the
TRANSFAC Public database and grouped into Vertebrates, Insects, Plants,
Fungi, Nematodes and Bacteria.
According to Table 2.2.9, the PWM-based collections contain more un- mapped TF models than the TFBS-based collections. Lack of information in the source databases is the major reason. Some matrices such as MA0102.1 and MA0061.1 in the JASPAR CORE database were built from TFBSs of more than one species but accession numbers of the homologous proteins are not available. Some matrices in the TRANSFAC and JASPAR CORE databases have protein accession numbers available, while records of the corresponding coding genes cannot be found in the NCBI Gene database.
These proteins often belong to species such as Pisum sativum and Triticum aestivum, which are not as well-studied as model organisms.
2.3 Results and Discussion
In this section, we introduce the user interface, followed by a comparison of features to existing webtools, and evaluation of precomputed TF models in
LASAGNA-Search and MAPPER2. Finally, we discuss future directions for
78 improving LASAGNA-Search.
2.3.1 User Interface
2.3.1.1 Input Page
The LASAGNA-Search input page is divided into three parts, one for TF model input, one for promoter sequence input and one for result filtering parameter input. Figure 2.3.1a shows a screenshot of the input page. Two options are available for result filtering. One is setting a p-value threshold so that only hits with equal or lower p-values will be reported. The other is setting k so that only k hits with the highest scores will be reported.
For TF model input, LASAGNA-Search accepts variable-length TFBSs for model building. Users may input TFBSs in the FASTA format. The TF-
BSs will be aligned on clicking the “Start Searching” button. The PWM and sequence logo [15] of the automatically trimmed alignment will be displayed. Users may choose to further trim the alignment or recover pre- viously trimmed columns. Figure 2.3.1c shows the user interface for TFBS alignment trimming. In addition to TFBSs, users may input a PWM for model building. LASAGNA-Search recognizes formats used by JASPAR,
TRANSFAC and UniPROBE.
LASAGNA-Search currently offers two ways of selecting models in the
79 TFBS-based and PWM-based collections. One is to browse each model collection, while the other is to search by keywords for models in all the collections. To browse a collection, users may click the radio button for the collection to browse models by species or species group. A model can then be added to the “shopping cart” by marking the model with a tick. Remov- ing a tick mark will remove the corresponding model from the “shopping cart”. To search for models, users may enter one or more keywords and click the “Search” button. The models found will be displayed in a list and can be similarly selected or removed (see Figure 2.3.1b for an example). The number of selected models is displayed on the input page. Users may click the “Show” button to view these models and remove the unwanted ones.
For promoter sequence input, users may input their own promoter se- quences in the FASTA format. However, users may retrieve promoter se- quences by NCBI Gene IDs, gene symbols, or mRNA accession numbers. By clicking the “Search” button, LASAGNA-Search will display the matching promoters. Figure 2.3.1d shows the promoters found by keywords CCND1 and MYB. Users may choose to examine only promoters of a particular species. Only the matching human promoters are listed in Figure 2.3.1d after applying the filter. Promoters are selected in a manner similar to se- lecting TF models. Finally, users may also select from a list of randomly
80 sampled promoters of a chosen species.
2.3.1.2 Result Page
The result page is organized into 5 tabs. The first tab displays hits on all the promoter sequences, whereas the second tab displays hits pertaining to one promoter sequence at a time. The third tab shows the gene regula- tory network inferred from search results. The fourth tab allows importing previous search results to be merged with the current search results. The last tab contains the inputs, including the selected TF models, the selected promoters and the search parameters. Figure 2.3.2 shows an example result page with the third tab named “Promoter view” showing.
Only hits meeting the specified criterion are reported in the first and sec- ond tabs. For each hit, the model name, sequence, 0-based position, strand, score, p-value and E-value are reported. Hits found in the same promoter sequence can be sorted by model name, sequence, position, strand, p-value and E-value by clicking the respective column header. By default, the hits are displayed in a HTML table. Users may click a button on the result page to obtain the table in the tab-delimited format. Previous search results in the tab-delimited format can be easily imported to the current search results.
This is particularly useful when additional TFs of interests to the user are identified after an initial search.
81 Users may display the hits along the promoter sequence, where the log p- − value of each hit is used as the height to plot a box. This allows easy visualization of the predicted binding sites by a model in the context of those by the other models. Finally, the hits can be saved in GFF (general fea- ture format) or the bedGraph format for visualization in the UCSC Genome
Browser [19]. Links are provided for each promoter sequence to automati- cally create a custom track and redirect users to the UCSC Genome Browser.
Figure 2.3.3 shows a custom track of putative binding sites predicted by
LASAGNA-Search in the context of 4 other relevant tracks.
The automatically inferred GRN can be displayed and manipulated by click- ing the tab named “Gene regulatory network”. To produce a sparser net- work, users may set a more stringent p-value than the one used to filter hits.
Users may show only nodes belonging to one or more species listed un- der “Filter by species” Figure 2.2.3a shows the network after restricting the species to Homo sapiens. Users may choose to display the TF coding genes by checking “Map TFs to coding genes”. Figure 2.2.3b shows the resulting network. While 6 nodes are present in the GRN in Figure 2.2.3b, there are essentially only 2 genes and their products in the network. When a GRN involves more genes, it may be desirable to simplify the GRN, replacing the
82 TF models with their respective coding genes. Figure 2.2.3c displays the simplified two node GRN generated by checking “Simple network”. We note that a GRN can be simplified only after the TF models are mapped to coding genes.
2.3.2 Comparison of Features to Existing Webtools
LASAGNA-Search was designed to allow users to scan promoters for TFBSs without leaving the LASAGNA-Search page. Many features of LASAGNA-
Search were developed for user convenience reasons. Hence, without the knowledge of PWM or TFBS databases and promoter sequence retrieval tools, users can start searching for binding sites in a promoter sequence and visualize the hits in the UCSC Genome Browser immediately. There are several integrative TFBS search webtools available. It is useful to compare
LASAGNA-Search with the existing webtools to better understand the ad- vantages and disadvantages of LASAGNA-Search and suggest future work to improve LASAGNA-Search. Table 2.3.10 summarizes the comparison of LASAGNA-Search to matrix-scan and the search engine of MAPPER2 database for TFBS search.
In terms of input TF models, LASAGNA-Search and MAPPER2 have large libraries of TF models, while users need to collect PWMs before using matrix- scan. Users may input a PWM or unaligned TFBSs to LASAGNA-Search for
83 model building. On the other hand, while matrix-scan accepts PWMs, both matrix-scan and MAPPER2 do not accept unaligned TFBSs. For promoter sequences, all the three tools accept sequences in FASTA, while matrix-scan handles sequences in 5 additional formats. Automatic sequence retrieval for matrix-scan is accomplished by interfacing with two tools, “retrieve se- quence” and “retrieve EnsEMBL sequence”, on the same website. These two tools are capable of retrieving sequences in a wide range of species and can be used with any TFBS search tools. LASAGNA-Search and MAPPER2 offer integrated promoter retrieval tools supporting 7 and 3 species, respectively.
Visualization of predicted binding sites is usually tightly connected with the promoter sequence retrieval used by a tool. This is because to create a cus- tom track in the UCSC Genome Browser, the genome build (release version) and coordinates in the genome must be known for a promoter sequence. For
LASAGNA-Search and MAPPER2, hits found on any promoter sequences retrieved by the provided tool can be visualized with ease in the UCSC
Genome Browser. Therefore, 7 and 3 species are supported by LASAGNA-
Search and MAPPER2, respectively. Visualizing hits found by matrix-scan in the UCSC Genome Browser is possible only when the genome build and coordinates are specified in the FASTA header of the promoter sequence.
Headers of sequences retrieved by the aforementioned two tools, however,
84 do not contain the required information enabling visualization of hits in the
UCSC Genome Browser.
Gene regulatory network inference from search results is only available at
LASAGNA-Search among the three integrative webtools. PAINT [31] of- fers similar function by integrating Match [40] in the TRANSFAC Public or
Professional databases and promoter sequence retrieval for human, mouse and rat. Compared to PAINT, LASAGNA-Search contains 1792 TF mod- els from 5 source databases and retrieves promoters for 15 species. The major difference between LASAGNA-Search and PAINT, however, is that
LASAGNA-Search keeps track of the coding genes of TF models. This is an important feature because it allows visualization of self-regulation as self- loops and merging nodes for TF models coded by the same genes.
Finally, it is useful to compare LASAGNA-Search to other relevant webtools.
The MEME Suite [2] offers web interfaces to 4 TFBS search tools with access to whole-genome promoter sequences. However, these tools have no access to the PWM database in the suite, nor do they scan promoters of specific genes and offer visualization of hits. Two tools motivated by evolutionary conservation are COTRASIF [82] and ReXSpecies2 [78]. COTRASIF collects
138 JASPAR and 398 TRANSFAC PWMs and offer whole-genome Ensembl
85 promoter sequences. However, it does not allow selection of gene-specific promoter sequences nor does it offer visualization. ReXSpecies2, on the other hand, sources PWMs from JASPAR, scans promoters of specific genes and allows visualization in the UCSC Genome Browser. However, it focuses only on human and mouse and selecting individual PWMs requires use of regular expression-like patterns.
2.3.3 Evaluation of Precomputed TF Models
Since MAPPER2 is the most comparable webtool to LASAGNA, it is useful to compare the TF model collections offered by LASAGNA-Search and MAP-
PER2 on a whole-genome basis. As the MAPPER2 database stores, for each
TF model, hits from scanning the 10Kbp upstream region of each transcript, we scanned the same sequences using TF models offered by LASAGNA-
Search. We have no access to the profile hidden Markov models [20] used by MAPPER2 and the dynamic scanning interface offered by MAPPER2 was not functioning at the time of writing. Fortunately, MAPPER2 allows downloading of the top-1000 hits for each model. We hence limited the comparison to the top-1000 hits produced by each TF model.
To evaluate model performance, human and mouse ChIP-seq data from the
ENCODE project [68] was used as the gold-standard. Hence, we consid- ered models for human and mouse TFs. The comparison was performed
86 on a per-TF basis and all the TFs that can be validated were included. Ta- bles 2.3.11 and 2.3.12 list the ChIP-seq tracks (experiments) by TF for human and mouse, respectively. We associated each TF with models that can be used to predict its binding sites. Each of the 1000 hits produced by a model was checked against the ChIP-seq peaks of the TF. A hit is marked a true positive if it is completely covered by a peak in at least one experiment as
ChIP-seq peaks are much longer than TFBSs. Otherwise, a hit is marked a false positive.
87 AB
C
D
Figure 2.3.1: Input Page of LASAGNA-Search
User interface of the LASAGNA-Search input page. (A) The input page when the “Enter known TFBSs” radio button is checked. (A) Selecting TF models by keyword search. The models matching keywords p53 and STAT3 are listed in the table. (C) Manual trimming of TFBS alignments. Automatically trimmed alignment of 11 NF-Y binding sites is presented as a PWM and a sequence logo for manual trimming. Users may choose to further trim the alignment or to recover previously trimmed columns by clicking the 4 buttons on the bottom. (D) Selecting promoters by keyword search. Promoters found by the keywords CCND1 and MYB are shown in the table. Only promoters belonging to Homo sapiens are shown.
88 LASAGNA-Search Results http://biogrid.engr.uconn.edu/lasagna_search/lasagna...
All results Promoter view Gene regulatory network Import results Inputs
Homo sapiens chr11 + 69455873 CCND1 NM_053056
Homo sapiens chr6 + 135502453 MYB NM_001130173
Transcripts
Species Chr Strand TSS Symbol mRNA Range Homo sapiens chr6 + 135502453 MYB NM_001130173 -950 to +50 Homo sapiens chr6 + 135502453 MYB NM_001130172 -950 to +50 Homo sapiens chr6 + 135502453 MYB NM_001161656 -950 to +50 Homo sapiens chr6 + 135502453 MYB NM_001161657 -950 to +50 Homo sapiens chr6 + 135502453 MYB NM_001161658 -950 to +50 Homo sapiens chr6 + 135502453 MYB NM_001161659 -950 to +50 Homo sapiens chr6 + 135502453 MYB NM_001161660 -950 to +50 Homo sapiens chr6 + 135502453 MYB NM_005375 -950 to +50
Hits
gi|224589818:135501503-135502502 Homo sapiens chromosome 6, GRCh37.p9 Primary Assembly Position Name Sequence Strand Score p-value E-value (0-based) AML-1a AGCGGT 501 - 7.31 0 0 (M00271) AML-1a AGCGGT 275 + 7.31 0 0 (M00271) STAT5A GAGTTCTG 1 + 8.85 0 0 (M00499) AML-1a TGCGGT 125 + 7.2 0.000125 0.124 (M00271)Figure 2.3.2: Result Page of LASAGNA-Search
The LASAGNA-Search resultFinished page retrieving with the results... “Promoter view” tab showing. Users may examine the hits on individual promoter sequences by clicking the respective title bars. The content for the MYB promoter is shown. The first table lists the transcripts corresponding to this sequence. The second table displays the hits found on this promoter sequence.
1 of 1 08/08/2012 02:53 PM
89 Scale 200 bases hg19 chr6: 135,501,800 135,501,900 135,502,000 135,502,100 135,502,200 135,502,300 135,502,400 135,502,500 135,502,600 RefSeq Genes MYB MYB MYB MYB MYB MYB MYB MYB LASAGNA-Search Predicted TFBSs AP-2(M00189) c-Myb(M00183) Myb(MA0100.1) SRF(M00215) NF-kappaB(M00194) Tcfap2a(UP00005.Tcfap2a.secondary) Srf(UP00077.Srf.primary) NF-kappaB(M00208) NFKB1(MA0105.1) SRF(M00186) NF-kappaB(p65)(M00052) NF-kappaB(p50)(M00051) E2F(M00024) NF-kappaB(M00054) E2F1(MA0024.1) NF-kappaB(MA0061.1) E2F(M00050) E2F(M00516) 50 _ H1-hESC H3K4me3 Histone Mods by ChIP-seq Signal from ENCODE/Broad H1-hESC H3K4m3 1 _ 50 _ K562 H3K4me3 Histone Mods by ChIP-seq Signal from ENCODE/Broad K562 H3K4m3 1 _ 4.41 _ GERP scores for mammalian alignments
GERP 0 -
-8.81 _ Figure 2.3.3: Visualization of Hits in the UCSC Genome Browser
The hits were produced by scanning the 950bp upstream to 50bp down- stream promoter region of human gene MYB with 27 TF models. Only significant hits (p-value 0.001) were retained. The hits in the GFF format are displayed in pack mode≤ with 4 other tracks. Hits predicted by the same TF model are connected by a line. The RefSeq Genes track shows the rel- ative hit positions to gene MYB. Two histone methylation tracks and the GERP (Genomic Evolutionary Rate Profiling) track [16] are also shown. The GERP score is a measure of evolutionary conservation and TFs are known to preferentially bind motif instances in conserved regions [14]. Histone methylation has been shown to be the most important factor in predicting the general binding preference of TFs [21].
90 Table 2.3.10
LASAGNA-Search matrix-scan MAPPER2 TF Model User PWM Yes Yes No User TFBSs Yes, unaligned, No Yes, aligned TFBSs. variable-length TFBSs. Model Yes, 1792 Not available Yes, 1017 collection TFBS-based and TFBS-based PWM-based models. models. Promoter Sequence Format FASTA FASTA and 5 other FASTA formats. Retrieval tool Yes, built-in for 15 Yes, comprehensive Yes, built-in for 3 species. species coverage by species. tools retrieve sequence and retrieve EnsEMBL sequence on the same website. Search Result Filtering p-value p-value E-value Local Yes Yes Yes visualization Visualization Yes, supports 15 Limited. The build, Yes, supports 3 in UCSC species. coordinates and species. Genome orientation must be Browser specified in the FASTA sequence header GRN inference Yes No No
91 Table 2.3.11: Human ENCODE Tracks Used for Validating TF Models
TF Antibody Track
ATF1 Atf106325 wgEncodeSydhTfbsK562Atf106325StdPk
ATF2 Atf2sc81188 wgEncodeHaibTfbsGm12878Atf2sc81188V0422111PkRep1
wgEncodeHaibTfbsGm12878Atf2sc81188V0422111PkRep2
wgEncodeHaibTfbsH1hescAtf2sc81188V0422111PkRep1 92 wgEncodeHaibTfbsH1hescAtf2sc81188V0422111PkRep2
ATF3 Atf3 wgEncodeHaibTfbsA549Atf3V0422111Etoh02PkRep1
wgEncodeHaibTfbsA549Atf3V0422111Etoh02PkRep2
wgEncodeHaibTfbsGm12878Atf3Pcr1xPkRep1
wgEncodeHaibTfbsGm12878Atf3Pcr1xPkRep2
wgEncodeHaibTfbsH1hescAtf3V0416102PkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsH1hescAtf3V0416102PkRep2
wgEncodeHaibTfbsHct116Atf3V0422111PkRep1
wgEncodeHaibTfbsHct116Atf3V0422111PkRep2
wgEncodeHaibTfbsHepg2Atf3V0416101PkRep1
93 wgEncodeHaibTfbsHepg2Atf3V0416101PkRep2
wgEncodeHaibTfbsK562Atf3V0416101PkRep1
wgEncodeHaibTfbsK562Atf3V0416101PkRep2
wgEncodeSydhTfbsK562Atf3StdPk
CEBPB Cebpb wgEncodeSydhTfbsA549CebpbIggrabPk
wgEncodeSydhTfbsH1hescCebpbIggrabPk
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeSydhTfbsHelas3CebpbIggrabPk
wgEncodeSydhTfbsHepg2CebpbForsklnStdPk
wgEncodeSydhTfbsHepg2CebpbIggrabPk
wgEncodeSydhTfbsImr90CebpbIggrabPk
94 wgEncodeSydhTfbsK562CebpbIggrabPk
Cebpbsc150 wgEncodeHaibTfbsA549Cebpbsc150V0422111PkRep1
wgEncodeHaibTfbsA549Cebpbsc150V0422111PkRep2
wgEncodeHaibTfbsEcc1Cebpbsc150V0422111PkRep1
wgEncodeHaibTfbsEcc1Cebpbsc150V0422111PkRep2
wgEncodeHaibTfbsGm12878Cebpbsc150V0422111PkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsGm12878Cebpbsc150V0422111PkRep2
wgEncodeHaibTfbsHct116Cebpbsc150V0422111PkRep1
wgEncodeHaibTfbsHct116Cebpbsc150V0422111PkRep2
wgEncodeHaibTfbsHepg2Cebpbsc150V0416101PkRep1
95 wgEncodeHaibTfbsHepg2Cebpbsc150V0416101PkRep2
wgEncodeHaibTfbsK562Cebpbsc150V0422111PkRep1
wgEncodeHaibTfbsK562Cebpbsc150V0422111PkRep2
wgEncodeHaibTfbsMcf7Cebpbsc150V0422111PkRep1
wgEncodeHaibTfbsMcf7Cebpbsc150V0422111PkRep2
CREB1 Creb1sc240 wgEncodeHaibTfbsA549Creb1sc240V0416102Dex100nmPkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsA549Creb1sc240V0416102Dex100nmPkRep2
wgEncodeHaibTfbsEcc1Creb1sc240V0422111PkRep1
wgEncodeHaibTfbsEcc1Creb1sc240V0422111PkRep2
wgEncodeHaibTfbsGm12878Creb1sc240V0422111PkRep1
96 wgEncodeHaibTfbsGm12878Creb1sc240V0422111PkRep2
wgEncodeHaibTfbsH1hescCreb1sc240V0422111PkRep1
wgEncodeHaibTfbsH1hescCreb1sc240V0422111PkRep2
wgEncodeHaibTfbsHepg2Creb1sc240V0422111PkRep1
wgEncodeHaibTfbsHepg2Creb1sc240V0422111PkRep2
wgEncodeHaibTfbsK562Creb1sc240V0422111PkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsK562Creb1sc240V0422111PkRep2
CUX1 Cdpsc6327 wgEncodeSydhTfbsGm12878Cdpsc6327IggmusPk
wgEncodeSydhTfbsK562Cdpsc6327IggrabPk
E2F1 E2f1 wgEncodeSydhTfbsHelas3E2f1StdPk
97 Hae2f1 wgEncodeSydhTfbsHelas3Hae2f1StdPk
wgEncodeSydhTfbsMcf7Hae2f1UcdPk
E2F4 E2f4 wgEncodeSydhTfbsGm12878E2f4IggmusPk
wgEncodeSydhTfbsHelas3E2f4StdPk
wgEncodeSydhTfbsK562E2f4UcdPk
wgEncodeSydhTfbsMcf10aesE2f4TamHvdPk
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
ELK1 Elk112771 wgEncodeSydhTfbsGm12878Elk112771IggmusPk
wgEncodeSydhTfbsHelas3Elk112771IggrabPk
wgEncodeSydhTfbsK562Elk112771IggrabPk
ELK4 Elk4 wgEncodeSydhTfbsHek293Elk4UcdPk
98 wgEncodeSydhTfbsHelas3Elk4UcdPk
EP300 P300 wgEncodeHaibTfbsA549P300V0422111Etoh02PkRep1
wgEncodeHaibTfbsA549P300V0422111Etoh02PkRep2
wgEncodeHaibTfbsEcc1P300V0422111PkRep1
wgEncodeHaibTfbsEcc1P300V0422111PkRep2
wgEncodeHaibTfbsGm12878P300Pcr1xPkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsGm12878P300Pcr1xPkRep2
wgEncodeHaibTfbsH1hescP300V0416102PkRep1
wgEncodeHaibTfbsH1hescP300V0416102PkRep2
wgEncodeHaibTfbsHepg2P300V0416101PkRep1
99 wgEncodeHaibTfbsHepg2P300V0416101PkRep2
wgEncodeHaibTfbsMcf7P300V0422111PkRep1
wgEncodeHaibTfbsMcf7P300V0422111PkRep2
wgEncodeHaibTfbsSknshP300V0422111PkRep1
wgEncodeHaibTfbsSknshP300V0422111PkRep2
wgEncodeHaibTfbsSknshraP300V0416102PkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsSknshraP300V0416102PkRep2
wgEncodeHaibTfbsT47dP300V0416102Dm002p1hPkRep1
wgEncodeHaibTfbsT47dP300V0416102Dm002p1hPkRep2
wgEncodeSydhTfbsGm12878P300IggmusPk
100 wgEncodeSydhTfbsK562P300IggrabPk
P300sc582 wgEncodeSydhTfbsHepg2P300sc582IggrabPk
P300sc584 wgEncodeSydhTfbsGm12878P300sc584IggmusPk
P300sc584sc48343wgEncodeSydhTfbsK562P300sc584sc48343IggrabPk
P300sc584sc584 wgEncodeSydhTfbsHelas3P300sc584sc584IggrabPk
ESR1 Eraa wgEncodeHaibTfbsEcc1EraaV0416102Bpa1hPkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsEcc1EraaV0416102Bpa1hPkRep2
wgEncodeHaibTfbsT47dEraaV0416102Bpa1hPkRep1
wgEncodeHaibTfbsT47dEraaV0416102Bpa1hPkRep2
Eralphaa wgEncodeHaibTfbsEcc1EralphaaV0416102Est10nm1hPkRep1
101 wgEncodeHaibTfbsEcc1EralphaaV0416102Est10nm1hPkRep2
wgEncodeHaibTfbsEcc1EralphaaV0416102Gen1hPkRep1
wgEncodeHaibTfbsEcc1EralphaaV0416102Gen1hPkRep2
wgEncodeHaibTfbsT47dEralphaaPcr2xGen1hPkRep1
wgEncodeHaibTfbsT47dEralphaaPcr2xGen1hPkRep2
wgEncodeHaibTfbsT47dEralphaaV0416102Est10nm1hPkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsT47dEralphaaV0416102Est10nm1hPkRep2
ETS1 Ets1 wgEncodeHaibTfbsA549Ets1V0422111Etoh02PkRep1
wgEncodeHaibTfbsA549Ets1V0422111Etoh02PkRep2
wgEncodeHaibTfbsGm12878Ets1Pcr1xPkRep1
102 wgEncodeHaibTfbsGm12878Ets1Pcr1xPkRep1V2
wgEncodeHaibTfbsGm12878Ets1Pcr1xPkRep2
wgEncodeHaibTfbsGm12878Ets1Pcr1xPkRep2V2
wgEncodeHaibTfbsK562Ets1V0416101PkRep1
wgEncodeHaibTfbsK562Ets1V0416101PkRep2
FOS Cfos wgEncodeSydhTfbsGm12878CfosStdPk
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeSydhTfbsHelas3CfosStdPk
wgEncodeSydhTfbsHuvecCfosUcdPk
wgEncodeSydhTfbsK562CfosStdPk
wgEncodeSydhTfbsMcf10aesCfosEtoh01HvdPk
103 wgEncodeSydhTfbsMcf10aesCfosTam112hHvdPk
wgEncodeSydhTfbsMcf10aesCfosTam14hHvdPk
wgEncodeSydhTfbsMcf10aesCfosTamHvdPk
Efos wgEncodeUchicagoTfbsK562EfosControlPk
FOSL1 Fosl1 wgEncodeHaibTfbsHct116Fosl1V0422111PkRep1
wgEncodeHaibTfbsHct116Fosl1V0422111PkRep2
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
Fosl1sc183 wgEncodeHaibTfbsH1hescFosl1sc183V0416102PkRep1
wgEncodeHaibTfbsH1hescFosl1sc183V0416102PkRep2
wgEncodeHaibTfbsK562Fosl1sc183V0416101PkRep1
wgEncodeHaibTfbsK562Fosl1sc183V0416101PkRep2
104 FOXM1 Foxm1sc502 wgEncodeHaibTfbsEcc1Foxm1sc502V0422111PkRep1
wgEncodeHaibTfbsEcc1Foxm1sc502V0422111PkRep2
wgEncodeHaibTfbsGm12878Foxm1sc502V0422111PkRep1
wgEncodeHaibTfbsGm12878Foxm1sc502V0422111PkRep2
wgEncodeHaibTfbsMcf7Foxm1sc502V0422111PkRep1
wgEncodeHaibTfbsMcf7Foxm1sc502V0422111PkRep2
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsSknshFoxm1sc502V0422111PkRep1
wgEncodeHaibTfbsSknshFoxm1sc502V0422111PkRep2
GABPA Gabp wgEncodeHaibTfbsA549GabpV0422111Etoh02PkRep1
wgEncodeHaibTfbsA549GabpV0422111Etoh02PkRep2
105 wgEncodeHaibTfbsGm12878GabpPcr2xPkRep1
wgEncodeHaibTfbsGm12878GabpPcr2xPkRep2
wgEncodeHaibTfbsH1hescGabpPcr1xPkRep1
wgEncodeHaibTfbsH1hescGabpPcr1xPkRep2
wgEncodeHaibTfbsHelas3GabpPcr1xPkRep1
wgEncodeHaibTfbsHelas3GabpPcr1xPkRep2
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsHepg2GabpPcr2xPkRep1
wgEncodeHaibTfbsHepg2GabpPcr2xPkRep2
wgEncodeHaibTfbsHl60GabpV0422111PkRep1
wgEncodeHaibTfbsHl60GabpV0422111PkRep2
106 wgEncodeHaibTfbsK562GabpV0416101PkRep1
wgEncodeHaibTfbsK562GabpV0416101PkRep2
wgEncodeHaibTfbsMcf7GabpV0422111PkRep1
wgEncodeHaibTfbsMcf7GabpV0422111PkRep2
wgEncodeHaibTfbsSknshGabpV0422111PkRep1
wgEncodeHaibTfbsSknshGabpV0422111PkRep2
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
GATA3 Gata3 wgEncodeHaibTfbsA549Gata3V0422111PkRep1
wgEncodeHaibTfbsA549Gata3V0422111PkRep2
wgEncodeHaibTfbsMcf7Gata3V0422111PkRep1
wgEncodeHaibTfbsMcf7Gata3V0422111PkRep2
107 wgEncodeHaibTfbsSknshGata3V0422111PkRep1
wgEncodeHaibTfbsSknshGata3V0422111PkRep2
wgEncodeSydhTfbsMcf7Gata3UcdPk
Gata3sc268 wgEncodeHaibTfbsT47dGata3sc268V0416102Dm002p1hPkRep1
wgEncodeHaibTfbsT47dGata3sc268V0416102Dm002p1hPkRep2
Gata3sc269 wgEncodeSydhTfbsMcf7Gata3sc269UcdPk
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
Gata3sc269sc269 wgEncodeSydhTfbsShsy5yGata3sc269sc269UcdPk
HNF4A Hnf4a wgEncodeSydhTfbsHepg2Hnf4aForsklnStdPk
Hnf4asc8987 wgEncodeHaibTfbsHepg2Hnf4asc8987V0416101PkRep1
wgEncodeHaibTfbsHepg2Hnf4asc8987V0416101PkRep2
108 HSF1 Hsf1 wgEncodeSydhTfbsHepg2Hsf1ForsklnStdPk
IRF1 Irf1 wgEncodeSydhTfbsK562Irf1Ifna30StdPk
wgEncodeSydhTfbsK562Irf1Ifna6hStdPk
wgEncodeSydhTfbsK562Irf1Ifng30StdPk
wgEncodeSydhTfbsK562Irf1Ifng6hStdPk
JUN Cjun wgEncodeSydhTfbsH1hescCjunIggrabPk
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeSydhTfbsHelas3CjunIggrabPk
wgEncodeSydhTfbsHepg2CjunIggrabPk
wgEncodeSydhTfbsHuvecCjunStdPk
wgEncodeSydhTfbsK562CjunIfna30StdPk
109 wgEncodeSydhTfbsK562CjunIfna6hStdPk
wgEncodeSydhTfbsK562CjunIfng30StdPk
wgEncodeSydhTfbsK562CjunIfng6hStdPk
wgEncodeSydhTfbsK562CjunIggrabPk
wgEncodeSydhTfbsK562CjunStdPk
JUNB Ejunb wgEncodeUchicagoTfbsK562EjunbControlPk
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
JUND Ejund wgEncodeUchicagoTfbsK562EjundControlPk
Jund wgEncodeHaibTfbsA549JundV0416102Etoh02PkRep1
wgEncodeHaibTfbsA549JundV0416102Etoh02PkRep2
wgEncodeHaibTfbsH1hescJundV0416102PkRep1
110 wgEncodeHaibTfbsH1hescJundV0416102PkRep2
wgEncodeHaibTfbsHct116JundV0422111PkRep1
wgEncodeHaibTfbsHct116JundV0422111PkRep2
wgEncodeHaibTfbsHepg2JundPcr1xPkRep1
wgEncodeHaibTfbsHepg2JundPcr1xPkRep2
wgEncodeHaibTfbsMcf7JundV0422111PkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsMcf7JundV0422111PkRep2
wgEncodeHaibTfbsSknshJundV0422111PkRep1
wgEncodeHaibTfbsSknshJundV0422111PkRep2
wgEncodeHaibTfbsT47dJundV0422111PkRep1
111 wgEncodeHaibTfbsT47dJundV0422111PkRep2
wgEncodeSydhTfbsGm12878JundIggrabPk
wgEncodeSydhTfbsGm12878JundStdPk
wgEncodeSydhTfbsH1hescJundIggrabPk
wgEncodeSydhTfbsHelas3JundIggrabPk
wgEncodeSydhTfbsHepg2JundIggrabPk
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeSydhTfbsK562JundIggrabPk
wgEncodeSydhTfbsSknshJundIggrabPk
MAX Max wgEncodeHaibTfbsA549MaxV0422111PkRep1
wgEncodeHaibTfbsA549MaxV0422111PkRep2
112 wgEncodeHaibTfbsEcc1MaxV0422111PkRep1
wgEncodeHaibTfbsEcc1MaxV0422111PkRep2
wgEncodeHaibTfbsH1hescMaxV0422111PkRep1
wgEncodeHaibTfbsH1hescMaxV0422111PkRep2
wgEncodeHaibTfbsHct116MaxV0422111PkRep1
wgEncodeHaibTfbsHct116MaxV0422111PkRep2
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsHepg2MaxV0422111PkRep1
wgEncodeHaibTfbsHepg2MaxV0422111PkRep2
wgEncodeHaibTfbsK562MaxV0416102PkRep1
wgEncodeHaibTfbsK562MaxV0416102PkRep2
113 wgEncodeHaibTfbsMcf7MaxV0422111PkRep1
wgEncodeHaibTfbsMcf7MaxV0422111PkRep2
wgEncodeHaibTfbsSknshMaxV0422111PkRep1
wgEncodeHaibTfbsSknshMaxV0422111PkRep2
wgEncodeSydhTfbsA549MaxIggrabPk
wgEncodeSydhTfbsGm12878MaxIggmusPk
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeSydhTfbsGm12878MaxStdPk
wgEncodeSydhTfbsH1hescMaxUcdPk
wgEncodeSydhTfbsHelas3MaxIggrabPk
wgEncodeSydhTfbsHelas3MaxStdPk
114 wgEncodeSydhTfbsHepg2MaxIggrabPk
wgEncodeSydhTfbsHuvecMaxStdPk
wgEncodeSydhTfbsK562MaxIggrabPk
wgEncodeSydhTfbsK562MaxStdPk
wgEncodeSydhTfbsNb4MaxStdPk
MEF2A Mef2a wgEncodeHaibTfbsGm12878Mef2aPcr1xPkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsGm12878Mef2aPcr1xPkRep2
wgEncodeHaibTfbsK562Mef2aV0416101PkRep1
wgEncodeHaibTfbsK562Mef2aV0416101PkRep2
wgEncodeHaibTfbsSknshMef2aV0422111PkRep1
115 wgEncodeHaibTfbsSknshMef2aV0422111PkRep2
MYC Cmyc wgEncodeSydhTfbsA549CmycIggrabPk
wgEncodeSydhTfbsH1hescCmycIggrabPk
wgEncodeSydhTfbsHelas3CmycStdPk
wgEncodeSydhTfbsK562CmycIfna30StdPk
wgEncodeSydhTfbsK562CmycIfna6hStdPk
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeSydhTfbsK562CmycIfng30StdPk
wgEncodeSydhTfbsK562CmycIfng6hStdPk
wgEncodeSydhTfbsK562CmycIggrabPk
wgEncodeSydhTfbsK562CmycStdPk
116 wgEncodeSydhTfbsMcf10aesCmycEtoh01HvdPk
wgEncodeSydhTfbsMcf10aesCmycTam14hHvdPk
wgEncodeSydhTfbsNb4CmycStdPk
NFATC1Nfatc1sc17834 wgEncodeHaibTfbsGm12878Nfatc1sc17834V0422111PkRep1
wgEncodeHaibTfbsGm12878Nfatc1sc17834V0422111PkRep2
NFE2 Nfe2 wgEncodeSydhTfbsK562Nfe2StdPk
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
Nfe2sc22827 wgEncodeSydhTfbsGm12878Nfe2sc22827StdPk
NFIC Nficsc81335 wgEncodeHaibTfbsEcc1Nficsc81335V0422111PkRep1
wgEncodeHaibTfbsEcc1Nficsc81335V0422111PkRep2
wgEncodeHaibTfbsGm12878Nficsc81335V0422111PkRep1
117 wgEncodeHaibTfbsGm12878Nficsc81335V0422111PkRep2
wgEncodeHaibTfbsHepg2Nficsc81335V0422111PkRep1
wgEncodeHaibTfbsHepg2Nficsc81335V0422111PkRep2
wgEncodeHaibTfbsSknshNficsc81335V0422111PkRep1
wgEncodeHaibTfbsSknshNficsc81335V0422111PkRep2
NFYA Nfya wgEncodeSydhTfbsGm12878NfyaIggmusPk
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeSydhTfbsHelas3NfyaIggrabPk
wgEncodeSydhTfbsK562NfyaStdPk
NFYB Nfyb wgEncodeSydhTfbsGm12878NfybIggmusPk
wgEncodeSydhTfbsHelas3NfybIggrabPk
118 wgEncodeSydhTfbsK562NfybStdPk
NR2F2 Nr2f2sc271940 wgEncodeHaibTfbsHepg2Nr2f2sc271940V0422111PkRep1
wgEncodeHaibTfbsHepg2Nr2f2sc271940V0422111PkRep2
wgEncodeHaibTfbsK562Nr2f2sc271940V0422111PkRep1
wgEncodeHaibTfbsK562Nr2f2sc271940V0422111PkRep2
wgEncodeHaibTfbsMcf7Nr2f2sc271940V0422111PkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsMcf7Nr2f2sc271940V0422111PkRep2
NR3C1 Gr wgEncodeHaibTfbsA549GrPcr1xDex500pmPkRep1
wgEncodeHaibTfbsA549GrPcr1xDex500pmPkRep2
wgEncodeHaibTfbsA549GrPcr1xDex50nmPkRep1
119 wgEncodeHaibTfbsA549GrPcr1xDex50nmPkRep2
wgEncodeHaibTfbsA549GrPcr1xDex5nmPkRep1
wgEncodeHaibTfbsA549GrPcr1xDex5nmPkRep2
wgEncodeHaibTfbsA549GrPcr2xDex100nmPkRep1
wgEncodeHaibTfbsA549GrPcr2xDex100nmPkRep2
wgEncodeHaibTfbsEcc1GrV0416102Dex100nmPkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsEcc1GrV0416102Dex100nmPkRep2
Grp20 wgEncodeSydhTfbsHepg2Grp20ForsklnStdPk
PAX5 Pax5c20 wgEncodeHaibTfbsGm12878Pax5c20Pcr1xPkRep1
wgEncodeHaibTfbsGm12878Pax5c20Pcr1xPkRep2
120 wgEncodeHaibTfbsGm12891Pax5c20V0416101PkRep1
wgEncodeHaibTfbsGm12891Pax5c20V0416101PkRep2
wgEncodeHaibTfbsGm12892Pax5c20V0416101PkRep1
wgEncodeHaibTfbsGm12892Pax5c20V0416101PkRep2
Pax5n19 wgEncodeHaibTfbsGm12878Pax5n19Pcr1xPkRep1
wgEncodeHaibTfbsGm12878Pax5n19Pcr1xPkRep2
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
POU2F2 Pou2f2 wgEncodeHaibTfbsGm12878Pou2f2Pcr1xPkRep1
wgEncodeHaibTfbsGm12878Pou2f2Pcr1xPkRep2
wgEncodeHaibTfbsGm12878Pou2f2Pcr1xPkRep3
wgEncodeHaibTfbsGm12891Pou2f2Pcr1xPkRep1
121 wgEncodeHaibTfbsGm12891Pou2f2Pcr1xPkRep2
RELA Nfkb wgEncodeSydhTfbsGm10847NfkbTnfaIggrabPk
wgEncodeSydhTfbsGm12878NfkbTnfaIggrabPk
wgEncodeSydhTfbsGm12891NfkbTnfaIggrabPk
wgEncodeSydhTfbsGm12892NfkbTnfaIggrabPk
wgEncodeSydhTfbsGm15510NfkbTnfaIggrabPk
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeSydhTfbsGm18505NfkbTnfaIggrabPk
wgEncodeSydhTfbsGm18526NfkbTnfaIggrabPk
wgEncodeSydhTfbsGm18951NfkbTnfaIggrabPk
wgEncodeSydhTfbsGm19099NfkbTnfaIggrabPk
122 wgEncodeSydhTfbsGm19193NfkbTnfaIggrabPk
RXRA Rxra wgEncodeHaibTfbsGm12878RxraPcr1xPkRep1
wgEncodeHaibTfbsGm12878RxraPcr1xPkRep2
wgEncodeHaibTfbsH1hescRxraV0416102PkRep1
wgEncodeHaibTfbsH1hescRxraV0416102PkRep2
wgEncodeHaibTfbsHepg2RxraPcr1xPkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsHepg2RxraPcr1xPkRep2
wgEncodeHaibTfbsSknshRxraV0422111PkRep1
wgEncodeHaibTfbsSknshRxraV0422111PkRep2
SP1 Sp1 wgEncodeHaibTfbsA549Sp1V0422111Etoh02PkRep1
123 wgEncodeHaibTfbsA549Sp1V0422111Etoh02PkRep2
wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep1
wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep2
wgEncodeHaibTfbsH1hescSp1Pcr1xPkRep1
wgEncodeHaibTfbsH1hescSp1Pcr1xPkRep2
wgEncodeHaibTfbsHct116Sp1V0422111PkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsHct116Sp1V0422111PkRep2
wgEncodeHaibTfbsHepg2Sp1Pcr1xPkRep1
wgEncodeHaibTfbsHepg2Sp1Pcr1xPkRep2
wgEncodeHaibTfbsK562Sp1Pcr1xPkRep1
124 wgEncodeHaibTfbsK562Sp1Pcr1xPkRep2
SPI1 Pu1 wgEncodeHaibTfbsGm12878Pu1Pcr1xPkRep1
wgEncodeHaibTfbsGm12878Pu1Pcr1xPkRep2
wgEncodeHaibTfbsGm12878Pu1Pcr1xPkRep3
wgEncodeHaibTfbsGm12891Pu1Pcr1xPkRep1
wgEncodeHaibTfbsGm12891Pu1Pcr1xPkRep2
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsHl60Pu1V0422111PkRep1
wgEncodeHaibTfbsHl60Pu1V0422111PkRep2
wgEncodeHaibTfbsK562Pu1Pcr1xPkRep1
wgEncodeHaibTfbsK562Pu1Pcr1xPkRep2
125 SREBF1 Srebp1 wgEncodeSydhTfbsGm12878Srebp1IggrabPk
wgEncodeSydhTfbsHepg2Srebp1InslnStdPk
wgEncodeSydhTfbsHepg2Srebp1PravastStdPk
SRF Srf wgEncodeHaibTfbsEcc1SrfV0422111PkRep1
wgEncodeHaibTfbsEcc1SrfV0422111PkRep2
wgEncodeHaibTfbsGm12878SrfPcr2xPkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsGm12878SrfPcr2xPkRep2
wgEncodeHaibTfbsGm12878SrfV0416101PkRep1
wgEncodeHaibTfbsGm12878SrfV0416101PkRep2
wgEncodeHaibTfbsH1hescSrfPcr1xPkRep1
126 wgEncodeHaibTfbsH1hescSrfPcr1xPkRep2
wgEncodeHaibTfbsHct116SrfV0422111PkRep1
wgEncodeHaibTfbsHct116SrfV0422111PkRep2
wgEncodeHaibTfbsHepg2SrfV0416101PkRep1
wgEncodeHaibTfbsHepg2SrfV0416101PkRep2
wgEncodeHaibTfbsK562SrfV0416101PkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsK562SrfV0416101PkRep2
wgEncodeHaibTfbsMcf7SrfV0422111PkRep1
wgEncodeHaibTfbsMcf7SrfV0422111PkRep2
STAT1 Stat1 wgEncodeSydhTfbsGm12878Stat1StdPk
127 wgEncodeSydhTfbsHelas3Stat1Ifng30StdPk
wgEncodeSydhTfbsK562Stat1Ifna30StdPk
wgEncodeSydhTfbsK562Stat1Ifna6hStdPk
wgEncodeSydhTfbsK562Stat1Ifng30StdPk
wgEncodeSydhTfbsK562Stat1Ifng6hStdPk
STAT2 Stat2 wgEncodeSydhTfbsK562Stat2Ifna30StdPk
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeSydhTfbsK562Stat2Ifna6hStdPk
STAT3 Stat3 wgEncodeSydhTfbsGm12878Stat3IggmusPk
wgEncodeSydhTfbsHelas3Stat3IggrabPk
wgEncodeSydhTfbsMcf10aesStat3Etoh01StdPk
128 wgEncodeSydhTfbsMcf10aesStat3Etoh01bStdPk
wgEncodeSydhTfbsMcf10aesStat3Etoh01cStdPk
wgEncodeSydhTfbsMcf10aesStat3Tam112hHvdPk
wgEncodeSydhTfbsMcf10aesStat3TamStdPk
STAT5A Stat5asc74442 wgEncodeHaibTfbsGm12878Stat5asc74442V0422111PkRep1
wgEncodeHaibTfbsGm12878Stat5asc74442V0422111PkRep2
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsK562Stat5asc74442V0422111PkRep1
wgEncodeHaibTfbsK562Stat5asc74442V0422111PkRep2
TBP Tbp wgEncodeSydhTfbsGm12878TbpIggmusPk
wgEncodeSydhTfbsH1hescTbpIggrabPk
129 wgEncodeSydhTfbsHelas3TbpIggrabPk
wgEncodeSydhTfbsHepg2TbpIggrabPk
wgEncodeSydhTfbsK562TbpIggmusPk
TCF3 Tcf3 wgEncodeHaibTfbsGm12878Tcf3Pcr1xPkRep1
wgEncodeHaibTfbsGm12878Tcf3Pcr1xPkRep2
USF1 Usf1 wgEncodeHaibTfbsA549Usf1Pcr1xDex100nmPkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsA549Usf1Pcr1xDex100nmPkRep2
wgEncodeHaibTfbsA549Usf1Pcr1xEtoh02PkRep1
wgEncodeHaibTfbsA549Usf1Pcr1xEtoh02PkRep2
wgEncodeHaibTfbsA549Usf1V0422111Etoh02PkRep1
130 wgEncodeHaibTfbsA549Usf1V0422111Etoh02PkRep2
wgEncodeHaibTfbsEcc1Usf1V0422111PkRep1
wgEncodeHaibTfbsEcc1Usf1V0422111PkRep2
wgEncodeHaibTfbsGm12878Usf1Pcr2xPkRep1
wgEncodeHaibTfbsGm12878Usf1Pcr2xPkRep2
wgEncodeHaibTfbsH1hescUsf1Pcr1xPkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsH1hescUsf1Pcr1xPkRep2
wgEncodeHaibTfbsHct116Usf1V0422111PkRep1
wgEncodeHaibTfbsHct116Usf1V0422111PkRep2
wgEncodeHaibTfbsHepg2Usf1Pcr1xPkRep1
131 wgEncodeHaibTfbsHepg2Usf1Pcr1xPkRep2
wgEncodeHaibTfbsK562Usf1V0416101PkRep1
wgEncodeHaibTfbsK562Usf1V0416101PkRep2
wgEncodeHaibTfbsSknshUsf1V0422111PkRep1
wgEncodeHaibTfbsSknshUsf1V0422111PkRep2
Usf1sc8983 wgEncodeHaibTfbsSknshraUsf1sc8983V0416102PkRep1
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsSknshraUsf1sc8983V0416102PkRep2
YY1 Yy1 wgEncodeHaibTfbsGm12892Yy1V0416101PkRep1
wgEncodeHaibTfbsGm12892Yy1V0416101PkRep2
wgEncodeHaibTfbsK562Yy1V0416101PkRep1
132 wgEncodeHaibTfbsK562Yy1V0416101PkRep2
wgEncodeHaibTfbsK562Yy1V0416102PkRep1
wgEncodeHaibTfbsK562Yy1V0416102PkRep2
wgEncodeSydhTfbsGm12878Yy1StdPk
wgEncodeSydhTfbsK562Yy1UcdPk
wgEncodeSydhTfbsNt2d1Yy1UcdPk
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
Yy1sc281 wgEncodeHaibTfbsEcc1Yy1sc281V0422111PkRep1
wgEncodeHaibTfbsEcc1Yy1sc281V0422111PkRep2
wgEncodeHaibTfbsGm12878Yy1sc281Pcr1xPkRep1
wgEncodeHaibTfbsGm12878Yy1sc281Pcr1xPkRep2
133 wgEncodeHaibTfbsGm12891Yy1sc281V0416101PkRep1
wgEncodeHaibTfbsGm12891Yy1sc281V0416101PkRep2
wgEncodeHaibTfbsH1hescYy1sc281V0416102PkRep1
wgEncodeHaibTfbsH1hescYy1sc281V0416102PkRep2
wgEncodeHaibTfbsHct116Yy1sc281V0416101PkRep1
wgEncodeHaibTfbsHct116Yy1sc281V0416101PkRep2
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
wgEncodeHaibTfbsHepg2Yy1sc281V0416101PkRep1
wgEncodeHaibTfbsHepg2Yy1sc281V0416101PkRep2
wgEncodeHaibTfbsK562Yy1sc281V0416101PkRep1
wgEncodeHaibTfbsK562Yy1sc281V0416101PkRep2
134 wgEncodeHaibTfbsSknshYy1sc281V0422111PkRep1
wgEncodeHaibTfbsSknshYy1sc281V0422111PkRep2
wgEncodeHaibTfbsSknshraYy1sc281V0416102PkRep1
wgEncodeHaibTfbsSknshraYy1sc281V0416102PkRep2
ZEB1 Zeb1 wgEncodeHaibTfbsHepg2Zeb1V0422111PkRep1
wgEncodeHaibTfbsHepg2Zeb1V0422111PkRep2
Continued on next page Table 2.3.11 – continued from previous page
TF Antibody Track
Zeb1sc25388 wgEncodeHaibTfbsGm12878Zeb1sc25388V0416102PkRep1
wgEncodeHaibTfbsGm12878Zeb1sc25388V0416102PkRep2 135 Table 2.3.12: Mouse ENCODE Tracks Used for Validating TF Models
TF Antibody Track
Cebpb Cebpb wgEncodeCaltechTfbsC2c12CebpbFCntrl50bE2p60hPcr1xPkRep1
wgEncodeCaltechTfbsC2c12CebpbFCntrl50bE2p60hPcr1xPkRep2
wgEncodeCaltechTfbsC2c12CebpbFCntrl50bPcr1xPkRep1
Ets1 Ets1 wgEncodeSydhTfbsCh12Ets1IggrabPk 136 wgEncodeSydhTfbsMelEts1IggrabPk
Gata1 Gata1 wgEncodeSydhTfbsMelGata1Dm2p5dStdPk
wgEncodeSydhTfbsMelGata1IggratPk
Jun Cjun wgEncodeSydhTfbsCh12CjunIggrabPk
Jund Jund wgEncodeSydhTfbsCh12JundIggrabPk
wgEncodeSydhTfbsMelJundIggrabPk
Continued on next page Table 2.3.12 – continued from previous page
TF Antibody Track
Mafk Mafk wgEncodeSydhTfbsEse14MafkStdPk
wgEncodeSydhTfbsMelMafkDm2p5dStdPk
Mafkab50322wgEncodeSydhTfbsCh12Mafkab50322IggrabPk
wgEncodeSydhTfbsMelMafkab50322IggrabPk
137 Myb Cmybh141 wgEncodeSydhTfbsMelCmybh141IggrabPk
Cmybsc7874 wgEncodeSydhTfbsMelCmybsc7874IggrabPk
Myod1Sc32758 wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bE2p24hPcr2xPkRep1
wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bE2p60hPcr2xPkRep1
wgEncodeCaltechTfbsC2c12Sc32758FCntrl32bPcr2xPkRep1
wgEncodeCaltechTfbsC2c12Sc32758FCntrl50bE2p7dPcr1xPkRep1
Continued on next page Table 2.3.12 – continued from previous page
TF Antibody Track
Myog Sc12732 wgEncodeCaltechTfbsC2c12Sc12732FCntrl32bE2p24hPcr2xPkRep1
wgEncodeCaltechTfbsC2c12Sc12732FCntrl32bE2p60hPcr2xPkRep1
wgEncodeCaltechTfbsC2c12Sc12732FCntrl32bPcr2xPkRep1
wgEncodeCaltechTfbsC2c12Sc12732FCntrl50bE2p7dPcr1xPkRep1
138 Pax5 Pax5c wgEncodePsuTfbsCh12Pax5cFImmortal2a4bInputPk
Srf Srf wgEncodeCaltechTfbsC2c12SrfFCntrl32bE2p24hPcr2xPkRep1
Tcf3 Tcf3 wgEncodeCaltechTfbsC2c12Tcf3FCntrl32bE2p5dPcr2xPkRep1
Usf1 Usf1 wgEncodeCaltechTfbsC2c12Usf1FCntrl50bE2p60hPcr1xPkRep1
wgEncodeCaltechTfbsC2c12Usf1FCntrl50bPcr1xPkRep1 Evaluating a model based on the top-1000 hits is analogous to evaluating a search engine based on the top-1000 documents. Therefore, we used average precision [84] to score a model. This performance measure is widely used
P1000 in the information retrieval community and is defined as P(k) tp(k)/c. k=1 × P(k) gives the precision based on the top k hits (fraction of the top k hits that are true positives). Indicator tp(k) is 1 if hit k is a true positive. Otherwise, tp(k) is 0. The denominator c is the portion of upstream regions covered by peaks in bases and was computed based on all the ChIP-seq experiments used to validate the model. We also scored each model by accuracy, which is equivalent to P(1000).
The performance of LASAGNA-Search on a TF was measured by the av- erage score of the associated models and likewise for MAPPER2. Results for LASAGNA-Search are listed in Tables 2.3.13 and 2.3.14, while results for MAPPER2 are listed in Tables 2.3.15 and 2.3.16. Average precision and accuracy are given in individual columns. Each row presents the perfor- mance of a model in predicting the binding sites of a TF. Figure 2.3.4 shows the comparison between LASAGNA-Search and MAPPER2 in terms of av- erage precision. The same comparison in terms of accuracy is shown in
Figure 2.3.5.
139 Table 2.3.13: Performance of LASAGNA-Search TF Models Validated by ENCODE Human ChIP-seq Data
TF Collection Model ID Model Name AccuracyAve Prec1
ATF1 TRANSFAC M00017 ATF 0.339 5.973
(PWM-based) ATF1 TRANSFAC T00051 ATF 0.374 4.895
(TFBS-based)
140 ATF3 TRANSFAC M00017 ATF 0.279 2.554
(PWM-based) ATF3 TRANSFAC T00051 ATF 0.302 2.184
(TFBS-based) CEBPB TRANSFAC M00117 C/EBPbeta 0.484 2.079
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
CEBPB TRANSFAC M00109 C/EBPbeta 0.364 1.196
(PWM-based) CEBPB TRANSFAC T00581 NF-IL6-2 0.12 0.098
(TFBS-based) CREB1 TRANSFAC M00178 CREB 0.499 3.007 141
(PWM-based) CREB1 TRANSFAC M00113 CREB 0.409 2.301
(PWM-based) CREB1 TRANSFAC M00115 Tax/CREB 0.095 2.301
(PWM-based) CREB1 TRANSFAC M00114 Tax/CREB 0.222 2.006
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
CREB1 TRANSFAC M00177 CREB 0.443 1.925
(PWM-based) CREB1 JASPAR CORE MA0018.1 CREB1 0.309 1.250
CREB1 TRANSFAC M00039 CREB 0.482 1.019
142 (PWM-based) CREB1 JASPAR CORE MA0018.2 CREB1 0.482 0.517
CREB1 TRANSFAC T00163 CREB 0.348 0.106
(TFBS-based) ATF2 TRANSFAC M00179 CRE-BP1 0.399 3.058
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
ATF2 TRANSFAC M00017 ATF 0.327 3.046
(PWM-based) ATF2 TRANSFAC T00051 ATF 0.371 2.345
(TFBS-based) ATF2 TRANSFAC T00167 ATF-2 0.444 2.136 143
(TFBS-based) ATF2 TRANSFAC M00041 CRE-BP1:c-Jun 0.379 1.822
(PWM-based) ATF2 TRANSFAC M00040 CRE-BP1 0.124 0.198
(PWM-based) CUX1 TRANSFAC M00095 CDP 0.054 0.883
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
CUX1 TRANSFAC M00104 CDP CR1 0.193 0.526
(PWM-based) CUX1 TRANSFAC M00102 CDP 0.149 0.412
(PWM-based) CUX1 TRANSFAC M00105 CDP CR3 0.034 0.083 144
(PWM-based) CUX1 TRANSFAC M00106 CDP CR3+HD 0.151 0.028
(PWM-based) E2F1 TRANSFAC T01542 E2F-1 0.324 2.167
(TFBS-based) E2F1 TRANSFAC M00024 E2F 0.185 2.122
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
E2F1 TRANSFAC T00221 E2F 0.326 1.878
(TFBS-based) E2F1 TRANSFAC M00516 E2F 0.312 1.493
(PWM-based) E2F1 TRANSFAC M00050 E2F 0.317 1.493 145
(PWM-based) E2F1 JASPAR CORE MA0024.1 E2F1 0.317 0.758
E2F4 TRANSFAC M00024 E2F 0.221 2.186
(PWM-based) E2F4 TRANSFAC T00221 E2F 0.366 1.780
(TFBS-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
E2F4 TRANSFAC M00516 E2F 0.33 1.295
(PWM-based) E2F4 TRANSFAC M00050 E2F 0.325 0.932
(PWM-based) ELK1 JASPAR CORE MA0028.1 ELK1 0.174 2.392 146
ELK1 TRANSFAC T00250 Elk-1 0.152 0.946
(TFBS-based) ELK1 TRANSFAC M00007 Elk-1 0.157 0.587
(PWM-based) ELK1 TRANSFAC M00025 Elk-1 0.346 0.501
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
ELK4 JASPAR CORE MA0076.1 ELK4 0.322 6.723
EP300 TRANSFAC M00033 p300 0.15 0.150
(PWM-based) ESR1 TRANSFAC T00261 ER-alpha 0.057 9.634
147 (TFBS-based) ESR1 JASPAR CORE MA0112.1 ESR1 0.148 3.845
ESR1 JASPAR CORE MA0112.2 ESR1 0.264 2.198
ESR1 TRANSFAC M00191 ER 0.124 0.707
(PWM-based) ETS1 JASPAR CORE MA0098.1 ETS1 0.06 0.063
FOXM1 ORegAnno ORegAnno 9606 32 FOXM1 (HNF3) 0.068 0.064
Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
FOS TRANSFAC T00123 c-Fos 0.475 4.708
(TFBS-based) FOS UniPROBE UP00425.Jun+Fos Heterodimer.primary Jun/Fos Heterodimer 0.649 3.756
FOS TRANSFAC M00188 AP-1 0.4 2.827
148 (PWM-based) FOS TRANSFAC T00029 AP-1 0.502 2.807
(TFBS-based) FOS TRANSFAC M00172 AP-1 0.379 2.720
(PWM-based) FOS TRANSFAC M00517 AP-1 0.603 2.603
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
FOS TRANSFAC M00173 AP-1 0.377 1.814
(PWM-based) FOS TRANSFAC M00199 AP-1 0.509 1.664
(PWM-based) FOS JASPAR CORE MA0099.2 AP1 0.365 1.497 149
FOS TRANSFAC M00174 AP-1 0.52 1.399
(PWM-based) GABPA JASPAR CORE MA0062.1 GABPA 0.539 3.685
GATA1 TRANSFAC M00346 GATA-1 0.056 0.213
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
GATA1 TRANSFAC M00126 GATA-1 0.05 0.158
(PWM-based) GATA1 TRANSFAC M00128 GATA-1 0.073 0.103
(PWM-based) GATA1 TRANSFAC M00203 GATA-X 0.091 0.092 150
(PWM-based) GATA1 TRANSFAC T00306 GATA-1 0.064 0.072
(TFBS-based) GATA1 TRANSFAC M00127 GATA-1 0.032 0.025
(PWM-based) GATA3 TRANSFAC M00203 GATA-X 0.104 0.275
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
GATA3 TRANSFAC M00077 GATA-3 0.155 0.127
(PWM-based) GATA3 TRANSFAC T00311 GATA-3 0.068 0.095
(TFBS-based) GATA3 JASPAR CORE MA0037.1 GATA3 0.095 0.063 151
NR3C1 JASPAR CORE MA0113.1 NR3C1 0.125 1.158
NR3C1 TRANSFAC T01920 GR-beta 0.147 1.158
(TFBS-based) NR3C1 TRANSFAC T00337 GR-alpha 0.147 1.007
(TFBS-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
NR3C1 TRANSFAC M00192 GR 0.106 0.541
(PWM-based) NR3C1 TRANSFAC M00205 GR 0.093 0.409
(PWM-based) HNF4A JASPAR CORE MA0114.1 HNF4A 0.216 2.664 152
HNF4A ORegAnno ORegAnno 9606 8 HNF4A — HNF4A 0.073 2.477
(HNF4) HNF4A TRANSFAC M00158 COUP-TF, HNF-4 0.169 1.750
(PWM-based) HNF4A TRANSFAC M00134 HNF-4 0.222 0.490
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
HSF1 TRANSFAC M00146 HSF1 0.005 0.045
(PWM-based) IRF1 JASPAR CORE MA0050.1 IRF1 0.223 1.186
IRF1 TRANSFAC T00423 IRF-1 0.24 1.181
153 (TFBS-based) IRF1 TRANSFAC M00062 IRF-1 0.149 0.250
(PWM-based) JUN UniPROBE UP00425.Jun+Fos Heterodimer.primary Jun/Fos Heterodimer 0.387 2.923
JUN TRANSFAC M00517 AP-1 0.3 2.225
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
JUN TRANSFAC M00188 AP-1 0.186 1.196
(PWM-based) JUN TRANSFAC T00133 c-Jun 0.125 0.842
(TFBS-based) JUN TRANSFAC T00029 AP-1 0.2 0.771 154
(TFBS-based) JUN TRANSFAC M00172 AP-1 0.167 0.542
(PWM-based) JUN TRANSFAC M00199 AP-1 0.247 0.487
(PWM-based) JUN TRANSFAC M00041 CRE-BP1:c-Jun 0.481 0.447
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
JUN ORegAnno ORegAnno 9606 103 AP1 (Jun family) CRE — 0.188 0.447
Ap1(JUN) — JUN — AP1
— AP-1 JUN JASPAR CORE MA0099.2 AP1 0.188 0.408
155 JUN TRANSFAC M00173 AP-1 0.169 0.360
(PWM-based) JUN TRANSFAC M00174 AP-1 0.268 0.244
(PWM-based) JUNB TRANSFAC M00517 AP-1 0.26 1.291
(PWM-based) JUND TRANSFAC M00517 AP-1 0.627 1.688
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
MAX JASPAR CORE MA0058.1 MAX 0.241 1.409
MAX TRANSFAC M00118 c-Myc:Max 0.508 1.205
(PWM-based) MAX JASPAR CORE MA0059.1 MYC::MAX 0.568 0.883
156 MAX TRANSFAC M00615 c-Myc:Max 0.384 0.760
(PWM-based) MAX TRANSFAC M00123 c-Myc:Max 0.427 0.626
(PWM-based) MAX TRANSFAC M00119 Max 0.452 0.281
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
MAZ TRANSFAC T00490 MAZ 0.267 0.489
(TFBS-based) MEF2A TRANSFAC M00006 MEF-2 0.055 2.325
(PWM-based) MEF2A TRANSFAC M00231 MEF-2 0.159 1.889 157
(PWM-based) MEF2A TRANSFAC M00026 RSRFC4 0.137 1.705
(PWM-based) MEF2A TRANSFAC M00233 MEF-2 0.057 0.565
(PWM-based) MEF2A TRANSFAC M00232 MEF-2 0.146 0.374
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
MEF2A JASPAR CORE MA0052.1 MEF2A 0.091 0.324
MEF2A TRANSFAC T01005 MEF-2A 0.059 0.266
(TFBS-based) MYC TRANSFAC T00140 c-Myc 0.314 0.602
158 (TFBS-based) MYC TRANSFAC M00123 c-Myc:Max 0.272 0.551
(PWM-based) MYC TRANSFAC M00118 c-Myc:Max 0.302 0.432
(PWM-based) MYC TRANSFAC M00615 c-Myc:Max 0.242 0.352
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
NFATC1 TRANSFAC T01945 NF-AT2 0.023 0.023
(TFBS-based) NFATC1 TRANSFAC M00302 NF-AT 0.033 0.012
(PWM-based) NFE2 TRANSFAC M00037 NF-E2 0.095 3.073 159
(PWM-based) NFIC TRANSFAC M00193 NF-1 0.245 1.204
(PWM-based) NFIC JASPAR CORE MA0119.1 TLX1::NFIC 0.345 1.044
NFIC TRANSFAC M00056 myogenin / NF-1 0.139 0.641
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
NFIC TRANSFAC T00539 NF-1 0.326 0.547
(TFBS-based) NFIC ORegAnno ORegAnno 9606 34 NFIC (NF-I) — nf1 — NFI 0.242 0.440
— NFIC NFIC TRANSFAC T00174 CTF 0.207 0.211 160
(TFBS-based) NFIC JASPAR CORE MA0161.1 NFIC 0.077 0.048
NFYA TRANSFAC T00150 NF-Y 0.09 9.976
(TFBS-based) NFYA TRANSFAC M00287 NF-Y 0.502 6.903
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
NFYA TRANSFAC M00185 NF-Y 0.223 2.700
(PWM-based) NFYA TRANSFAC M00209 NF-Y 0.27 1.914
(PWM-based) NFYA JASPAR CORE MA0060.1 NFYA 0.428 0.416 161
NFYB TRANSFAC T00150 NF-Y 0.124 13.874
(TFBS-based) NFYB TRANSFAC M00287 NF-Y 0.643 4.967
(PWM-based) NFYB TRANSFAC M00209 NF-Y 0.393 3.195
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
NFYB TRANSFAC M00185 NF-Y 0.315 0.946
(PWM-based) PAX5 TRANSFAC M00144 BSAP 0.077 0.103
(PWM-based) PAX5 TRANSFAC M00143 BSAP 0.041 0.022 162
(PWM-based) POU2F2 TRANSFAC T00647 POU2F2 0.1 0.909
(TFBS-based) POU2F2 TRANSFAC M00210 OCT-x 0.226 0.211
(PWM-based) POU2F2 TRANSFAC T00646 POU2F2 (Oct-2.1) 0.032 0.014
(TFBS-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
RELA TRANSFAC M00052 NF-kappaB (p65) 0.135 0.905
(PWM-based) RELA TRANSFAC T00590 NF-kappaB 0.235 0.627
(TFBS-based) RELA TRANSFAC M00194 NF-kappaB 0.2 0.540 163
(PWM-based) RELA TRANSFAC M00054 NF-kappaB 0.191 0.526
(PWM-based) RELA TRANSFAC M00208 NF-kappaB 0.176 0.500
(PWM-based) RELA JASPAR CORE MA0107.1 RELA 0.195 0.344
Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
RXRA JASPAR CORE MA0065.1 PPARG::RXRA 0.076 0.328
RXRA JASPAR CORE MA0159.1 RXR::RAR DR5 0.07 0.285
RXRA JASPAR CORE MA0115.1 NR1H2::RXRA 0.069 0.254
RXRA JASPAR CORE MA0074.1 RXRA::VDR 0.034 0.058
164 RXRA TRANSFAC T01345 RXR-alpha 0.028 0.033
(TFBS-based) SP1 TRANSFAC T00759 Sp1 0.647 3.648
(TFBS-based) SP1 JASPAR CORE MA0079.2 SP1 0.045 1.716
SP1 ORegAnno ORegAnno 9606 2 Sp1 — SP1 (Sp-1) — 0.172 0.233
ENSG00000185591 — SP-1 Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
SP1 JASPAR CORE MA0079.1 SP1 0.028 0.028
SP1 TRANSFAC M00196 Sp1 0.459 0.018
(PWM-based) SP1 TRANSFAC M00008 Sp1 0.061 0.006
165 (PWM-based) SPI1 TRANSFAC T02068 PU.1 0.421 3.008
(TFBS-based) SPI1 JASPAR CORE MA0080.2 SPI1 0.262 1.038
SPI1 JASPAR CORE MA0080.1 SPI1 0.164 0.407
SREBF1 TRANSFAC M00220 SREBP-1 0.027 0.060
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
SREBF1 TRANSFAC M00221 SREBP-1 0.008 0.005
(PWM-based) SRF JASPAR CORE MA0083.1 SRF 0.247 3.486
SRF TRANSFAC M00152 SRF 0.248 2.705
166 (PWM-based) SRF TRANSFAC T00764 SRF 0.282 2.684
(TFBS-based) SRF TRANSFAC M00186 SRF 0.236 2.481
(PWM-based) SRF TRANSFAC M00215 SRF 0.274 2.053
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
STAT1 ORegAnno ORegAnno 9606 77 STAT1 0.218 3.255
STAT1 TRANSFAC M00223 STATx 0.227 1.220
(PWM-based) STAT1 TRANSFAC M00258 ISRE 0.081 1.182
167 (PWM-based) STAT1 JASPAR CORE MA0137.1 STAT1 0.121 1.101
STAT1 JASPAR CORE MA0137.2 STAT1 0.22 0.421
STAT1 TRANSFAC M00224 STAT1 0.393 0.345
(PWM-based) STAT1 TRANSFAC M00496 STAT1 0.147 0.186
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
STAT2 TRANSFAC M00223 STATx 0.014 0.968
(PWM-based) STAT2 TRANSFAC M00258 ISRE 0.027 0.173
(PWM-based) STAT3 TRANSFAC M00223 STATx 0.304 1.353 168
(PWM-based) STAT3 TRANSFAC M00497 STAT3 0.16 1.170
(PWM-based) STAT3 TRANSFAC M00225 STAT3 0.354 0.274
(PWM-based) STAT5A TRANSFAC M00460 STAT5A (homotetramer) 0.102 0.157
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
STAT5A TRANSFAC M00457 STAT5A (homodimer) 0.111 0.154
(PWM-based) STAT5A TRANSFAC M00499 STAT5A 0.049 0.029
(PWM-based) TAL1 JASPAR CORE MA0091.1 TAL1::TCF3 0.037 0.062 169
TAL1 TRANSFAC M00065 Tal-1beta:E47 0.036 0.055
(PWM-based) TAL1 TRANSFAC M00066 Tal-1alpha:E47 0.033 0.044
(PWM-based) TAL1 TRANSFAC M00070 Tal-1beta:ITF-2 0.046 0.040
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
TBP TRANSFAC T00820 TFIID 0.045 0.244
(TFBS-based) TBP TRANSFAC M00252 TATA 0.156 0.181
(PWM-based) TBP TRANSFAC M00216 TATA 0.063 0.039 170
(PWM-based) TBP TRANSFAC T00794 TBP 0.072 0.030
(TFBS-based) TBP TRANSFAC M00471 TBP 0.159 0.020
(PWM-based) TCF3 TRANSFAC M00065 Tal-1beta:E47 0.023 0.281
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
TCF3 TRANSFAC M00071 E47 0.024 0.140
(PWM-based) TCF3 TRANSFAC T00204 E12 0.098 0.025
(TFBS-based) TCF3 TRANSFAC M00222 Hand1:E47 0.006 0.025 171
(PWM-based) TCF3 JASPAR CORE MA0091.1 TAL1::TCF3 0.03 0.021
TCF3 TRANSFAC M00002 E47 0.07 0.013
(PWM-based) TCF3 TRANSFAC M00066 Tal-1alpha:E47 0.023 0.001
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
ZEB1 TRANSFAC M00414 AREB6 0.046 0.175
(PWM-based) ZEB1 TRANSFAC M00412 AREB6 0.023 0.121
(PWM-based) ZEB1 TRANSFAC T00625 ZEB (1124 AA) 0.075 0.058 172
(TFBS-based) ZEB1 TRANSFAC M00413 AREB6 0.044 0.025
(PWM-based) ZEB1 TRANSFAC M00415 AREB6 0.012 0.004
(PWM-based) NR2F2 TRANSFAC T00045 COUP-TF2 0.107 0.131
(TFBS-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
NR2F2 TRANSFAC M00155 ARP-1 0.053 0.053
(PWM-based) USF1 TRANSFAC M00187 USF 0.756 8.974
(PWM-based) USF1 TRANSFAC T00874 USF1 0.543 5.127 173
(TFBS-based) USF1 TRANSFAC M00217 USF 0.395 2.301
(PWM-based) USF1 JASPAR CORE MA0093.1 USF1 0.395 2.300
USF1 TRANSFAC M00122 USF 0.409 2.300
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
USF1 TRANSFAC M00121 USF 0.345 1.678
(PWM-based) YY1 TRANSFAC M00069 YY1 0.288 0.755
(PWM-based) YY1 TRANSFAC T00915 YY1 0.249 0.566 174
(TFBS-based) YY1 TRANSFAC M00059 YY1 0.15 0.194
(PWM-based) YY1 JASPAR CORE MA0095.1 YY1 0.158 0.189
FOSL1 TRANSFAC M00517 AP-1 0.226 3.039
(PWM-based) Continued on next page Table 2.3.13 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
CTCF ORegAnno ORegAnno 9606 74 CTCF 0.989 6.907
CTCF JASPAR CORE MA0139.1 CTCF 0.767 4.733
1 5 Average Precision (10− ) 175 Table 2.3.14: Performance of LASAGNA-Search TF Models Validated by ENCODE Mouse ChIP-seq Data
TF Collection Model ID Model Name AccuracyAve Prec1
Cebpb TRANSFAC (TFBS-based) T00017 C/EBPbeta(p35) 0.063 4.707
Cebpb TRANSFAC (PWM-based) M00117 C/EBPbeta 0.151 2.453
Cebpb TRANSFAC (PWM-based) M00109 C/EBPbeta 0.084 1.260
Fli1 UniPROBE UP00416.Fli1.primary Fli1 0.205 7.963 176 Gata1 JASPAR CORE MA0035.2 Gata1 0.159 0.566
Gata1 TRANSFAC (PWM-based) M00346 GATA-1 0.147 0.527
Gata1 JASPAR CORE MA0140.1 Tal1::Gata1 0.154 0.502
Gata1 JASPAR CORE MA0035.1 Gata1 0.067 0.477
Gata1 TRANSFAC (TFBS-based) T00305 GATA-1 0.162 0.441
Gata1 TRANSFAC (PWM-based) M00128 GATA-1 0.124 0.318
Continued on next page Table 2.3.14 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
Gata1 TRANSFAC (PWM-based) M00203 GATA-X 0.171 0.123
Gata1 TRANSFAC (PWM-based) M00126 GATA-1 0.084 0.095
Gata1 TRANSFAC (PWM-based) M00127 GATA-1 0.054 0.083
Gata1 TRANSFAC (PWM-based) M00075 GATA-1 0.075 0.050
177 Jun TRANSFAC (PWM-based) M00174 AP-1 0.022 1.884
Jun TRANSFAC (TFBS-based) T00032 AP-1 0.027 0.262
Jun TRANSFAC (PWM-based) M00172 AP-1 0.027 0.225
Jun TRANSFAC (PWM-based) M00188 AP-1 0.028 0.120
Jun TRANSFAC (PWM-based) M00517 AP-1 0.049 0.111
Jun TRANSFAC (PWM-based) M00199 AP-1 0.047 0.088
Continued on next page Table 2.3.14 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
Jun TRANSFAC (PWM-based) M00041 CRE-BP1:c-Jun 0.136 0.085
Jun TRANSFAC (PWM-based) M00173 AP-1 0.035 0.060
Jund TRANSFAC (PWM-based) M00517 AP-1 0.053 0.283
Mafk UniPROBE UP00044.Mafk.primary Mafk 0.181 9.666
178 Mafk TRANSFAC (PWM-based) M00037 NF-E2 0.449 2.187
Mafk UniPROBE UP00044.Mafk.secondary Mafk 0.023 0.959
Mafk TRANSFAC (TFBS-based) T00557 NF-E2 0.158 0.028
Myb TRANSFAC (TFBS-based) T00138 c-Myb 0.011 0.112
Myb UniPROBE UP00092.Myb.secondary Myb 0.018 0.109
Myb TRANSFAC (PWM-based) M00183 c-Myb 0.002 0.093
Continued on next page Table 2.3.14 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
Myb UniPROBE UP00092.Myb.primary Myb 0.023 0.023
Myb TRANSFAC (PWM-based) M00004 c-Myb 0.009 0.014
Myb JASPAR CORE MA0110.1 ATHB-5 0.002 0.001
Myb JASPAR CORE MA0100.1 Myb 0.009 0.001
179 Myod1 TRANSFAC (PWM-based) M00001 MyoD 0.096 0.675
Myod1 TRANSFAC (TFBS-based) T00526 MyoD 0.124 0.398
Myod1 TRANSFAC (PWM-based) M00184 MyoD 0.098 0.332
Myog TRANSFAC (TFBS-based) T00528 myogenin 0.159 0.859
Pax5 TRANSFAC (PWM-based) M00144 BSAP 0.001 0.066
Pax5 JASPAR CORE MA0014.1 Pax5 0.004 0.024
Continued on next page Table 2.3.14 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
Pax5 TRANSFAC (PWM-based) M00143 BSAP 0 0.000
Srf TRANSFAC (TFBS-based) T00765 SRF (504 AA) 0.076 1.888
Srf TRANSFAC (PWM-based) M00215 SRF 0.078 1.817
Srf TRANSFAC (PWM-based) M00152 SRF 0.077 1.348
180 Srf UniPROBE UP00077.Srf.primary Srf 0.055 1.245
Srf TRANSFAC (PWM-based) M00186 SRF 0.079 0.557
Srf UniPROBE UP00077.Srf.secondary Srf 0.002 0.001
Tbp UniPROBE UP00029.Tbp.primary Tbp 0.06 0.187
Tbp UniPROBE UP00029.Tbp.secondary Tbp 0.096 0.168
Tbp TRANSFAC (PWM-based) M00252 TATA 0.085 0.108
Continued on next page Table 2.3.14 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
Tbp TRANSFAC (PWM-based) M00471 TBP 0.13 0.058
Tbp TRANSFAC (PWM-based) M00216 TATA 0.074 0.058
Tcf3 JASPAR CORE MA0092.1 Hand1::Tcfe2a 0.004 0.015
Tcf3 UniPROBE UP00046.Tcfe2a.secondary Tcfe2a 0.007 0.010
181 Tcf3 UniPROBE UP00046.Tcfe2a.primary Tcfe2a 0.008 0.006
Usf1 TRANSFAC (PWM-based) M00122 USF 0.23 16.555
Usf1 TRANSFAC (PWM-based) M00217 USF 0.182 8.406
Usf1 TRANSFAC (PWM-based) M00187 USF 0.391 5.904
Usf1 TRANSFAC (PWM-based) M00121 USF 0.276 3.613
Ets1 TRANSFAC (TFBS-based) T00111 c-Ets-1 0.134 0.458
Continued on next page Table 2.3.14 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
Ets1 UniPROBE UP00414.Ets1.primary Ets1 0.113 0.266
Ets1 TRANSFAC (PWM-based) M00032 c-Ets-1(p54) 0.131 0.260
1 5 Average Precision (10− ) 182 Table 2.3.15: Performance of MAPPER2 TF Models Validated by ENCODE Human ChIP-seq Data
TF Collection Model ID Model Name AccuracyAve Prec1
ATF1 MAPPER T00968 ATF-1 0.036 5.129
ATF1 TRANSFAC M00981 CREB, ATF 0.204 4.308
ATF1 TRANSFAC M00017 ATF 0.343 1.504
ATF1 TRANSFAC M00691 ATF1 0.404 0.300 183 ATF3 TRANSFAC M00513 ATF3 0.354 4.616
ATF3 MAPPER T01313 ATF3 0.08 2.167
ATF3 TRANSFAC M00981 CREB, ATF 0.168 0.692
ATF3 TRANSFAC M00017 ATF 0.296 0.259
CEBPB TRANSFAC M00117 C/EBPbeta 0.091 2.370
CEBPB TRANSFAC M00109 C/EBPbeta 0.342 0.899
Continued on next page Table 2.3.15 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
CEBPB MAPPER T00581 C/EBPbeta 0.267 0.556
CEBPB TRANSFAC M00912 C/EBP 0.599 0.223
CREB1 TRANSFAC M00917 CREB 0.688 5.561
CREB1 TRANSFAC M00916 CREB 0.59 3.957
184 CREB1 MAPPER T00163 CREB 0.323 1.774
CREB1 TRANSFAC M00115 Tax/CREB 0.144 1.139
CREB1 TRANSFAC M00114 Tax/CREB 0.226 0.559
CREB1 TRANSFAC M00981 CREB, ATF 0.23 0.524
CREB1 TRANSFAC M00113 CREB 0.418 0.266
ATF2 TRANSFAC M00017 ATF 0.325 5.336
Continued on next page Table 2.3.15 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
ATF2 MAPPER T00167 ATF-2-xbb4 0.599 1.656
ATF2 TRANSFAC M00981 CREB, ATF 0.198 0.583
CUX1 TRANSFAC M00106 CDP CR3+HD 0.066 0.609
CUX1 TRANSFAC M00104 CDP CR1 0.174100719 0.146
185 CUX1 TRANSFAC M00095 CDP 0.038 0.036
CUX1 TRANSFAC M00105 CDP CR3 0.026 0.014
E2F1 TRANSFAC M00938 E2F-1 0.066 4.711
E2F1 MAPPER T05208 pRb:E2F-1:DP-1 0.193 3.887
E2F1 TRANSFAC M00940 E2F-1 0.446 2.875
E2F1 TRANSFAC M00920 E2F 0.492 1.502
Continued on next page Table 2.3.15 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
E2F1 TRANSFAC M00024 E2F 0.209 1.468
E2F1 MAPPER T00221 E2F:DP 0.391 1.455
E2F1 MAPPER T05205 E2F-1:DP-2 0.168 1.312
E2F1 TRANSFAC M00919 E2F 0.283 1.140
186 E2F1 TRANSFAC M00740 Rb:E2F-1:DP-1 0.285 0.627
E2F1 TRANSFAC M00050 E2F 0.305 0.424
E2F1 MAPPER T01542 E2F-1 0.243902439 0.051
E2F4 MAPPER T05206 E2F-4:DP-1 0.286 3.600
E2F4 TRANSFAC M00920 E2F 0.469 2.754
E2F4 TRANSFAC M00024 E2F 0.254 2.099
Continued on next page Table 2.3.15 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
E2F4 MAPPER T05207 E2F-4:DP-2 0.163 1.803
E2F4 MAPPER T00221 E2F:DP 0.425 1.352
E2F4 TRANSFAC M00739 E2F-4:DP-2 0.395 1.324
E2F4 TRANSFAC M00738 E2F-4:DP-1 0.297 1.318
187 E2F4 TRANSFAC M00050 E2F 0.313 1.186
E2F4 TRANSFAC M00919 E2F 0.304 0.527
ELK1 TRANSFAC M00007 Elk-1 0.222 5.686
ELK1 TRANSFAC M00025 Elk-1 0.567 1.062
ELK1 MAPPER T00250 Elk-1 0.017 0.005
ELK4 MAPPER T00737 SAP-1a 0.004 0.064
Continued on next page Table 2.3.15 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
EP300 TRANSFAC M00033 p300 0.147 0.123
ESR1 MAPPER T00261 ER-alpha 0.017 0.237
ESR1 TRANSFAC M00959 ER 0.048 0.086
ETS1 MAPPER T00112 c-Ets-1 0.033 0.045
188 FOXM1 TRANSFAC M00791 HNF3 0.029 0.015
FOXM1 MAPPER T01104 HNF-3 0.017 0.004
FOS TRANSFAC M00924 AP-1 0.118 0.896
FOS MAPPER T00029 AP-1 0.29 0.180
FOS MAPPER T00123 c-Fos 0.116 0.144
GABPA MAPPER T01390 GABP-alpha 0.097 0.121
Continued on next page Table 2.3.15 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
GATA3 MAPPER T00311 GATA-3 isoform-1 0.06 0.040
NR3C1 MAPPER T05076 GR 0.065 0.411
NR3C1 MAPPER T00337 GR-alpha 0.1 0.314
NR3C1 TRANSFAC M00960 PR, GR 0.001 0.000
189 HNF4A TRANSFAC M00762 PPAR, HNF-4, COUP, RAR 0.143 1.678
HNF4A TRANSFAC M00638 HNF4alpha 0.14 1.535
HNF4A TRANSFAC M00158 COUP-TF, HNF-4 0.179 1.260
HNF4A TRANSFAC M00967 HNF4, COUP 0.03 1.093
HNF4A MAPPER T02758 HNF-4 0.03 0.089
HNF4A TRANSFAC M00764 HNF4 direct repeat 1 0.186 0.043
Continued on next page Table 2.3.15 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
HSF1 MAPPER T00383 HSF 0.002 0.252
HSF1 TRANSFAC M00641 HSF 0.014 0.034
IRF1 TRANSFAC M00772 IRF 0.266 1.543
IRF1 TRANSFAC M00972 IRF 0.228 0.808
190 IRF1 TRANSFAC M00062 IRF-1 0.237 0.693
IRF1 MAPPER T00423 IRF-1 0.032 0.011
JUN MAPPER T00133 c-Jun 0.132 0.280
JUN TRANSFAC M00924 AP-1 0.058 0.195
JUN MAPPER T00029 AP-1 0.121 0.051
JUNB TRANSFAC M00924 AP-1 0.05 0.065
Continued on next page Table 2.3.15 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
JUND TRANSFAC M00924 AP-1 0.141 0.127
MAX MAPPER T05056 Max 0.36 0.506
MEF2A TRANSFAC M00407 RSRFC4 0.012 0.326
MEF2A TRANSFAC M00941 MEF-2 0.014 0.324
191 MEF2A TRANSFAC M00026 RSRFC4 0.018 0.157
MEF2A TRANSFAC M00006 MEF-2 0.053 0.138
MEF2A MAPPER T01005 MEF-2A 0.046 0.051
MEF2A TRANSFAC M00406 MEF-2 0.049 0.050
MEF2A MAPPER T01009 RSRFC4 0.005 0.032
MEF2A MAPPER T01006 aMEF-2 0.122580645 0.002
Continued on next page Table 2.3.15 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
MYC MAPPER T00140 c-Myc 0.124 0.097
NFATC1 TRANSFAC M00935 NF-AT 0.023 0.011
NFE2 TRANSFAC M00037 NF-E2 0.143 4.432
NFIC TRANSFAC M00056 myogenin / NF-1 0.185 0.361
192 NFYA TRANSFAC M00687 alpha-CP1 0.166 1.502
NFYA MAPPER T00150 NF-Y 0.176 1.416
NFYA TRANSFAC M00775 NF-Y 0.205 1.370
NFYB TRANSFAC M00687 alpha-CP1 0.23 2.955
NFYB MAPPER T00150 NF-Y 0.251 2.304
NFYB TRANSFAC M00775 NF-Y 0.312 2.132
Continued on next page Table 2.3.15 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
PAX5 TRANSFAC M00143 Pax-5 0.017 0.201
PAX5 TRANSFAC M00144 Pax-5 0.115 0.043
POU2F2 MAPPER T00647 10/02/2012 0.009 0.193
POU2F2 TRANSFAC M00795 Octamer 0.115 0.002
193 RELA TRANSFAC M00054 NF-kappaB 0.016251354 1.035
RELA TRANSFAC M00052 NF-kappaB (p65) 0.256 0.433
RELA MAPPER T00590 NF-kappaB 0.148 0.221
RELA MAPPER T00594 RelA-p65 0.128 0.005
RXRA MAPPER T05325 LXR-beta:RXR-alpha 0.007 0.291
RXRA TRANSFAC M00631 FXR/RXR-alpha 0.03 0.242
Continued on next page Table 2.3.15 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
RXRA MAPPER T05313 FXR:RXR-alpha 0.043 0.199
RXRA TRANSFAC M00647 LXR 0.062 0.115
RXRA TRANSFAC M00767 FXR inverted repeat 1 0.048 0.111
RXRA MAPPER T01345 RXR-alpha 0.024 0.078
194 RXRA MAPPER T05324 LXR-alpha:RXR-alpha 0.043 0.074
RXRA TRANSFAC M00766 LXR direct repeat 4 0.046 0.060
RXRA TRANSFAC M00963 T3R 0.017 0.052
RXRA TRANSFAC M00518 PPARalpha:RXRalpha 0.043 0.029
RXRA TRANSFAC M00966 VDR, CAR, PXR 0.031 0.013
RXRA TRANSFAC M00964 PXR, CAR, LXR, FXR 0.009 0.003
Continued on next page Table 2.3.15 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
SP1 TRANSFAC M00933 Sp1 0.392 3.309
SP1 TRANSFAC M00932 Sp1 0.105 2.256
SP1 TRANSFAC M00931 Sp1 0.648 1.211
SP1 TRANSFAC M00008 Sp1 0.535 0.087
195 SPI1 TRANSFAC M00658 PU.1 0.185 0.518
SREBF1 MAPPER T01556 SREBP-1a 0.036023055 0.175
SREBF1 TRANSFAC M00776 SREBP 0.011 0.009
SRF TRANSFAC M00810 SRF 0.18 1.674
SRF TRANSFAC M00922 SRF 0.197 1.503
SRF MAPPER T00764 SRF 0.05 0.074
Continued on next page Table 2.3.15 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
STAT1 MAPPER T00428 ISGF-3 0.001 1.822
STAT1 MAPPER T04759 STAT1 0.206 1.108
STAT1 TRANSFAC M00972 IRF 0.071 0.111
STAT1 TRANSFAC M00223 STATx 0.31 0.001
196 STAT2 MAPPER T00428 ISGF-3 0.002 0.336
STAT2 TRANSFAC M00972 IRF 0.011 0.077
STAT2 TRANSFAC M00223 STATx 0.024 0.033
STAT3 MAPPER T01493 STAT3 0.175 1.349
STAT3 TRANSFAC M00223 STATx 0.35 0.392
STAT5A TRANSFAC M00457 STAT5A (homodimer) 0.006 0.072
Continued on next page Table 2.3.15 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
STAT5A TRANSFAC M00460 STAT5A (homotetramer) 0.074 0.001
TBP MAPPER T00794 TBP 0.036 0.009
TCF3 MAPPER T00207 E47 0.04 0.537
TCF3 TRANSFAC M00804 E2A 0.011 0.506
197 TCF3 TRANSFAC M00693 E12 0.134 0.155
TCF3 TRANSFAC M00002 E47 0.134 0.068
TCF3 MAPPER T00204 E12 0.014 0.005
TCF3 TRANSFAC M00929 MyoD 0.076 0.004
ZEB1 MAPPER T00625 ZEB (1124 AA) 0.003 0.139
ZEB1 TRANSFAC M00414 AREB6 0.014 0.014
Continued on next page Table 2.3.15 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
ZEB1 TRANSFAC M00413 AREB6 0.069 0.012
ZEB1 TRANSFAC M00412 AREB6 0.022 0.005
NR2F2 TRANSFAC M00765 COUP direct repeat 1 0.284 1.258
NR2F2 TRANSFAC M00762 PPAR, HNF-4, COUP, RAR 0.238 0.788
198 NR2F2 TRANSFAC M00155 ARP-1 (COUP-TF2) 0.207 0.630
NR2F2 MAPPER T00045 COUP-TF2 0.124156545 0.139
NR2F2 TRANSFAC M00967 HNF4, COUP 0.063 0.041
USF1 MAPPER T00874 USF1 0.057 6.019
USF1 TRANSFAC M00796 USF 0.634 0.053
YY1 TRANSFAC M00059 YY1 0.076 0.915
Continued on next page Table 2.3.15 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
YY1 TRANSFAC M00069 YY1 0.328 0.128
YY1 MAPPER T00915 YY1 0.093 0.068
YY1 TRANSFAC M00793 YY1 0.13 0.057
FOSL1 TRANSFAC M00924 AP-1 0.03 0.075
199 1 5 Average Precision (10− ) Table 2.3.16: Performance of MAPPER2 TF Models Validated by ENCODE Mouse ChIP-seq Data
TF Collection Model ID Model Name AccuracyAve Prec1
Cebpb TRANSFAC M00912 C/EBP 0.005 4.611
Cebpb MAPPER T00017 C/EBPbeta(p35) 0.016 0.254
Cebpb TRANSFAC M00109 C/EBPbeta 0.04 0.041
Cebpb TRANSFAC M00117 C/EBPbeta 0.143 0.005 200 Gata1 MAPPER T00305 GATA-1 0.043 0.037
Jun MAPPER T00131 c-Jun 0.004 0.028
Jun TRANSFAC M00924 AP-1 0.016 0.019
Jun MAPPER T00032 AP-1 0.013 0.002
Jund MAPPER T00437 JunD 0.007782101 0.034
Jund TRANSFAC M00924 AP-1 0.02 0.002
Continued on next page Table 2.3.16 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
Mafk TRANSFAC M00037 NF-E2 0.581 14.487
Myb MAPPER T00138 c-Myb 0.006 0.009
Myod1 TRANSFAC M00804 E2A 0.036 0.263
Myod1 TRANSFAC M00929 MyoD 0.08 0.054
201 Myog TRANSFAC M00804 E2A 0.036 0.274
Myog TRANSFAC M00929 MyoD 0.096 0.265
Myog TRANSFAC M00712 myogenin 0.103 0.041
Pax5 TRANSFAC M00144 Pax-5 0.001 0.118
Pax5 TRANSFAC M00143 Pax-5 0.002 0.013
Srf TRANSFAC M00922 SRF 0.064 1.266
Continued on next page Table 2.3.16 – continued from previous page
TF Collection Model ID Model Name AccuracyAve Prec1
Srf TRANSFAC M00810 SRF 0.065 1.078
Srf MAPPER T00765 SRF-L 0.062 0.761
Tcf3 TRANSFAC M00804 E2A 0.012 0.058
Tcf3 TRANSFAC M00929 MyoD 0.016 0.038
202 Usf1 TRANSFAC M00796 USF 0.294 10.486
Usf1 MAPPER T00877 USF1 0.013 0.017
Ets1 TRANSFAC M00032 c-Ets-1(p54) 0.162 0.475
Ets1 MAPPER T00111 c-Ets-1 0.059 0.060
1 5 Average Precision (10− ) An outlier corresponding to Mafk is seen in Figures 2.3.4 and 2.3.5. Four models in LASAGNA-Search and one MAPPER2 model were used to predict
Mafk binding sites (see Tables 2.3.14 and 2.3.16). Interestingly, the best model of each tool is based on the same TRANSFAC matrix M00037. The
LASAGNA-Search model is a PWM model that has no position dependence information. The MAPPER2 model, however, is a HMM model, which considers position dependence. The use of position dependence gave the
MAPPER2 model an edge over the LASAGNA-Search model. The other three LASAGNA-Search models performed much worse than the best one, resulting in poor average performance on Mafk. While it is difficult to draw conclusions on mouse TFs based on 13 TFs, the results on human TFs indicate that LASAGNA-Search models are significantly better. Overall, we observe that LASAGNA-Search significantly outperforms MAPPER2, indicating that models in LASAGNA-Search more accurately predict TFBSs.
2.4 Future Directions
We plan to improve LASAGNA-Search in two aspects, expanding the con- tent and incorporating useful features. More species will be supported in automatic promoter retrieval and visualization in the UCSC Genome browser. To expand our TF model collections, more sources of TFBSs and
PWMs such as the PAZAR database [64] and ChIP-seq data will be consid-
203 ered. The GBP score [21] is based on multiple evidence sources including evolutionary conservation and has been shown to improve prediction of binding sites. Integration of the GBP scores with the search module will be investigated.
Using a cluster of TF models to scan a sequence for binding sites has been shown to outperform using the best model in the cluster [62]. This strategy will benefit from our large collections of TF models and improve the TFBS search performance of LASAGNA-Search. Finally, we will enable the search for two-block motifs [8, 38], that is, binding sites composed of two half sites separated by variable-length gaps. While plenty of work has been devoted to de novo two-block motif discovery [8, 56, 54], searching for two-block motif instances is straight-forward. Using two TF models with or without gap penalty [8] will be investigated.
204 A Homo sapiens B Mus musculus C Overall
51 TFs 13 TFs ● Mafk 64 TFs ● p−value: 0.00017 0.00015 p−value: 0.02 0.00015 p−value: 2.9e−05 6e−05 5e−05
● 0.00010 0.00010 4e−05
● MAPPER2 ● 3e−05
● ● ● ● ● ● ● ● 0.00005 0.00005 ● 2e−05 ● ●
● ● ● ●
205 ● ● ● ● ● ● ● ● ● ●●
1e−05 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ●● ● ● ● ●● ● ● ●●●● ●● ● ●●●● ● ● ●● ●● ●●●●●●●● ●● ● ● ● ● 0e+00 0.00000 0.00000 0e+00 1e−05 2e−05 3e−05 4e−05 5e−05 6e−05 0.00000 0.00005 0.00010 0.00015 0.00000 0.00005 0.00010 0.00015
LASAGNA−Search LASAGNA−Search LASAGNA−Search
Figure 2.3.4: Comparison of Precomputed TF Models by Average Precision
Plots of performance of MAPPER2 against that of LASAGNA-Search for (A) human TFs, (B) mouse TFs and (C) all the TFs, respectively. Each point in a plot corresponds to a TF, whose binding sites can be predicted by more than one model. Each model is scored by average precision and the average score across the models is used to plot the point. The outlier, Mafk, in (B) is marked in red. The number of TFs and the p-value of a Wilcoxon signed-rank test are shown for each plot. A Homo sapiens B Mus musculus C Overall 0.6 51 TFs 13 TFs ● Mafk 64 TFs p−value: 7e−06 p−value: 0.034 p−value: 1.4e−06 0.6 0.6 ● 0.5 0.5 0.5
● 0.4 ● 0.4 0.4 ● ● ● ● ● ● ● ●
● ● ● ● 0.3 0.3 0.3 ● ● MAPPER2 ● ● ● ● ● ● ● ●
● ● 0.2
0.2 ●
0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 206 ● ● ● 0.1 ● 0.1 0.1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 0.0 0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6
LASAGNA−Search LASAGNA−Search LASAGNA−Search
Figure 2.3.5: Comparison of Precomputed TF Models by Accuracy
Plots of performance of MAPPER2 against that of LASAGNA-Search for (A) human TFs, (B) mouse TFs and (C) all the TFs, respectively. Each point in a plot corresponds to a TF, whose binding sites can be predicted by more than one model. Each model is scored by accuracy and the average score across the models is used to plot the point. The outlier, Mafk, in (B) is marked in red. The number of TFs and the p-value of a Wilcoxon signed-rank test are shown for each plot. Chapter 3 Searching for Transcription Factor Binding Sites in Vector Spaces
3.1 Background
In this chapter, we describe searching for transcription factor binding sites in vector spaces. Specifically, l-mers are placed in the Euclidean space such that each l-mer corresponds to a vector in the space. With known binding sites of a TF, we construct a profile vector for the TF. This profile vector can then be used as a query vector to search for the unknown binding sites in the space given a similarity measure between two vectors. The vector space model has long been used in information retrieval (IR) [70, 53]. Under this model, each document in a collection is embedded in a t-dimensional space.
That is, each document is represented by a t-element vector, where t is the number of distinct terms present in the document collection or corpus. To search for documents on a particular topic, a query composed of terms rele- vant to the topic is constructed. The query can be similarly embedded in the t-dimensional space. Similarity between the query and a document can then be measured by measuring the similarity between the two corresponding
207 vectors. In the TFBS search problem, the entire genome or the collection of promoter region sequences corresponds to the corpus, whereas an l-mer is analogous to a document in IR. On the other hand, a TF is analogous to a topic, while a TF representation is the analog of a query for the topic.
In this framework, we propose two novel approaches to constructing a query vector for a TF of interests. They are named the negative-to-positive vector
(NPV) and the optimal discriminating vector (ODV) methods. We compare the proposed methods to a state-of-the-art method, the ULPB method, as well as the widely-used PSSM method. Performance of a method is assessed by cross-validation experiments on two data sets collected from RegulonDB
[27] and JASPAR [65], respectively. Based on the NPV and ODV methods, we investigate motif subtype identification and show that, consistent with pre- vious studies [35, 28], identification of motif subtypes improves TFBS search performance. Independent validation on human ChIP-seq data gives fur- ther insights into the proposed methods. Finally, we discuss the advantages of searching for TF binding sites in the proposed framework.
3.2 Methods
3.2.1 Data Sets
To understand the compared methods in this chapter, we experimented on
208 Table 3.2.17: Statistics of the E. coli TFs in RegulonDB
Name Length # TFBSs Name Length # TFBSs MetJ 8 29 Lrp 12 62 SoxS 18 19 H-NS 15 37 FlhDC 16 20 AraC 18 20 Fis 15 206 ArcA 15 93 IHF 13 101 OmpR 20 22 PhoB 20 17 GlpR 20 23 OxyR 17 41 CpxR 15 37 NarL 7 90 CRP 22 249 TyrR 18 19 NarP 7 20 Fur 19 81 LexA 20 40 NtrC 17 17 FNR 14 87 MalT 10 20 PhoP 17 21 ArgR 18 32 NsrR 11 37
prokaryotic as well as eukaryotic transcription factors. The known prokary- otic TF binding sites were collected from RegulonDB [27] release 6.8. Con- sidered in [69], this data source contains binding sites of TFs in the E. coli
K-12 genome. We considered a data set of 26 TFs with 17 or more known binding sites. The filtering criterion ensures that, for each TF, we have enough examples to learn from. Similar filtering criteria were used in [69].
This data set is summarized in Table 3.2.17.
The known eukaryotic TF binding sites were collected from JASPAR CORE database (the 4th release) [65]. TFs of Homo sapiens and Mus musculus were filtered by two criteria. A TF was kept only if it has at least 20 known binding sites and the length of its binding sites is at least 6 nucleotides.
209 The length criterion, arbitrarily chosen, ensures a TF under consideration is specific enough. This data set is summarized in Table 3.2.18.
3.2.2 Notation
For clarity, we list and define functions and variables used throughout this chapter.
fi(u) denotes the probability of observing letter u at position i of a TFBS, • where u A, C, G, T . ∈ { }
fi j(u, v) denotes the probability of observing letters u and v at positions • , i and j, respectively, where i < j and u, v A, C, G, T . ∈ { }
fi(v u) denotes the position-specific conditional probability of observ- • | ing v at position i + 1 given u has been seen at position i, where
u, v A, C, G, T . ∈ { }
f (v u) denotes the background conditional probability of observing • | v given u has been observed at the previous position, where u, v ∈ A, C, G, T . { }
Iu( ) is the indicator function given by • · 1 if v = u, I (v) = (3.2.1) u 0otherwise,
where u, v A, C, G, T . ∈ { } 210 Iu u ( ) is similarly defined as follows: • 1 2 · 1if v1 = u1 and v2 = u2, I (v v ) = (3.2.2) u1u2 1 2 0 otherwise,
where u , u , v , v A, C, G, T . 1 2 1 2 ∈ { }
ICi denotes the information content at position i of a binding site. • Information content is closely related to entropy, a measure of uncer-
tainty in information theory. The entropy at position i is given by Ei =
P 1 u A, C, G, T fi(u) log2 fi(u) . When fi(u) = 4 for all u A, C, G, T , − ∈{ } ∈ { }
Ei attains the maximal entropy of 2 and we are most uncertain about
the letter at position i. ICi is simply defined as
X ICi = 2 Ei = 2 + fi(u) log fi(u) . (3.2.3) − 2 u A, C, G, T ∈{ }
ICi j denotes the information content of the position pair (i, j) of a • , binding site. Similarly,
X h i ICi,j = 4 + fi,j(u, v) log2 fi,j(u, v) , (3.2.4) u,v A, C, G, T ∈{ }
1 where the maximal entropy of 4 is attained when fi,j(u, v) = 16 for all
u, v A, C, G, T . ∈ { }
3.2.3 Embedding Short Sequences in Vector Spaces
We describe how a short sequence of l nucleotides or an l-mer is placed in
211 AGTG……CTCT
1000001000010010……0100000101000001
Figure 3.2.1: Illustration of Embedding a Short Sequence in Vector Space
Each nucleotide in the sequence is converted to 4 indicator variables.
th a vector space. Let s be an l-mer and si denote its i nucleotide. Each nu- cleotide in s is converted to 4 variables, that is, si is converted to wiIA(si), wiIC(si), wiIG(si) and wiIT(si) for i = 1, 2,..., l. Hence, s is converted to 4l variables, placing s in R4l.
Figure 3.2.1 illustrates the conversion of each nucleotide in an l-mer to 4 variables when wi = 1 for i = 1, 2,..., l.
We further consider nucleotide pair (si, sj), where i < j. Only pairs in close proximity are considered in this study. We consider (si, sj) only if j i = 1 − or 2, i.e., a pair of nucleotides is considered only if they are adjacent or separated by one nucleotide. Nucleotide pair (si, sj) is similarly converted to
16 variables, wi,jIAA(sisj), wi,jIAC(sisj),..., wi,jITT(sisj), as there are 16 possible nucleotide pairs, AA, AC, ..., TT . We use 32l 48 additional variables to { } − encode the pairs since there are l 1 adjacent pairs and l 2 pairs separated − − by one nucleotide. Consequently, considering individual nucleotides and nucleotide pairs, each l-mer is converted to a (36l 48)-element vector. −
212 In this study, we consider two choices of wi’s and wi,j’s. For the first choice, all the nucleotides and nucleotide pairs are given the same weight, i.e.,
th wi = 1 and wi,j = 1 for all i and j. The second one assigns weight to the i nucleotide according to the information content at position i. Similarly, it assigns weight to pair (i, j) according to the information content at this pair p of positions. Specifically, wi = √ICi and wi,j = ICi,j for all i and j.
3.2.4 Searching for TFBSs in Vector Spaces
Given a query vector t in space, we score an l-mer s as follows:
Score(s) = sTt, (3.2.1) where s denote the corresponding vector of s. In other words, the score of s is obtained by taking the dot-product between s and t. It can be seen that Score(s) measures the similarity between s and t. Assuming that t corresponds to an l-mer t, Score(s) counts the number of nucleotides and nucleotide pairs shared between s and t when wi = 1 and wi,j = 1 for all i and j. However, we note that t can be any vector in the space and does not necessarily correspond to an l-mer.
As described above, an l-mer is converted to a (36l 48)-element vector. − (36l 48) Hence, we use t to search for binding sites in R − . Our approach offers great flexibility in that it easily allows searching for binding sites in a lower
213 dimensional subspace. By setting all but the first 4l elements in t to zero, we are essentially searching for binding sites in R4l. In this chapter, we exploit this advantage and simultaneously search for transcription factor binding
4l (36l 48) sites in three subspaces. Two of them are R and R − . The third one is
(16l 12) R − . This subspace is obtained from considering only the first nucleotide and the l 1 adjacent nucleotide pairs as in a first order Markov chain. − 3.2.5 The NPV Method
We first introduce a simple approach to constructing a query vector. Let
P be the set of n+ binding sites and N be the set of n non-binding sites − of a particular transcription factor. We embed all the l-mers in P and N in
(36l 48) R − . We then find the mean binding site vector
1 X µ = s + n + s P ∈ as well as the mean non-binding site vector
1 X µ = s. − n s N − ∈
The query vector t is found by subtracting µ from µ+, that is, t = µ+ µ . − − − The query vector t can be seen as the vector pointing from the center of the non-binding site vectors to the center of the binding site vectors. Hence, we call it the negative-to-positive vector (NPV) method. Figure 3.2.2 illustrates the idea.
214 ● non−TFBS ● µ− TFBS µ
−2.4 + −2.6 ● Axis 2 −2.8 −3.0
−1 0 1 2
Axis 1
Figure 3.2.2: Illustration of the NPV Method
The solid arrow represents the negative-to-positive vector µ+ µ , pointing − − from µ to µ+. The hallow triangles denote the known binding sites, whereas − the circles represent the known non-binding sites. The center of the binding site vectors is marked by the solid triangle, while the center of the non- binding site vectors is marked by the solid circle.
The score of an l-mer s given by the NPV method is therefore
T T T Score(s) = s (µ+ µ ) = s µ+ s µ . (3.2.1) − − − −
We can see that it computes the similarity between s and the mean binding site vector as well as the similarity between s and the mean non-binding site vector. It then scores s by the difference of the two similarity scores. The
215 more similar s is to the mean binding site vector, the higher the score. The less similar s is to the mean non-binding site vector, the higher the score.
From the perspective of geometry, we note that Score(s) in (3.2.1) is propor- tional to Score(s)/ t , where t is the length of the query vector t. Moreover, || || || || by virtue of the equality
sTt = s t cos θ, || || || || we know Score(s)/ t equals the orthogonal projection of s onto t, where θ || || is the angle formed by vectors s and t (see Figure 3.2.3 for an illustration).
The computation of Score(s) is therefore equivalent to computation of the orthogonal projection of s onto t. Similarly, the computation of Score(s) in
(3.2.1) is equivalent to computation of the orthogonal projection of s onto
µ+ µ . In Figure 3.2.2, we observe that vector µ+ µ is pointing to the − − − − left and, projected onto this vector, most of the binding sites are on the left of the non-binding sites. This implies that most of the binding sites have a higher score than the non-binding sites.
3.2.6 The ODV Method
We have described the NPV method, which offers a heuristic way of con- structing a query vector. We now introduce a way of finding an optimal
(36l 48) query vector β R − . Suppose that P = n+ and N = n , that is, ∈ | | | | − 216 s t
Figure 3.2.3: The Orthogonal Projection of s onto t
It can be seen that the projection of s onto t is equal to Score(s)/ t Score(s). || || ∝
there are n+ binding sites and n non-binding sites for a particular TF. Let −
P = s , s ,..., s n and N = s n , s n ,..., s n , where s i denotes the { (1) (2) ( +)} { ( ++1) ( ++2) ( )} ( ) th i l-mer in the union of the two sets and n = n+ + n . We find the optimal β − by solving the following minimization problem:
Xn+ Xn 1 2 C C min β + ξi + ξi (3.2.1) β,b,ξ 2|| || n+ n i=1 − i=n++1
Score(s(i)) b + 1 ξi subject to − for s i P, (3.2.2) β ≥ β ( ) ∈ || || || || Score(s(i)) b 1 + ξi − for s i N, (3.2.3) β ≤ β ( ) ∈ || || || ||
ξi 0 i. (3.2.4) ≥ ∀
The constraint in (3.2.2) ensures that the projection of a TFBS s(i) onto the vec-
Score(s(i)) b+1 tor β, β , exceeds the threshold β . On the other hand, the constraint || || || || in (3.2.3) ensures that the projection of a non-TFBS s(i) onto β stays below the
b 1 threshold β− . Flexibility is given to the thresholds by introducing ξi’s with || || cost captured by the last two terms in (3.2.1). Finally, to clearly distinguish
217 TFBSs from non-TFBSs, the squared difference between the two thresholds
2 b+1 b 1 2 ( β and β− ) is made as large as possible. This amounts to maximizing β || || || || || || or, equivalently, minimizing 1 β 2, which is the first term in (3.2.1). We call 2 || || this approach the optimal discriminating vector (ODV) method.
The optimization problem in (3.2.1) is known as a quadratic programming problem with linear inequality constraints specified in (3.2.2), (3.2.3) and
(3.2.4). There are p + n + 1 variables and 2n constraints, where p = 36l − 48 is the dimension of β. We can see that (3.2.2) and (3.2.3) specify n constraints whereas (3.2.4) imposes n constraints on the variables. Quadratic programming [7] is well-studied and hence general solvers are available, e.g., the OpenOpt framework [43]. To solve this problem, the parameter
C(> 0) is first arbitrarily chosen. A solver then searches for values of β =
T T (β1, . . . , βp) , b and ξ = (ξ1, . . . , ξn) such that the objective function in (3.2.1) is minimized while the constraints in (3.2.2), (3.2.3) and (3.2.4) are satisfied simultaneously. It can be seen that an optimal solution to (3.2.1) always exists since the search space of β, b, ξ is never empty. To find a feasible { } p solution, one can arbitrarily pick β , 0 R and b R. For s(i) P, one ∈ ∈ ∈ can pick ξi R such that the constraint in (3.2.2) is satisfied. Similarly, for ∈ s i N, one can pick ξi R such that the constraint in (3.2.3) is met. We ( ) ∈ ∈ can then compute the value of the objective function as the values of all the
218 variables are known. One way to choose the parameter C in (3.2.1) is to search for C in a range by cross-validation. The parameter is TF-dependent
6 in general, but experiments showed that a small C = 2− will usually suffice
6 and hence we set C = 2− for all the ODV experiments in this study.
3.2.7 The PSSM and ULPB Methods
We briefly describe the ungapped likelihood under positional background
(ULPB) method proposed in [69] and the position-specific scoring matrix
(PSSM) method compared therein. We refer readers to Section 3.2.2 for functions and variables used here. Consider a specific TF with binding sites of length l. The PSSM method scores an l-mer s by
l X log fi(si) , (3.2.1) i=1
th where si denotes the i letter in s. We note that usually the ratio fi(si)/ f (si) is used in place of fi(si), where f (si) is the background probability of si. The simpler form in (3.2.1) was compared in [69] and hence it serves as a baseline method in this study.
The ULPB models a TFBS by a first-order Markov chain and models the background by another first-order Markov chain. The background transi- tion probabilities are estimated using the entire genome of a species and hence the ULPB method uses negative examples implicitly. It scores an
219 l-mer s by
l 1 ! X− fi(si+1 si) log f1(s1) + log | . (3.2.2) f (si+1 si) i=1 |
Although ULPB does not consider background probability in the first term of
(3.2.2), the score is approximately the log-likelihood ratio of the two Markov chains.
The main difference between the PSSM method and the NPV, ODV and
ULPB methods is that the PSSM method does not score nucleotide pairs nor does it utilize a background distribution. The NPV and ODV methods explicitly take advantage of negative binding sites, while the ULPB method does it implicitly by using a background distribution. The flexibility of the proposed framework allows the NPV and ODV methods to easily search in subspaces, further distinguishing the PSSM and ULPB methods from the proposed ones.
3.3 Results and Discussion
3.3.1 Performance Assessment and Evaluation Metrics
The performance of a TFBS search method is evaluated by ν-fold cross- validation (CV). Consider a TF with n+ TFBSs of length l with flanking regions on both sides. A set of negative examples, Ntest, called the test neg-
220 atives is constructed from the TFBSs of the other TFs with filtering as in
[63]. Another set of negative examples, Ntrain, called the training negatives is collected from sequences embedding the n+ binding sites. It is comprised of all the l-mers except for the TFBSs and two neighboring l-mers of each TFBS.
The n TFBSs are first divided into ν sets, each of which contains n+ or + b ν c n+ + 1 TFBSs. At each iteration of ν-fold CV, one of the ν TFBS sets called b ν c the test TFBS set Ptest is left out. The rest of the TFBSs are therefore called the training TFBSs. A scoring function is obtained using the training TFBSs and non-TFBSs randomly sampled from the training negatives, where the ratio of numbers of non-TFBSs to TFBSs is set to 10. The test TFBSs in Ptest along with the non-TFBSs in Ntest are then scored by the scoring function.
To score a test sequence, both the forward and reverse strands are scored and, in case the test sequence is longer or shorter than l, the l-mer pro- ducing the highest score is used. For each test TFBS t P , we find its ∈ test rank relative to all the non-TFBSs in Ntest. Formally, the rank of t equals
1 + s N Score(s) Score(t) . |{ ∈ test| ≥ }|
After the ν-fold CV, we end up with n+ ranks, each of which corresponds to a TFBS. To allow comparison of methods, we use the area under the ROC curve (AUC) to gauge the performance of a method on the TF. The ROC
221 curve is a plot of true positive rate (TPR) against false positive rate (FPR), displaying the trade-off between TPR and FPR. We refer readers to [23] for an introduction to this metric. In this study, ν = 10 for all the CV experiments.
For the NPV and ODV methods, the best weight and subspace combination is obtained at each iteration of the ν-fold CV. Specifically, another (ν 1)-fold − CV is performed on the ν 1 sets of TFBSs to search for the best combination. −
3.3.2 Prokaryotic Transcription Factor Binding Sites
To understand the behavior of search methods on prokaryotic TF binding sites, we conducted 10-fold cross-validation experiments on the 26-TF Reg- ulonDB data set. The proposed NPV and ODV methods were compared to the ULPB method [69]. The PSSM method, considered in [69], was also included for comparison since it served as a simple baseline method.
Figure 3.3.1a shows the plot of area under the ROC curve (AUC) across the
26 TFs for each method. We can see that the ODV method has the best
AUC on 12 out of 26 TFs and the NPV method has the best AUC on 9 out of 26 TFs whereas the ULPB and PSSM methods have the best AUC on 1 and 4 TFs, respectively. To gauge the relative performance between two methods, statistical tests [87] were performed on all the 6 pairs of methods.
Figure 3.3.1b shows the p-values of the pairwise comparisons. We first notice that, consistent with the results in [69], ULPB outperformed PSSM with a
222 slightly larger p-value of 0.0679 than the usual 0.05 significance cut-off. As seen in Figure 3.2.3b, the NPV and ODV methods are significantly better than the PSSM and ULPB methods. We can see that the ODV method benefited from optimization albeit minimizing the objective function in (3.2.1) does not guarantee maximization of the AUC.
3.3.3 Eukaryotic Transcription Factor Binding Sites
Here we compare the proposed NPV and ODV methods to the ULPB and
PSSM methods on eukaryotic TF binding sites. As in the previous section, we conducted 10-fold cross-validation experiments on the 28-TF JASPAR data set. Figure 3.3.2a shows the plot of AUC across the 28 TFs for each method. We can see that both the ODV and NPV methods have the best
AUC on 13 out of 28 TFs while the ULPB and PSSM methods have the best
AUC on 6 and 4 TFs, respectively. All the methods have the best AUC of 1 on MA0149.1 and MA0115, while the ODV, NPV and PSSM methods have the best AUC of 0.999 on MA0137.2.
Similarly, statistical tests [87] were performed on all the 6 pairs of methods.
Figure 3.3.2b shows that the NPV and ODV methods are significantly better than the PSSM and ULPB methods. ULPB is significantly better than PSSM, which is again consistent with the results reported in [69]. Overall, perfor- mance of the four methods remain unchanged as we shift from prokaryotic
223 transcription factors to eukaryotic ones. This implies that a TFBS search method effective on prokaryotic transcription factors will perform equally well on eukaryotic transcription factors and vice versa.
3.3.4 Motif Subtype Identification in Vector Spaces
It has been shown that the binding sites of a TF can be better represented by
2 motif subtypes than by a single motif [35, 28]. In search for new binding sites, two position-specific scoring matrices are used to score an l-mer and the higher score of the two is assigned to this l-mer. Searching with two
PSSMs was shown to be superior to searching with a single PSSM by cross- species conservation statistics in these studies.
We demonstrate that motif subtypes can be readily identified once we embed l-mers in a vector space. The purpose here, however, is not to compare motif subtype identification algorithms. We adopted a slightly different approach to motif subtype identification from those in previous work [35, 28], while the idea is similar. As usual, all the l-mers were first embedded in a vector space. The known binding sites of a TF were clustered into two subtypes by the k-means algorithm [17]. Immediately, we have a variant of the NPV method called the kNPV method, where k = 2 denotes the number of motif subtypes. The kNPV method first computes the mean vectors of these two
224 subtypes, µ+1 and µ+2, and scores an l-mer s by
n T T o Score(s) = max s (µ+1 µ ) , s (µ+2 µ ) , − − − − where µ is the mean vector of the non-binding sites. Figure 3.3.3 illustrates − the kNPV method.
Similarly, the kODV method scores an l-mer s by
n o Score(s) = max sTβ / β , sTβ / β , +1 || +1|| +2 || +2||
where β+i is obtained using TFBSs in cluster i, i = 1, 2. Unlike the kNPV method, the lengths of β+i’s may be very different and hence β+i’s are scaled to unit vectors so as not to bias the scoring function. We note that the choice of k = 2 came from previous studies [35, 28]. Generally, k can be greater than
2 or even automatically selected [37]. This however is beyond the scope of this study and may be investigated in the future.
We assessed the kNPV and kODV methods by 10-fold cross-validation on both the RegulonDB and JASPAR data sets. Figure 3.3.4 shows the results in terms of AUC. We observe in Figure 3.3.4a that overall introducing motif subtypes into the NPV and ODV methods improves the search performance
7 5 (p-values: 6.41 10− and 8.31 10− , respectively). Results in Figure 3.3.4b × × 225 3 3 also support this observation (p-values: 1.61 10− and 3.04 10− , respec- × × tively). The kNPV and kODV are comparable on both the RegulonDB and
JASPAR data sets (p-values: 0.197 and 0.47, respectively). These results are consistent with those reported in [35, 28].
3.3.5 Independent validation on ChIP-seq Data
To evaluate the proposed NPV and ODV methods on the whole genome scale, we built TF models using TFBSs in the JASPAR database to scan all the human (build hg19) 1000-base promoter sequences obtained from the
UCSC Genome Browser database [26]. ChIP-seq peaks from the ENCODE project were also retrieved [68]. Specifically, the wgEncodeRegTfbsClus- teredV2 table of build hg19 was obtained. We checked TFs in Table 3.2.18 against the annotations and found 14 JASPAR TFs, recognized by 17 anti- bodies present in the ENCODE annotations. The mapping is listed in the
first 3 columns of Table 3.3.19.
226 Table 3.2.18: Statistics of TFs in the JASPAR Database
Mus musculus ID Name Length # TFBSs MA0039.2 Klf4 10 4336 MA0047.2 Foxa2 12 809 MA0062.2 GABPA 11 87 MA0065.2 PPARG::RXRA 15 839 MA0104.2 Mycn 26 85 MA0141.1 Esrrb 12 3613 MA0142.1 Pou5f1 15 1332 MA0143.1 Sox2 15 666 MA0144.1 Stat3 19 830 MA0145.1 Tcfcp2l1 14 3931 MA0146.1 Zfx 20 477 MA0147.1 Myc 10 682 MA0154.1 EBF1 10 21 Homo sapiens ID Name Length # TFBSs MA0037 GATA3 6 20 MA0052 MEF2A 10 31 MA0077 SOX9 9 45 MA0080.2 SPI1 7 35 MA0083 SRF 12 26 MA0112.2 ESR1 20 472 MA0115 NR1H2::RXRA 17 22 MA0137.2 STAT1 15 2082 MA0138 REST 19 22 MA0138.2 REST 11 871 MA0139.1 CTCF 11 944 MA0148.1 FOXA1 11 896 MA0149.1 EWSR1-FLI1 17 101 MA0159.1 RXR::RAR DR5 17 23 MA0258.1 ESR2 18 356
227 1
0.9
0.8
070.7 NPV ODV ULPB PSSM 0.6
PWM 0.5 ● AUC ULPB PSSM 0.5 0.0679 7.24e−05 0.000802 NPV 0.4 ODV 0.3 ULPB 0.0679 0.5● 0.00817 0.000802 228 020.2
NPV 7.24e−05 0.00817 0.5● 0.291 0.1
0 ODV 0.000802 0.000802 0.291 0.5● Fis NS Lrp Fur IHF ‐ CRP FNR TyrR NtrC SoxS NarL ArcA LexA AraC NsrR GlpR ArgR NarP MetJ CpxR H MalT OxyR PhoP PhoB FlhDC OmpR (a) (b) Figure 3.3.1: Comparison of the PSSM, ULPB, NPV and ODV Methods on the RegulonDB Data Set
(a) Plot of AUC values across the 26 prokaryotic TFs for each method. (b) Matrix of p-values from pairwise compar- isons. A red solid circle in row i and column j indicates that method i outperformed method j, while a blue one in row i and column j indicates that method i is inferior to method j. The size and darkness of a circle imply the significance of the relationship between two methods. The larger and darker a circle, the more significant the relationship. White background indicates exceeding the usual 0.05 significance cut-off, while gray background indicates the opposite. 1
0950.95
0.9 NPV ODV ULPB PSSM
0.85 C
UU PWM A PSSM 0.5● 0.0468 2.42e−05 7.71e−05 0.8 ULPB NPV ODV
0.75 ULPB 0.0468 0.5● 0.0154 0.00189 229 070.7 NPV 2.42e−05 0.0154 0.5● 0.39
0.65
ODV 7.71e−05 0.00189 0.39 0.5● MA0077 MA0138 MA0037 MA0083 MA0052 MA0115 MA0062.2 MA0104.2 MA0144.1 MA0065.2 MA0146.1 MA0080.2 MA0159.1 MA0112.2 MA0139.1 MA0154.1 MA0143.1 MA0141.1 MA0138.2 MA0145.1 MA0148.1 MA0039.2 MA0047.2 MA0147.1 MA0142.1 MA0258.1 MA0137.2 MA0149.1 (a) (b)
Figure 3.3.2: Comparison of the PSSM, ULPB, NPV and ODV Methods on the JASPAR Data Set
(a) Plot of AUC values across the 28 eukaryotic TFs for each method. (b) Matrix of p-values from pairwise comparisons. A red solid circle in row i and column j indicates that method i outperformed method j, while a blue one in row i and column j indicates that method i is inferior to method j. The size and darkness of a circle imply the significance of the relationship between two methods. The larger and darker a circle, the more significant the relationship. White background indicates exceeding the usual 0.05 significance cut-off, while gray background indicates the opposite. ● non−TFBS 2.5 ● µ− TFBS µ+ 2.0 1.5 1.0 Axis 2
● 0.5 0.0
1.0 1.5 2.0 2.5 3.0
Axis 1
Figure 3.3.3: Illustration of the kNPV Method
The solid arrows represent the negative-to-positive vectors µ+1 µ and − − µ+2 µ , pointing from µ to µ+1 and µ+2, respectively. The hallow triangles − − denote− the known binding sites, whereas the circles represent the known non-binding sites. The centers of the binding site vectors are marked by the solid triangles, while the center of the non-binding site vectors is marked by the solid circle.
230 1.00 1.00
0900.90 0950.95
0.80 0.90
0.70 0.85 CC ODV ODV 0.60 AU
AUC kODV kODV 0.80 NPV NPV 0500.50 kNPV kNPV 0.75 231
0.40
0.70
0.30
0.65 0.20 Fis Lrp Fur IHF MA0115 FNR MA0077 MA0138 MA0037 MA0052 MA0083 NtrC CRP TyrR NarL MetJ ArcA MalT AraC GlpR NarP LexA NsrR ArgR SoxS PhoB PhoP H-NS CpxR OxyR MA0112.2 FlhDC OmpR MA0062.2 MA0080.2 MA0104.2 MA0065.2 MA0144.1 MA0146.1 MA0159.1 MA0154.1 MA0145.1 MA0141.1 MA0138.2 MA0039.2 MA0258.1 MA0047.2 MA0139.1 MA0148.1 MA0143.1 MA0147.1 MA0142.1 MA0137.2 MA0149.1 (a) (b)
Figure 3.3.4: The kNPV (kODV) Method Versus the NPV (ODV) Method
The number of motif subtypes k is set to 2. (a) Plot of AUC values across the 26 prokaryotic TFs for each method. (b) Plot of AUC values across the 28 eukaryotic TFs for each method. Table 3.3.19: Results of Independent Validation on ChIP-seq Data
ENCODE JASPAR Name PSSM ULPB NPV S1 IC ODV S IC GATA3 (SC-268) MA0037 GATA3 0.48922 0.46841 0.50963 1 Y 0.51441 1 Y MEF2A MA0052 MEF2A 0.42566 0.45955 0.35283 3 Y 0.34807 3 N PU.1 MA0080.2 SPI1 0.50631 0.49267 0.57575 3 Y 0.58014 3 N SRF MA0083 SRF 0.34299 0.38457 0.43920 2 N 0.43183 3 N MA0138 REST 0.50615 0.46371 0.46603 1 N 0.47956 2 N NRSF MA0138.2 REST 0.48031 0.48299 0.49070 3 Y 0.49522 3 N ERalpha a MA0112.2 ESR1 0.53980 0.49058 0.52414 3 N 0.52146 1 N STAT1 MA0137.2 STAT1 0.55348 0.58555 0.61733 1 N 0.62338 1 Y CTCF 0.60370 0.60377 0.63785 0.64769 232 CTCF (C-20)MA0139.1 CTCF 0.44108 0.44696 0.53181 2 Y 0.54306 2 Y CTCF (SC-5916) 0.46729 0.47047 0.54097 0.55028 FOXA1 (C-20) 0.48083 0.48698 0.48994 0.49853 MA0148.1 FOXA1 3 Y 3 N FOXA1 (SC-101058) 0.48897 0.48326 0.49945 0.50986 EBF 0.50011 0.51202 0.56084 0.59172 MA0154.1 EBF1 3 Y 3 N EBF1 (C-8) 0.42214 0.43705 0.52067 0.53207 FOXA2 (SC-6554) MA0047.2 Foxa2 0.48328 0.39496 0.45500 3 Y 0.47906 3 N STAT3 MA0144.1 Stat3 0.39145 0.33052 0.38094 3 Y 0.43807 3 Y POU5F1 (SC-9081) MA0142.1 Pou5f1 0.42151 0.42793 0.40855 3 N 0.45449 3 N 1 Subspace 1: R4l (16l 12) 2: R − (36l 48) 3: R − For the NPV and ODV methods, the best weight and subspace combination was found by 5-fold cross-validation on the JASPAR TFBSs, while flanking genomic sequences of the TFBSs were the sources of negative binding sites.
To assess the 4 compared methods, we considered the part of a ROC curve where FPR is at most 0.01 and calculated the AUC scaled to between 0 and
1. This is nearly equivalent to allowing at most 10 false positive hits per promoter on average. As a peak spans about 200 bases, it is considered re- called when it fully contains a predicted binding site. Similarly, a predicted binding site must be fully covered by a peak to be a true positive hit.
In Table 3.3.19, we observe that ODV, NPV, ULPB and PSSM produced the best AUC on 13, 1, 1 and 3 out of 18 tests, respectively. Statistical tests showed that ODV significantly outperformed the other 3 methods (p- values 0.0028), NPV significantly outperformed ULPB and PSSM (p-values ≤ 0.0449), and ULPB and PSSM are comparable (p-value: 0.433). We notice ≤ that both NPV and ODV performed worse than the other two methods on
MEF2A. As NPV and ODV both sample negative examples from flanking sequences of TFBSs, we suspect that this is one example where the flanking sequences do not represent well the entire promoters. ODV performed consistently across tests corresponding to the same JASPAR ID such as the three for CTCF. Examining the best weight and subspace, we can see that
233 the subspace agrees on 11 out of 14 TF models, while the weight agrees on only 7 of them. The latter may be because ODV optimizes the β vector and hence is less sensitive to the weight used to embed an l-mer.
3.4 Conclusions
In this chapter, we proposed to search for transcription factor binding sites in vector spaces. The novel NPV and ODV methods were introduced to construct a query vector to search for binding sites of a TF. We compared our methods to a state-of-the-art method, the ULPB method, and the widely- used PSSM method. Cross-validation experiments revealed that the NPV and ODV methods significantly outperformed the ULPB and PSSM meth- ods on prokaryotic as well as eukaryotic TF binding sties. Independent validation on human ChIP-seq data further verified that the NPV and ODV methods are significantly better than the other compared methods.
One of the advantages of our framework is that it allows one to easily search for binding sites in various subspaces. Hence, one can search in the best sub- space for each individual TF since one can hardly find an optimal subspace for all the TFs. Another advantage is that under the proposed framework one can readily identify motif subtypes for a TF. Hence, to exploit this ad- vantage, we introduced the kNPV and kODV methods, immediate variants
234 of the NPV and ODV methods. We demonstrated that, consistent with re- sults in previous studies, kNPV (kODV) significantly improved NPV (ODV) on the two data sets.
Our future work aims for extending our proposed methods to handling known binding sites of variable lengths. We will seek to approach this prob- lem without resorting to multiple sequence alignment, which is notoriously time-consuming. In the meantime, we will also seek to identify additional promising subspaces to search for TF binding sites in.
235
Bibliography
[1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Ad- dison Wesley, 1st edition, May 1999.
[2] T. L. Bailey, M. Boden, F. A. Buske, M. Frith, C. E. Grant, L. Clementi, J. Ren, W. W. Li, and W. S. Noble. Meme suite: tools for motif discovery and searching. Nucleic Acids Research, 37(suppl 2):W202–W208, 2009.
[3] T. L. Bailey and C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pages 28–36, 1994.
[4] S. Balla, V. Thapar, S. Verma, T. Luong, T. Faghri, C.-H. Huang, S. Ra- jasekaran, J. J. del Campo, J. H. Shinn, W. A. Mohler, M. W. Maciejewski, M. R. Gryk, B. Piccirillo, S. R. Schiller, and M. R. Schiller. Minimo- tif miner: a tool for investigating protein function. Nature methods, 3(3):175–177, March 2006.
[5] Y. Barash, G. Bejerano, and N. Friedman. A simple hyper-geometric approach for discovering putative transcription factor binding sites. In WABI ’01: Proceedings of the First International Workshop on Algorithms in Bioinformatics, pages 278–293, London, UK, 2001. Springer-Verlag.
[6] M. F. Berger and M. L. Bulyk. Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nature protocols, 4(3):393–411, 2009.
[7] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1999.
[8] C. Bi, J. Leeder, and C. Vyhlidal. A comparative study on computational two-block motif detection: algorithms and applications. Mol Pharm, 5(1):3–16, 2008.
[9] S. G. Blumenthal, G. Aichele, T. Wirth, A. P. Czernilofsky, A. Nordheim, and J. Dittmer. Regulation of the human interleukin-5 promoter by ets
237 transcription factors: Ets1 and ets2, but not elf-1, cooperate with gata3 and htlv-i tax1. Journal of Biological Chemistry, 274(18):12910–12916, 1999.
[10] J. C. Bryne, E. Valen, M.-H. E. Tang, T. Marstrand, O. Winther, I. da Piedade, A. Krogh, B. Lenhard, and A. Sandelin. Jaspar, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Research, 36(suppl 1):D102–D106, 2008.
[11] J. Buhler and M. Tompa. Finding motifs using random projections. In RECOMB ’01: Proceedings of the fifth annual international conference on Computational biology, pages 69–76, New York, NY, USA, 2001. ACM.
[12] K. Cartharius, K. Frech, K. Grote, B. Klocke, M. Haltmeier, A. Klin- genhoff, M. Frisch, M. Bayerlein, and T. Werner. Matinspector and beyond: promoter analysis based on transcription factor binding sites. Bioinformatics, 21(13):2933–2942, 2005.
[13] D. S. Chekmenev, C. Haid, and A. E. Kel. P-match: transcription factor binding site search by combining patterns and weight matrices. Nucleic Acids Research, 33(suppl 2):W432–W437, 2005.
[14] T. E. P. Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414):57–74, Sept. 2012.
[15] G. E. Crooks, G. Hon, J.-M. Chandonia, and S. E. Brenner. Weblogo: A sequence logo generator. Genome Research, 14(6):1188–1190, June 2004.
[16] E. V. Davydov, D. L. Goode, M. Sirota, G. M. Cooper, A. Sidow, and S. Batzoglou. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS computational biology, 6(12):e1001025+, Dec. 2010.
[17] M. J. de Hoon, S. Imoto, J. Nolan, and S. Miyano. Open source clus- tering software. Bioinformatics, 20(9):1453–1454, June 2004.
[18] H. T. Do and D. Wang. Overlap-based similarity metrics for motif search in dna sequences. In ICONIP ’09: Proceedings of the 16th Interna- tional Conference on Neural Information Processing, pages 465–474, Berlin, Heidelberg, 2009. Springer-Verlag.
[19] T. R. Dreszer, D. Karolchik, A. S. Zweig, A. S. Hinrichs, B. J. Raney, R. M. Kuhn, L. R. Meyer, M. Wong, C. A. Sloan, K. R. Rosenbloom, G. Roe, B. Rhead, A. Pohl, V. S. Malladi, C. H. Li, K. Learned, V. Kirkup, F. Hsu, R. A. Harte, L. Guruvadoo, M. Goldman, B. M. Giardine, P. A.
238 Fujita, M. Diekhans, M. S. Cline, H. Clawson, G. P. Barber, D. Haussler, and W. James Kent. The ucsc genome browser database: extensions and updates 2011. Nucleic Acids Research, 40(D1):D918–D923, 2012.
[20] S. R. Eddy. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS computational biology, 4(5):e1000069+, May 2008.
[21] J. Ernst, H. L. Plasterer, I. Simon, and Z. Bar-Joseph. Integrating mul- tiple evidence sources to predict transcription factor binding in the human genome. Genome Research, 20(4):526–536, 2010.
[22] P. J. Farnham. Insights from genomic profiling of transcription factors. Nat Rev Genet, 10(9):605–616, 2009.
[23] T. Fawcett. An introduction to roc analysis. Pattern Recogn. Lett., 27:861–874, June 2006.
[24] E. Fazius, V.Shelest, and E. Shelest. Sitar: a novel tool for transcription factor binding site prediction. Bioinformatics, 27(20):2806–2811, 2011.
[25] P.M. Fordyce, D. Pincus, P.Kimmig, C. S. Nelson, H. El-Samad, P.Wal- ter, and J. L. DeRisi. Basic leucine zipper transcription factor hac1 binds dna in two distinct modes as revealed by microfluidic analyses. Pro- ceedings of the National Academy of Sciences, 109(45):E3084–E3093, 2012.
[26] P. A. Fujita, B. Rhead, A. S. Zweig, A. S. Hinrichs, D. Karolchik, M. S. Cline, M. Goldman, G. P. Barber, H. Clawson, A. Coelho, M. Diekhans, T. R. Dreszer, B. M. Giardine, R. A. Harte, J. Hillman-Jackson, F. Hsu, V. Kirkup, R. M. Kuhn, K. Learned, C. H. Li, L. R. Meyer, A. Pohl, B. J. Raney, K. R. Rosenbloom, K. E. Smith, D. Haussler, and W. J. Kent. The ucsc genome browser database: update 2011. Nucleic Acids Research, 39(suppl 1):D876–D882, 2011.
[27] S. Gama-Castro, V.Jimenez-Jacinto,´ M. Peralta-Gil, A. Santos-Zavaleta, M. I. Penaloza˜ Spinola, B. Contreras-Moreira, J. Segura-Salazar, L. Muniz˜ Rascado, I. Mart´ınez-Flores, H. Salgado, C. Bonavides-Mart´ınez, C. Abreu- Goodger, C. Rodr´ıguez-Penagos, J. Miranda-R´ıos, E. Morett, E. Merino, A. M. Huerta, L. Trevino˜ Quintanilla, and J. Collado-Vides. Regu- lonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Research, 36(suppl 1):D120–D124, 2008.
239 [28] B. Georgi and A. Schliep. Context-specific independence mixture mod- eling for positional weight matrices. Bioinformatics, 22(14):e166–e173, 2006.
[29] S. Georgiev,A. Boyle, K. Jayasurya, X. Ding, S. Mukherjee, and U. Ohler. Evidence-ranked motif identification. Genome Biology, 11(2):R19, 2010.
[30] D. G. Gilbert. eugenes: a eukaryote genome information system. Nucleic Acids Research, 30(1):145–148, 2002.
[31] G. E. Gonye, P. Chakravarthula, J. S. Schwaber, and R. Vadigepalli. From promoter analysis to transcriptional regulatory network predic- tion using paint. 408, August 2007.
[32] O. L. Griffith, S. B. Montgomery,B. Bernier, B. Chu, K. Kasaian, S. Aerts, S. Mahony, M. C. Sleumer, M. Bilenky, M. Haeussler, M. Griffith, S. M. Gallo, B. Giardine, B. Hooghe, P. Van Loo, E. Blanco, A. Ticoll, S. Lith- wick, E. Portales-Casamar, I. J. Donaldson, G. Robertson, C. Wadelius, P. De Bleser, D. Vlieghe, M. S. Halfon, W. Wasserman, R. Hardison, C. M. Bergman, S. J. Jones, and T. O. R. A. Consortium. Oreganno: an open-access community-driven resource for regulatory annotation. Nucleic Acids Research, 36(suppl 1):D107–D113, 2008.
[33] S. Gupta, J. Stamatoyannopoulos, T. Bailey, and W. Noble. Quantifying similarity between motifs. Genome Biology, 8(2):R24, 2007.
[34] S. Hannenhalli. Eukaryotic transcription factor binding sites–modeling and integrative search methods. Bioinformatics, 24(11):1325–1331, 2008.
[35] S. Hannenhalli and L.-S. Wang. Enhanced position weight matrices using mixture models. Bioinformatics, 21(suppl 1):i204–212, 2005.
[36] G. Z. Hertz and G. D. Stormo. Identifying dna and protein patterns with statistically significant alignments of multiple sequences. Bioinfor- matics, 15(7):563–577, 1999.
[37] A. K. Jain. Data clustering: 50 years beyond K-means. Pattern Recog- nition Letters, 31(8):651–666, June 2010.
[38] D. S. Johnson, A. Mortazavi, R. M. Myers, and B. Wold. Genome-wide mapping of in vivo protein-dna interactions. Science, 316(5830):1497– 1502, 2007.
[39] I. T. Jolliffe. Principal Component Analysis. Springer, second edition, October 2002.
240 [40] A. Kel, E. Gling, I. Reuter, E. Cheremushkin, O. Kel-Margoulis, and E. Wingender. MatchTM : a tool for searching transcription factor binding sites in dna sequences. Nucleic Acids Research, 31(13):3576–3579, 2003.
[41] S. M. KieÅbasa, H. Klein, H. G. Roider, M. Vingron, and N. Blthgen. Transfind–predicting transcriptional regulators for gene sets. Nucleic Acids Research, 38(suppl 2):W275–W280, 2010.
[42] T. Kozuka, M. Sugita, S. Shetzline, A. M. Gewirtz, and Y.Nakata. c-myb and gata-3 cooperatively regulate il-13 expression via conserved gata-3 response element and recruit mixed lineage leukemia (mll) for histone modification of the il-13 locus. The Journal of Immunology, 187(11):5974– 5982, 2011.
[43] D. L. Kroshko. OpenOpt 0.36. http://openopt.org/, Sept. 2011.
[44] A. Kumar and L. Cowen. Recognition of beta-structural motifs using hidden markov models trained with simulated evolution. Bioinformat- ics, 26(12):i287–i293, 2010.
[45] M. Larkin, G. Blackshields, N. Brown, R. Chenna, P. McGettigan, H. McWilliam, F. Valentin, I. Wallace, A. Wilm, R. Lopez, J. Thomp- son, T. Gibson, and D. Higgins. Clustal w and clustal x version 2.0. Bioinformatics, 23(21):2947–2948, 2007.
[46] C. Lee, A. Abdool, and C.-H. Huang. Pca-based population structure inference with generic clustering algorithms. BMC Bioinformatics, 10(S- 1):S73, 2009.
[47] C. Lee and C.-H. Huang. LASAGNA-Search 2.0: integrated transcrip- tion factor binding site search and visualization in a browser. submitted.
[48] C. Lee and C.-H. Huang. Geometric visualization of transcription factor binding sites in context. In Proceedings of the Second ACM Inter- national Conference on Bioinformatics and Computational Biology, BCB ’11, pages 457–461, New York, NY, USA, 2011. ACM.
[49] C. Lee and C.-H. Huang. Searching for transcription factor binding sites in vector spaces. BMC Bioinformatics, 13(1):215, 2012.
[50] C. Lee and C.-H. Huang. LASAGNA: A novel algorithm for transcrip- tion factor binding site alignment. BMC Bioinformatics, 14:108, 2013.
[51] C. Lee and C.-H. Huang. LASAGNA-Search: an integrated webtool for transcription factor binding site search and visualization. BioTechniques, 54(3):141–153, 2013.
241 [52] C. Lee, I. Mandoiu, and C. Nelson. Inferring ethnicity from mitochon- drial dna sequence. BMC Proceedings, 5(Suppl 2):S11, 2011.
[53] D. L. Lee, H. Chuang, and K. Seamons. Document ranking and the vector-space model. IEEE Softw., 14:67–75, March 1997.
[54] L. Li. GADEM: A Genetic Algorithm Guided Formation of Spaced Dyads Coupled with an EM Algorithm for Motif Discovery. Journal of Computational Biology, 16(2):317–329, Feb. 2009.
[55] N. Li and M. Tompa. Analysis of computational approaches for motif discovery. Algorithms for Molecular Biology, 1(1):8, 2006.
[56] X. Liu, D. L. Brutlag, and J. S. Liu. Bioprospector: Discovering conserved dna motifs in upstream regulatory regions of co-expressed genes. In Pac. Symp. Biocomput, pages 127–138, 2001.
[57] C. T. Lopes, M. Franz, F. Kazi, S. L. Donaldson, Q. Morris, and G. D. Bader. Cytoscape web: an interactive web-based network browser. Bioinformatics, 26(18):2347–2348, 2010.
[58] V. D. Marinescu, I. S. Kohane, and A. Riva. The mapper database: a multi-genome catalog of putative transcription factor binding sites. Nucleic Acids Research, 33(suppl 1):D91–D97, 2005.
[59] V. Matys, O. V. Kel-Margoulis, E. Fricke, I. Liebich, S. Land, A. Barre- Dirrie, I. Reuter, D. Chekmenev, M. Krull, K. Hornischer, N. Voss, P. Stegmaier, B. Lewicki-Potapov, H. Saxel, A. E. Kel, and E. Wingender. R R Transfac and its module transcompel : transcriptional gene regula- tion in eukaryotes. Nucleic Acids Research, 34(suppl 1):D108–D110, 2006.
[60] D. E. Newburger and M. L. Bulyk. Uniprobe: an online database of protein binding microarray data on proteindna interactions. Nucleic Acids Research, 37(suppl 1):D77–D82, 2009.
[61] C. Notredame. Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol, 3(8):e123, 08 2007.
[62] Y.M. Oh, J. K. Kim, S. Choi, and J.-Y.Yoo. Identification of co-occurring transcription factor binding sites from dna sequence using clustered position weight matrices. Nucleic Acids Research, 40(5):e38, 2012.
[63] R. Osada, E. Zaslavsky, and M. Singh. Comparative analysis of meth- ods for representing and searching for transcription factor binding sites. Bioinformatics, 20(18):3516–3525, 2004.
242 [64] E. Portales-Casamar, D. Arenillas, J. Lim, M. I. Swanson, S. Jiang, A. McCallum, S. Kirov, and W. W. Wasserman. The pazar database of gene regulatory information coupled to the orca toolkit for the study of regulatory sequences. Nucleic Acids Research, 37(suppl 1):D54–D60, 2009. [65] E. Portales-Casamar, S. Thongjuea, A. T. Kwon, D. Arenillas, X. Zhao, E. Valen, D. Yusuf, B. Lenhard, W. W. Wasserman, and A. Sandelin. Jaspar 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Research, 38(suppl 1):D105–D110, 2010. [66] S. Rajasekaran, S. Balla, and C.-H. Huang. Exact algorithms for planted motif problems. Journal of Computational Biology, 12(8):1117–1128, 2005. [67] A. Riva. The MAPPER2 Database: a multi-genome catalog of putative transcription factor binding sites. Nucleic Acids Research, 40(D1):D155– D161, Jan. 2012. [68] K. R. Rosenbloom, T. R. Dreszer, M. Pheasant, G. P. Barber, L. R. Meyer, A. Pohl, B. J. Raney, T. Wang, A. S. Hinrichs, A. S. Zweig, P. A. Fujita, K. Learned, B. Rhead, K. E. Smith, R. M. Kuhn, D. Karolchik, D. Haussler, and W. J. Kent. Encode whole-genome data in the ucsc genome browser. Nucleic Acids Research, 38(suppl 1):D620–D625, 2010. [69] R. A. Salama and D. J. Stekel. Inclusion of neighboring base interdepen- dencies substantially improves genome-wide prokaryotic transcription factor binding site prediction. Nucl. Acids Res., 38(12):e135, 2010. [70] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18:613–620, November 1975. [71] A. Sandelin, W. W. Wasserman, and B. Lenhard. Consite: web-based prediction of regulatory elements using cross-species comparison. Nu- cleic Acids Research, 32(suppl 2):W249–W252, 2004. [72] G. Sandve, O. Abul, V.Walseng, and F. Drablos. Improved benchmarks for computational motif discovery. BMC Bioinformatics, 8(1):193, 2007. [73] T. D. Schneider, G. D. Stormo, L. Gold, and A. Ehrenfeucht. Infor- mation content of binding sites on nucleotide sequences. Journal of Molecular Biology, 188(3):415 – 431, 1986. [74] J. Schug. Using TESS to predict transcription factor binding sites in DNA sequence. In A. D. Baxevanis, editor, Current Protocols in Bioinfor- matics. J. Wiley and Sons, 2003.
243 [75] D. Shalon, S. J. Smith, and P. O. Brown. A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Genome Research, 6(7):639–645, July 1996.
[76] S. Sinha. Discriminative motifs. In RECOMB ’02: Proceedings of the sixth annual international conference on Computational biology, pages 291–298, New York, NY, USA, 2002. ACM.
[77] R. Staden. Computer methods to locate signals in nucleic acid se- quences. Nucleic Acids Research, 12(1Part2):505–519, 1984.
[78] S. Struckmann, D. Esch, H. Scholer, and G. Fuellen. Visualization and exploration of conserved regulatory modules using rexspecies 2. BMC Evolutionary Biology, 11(1):267, 2011.
[79] K. T. Takusagawa and D. K. Gifford. Negative information for motif discovery. In Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, pages 360–371, 2004.
[80] M. Thomas-Chollier, C. Herrmann, M. Defrance, O. Sand, D. Thieffry, and J. van Helden. Rsat peak-motifs: motif analysis in full-size chip-seq datasets. Nucleic Acids Research, 40(4):e31, 2012.
[81] J. D. Thompson, D. G. Higgins, and T. J. Gibson. Clustal w: improv- ing the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22(22):4673–4680, 1994.
[82] B. Tokovenko, R. Golda, O. Protas, M. Obolenskaya, and A. El’skaya. Cotrasif: conservation-aided transcription-factor-binding site finder. Nucleic Acids Research, 37(7):e49, 2009.
[83] J.-V.V.Turatsinze, M. Thomas-Chollier, M. Defrance, and J. van Helden. Using RSAT to scan genome sequences for transcription factor bind- ing sites and cis-regulatory modules. Nature protocols, 3(10):1578–1588, Sept. 2008.
[84] A. Turpin and F. Scholer. User performance versus precision measures for simple search tasks. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’06, pages 11–18, New York, NY, USA, 2006. ACM.
[85] J. Vilo, A. Brazma, I. Jonassen, A. Robinson, and E. Ukkonen. Min- ing for putative regulatory elements in the yeast genome using gene expression data. In Proceedings of the Eighth International Conference on
244 Intelligent Systems for Molecular Biology, pages 384–394. AAAI Press, 2000.
[86] J. Wang, M. F. Shannon, and I. G. Young. A role for ets1, synergizing with ap-1 and gata-3 in the regulation of il-5 transcription in mouse th2 lymphocytes. International Immunology, 18(2):313–323, February 2006.
[87] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83, 1945.
[88] C. Wong, Y.Li, C. Lee, and C.-H. Huang. Ensemble learning algorithms for classification of mtdna into haplogroups. Briefings in Bioinformatics, 12(1):1–9, 2011.
[89] C. Yanover, M. Singh, and E. Zaslavsky. M are better than one: an ensemble-based motif finder and its application to regulatory element prediction. Bioinformatics, 25(7):868–874, 2009.
[90] F. Zambelli, G. Pesole, and G. Pavesi. Pscan: finding over-represented transcription factor binding site motifs in sequences from co-regulated or co-expressed genes. Nucleic Acids Research, 37(suppl 2):W247–W252, 2009.
[91] E. Zaslavsky and M. Singh. A combinatorial optimization approach for diverse motif finding applications. Algorithms for Molecular Biology, 1(1):13, 2006.
[92] Y. Zhao and G. D. Stormo. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nature biotechnology, 29(6):480–483, June 2011.
[93] J. Zhu and M. Q. Zhang. Scpd: a promoter database of the yeast saccharomyces cerevisiae. Bioinformatics, 15(7):607–611, 1999.
245